Optimal Transport-based Alignment of Learned Character Representations for String Similarity

String similarity models are vital for record linkage, entity resolution, and search. In this work, we present STANCE–a learned model for computing the similarity of two strings. Our approach encodes the characters of each string, aligns the encodings using Sinkhorn Iteration (alignment is posed as an instance of optimal transport) and scores the alignment with a convolutional neural network. We evaluate STANCE’s ability to detect whether two strings can refer to the same entity–a task we term alias detection. We construct five new alias detection datasets (and make them publicly available). We show that STANCE (or one of its variants) outperforms both state-of-the-art and classic, parameter-free similarity models on four of the five datasets. We also demonstrate STANCE’s ability to improve downstream tasks by applying it to an instance of cross-document coreference and show that it leads to a 2.8 point improvement in Bˆ3 F1 over the previous state-of-the-art approach.


Introduction
String similarity models are crucial in record linkage, data integration, search and entity resolution systems, in which they are used to determine whether two strings refer to the same entity (Bilenko and Mooney, 2003;McCallum et al., 2005;. In the context of these systems, measuring string similarity is complicated by a variety of factors including: the use of nicknames (e.g., Bill Clinton instead of William Clinton), token permutations (e.g., US Navy and Naval Forces of the US) and noise, among others. Many state-of-the-art systems employ either classic similarity models, such as Levenshtein, longest common subsequence, and Jaro-Winkler, or learned models for string similarity Ventura et al., 2015;Kim et al., 2016a;Gan et al., 2017).
While classic and learned approaches can be effective, they both have a number of shortcomings. First, the classic approaches have few parameters making them inflexible and unlikely to succeed across languages or across domains with unique characteristics (e.g. company names, music album titles, etc.) (Needleman and Wunsch, 1970;Smith and Waterman, 1981;Winkler, 1999;Gionis et al., 1999;Bergroth et al., 2000;Cohen et al., 2003). Classic models also assume that each edit has equal cost, which is unrealistic. For example, consider the names Chun How and Chun Hao-which can refer to the same entity-and the names John A. Smith and John B. Smith, which cannot. Even though the first pair differ by 2 edits and the second pair by 1, transforming ow to ao in the first pair should cost less than transforming A to B in the second. Learned string similarity models address these problems by learning distinct costs for various edits and have thus proven successful in a number of domains (Bilenko and Mooney, 2003;McCallum et al., 2005;Gan et al., 2017). Some learned string similarity models, such as the SVM (Bilenko and Mooney, 2003) and CRFbased (McCallum et al., 2005) approaches, use edit patterns akin to insertions/swaps/deletions, which may lead to strong inductive biases. For example, even when costs are learned, two strings related by a token permutation-e.g., Grace Hopper and Hopper, Grace-are likely to have high cost even though they clearly refer to the same entity. Gan et al. (2017), on the other hand, provide less structure, encoding each string with a single vector embedding and measuring similarity between the embedded representations.
In this paper, we present a learned string similarity model that is flexible, captures sequential dependencies of characters, and is readily able to learn a wide range of edit patterns-such as token permutations. Our approach is comprised of three components: the first encodes each character in both strings using a recurrent neural network; the second softly aligns the two encoded sequences by solving an instance of optimal transport; the third scores the alignment with a convolutional neural network. Each component is differentiable, allowing for end-to-end training. Our model is called STANCE-an acronym that stands for: Similarity of Transport-Aligned Neural Character Encodings.
We evaluate STANCE's ability to capture string similarity in a task we term alias detection. The input to alias detection is a query mention (i.e., a string) and a set of candidate mentions, and the goal is to score querycandidate pairs that can refer to the same entity higher than pairs that cannot. For example, an accurate model scores the query Philips with candidates Philips Corporation and Katherine Philips higher than with M. Phelps. Alias detection differs from both coreference and entity linking in that neither surrounding natural language context of the mention nor external knowledge are available. A similar task is studied in recent work (Gan et al., 2017).
In experiments, we compare STANCE to stateof-the-art and classic models of string similarity in alias detection on 5 newly constructed datasetswhich we make publicly available. Our results demonstrate that STANCE outperforms all other approaches on 4 out of 5 datasets in terms of Hits@1 and 3 out of 5 datasets in terms of mean average precision. Of the two cases in which STANCE is outperformed by other methods in terms of mean average precision, one is by a variant of STANCE in an ablation study. We also demonstrate STANCE's capacity for supporting downstream tasks by using it in cross-document coreference for the Twitter at the Grammy's dataset (Dredze et al., 2016). Using STANCE improves upon the state-of-the-art by 2.8 points of B 3 F1. Analyzing our trained model reveals STANCE effectively learns sequence-aware character similarities, filters noise with optimal transport, and uses the CNN scoring component to detect unconventional similarity-preserving edit patterns.

STANCE
Our goal is to learn a model, f (·, ·), that measures the similarity between two strings-called mentions. The model should produce a high score when its inputs are aliases of the same entity, where a men-tion is an alias of an entity if it can be used to refer to that entity. For example, the mentions Barack H. Obama and Barry Obama are both aliases of the entity wiki/Barack_Obama. Note that the alias relationship is not transitive: both of the pairs Obama-Barack Obama and Obama-Michelle Obama are aliases of the same entity, but the pair Barack Obama-Michelle Obama are not.
In this section we describe our proposed model, STANCE, which is comprised of three stages: encoding both mentions and constructing a corresponding similarity matrix, softly aligning the encoded mentions, and scoring the alignment.

Mention Encoding Similarity Matrix
A flexible string similarity model is sequenceaware, i.e., the cost of each character transformation should depend on the surrounding characters (e.g., transforming Chun How to Chun Hao should have low cost). To capture these sequential dependencies, STANCE encodes each mention using a bidirectional long short-term memory network (LSTM) (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005). In particular, each character c i in a mention m is represented by a d-dimensional vector, h i , where h i is the concatenation of the hidden states corresponding to c i produced by running the LSTM in both directions. The encoded representations of the characters are stacked to form a matrix H (m) ∈ R L×d where L (a hyperparameter) is the maximum string length considered by STANCE. Given a query m and candidate m , STANCE computes a similarity matrix of their encodings via an inner product: S = H (m) H (m )T . Each cell in the resultant matrix represents a measure of the similarity between each pair of character encodings from m and m . Note that for a mention q only the first |q| (i.e., length of the string q) rows of H (q) contain non-zero values.

Soft Alignment via Optimal Transport
The next component of our model computes a soft alignment between the characters of m and m . Aligning the mentions is posed as a transport problem, where the goal is to convert one mention into another while minimizing cost. In particular, we solve the Kantorovich formulation of optimal transport (OT). In this formulation, two probability measures, p 1 and p 2 are given in addition to a cost matrix, C. This matrix defines the cost of moving H (m) < l a t e x i t s h a 1 _ b a s e 6 4 = " H x u t b R i X O b M k K n c N v 6 / 1 A o V A n 7 s = " > A A A D C 3 i c d V L L b h M x F H W m P E p 4 t I U l G 4 s I q S y o Z h B S W V Y t i y 6 D 1 D S V M k P l 8 d z J W P F j Z H s g k e V P Y M G 2 f A Y 7 x J a P 4 C f 4 B p x k B M 1 U v Z L l o 3 O O 5 e P r m 9 e c G R v H v 3 v R 1 p 2 7 9 + 5 v P + g / f P T 4 y c 7 u 3 t N z o x p N Y U Q V V / o i J w Y 4 k z C y z H K 4 q D U Q k X M Y 5 7 O T p T 7 + B N o w J c / s o o Z M k K l k J a P E B m p 8 + t H t i 1 f + c n c Q H 8 S r w j d B 0 o I B a m t 4 u d f 7 k x a K N g K k p Z w Y M 0 n i 2 m a O a M s o B 9 9 P G w M 1 o T M y h U m A k g g w m V v l 9 f h l Y A p c K h 2 W t H j F X j / h i D B m I f L g F M R W p q s t y d s 0 W w m P N 8 i 5 C W g z k p t q U l e M z j t B b f k u c 0 z W j Q V J 1 z n L h m O r 8 L J 3 u G A a q O W L A A j V L D w V 0 4 p o Q m 3 o c D + 1 F S g N I c K C g y u g Z J I t m x z u k P C 5 F a / z 7 v 0 t n n b 3 7 q w F K 5 U q I Y g s X J o z P 0 k y l 1 q Y h x 9 f 7 3 n p B o n 3 H S f h U 7 / W D X V D 0 L T y 6 b 9 2 d I w 5 7 V h f H 5 / 8 d 4 c J S b r z c B O c v z l I A v 7 w d n B 0 3 M 7 K N n q O X q B 9 l K B D d I R O 0 R C N E E U z 9 B V d o W / R l + h 7 9 C P 6 u b Z G v f b M M 7 R R 0 a + / a 2 s C Z w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " H x u t b R i X O b M k K n c N v 6 / 1 A o V A n 7 s = " > A A A D C 3 i c d V L L b h M x F H W m P E p 4 t I U l G 4 s I q S y o Z h B S W V Y t i y 6 D 1 D S V M k P l 8 d z J W P F j Z H s g k e V P Y M G 2 f A Y 7 x J a P 4 C f 4 B p x k B M 1 U v Z L l o 3 O O 5 e P r m 9 e c G R v H v 3 v R 1 p 2 7 9 + 5 v P + g / f P T 4 y c 7 u 3 t N z o x p N Y U Q V V / o i J w Y 4 k z C y z H K 4 q D U Q k X M Y 5 7 O T p T 7 + B N o w J c / s o o Z M k K l k J a P E B m p 8 + t H t i 1 f + c n c Q H 8 S r w j d B 0 o I B a m t 4 u d f 7 k x a K N g K k p Z w Y M 0 n i 2 m a O a M s o B 9 9 P G w M 1 o T M y h U m A k g g w m V v l 9 f h l Y A p c K h 2 W t H j F X j / h i D B m I f L g F M R W p q s t y d s 0 W w m P N 8 i 5 C W g z k p t q U l e M z j t B b f k u c 0 z W j Q V J 1 z n L h m O r 8 L J 3 u G A a q O W L A A j V L D w V 0 4 p o Q m 3 o c D + 1 F S g N I c K C g y u g Z J I t m x z u k P C 5 F a / z 7 v 0 t n n b 3 7 q w F K 5 U q I Y g s X J o z P 0 k y l 1 q Y h x 9 f 7 3 n p B o n 3 H S f h U 7 / W D X V D 0 L T y 6 b 9 2 d I w 5 7 V h f H 5 / 8 d 4 c J S b r z c B O c v z l I A v 7 w d n B 0 3 M 7 K N n q O X q B 9 l K B D d I R O 0 R C N E E U z 9 B V d o W / R l + h 7 9 C P 6 u b Z G v f b M M 7 R R 0 a + / a 2 s C Z w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " H x u t b R i X O b M k K n c N v 6 / 1 A o V A n 7 s = " > A A A D C 3 i c d V L L b h M x F H W m P E p 4 t I U l G 4 s I q S y o Z h B S W V Y t i y 6 D 1 D S V M k P l 8 d z J W P F j Z H s g k e V P Y M G 2 f A Y 7 x J a P 4 C f 4 B p x k B M 1 U v Z L l o 3 O O 5 e P r m 9 e c G R v H v 3 v R 1 p 2 7 9 + 5 v P + g / f P T 4 y c 7 u 3 t N z o x p N Y U Q V V / o i J w Y 4 k z C y z H K 4 q D U Q k X M Y 5 7 O T p T 7 + B N o w J c / s o o Z M k K l k J a P E B m p 8 + t H t i 1 f + c n c Q H 8 S r w j d B 0 o I B a m t 4 u d f 7 k x a K N g K k p Z w Y M 0 n i 2 m a O a M s o B 9 9 P G w M 1 o T M y h U m A k g g w m V v l 9 f h l Y A p c K h 2 W t H j F X j / h i D B m I f L g F M R W p q s t y d s 0 W w m P N 8 i 5 C W g z k p t q U l e M z j t B b f k u c 0 z W j Q V J 1 z n L h m O r 8 L J 3 u G A a q O W L A A j V L D w V 0 4 p o Q m 3 o c D + 1 F S g N I c K C g y u g Z J I t m x z u k P C 5 F a / z 7 v 0 t n n b 3 7 q w F K 5 U q I Y g s X J o z P 0 k y l 1 q Y h x 9 f 7 3 n p B o n 3 H S f h U 7 / W D X V D 0 L T y 6 b 9 2 d I w 5 7 V h f H 5 / 8 d 4 c J S b r z c B O c v z l I A v 7 w d n B 0 3 M 7 K N n q O X q B 9 l K B D d I R O 0 R C N E E U z 9 B V d o W / R l + h 7 9 C P 6 u b Z G v f b M M 7 R R 0 a + / a 2 s C Z w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " H m 3 o c D + 1 F S g N I c K C g y u g Z J I t m x z u k P C 5 F a / z 7 v 0 t n n b 3 7 q w F K 5 U q I Y g s X J o z P 0 k y l 1 q Y h x 9 f 7 3 n p B o n 3 H S f h U 7 / W D X V D 0 L T y 6 b 9 2 d I w a n t G t g c S W f 4 R F m z h M 9 h V 3 X b X n + A b 8 C Q j a K b q l S w f n X M s H 1 / f t O J M m z C 8 6 g V 3 7 t 6 7 / 2 D r Y f / R 4 y d P n w 2 2 d 4 5 1 W S s K E 1 r y U p 2 m R A N n E i a G G Q 6 n l Q I i U g 4 n 6 d l h o 5 9 8 B a V Z K c d m W U E i y F y y n F 8 m a j v s 7 J H x r x e u 8 / X S L p 9 2 d H b d g p d J S C C I z G 6 f M T a P E x g Y W / v v X e 5 r b Y e R c x 0 n 4 3 K 1 1 T e 0 I F C 1 c / K 8 d H W N K O 9 a 3 B 4 f / 3 X 5 C o u 4 8 3 A T H 7 / Y i j z + / H + 4 f t L O y h V 6 g l 2 g X R e g D 2 k d H a I Q m i K I F + o F + o l / B 9 + B 3 c B 5 c r K 1 B r z 3 z H G 1 U c P k X Q f w H X g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " a n t G t g c S W f 4 R F m z h M 9 h V 3 X b X n + A b 8 C Q j a K b q l S w f n X M s H 1 / f t O J M m z C 8 6 g V 3 7 t 6 7 / 2 D r Y f / R 4 y d P n w 2 2 d 4 5 1 W S s K E 1 r y U p 2 m R A N n E i a G G Q 6 n l Q I i U g 4 n 6 d l h o 5 9 8 B a V Z K c d m W U E i y F y y n F 8 m a j v s 7 J H x r x e u 8 / X S L p 9 2 d H b d g p d J S C C I z G 6 f M T a P E x g Y W / v v X e 5 r b Y e R c x 0 n 4 3 K 1 1 T e 0 I F C 1 c / K 8 d H W N K O 9 a 3 B 4 f / 3 X 5 C o u 4 8 3 A T H 7 / Y i j z + / H + 4 f t L O y h V 6 g l 2 g X R e g D 2 k d H a I Q m i K I F + o F + o l / B 9 + B 3 c B 5 c r K 1 B r z 3 z H G 1 U c P k X Q f w H X g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " a n t G t g c S W f 4 R F m z h M 9 h V 3 X b X n + A b 8 C Q j a K b q l S w f n X M s H 1 / f t O J M m z C 8 6 g V 3 7 t 6 7 / 2 D r Y f / R 4 y d P n w 2 2 d 4 5 1 W S s K E 1 r y U p 2 m R A N n E i a G G Q 6 n l Q I i U g 4 n 6 d l h o 5 9 8 B a V Z K c d m W U E i y F y y n F 8 m a j v s 7 J H x r x e u 8 / X S L p 9 2 d H b d g p d J S C C I z G 6 f M T a P E x g Y W / v v X e 5 r b Y e R c x 0 n 4 3 K 1 1 T e 0 I F C 1 c / K 8 d H W N K O 9 a 3 B 4 f / 3 X 5 C o u 4 8 3 A T H 7 / Y i j z + / H + 4 f t L O y h V 6 g l 2 g X R e g D 2 k d H a I Q m i K I F + o F + o l / B 9 + B 3 c B 5 c r K 1 B r z 3 z H G 1 U c P k X Q f w H X g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " a n t G t g c S W f 4 R F m z h M 9 h V 3 X b X n + A b 8 C Q j a K b q l S w f n X M s H 1 / f t O J M m z C 8 6 g V 3 7 t 6 7 / 2 D r Y f / R 4 y d P n w 2 2 d 4 5 1 W S s K E 1 r y U p 2 m R A N n E i a G G Q 6 n l Q I i U g 4 n 6 d l h o 5 9 8 B a V Z K c d m W U E i y F y y n F 8 m a j v s 7 J H x r x e u 8 / X S L p 9 2 d H b d g p d J S C C I z G 6 f M T a P E x g Y W / v v X e 5 r b Y e R c x 0 n 4 3 K 1 1 T e 0 I F C 1 c / K 8 d H W N K O 9 a 3 B 4 f / 3 X 5 C o u 4 8 3 A T H 7 / Y i j z + / H + 4 f t L O y h V 6 g l 2 g X R e g D 2 k d H a I Q m i K I F + o F + o l / B 9 + B 3 c B 5 c r K 1 B r z 3 z H G 1 U c P k X Q f w H X g = = < / l a t e x i t >  Figure  2b visualizes the transport matrix and Figure 2c visualizes the element-wise product of the similarity and transport matrices. Many of the characters are highly similar. Multiplying by the transport matrix amplifies the alignment of the mentions while reducing noise, resulting in a clean alignment for the CNN scoring component.
(or converting) each element in the support of p 1 to each element in the support of p 2 . The solution to OT is a matrix,P , called the transport plan, which defines how to completely convert p 1 into p 2 . A viable transport plan is required to be non-negative and is also required to have marginals of p 1 and p 2 (i.e., ifP is summed along the rows then p 1 is recovered and if it is summed along the columns p 2 is recovered). The goal is to find the plan with minimal cost, where | · | is the number of elements in the support of the corresponding distribution and P is the set of valid transportation plans. In this sense, a transportation plan can be thought of as a soft alignment of the supports of p 1 and p 2 (i.e., an element in p 1 can be aligned fractionally to multiple elements in p 2 ). A transportation plan can be computed efficiently via Sinkhorn Iteration exploiting parallelism using GPUs (empirically it has been shown to be quadratic in L) (Cuturi, 2013). The transport plan is defined as P = diag(u u u)Kdiag(v v v) where K := e −λC , u u u and v v v are found using the iterative algorithm, λ is the entropic regularizer, and diag(·) gives a matrix with its input argument as the diagonal (Cuturi, 2013). We specifically use the regularized objective that has been shown to be effective for training (Cuturi, 2013;Genevay et al., 2018). Optimal transport has been effectively used in several natural language-based applications such as computing the similarity between two documents as the transport cost (Kusner et al., 2015;Huang et al., 2016), in measuring distances between point cloud-based representations of words , and learning correspondences between word embedding spaces across domains/languages (Alvarez-Melis and Jaakkola, 2018; Alvarez-Melis et al., 2019).
In our case, p 1 represents the mention m and p 2 represents m . The distribution p 1 is defined as a point cloud consisting of the character embeddings computed by the LSTM applied to m, i.e., H (m) . Formally, it is a set of evenly weighted Dirac Delta functions in R d where d is the embedding dimensionality of the character representations. The distribution p 2 is defined similarly for m . The cost of transporting a character, c i of m to a character c j of m has cost, C i,j = S max − S i,j where S max = max i ,j S i ,j and S i,j is the inner product of h i and h j . The resulting transport plan is multiplied by the similarity matrix (Section 2.1) and subsequently fed as input to the next component of our model (Section 2.3). Despite being a soft alignment, this step helps mitigate spurious errors by reducing the similarity of characters pairs that are not aligned.

Alignment Score
The transport plan,P ∈ R L×L + describes how the characters in m are softly aligned to the characters in m . We compute the element-wise product of the similarity matrix, S, and the transport plan: S = S •P . Cells containing high values in S correspond to similar character pairs from m and m that are also well-aligned.
Note the distinction between this alignment and the way in which the transport cost can be used as distance measure. The alignment is used as a reweighting of the similarity matrix. In this way, the transport plan is closely related to attention-based models (Bahdanau et al., 2015;Parikh et al., 2016;Vaswani et al., 2017;Kim et al., 2017).
Finally, we employ a two dimensional convolutional neural network (CNN) to score S (Le- Cun et al., 1998). With access to the full matrix S , the CNN is able to detect multiple, aligned, character subsequences from m and m that are highly similar. By combining evidence from multiple-potentially non-continguousaligned character subsequences, the CNN detects long-range similarity-preserving edit patterns. This is crucial, for example, in computing a high score for the pair Obama, Barack and Barack Obama.
The architecture of the alignment-scoring CNN is a three layer network with filters of fixed size. A linear model is used to score the final output of the CNN. See Figure 1 for a visual representation of the STANCE architecture.
Training We train on mention triples, (q, p, n), where there exists an entity for which q and p are both aliases (i.e., (q, p) is a positive example), and there does not exist an entity for which both q and n are aliases (i.e., a negative example). We use the Bayesian Personalized Ranking objective (Rendle et al., 2009): σ(f (q, p) − f (q, n)).

Alias Detection
String similarity is a crucial piece of data integration, search and entity resolution systems, yet there are few large-scale datasets for training and evaluating domain-specific string similarity models. Unlike in coreference resolution, a high quality model should return high scores for mention pairs  in which both strings are aliases of (i.e., can refer to) the same entity. For example, the mention Clinton should exhibit high score with both B. Clinton and H. Clinton.
We construct five datasets for training and evaluating string similarity models derived from four large-scale public knowledge bases, which encompass a diverse range of entity types. The five datasets are summarized below: 1. Wikipedia (W) -We consider pages in Wikipedia to be entities. For each entity, we extract spans of text hyperlinked to that entity's page and use these as aliases. 1 2. Wikipedia-People (WP) -The Wikipedia dataset restricted to entities with type person in Freebase (Bollacker et al., 2008). 3. Patent Assignee (A) -Aliases of assignees (mostly organizations, some persons) found by combining entity information 2 with nondisambiguated assignees in patents 3 . 4. Music Artist (M) -MusicBrainz (Swartz, 2002) contains alternative names for music artists. For each dataset, entities are divided into training, development, and testing sets, such that each entity appears in only one set. This partitioning scheme is meant to ensure that performant models capture a general notion of similarity, rather than learning to recognize the aliases of particular entities. Dataset statistics can be found in Table 1.

Diseases (D) -The Comparative
Most mention-pairs selected uniformly at random are not aliases of the same entity. A model trained on such pairs may learn to always predict "Non-alias." To avoid learning such degenerate models and to avoid test sets for which degenerate models are performant, we carefully construct the training, development and test sets by including a mix of positive and negative examples and by generating negative examples designed to be difficult and practical. We use a mixture of the following five heuristics to generate negative examples: 1. Small Edit Distance -mentions with Levenshtein distance of 1 or 2 from the query; 2. Character Overlap -mentions that share a 4-gram word prefix or suffix with the query; 3. 4-Hop Aliases -first, construct a bipartite graph of mentions and entities where an edge between a mention and an entity denotes that the mention is an alias of the entity. Then, sample a mention that is not an alias of an entity for which the query is also an alias, and whose shortest path to the query requires 4 hops in the graph. Note that all mentions 2 hops from the query are aliases of an entity for which the query is also an alias. 4. 6-Hop Aliases -sample a mention whose shortest path to the query in the bipartite mention-entity graph is 6 hops. 5. Random -randomly sample mentions that are not aliases of the entity for which the query is also an alias. We do this by first sampling an entity and then sampling an alias of that entity uniformly at random.
In all cases, we sample such that entities that appear more frequently in the corpus and entities that have a larger number of aliases are more likely to be sampled (intuitively, these entities are more relevant and more challenging). For the Wikipediabased datasets, we sample entities proportionally to the number of hyperlink spans linking to the entity. For the Assignee dataset, we estimate entity fre-quency by the number of patents held by the entity. For the Music Artist dataset, entity frequency is estimated by the number of entity occurrences in the Last-FM-1k dataset (Last.fm; Celma, 2010). For the disease dataset, we do not have frequency information and so sampling is performed uniformly at random. For each dataset, 300 queries are selected for use in the development set and 4000 queries for use in the test set. Each query is paired with up to 1000 negative examples of each type mentioned above. For training, we also construct datasets using the approaches above for creating negative examples. Figure 3 illustrates how negative (and positive) examples are generated for the query peace agreement (which is used to refer to the entities wiki/Peace_Treaty and wiki/Lancaster_House_Agreement). 4-Hop (negative) aliases include Peace Support Operations and peacekeeping troops and 6-Hop (negative) examples include UN Peacekeeping and Blue beret. Note that for each type of negative example, any mention that is a true positive alias of the query is excluded from being a negative example, even if it satisfies one of the above heuristics.

Experiments
We evaluate STANCE directly via alias detection and also indirectly via cross document coreference. We also conduct an ablation study in order to understand the contribution of each of STANCE's three components to its overall performance.

Alias Detection
In the first experiment, we compare STANCE with both classic and learned similarity models in alias detection. Specifically, we compare STANCE to following approaches: • Deep Conflation Model (DCM) -state of the art model that encodes each string using a 1-dimensional CNN applied to character ngrams and computes cosine similarity (Gan et al., 2017). We use the available code 4 .

• Learned Dynamic Time Warping (LDTW)
-encode mentions using a bidirectional LSTM and compute similarity via dynamic time warping (DTW). We note equivalence between LDTW and weighted finite state trans-  ducers where the transducer topology is the edit distance (insert, delete, swap) program.
Parameters are learned such that DTW distance is meaningful (Cuturi and Blondel, 2017). • LSTM -represent each mention using the final hidden state of a bidirectional LSTM. Similarity is the dot product of mention representations (i.e. S |m||m | ). • Classic Approaches -Levenshtein Distance (Lev), Jaro-Winkler distance (JW), Longest Common Subsequence (LCS). • Phonetic Relaxation (Sdx) -transform mentions using the Soundex phonetic mapping and then compute Levenshtein. • CRF -implementation 5 of the model defined in (McCallum et al., 2005). Given a query mention, q, and a set of candidate mentions, we use each model to rank candidates by similarity to q. We compute the mean average precision (MAP) and hits at k = {1, 10, 50} of the ranking with respect to a set of ground truth labeled aliases. We report MAP and hits at k averaged over all test queries. The set of candidates for query q include all corresponding positive and negative examples from the test set (Section 3).
For models with hyperparameters, we tune the hyperparameters on the dev set using a grid search over: embedding dimension, learning rate, hidden state dimension, and number of filters (for the CNN). All models were implemented in Py-Torch, utilizing SinkhornAutoDiff 6 , and optimized with Adam (Kingma and Lei Ba, 2015). Our implementation is publicly available 7 .

Ablation Study
Our second experiment is designed to reveal the purpose of each of STANCE's components. To do so, we compare variants of STANCE with components removed and/or modified. Specifically, we compare the following variants: • WITHOUT-OT (-OT) -STANCE with LSTM encodings and CNN scoring but without optimal transport-based alignment. ) and CNN scoring model, designed to assess the importance of the initial mention encodings. Once more, the optimal transport-based alignment is removed. We evaluate each model variant using MAP and hits at k on the 5 datasets as in the first experiment. Results can be found in Table 2 and Table 3, respectively. We note that these ablations are equivalent to the models proposed by Traylor et al. (2017). Table 2 and Table 3 contain the MAP and hits at k (respectively) for each method and dataset (for alias detection and ablation experiments). The results reveal that with the exception of the disease dataset, STANCE (or one of its variants) performs best in terms of both metrics. The results suggest that the  optimal transport and CNN-based alignment scoring components of STANCE lead to a more robust model of similarity than inner-product based models, like LSTM and DCM. We hypothesize that using n-grams as opposed to individual characters embeddings is advantageous on the disease dataset, leading to DCM's top performance. Surprisingly, -OT is best on the assignee dataset. We hypothesize that this is due to many corporate acronyms.

Results and Analysis
To better understand STANCE's performance and improvement over the baseline methods we provide analysis of particular examples highlighting two advantages of the model: it leverages optimal transport for noise reduction, and it uses its CNN-based scoring function to learn non-standard similarity-preserving string edit patterns that would be difficult to learn with classic edit operations (i.e., insert, delete and substitute).
Noise Reduction. Since the model leverages distributed representations for characters, it often discovers many similarities between the characters in two mentions. For example, Figure 4a shows two strings that are not aliases of the same entity. Despite this, there are many regions of high similarity due to multiple instances of the character bigrams aa, an and en in both mentions. In experiments, we find that this leads the -OT model astray. However, STANCE's optimal transport component constructs a transport plan that contains little alignment between the characters in the mentions as seen in Figure 4b, which displays the product of the similarity matrix and the transportation plan. Ultimately, this leads STANCE to correctly predict that the two strings are not similar. Token Permutation. A natural and frequently occurring similarity-preserving edit pattern that occurs in our datasets is token permutation, i.e., the tokens of two aliases of the same entity are ordered differently in each mention. For example, consider the similarity matrix in Figure 5b. The CNN easily learns that two strings may be aliases of the same entity even if one is a token permutation of the other. This is because it identifies multiple contiguous "diagonal lines" in the similarity matrix. Classic and learned string similarity measures do not learn this relationship easily.

Cross Document Coreference
We evaluate the impact of using STANCE for in cross-document coreference in the Twitter at   the Grammy's dataset (Dredze et al., 2016). This dataset consists of 4577 mentions of 273 entities in tweets published close in time to the 2013 Grammy awards. We use the same train/dev/test partition with data provided by the authors 8 . The dataset is notable for having significant variation in the spellings of mentions that refer to the same entity. We design a simple cross-document coreference model that ignores the mention context and simply uses STANCE trained on the WikiPPL model. We perform average linkage hierarchical agglomerative clustering using STANCE scores as the linkage function and halt agglomerations according to a threshold (i.e., no agglomerations with linkage below the threshold are performed). We tune the threshold on the development set by finding the value which gives the highest evaluation score (B 3 F1). We compare our method to the previously published state of the art methods (Green (Green et al., 2012) and Phylo (Andrews et al., 2014)). Both of these methods report numbers using their name spelling features alone as well as with context features. We find that our approach outperforms both methods (including those using context features) on the test dataset in terms of B 3 F1 (Table 4). 8 bitbucket.org/mdredze/tgx

Related Work
Classic string similarity methods based on string alignment include Levenshtein distance, Longest Common Subsequence, Needleman andWunsch (1970), andWaterman (1981). Sequence modeling and alignment is a widely studied problem in both theoretical and applied computer science and is too vast to be properly covered entirely. We note that the most relevant prior work focuses on learned string edit models and includes the work of McCallum et al. (2005) which uses a model based on CRFs, and Bilenko and Mooney (2003) which uses a SVM-based model. Andrews et al. ( , 2014) developed a generative model, which is used for joint cross document coreference and string edit modeling tasks. Closely related work also appears in the field of computational morphology (Dreyer et al., 2008;Faruqui et al., 2016;Rastogi et al., 2016). Much of this work uses WFSTs with learned parameters. JRC-Names (Steinberger et al., 2011;Ehrmann et al., 2017) is a dataset that stores multilingual aliases of person and organization entities.
Similar neural network architectures to our approach have been used for related sequence alignment problems. Santos et al. (2017) uses an RNN to encode toponyms before using a multi-layer perceptron to determine if a pair of toponyms are matching. The Match-SRNN computes a similarity matrix over two sentence representations and uses an RNN applied to the matrix in a manner akin to the classic dynamic program for question answering and IR tasks (Wan et al., 2016). A similar RNN-based alignment approach was also used for phoneme recognition (Graves, 2012). Many previous works have studied character-level models (Kim et al., 2016b;Sutskever et al., 2011).
Alias detection also bears similarity to natural language inference tasks, where instead of aligning characters to determine if two mentions refer to the same entity, the task is to aligns words to determine if two sentences are semantically equivalent (Bowman et al., 2015;Williams et al., 2018).
Optimal transport and the related Wasserstein distance is studied in mathematics, optimization, and machine learning (Peyré et al., 2017;Villani, 2008). It has notably been used in the NLP community for modeling the distances between documents (Kusner et al., 2015;Huang et al., 2016) as the cost of transporting embedded representations of the words in one document to the words of the an-

Conclusion
In this work, we present STANCE, a neural model of string similarity that is trained end-to-end. The main components of our model are: a characterlevel bidirectional LSTM for character encoding, a soft alignment mechanism via optimal transport, and a powerful CNN for scoring alignments. We evaluate our model on 5 datasets created from publicly available knowledge bases and demonstrate that it outperforms the baselines in almost all cases. We also show that using STANCE improves upon state of the art performance in cross-document coreference in the Twitter at the Grammy's dataset. We analyze our trained model and show that its optimal transport component helps to filter noise and that is has the capacity to learn non-standard similarity-preserving string edit patterns.
In future work, we hope to further study the connections between our optimal transport-based alignment method and methods based on attention. We also hope to consider connections to work on probabilistic latent representation of permutations and matchings . Additionally, we hope to apply STANCE to a wider-range of entity resolution tasks, for which string similarity is a component of model that considers additional features such as the natural language context of the entity mention.