Learning Text Similarity with Siamese Recurrent Networks

This paper presents a deep architecture for learning a similarity metric on variable-length character sequences. The model combines a stack of character-level bidirectional LSTM’s with a Siamese architecture. It learns to project variable-length strings into a ﬁxed-dimensional embedding space by using only information about the similarity between pairs of strings. This model is applied to the task of job title normalization based on a manually annotated taxonomy. A small data set is incrementally expanded and augmented with new sources of variance. The model learns a representation that is selective to differences in the input that reﬂect semantic differences (e.g., “Java developer” vs. “HR manager”) but also invariant to non-semantic string differences (e.g., “Java de-veloper” vs. “Java programmer”).


Introduction
Text representation plays an important role in natural language processing (NLP). Tasks in this field rely on representations that can express the semantic similarity and dissimilarity between textual elements, be they viewed as sequences of words or characters. Such representations and their associated similarity metrics have many applications. For example, word similarity models based on dense embeddings  have recently been applied in diverse settings, such as sentiment analysis (dos Santos and Gatti, 2014) and recommender systems (Barkan and Koenigstein, 2016). Semantic textual similarity measures have been applied to tasks such as automatic summarization (Ponzanelli et al., 2015), debate analysis (Boltuzic andŠnajder, 2015) and paraphrase detection (Socher et al., 2011).
Measuring the semantic similarity between texts is also fundamental problem in Information Extraction (IE) (Martin and Jurafsky, 2000). An important step in many applications is normalization, which puts pieces of information in a standard format, so that they can be compared to other pieces of information. Normalization relies crucially on semantic similarity. An example of normalization is formatting dates and times in a standard way, so that "12pm", "noon" and "12.00h" all map to the same representation. Normalization is also important for string values. Person names, for example, may be written in different orderings or character encodings depending on their country of origin. A sophisticated search system may need to understand that the strings "李小龙", "Lee, Junfan" and "Bruce Lee" all refer to the same person and so need to be represented in a way that indicates their semantic similarity. Normalization is essential for retrieving actionable information from free, unstructured text.
In this paper, we present a system for job title normalization, a common task in information extraction for recruitment and social network analysis (Javed et al., 2014;Malherbe et al., 2014). The task is to receive an input string and map it to one of a finite set of job codes, which are predefined externally. For example, the string "software architectural technician Java/J2EE" might need to be mapped to "Java developer". This task can be approached as a highly multi-class classification problem, but in this study, the approach we take focuses on learning a representation of the strings such that synonymous job titles are close together. This approach has the advantage that it is flexible, i.e., the representation can function as the input space to a subsequent classifier, but can also be used to find closely related job titles or explore job title clusters. In addition, the architecture of the learning model allows us to learn useful representations with limited supervision.

Related Work
The use of (deep) neural networks for NLP has recently received much attention, starting from the seminal papers employing convolutional networks on traditional NLP tasks (Collobert et al., 2011) and the availability of high quality semantic word representations . In the last few years, neural network models have been applied to tasks ranging from machine translation (Zou et al., 2013;Cho et al., 2014) to question answering (Weston et al., 2015). Central to these models, which are usually trained on large amounts of labeled data, is feature representation. Word embedding techniques such as word2vec  and Glove (Pennington et al., 2014) have seen much use in such models, but some go beyond the word level and represent text as a sequence of characters (Kim et al., 2015;Ling et al., 2015). In this paper we take the latter approach for the flexibility it affords us in dealing with out-of-vocabulary words.
Representation learning through neural networks has received interest since autoencoders (Hinton and Salakhutdinov, 2006) have been shown to produce features that satisfy the two desiderata of representations; that they are invariant to differences in the input that do not matter for that task and selective to differences that do (Anselmi et al., 2015).
The Siamese network (Bromley et al., 1993) is an architecture for non-linear metric learning with similarity information. The Siamese network naturally learns representations that embody the invariance and selectivity desiderata through explicit information about similarity between pairs of objects. In contrast, an autoencoder learns invariance through added noise and dimensionality reduction in the bottleneck layer and selectivity solely through the condition that the input should be reproduced by the decoding part of the network. In contrast, a Siamese network learns an invariant and selective representation directly through the use of similarity and dissimilarity information.
Originally applied to signature verification (Bromley et al., 1993), the Siamese architecture has since been widely used in vision applications. Siamese convolutional networks were used to learn complex similarity metrics for face verification (Chopra et al., 2005) and dimensionality reduction on image features (Hadsell et al., 2006). A variant of the Siamese network, the triplet net-work (Hoffer and Ailon, 2015), was used to learn an image similarity measure based on ranking data (Wang et al., 2014).
The task of job title normalization is often framed as a classification task (Javed et al., 2014;Malherbe et al., 2014). Given the large number of classes (often in the thousands), multi-stage classifiers have shown good results, especially if information outside the string can be used (Javed et al., 2015). There are several disadvantage to this approach. The first is the expense of data acquisition for training. With many thousands of groups of job titles, often not too dissimilar from one another, manually classifying large amounts of job title data becomes prohibitively expensive. A second disadvantage of this approach is its lack of corrigibility. Once a classification error has been discovered or a new example has been added to a class, the only option to improve the system is to retrain the entire classifier with the new sample added to the correct class in the training set. The last disadvantage is that using a traditional classifier does not allow for transfer learning, i.e., reusing the learned model's representations for a different task.
A different approach is the use of string similarity measures to classify input strings by proximity to an element of a class (Spitters et al., 2010). The advantage of this approach is that there is no need to train the system, so that improvements can be made by adding job title strings to the data. The disadvantages are that data acquisition still needs to be performed by manually classifying strings and that the bulk of the problem is now shifted to constructing a good similarity metric.
By modeling similarity directly based on pairs of inputs, Siamese networks lend themselves well to the semantic invariance phenomena present in job title normalization: typos (e.g. "Java developeur"), near-synonymy (e.g., "developer" and "programmer") and extra words (e.g., "experienced Java developer"). This is the approach we take in this study.

Siamese Recurrent Neural Network
Recurrent Neural Networks (RNN) are neural networks adapted for sequence data (x 1 , . . . , x T ). At each time step t ∈ {1, . . . , T }, the hiddenstate vector h t is updated by the equation h t = σ(W x t + U h t−1 ), in which x t is the input at time t, W is the weight matrix from inputs to the hidden-state vector and U is the weight matrix on the hidden-state vector from the previous time step h t−1 . In this equation and below the logistic function is denoted by The Long Short-Term Memory (Hochreiter and Schmidhuber, 1997) variant of RNNs in particular has had success in tasks related to natural language processing, such as text classification (Graves, 2012) and language translation . Standard RNNs suffer from the vanishing gradient problem in which the backpropagated gradients become vanishingly small over long sequences (Pascanu et al., 2013). The LSTM model was proposed as a solution to this problem. Like the standard RNN, the LSTM sequentially updates a hidden-state representation, but it introduces a memory state c t and three gates that control the flow of information through the time steps. An output gate o t determines how much of c t should be exposed to the next node. An input gate i t controls how much the input x t matters at this time step. A forget gate f t determines whether the previous time step's memory should be forgotten. An LSTM is parametrized by weight matrices from the input and the previous state for each of the gates, in addition to the memory cell. We use the standard formulation of LSTMs with the logistic function (σ) on the gates and the hyperbolic tangent (tanh) on the activations. In the equations (1) below, • denotes the Hadamard (elementwise) product. (1) Bidirectional RNNs (Schuster and Paliwal, 1997) incorporate both future and past context by running the reverse of the input through a separate RNN. The output of the combined model at each time step is simply the concatenation of the outputs from the forward and backward networks. Bidirectional LSTM models in particular have recently shown good results on standard NLP tasks like Named Entity Recognition (Huang et al., 2015;Wang et al., 2015) and so we adopt this technique for this study.
Siamese networks (Chopra et al., 2005) are dual-branch networks with tied weights, i.e., they consist of the same network copied and merged with an energy function. Figure 1 shows an overview of the network architecture in this study. The training set for a Siamese network consists of triplets (x 1 , x 2 , y), where x 1 and x 2 are character sequences and y ∈ {0, 1} indicates whether x 1 and x 2 are similar (y = 1) or dissimilar (y = 0). The aim of training is to minimize the distance in an embedding space between similar pairs and maximize the distance between dissimilar pairs.

Contrastive loss function
The proposed network contains four layers of Bidirectional LSTM nodes. The activations at each timestep of the final BLSTM layer are averaged to produce a fixed-dimensional output. This output is projected through a single densely connected feedforward layer.
Let f W (x 1 ) and f W (x 2 ) be the projections of x 1 and x 2 in the embedding space computed by the network function f W . We define the energy of the model E W to be the cosine similarity between the embeddings of x 1 and x 2 : For brevity of notation, we will denote E W (x 1 , x 2 ) by E W . The total loss function over a data set X = x The instance loss function L W is a contrastive loss function, composed of terms for the similar (y = The loss functions for the similar and dissimilar cases are given by: Figure 2 gives a geometric perspective on the loss function, showing the positive and negative components separately. Note that the positive loss is scaled down to compensate for the sampling ratios of positive and negative pairs (see below).
The network used in this study contains four BLSTM layers with 64-dimensional hidden vectors h t and memory c t . There are connections at each time step between the layers. The outputs of the last layer are averaged over time and this 128-dimensional vector is used as input to a dense feedforward layer. The input strings are padded to produce a sequence of 100 characters, with the input string randomly placed in this sequence. The parameters of the model are optimized using the Adam method (Kingma and Ba, 2014) and each model is trained until convergence. We use the dropout technique (Srivastava et al., 2014) on the recurrent units (with probability 0.2) and between layers (with probability 0.4) to prevent overfitting.

Experiments
We conduct a set of experiments to test the model's capabilities. We start from a small data set based on a hand made taxonomy of job titles. In each subsequent experiment the data set is augmented by adding new sources of variance. We test the model's behavior in a set of unit tests, reflecting desired capabilities of the model, taking our cue from (Weston et al., 2015). This section discusses the data augmentation strategies, the composition of the unit tests, and the results of the experiments.

Baseline
Below we compare the performance of our model against a baseline n-gram matcher (Daelemans et al., 2004). Given an input string, this matcher looks up the closest neighbor from the base taxonomy by maximizing a similarity scoring function. The matcher subsequently labels the input string with that neighbor's group label. The similarity scoring function is defined as follows. Let Q = q 1 , . . . , q M be the query as a sequence of characters and C = c 1 , . . . , c N be the candidate match from the taxonomy. The similarity function is defined as: This (non-calibrated) similarity function has the properties that it is easy to compute, doesn't require any learning and is particularly insensitive to appending extra words in the input string, one of the desiderata listed below.
In the experiments listed below, the test sets consist of pairs of strings, the first of which is the input string and the second a target group label from the base taxonomy. The network model projects the input string into the embedding space and searches for its nearest neighbor under cosine distance from the base taxonomy. The test records a hit if and only if the neighbor's group label matches the target.

Data and Data Augmentation
The starting point for our data is a hand made proprietary job title taxonomy. This taxonomy partitions a set of 19,927 job titles into 4,431 groups. Table 1 gives some examples of the groups in the taxonomy. The job titles were manually and semiautomatically collected from résumés and vacancy postings. Each was manually assigned a group, such that the job titles in a group are close together in meaning. In some cases this closeness is an expression of a (near-)synonymy relation between the job titles, as in "developer" and "developer/programmer" in the "Software Engineer" category. In other cases a job title in a group is a specialization of another, for example "general operator" and "buzz saw operator" in the "Machine Operator" category. In yet other cases two job titles differ only in their expression of seniority, as in "developer" and "senior developer" in the "Software Engineer" category. In all cases, the relation between the job titles is one of semantic similarity and not necessarily surface form similarity. So while, "Java developer" and "J2EE programmer" are in the same group, "Java developer" and "real estate developer" should not be.
Note that some groups are close together in meaning, like the "Production Employee" and "Machine Operator" groups. Some groups could conceivably be split into two groups, depending on the level of granularity that is desired. We make no claim to completeness or consistency of these groupings, but instead regard the wide variety of different semantic relations between and within groups as an asset that should be exploited by our model.
The groups are not equal in size; the sizes follow a broken power-law distribution. The largest group contains 130 job titles, the groups at the other end of the distribution have only one. This affects the amount of information we can give to the system with regards to the semantic similarity between job titles in a group. The long tail of the distribution may impact the model's ability to accurately learn to represent the smallest groups. Figure 3 shows the distribution of the group sizes of the original taxonomy.
We proceed from the base taxonomy of job titles in four stages. At each stage we introduce (1) an   augmentation of the data which focuses on a particular property and (2) a test that probes the model for behavior related to that property. Each stage builds on the next, so the augmentations from the previous stage are always included. Initially, the data set consists of pairs of strings sampled from the taxonomy in a 4:1 ratio of between-class (negative) pairs to within-class (positive) pairs. This ratio was empirically determined but other studies have found a similar optimal ratio of negative to positive pairs in Siamese networks (Synnaeve and Dupoux, 2016). In the subsequent augmentations, we keep this ratio constant. 1. Typo and spelling invariance. Users of the system may supply job titles that differ in spelling from what is present in the taxonomy (e.g., "la-borer" vs "labourer") or they may make a typo and insert, delete or substitute a character. To induce invariance to these we augment the base taxonomy by extending it with positive sample pairs consisting of job title strings and the same string but with 20% of characters randomly substituted and 5% randomly deleted. Of the resulting training set, 10% consists of these typo pairs. The corresponding test set (Typos) consists of all the 19,928 job title strings in the taxonomy with 5% of their characters randomly substituted or deleted. This corresponds to an approximate upper bound on the proportion of spelling errors (Salthouse, 1986).

Synonyms
. Furthermore, the model must be invariant to synonym substitution. To continue on the example given above, the similarity between "Java developer" and "Java programmer" show that in the context of computer science "developer" and "programmer" are synonyms. This entails that, given the same context, "developer" can be substituted for "programmer" in any string in which it occurs without altering the meaning of that string. So "C++ developer" can be changed into "C++ programmer" and still refer to the same job. Together with the selectivity constraint, the invariance to synonym substitution constitutes a form of compositionality on the component parts of job titles. A model with this compositionality property will be able to generalize over the meanings of parts of job titles to form useful representations of unobserved inputs. We augment the data set by substituting words in job titles by synonyms from two sources. The first source is a manually constructed job title synonym set, consisting of around 1100 job titles, each with between one and ten synonyms for a total of 7000 synonyms. The second source of synonyms is by induction. As in the example above, we look through the taxonomy for groups in which two job titles share one or two words, e.g., "C++". The complements of the matching strings form a synonym candidate, e.g., "developer" and "programmer". If the can-didate meets certain requirements (neither part occurs in isolation, the parts do not contain special characters like '&', the parts consist of at most two words), then the candidate is accepted as a synonym and is substituted throughout the group. The effect of this augmentation on the group sizes is shown in figure 3. The corresponding test set (Composition) consists of a held out set of 7909 pairs constructed in the same way.
3. Extra words. To be useful in real-world applications, the model must also be invariant to the presence of superfluous words. Due to parsing errors or user mistakes the input to the normalization system may contain strings like "looking for C++ developers (urgent!)", or references to technologies, certifications or locations that are not present in the taxonomy. Table 2 shows some examples of real input. We augment the data set by extracting examples of superfluous words from real world data. We construct a set by selecting those input strings for which there is a job title in the base taxonomy which is the complete and strict substring of the input and which the baseline n-gram matcher selects as the normalization. As an example, in table 2, the input string "public relations consultant business business b2c" contains the taxonomy job title "public relations consultant". Part of this set (N = 1949) is held out from training and forms the corresponding test set (Extra Words).
Input string supervisor dedicated services share plans part II architectural assistant or architect at geography teacher 0.4 contract now customer relationship management developer super userâ forgot password public relations consultant business business b2c teaching assistant degree holders only contract Table 2: Example input strings to the system.

Feedback.
Lastly, and importantly for industrial applications, we would like our model to be corrigible, i.e., when the model displays undesired behavior or our knowledge about the domain increases, we want the model to facilitate manual intervention. As an example, if the trained model assigns a high similarity score to the string "Java developer" and "Coffee expert (Java, Yemen)" based on the corresponding substrings, we would like to be able to signal to the model that these particular instances do not belong together. To test this behavior, we manually scored a set of 11929 predic-tions. This set was subsequently used for training. The corresponding test set (Annotations) consists of a different set of 1000 manually annotated heldout input strings. Table 3 shows the results of the experiments. It compares the baseline n-gram system and proposed neural network models on the four tests outlined above. Each of the neural network models (1)-(4) was trained on augmentations of the data set that the previous model was trained on.

Results
The first thing to note is that both the n-gram matching system and the proposed models have near-complete invariance to simple typos. This is of course expected behavior, but this test functions as a good sanity check on the surface form mapping to the representations that the models learn.
In the performance of all tests except for the Annotations test, we see a strong effect of the associated augmentation. Model (1) shows 0.04 improvement over model (0) on the typo test. This indicates that the proposed architecture is suitable for learning invariance to typos, but that the addition of typos and spelling variants to the training input only produces marginal improvements over the already high accuracy on this test.
Model (2) shows 0.29 improvement over model (1) on the Composition test. This indicates that model (2) has successfully learned to combine the meanings of individual words in the job titles into new meanings. This is an important property for a system that aims to learn semantic similarity between text data. Compositionality is arguably the most important property of human language and it is a defining characteristic of the way we construct compound terms such as job titles. Note also that the model learned this behavior based largely on observations of combinations of words, while having little evidence on the individual meanings.
Model (3) shows 0.45 improvement over model (2) on the Extra Words test, jumping from 0.29 accuracy to 0.76. This indicates firstly that the proposed model can successfully learn to ignore large portions of the input sequence and secondly that the evidence of extra words around the job title is crucial for the system to do so. Being able to ignore subsequences of an input sequence is an important ability for information extraction systems.
The improvements on the Annotations test is also greatest when the extra words are added to the

Discussion
In this paper, we presented a model architecture for learning text similarity based on Siamese recurrent neural networks. With this architecture, we learned a series of embedding spaces, each based on a specific augmentation of the data set used to train the model. The experiments demonstrated that these embedding spaces captured important invariances of the input; the models showed themselves invariant to spelling variation, synonym replacements and superfluous words. The proposed architecture made no assumptions on the input distribution and naturally scales to a large number of classes. The ability of the system to learn these invariances stems from the contrastive loss function combined with the stack of recurrent layers. Using separate loss functions for similar and dissimilar samples helps the model maintain selectivity while learning invariance over different sources of variability. The experiment shows that the explicit use of prior knowledge to add these sources of invariance to the system was crucial in learning. Without this knowledge extra words and synonyms will negatively affect the performance of the system.
We would like to explore several directions in future work. The possibility space around the proposed network architecture could be explored more fully, for example by incorporating convolutional layers in addition to the recurrent layers, or by investigating a triplet loss function instead of the contrastive loss used in this study.
The application used here is a good use case for the proposed system, but in future work we would also like to explore the behavior of the Siamese recurrent network on standard textual similarity and semantic entailment data sets. In addition, the baseline used in this paper is relatively weak. A comparison to a stronger baseline would serve the further development of the proposed models.
Currently negative samples are selected randomly from the data set. Given the similarity between some groups and the large differences in group sizes, a more advanced selection strategy is likely to yield good results. For example, negative samples could be chosen such that they always emphasize minimal distances between groups. In addition, new sources of variation as well as the sampling ratios between them can be explored.
Systems like the job title taxonomy used in the current study often exhibit a hierarchical structure that we did not exploit or attempt to model in the current study. Future research could attempt to learn a single embedding which would preserve the separations between groups at different levels in the hierarchy. This would enable sophisticated transfer learning based on a rich embedding space that can represent multiple levels of similarities and contrasts simultaneously.