TextFlow: A Text Similarity Measure based on Continuous Sequences

Text similarity measures are used in multiple tasks such as plagiarism detection, information ranking and recognition of paraphrases and textual entailment. While recent advances in deep learning highlighted the relevance of sequential models in natural language generation, existing similarity measures do not fully exploit the sequential nature of language. Examples of such similarity measures include n-grams and skip-grams overlap which rely on distinct slices of the input texts. In this paper we present a novel text similarity measure inspired from a common representation in DNA sequence alignment algorithms. The new measure, called TextFlow, represents input text pairs as continuous curves and uses both the actual position of the words and sequence matching to compute the similarity value. Our experiments on 8 different datasets show very encouraging results in paraphrase detection, textual entailment recognition and ranking relevance.

Text similarity measures are used in multiple tasks such as plagiarism detection, information ranking and recognition of paraphrases and textual entailment. While recent advances in deep learning highlighted further the relevance of sequential models in natural language generation, existing similarity measures do not fully exploit the sequential nature of language. Examples of such similarity measures include ngrams and skip-grams overlap which rely on distinct slices of the input texts. In this paper we present a novel text similarity measure inspired from a common representation in DNA sequence alignment algorithms. The new measure, called TextFlow, represents input text pairs as continuous curves and uses both the actual position of the words and sequence matching to compute the similarity value. Our experiments on eight different datasets show very encouraging results in paraphrase detection, textual entailment recognition and ranking relevance.

Background
The number of pages required to print the content of the World Wide Web was estimated to 305 billion in a 2015 article 1 . While a big part of this content consists of visual information such as pictures and videos, texts also continue growing at a very high pace. A recent study shows that the average webpage weights 1,200 KB with plain text accounting for up to 16% of that size 2 .
While efficient distribution of textual data and computations are the key to deal with the increas-1 http://goo.gl/p9lt7V 2 http://goo.gl/c41wpa ing scale of textual search, similarity measures still play an important role in refining search results to more specific needs such as the recognition of paraphrases and textual entailment, plagiarism detection and fine-grained ranking of information. These tasks are also often performed on dedicated document collections for domain-specific applications where text similarity measures can be directly applied.
Finding relevant approaches to compute text similarity motivated a lot of research in the last decades (Sahami and Heilman, 2006;Hatzivassiloglou et al., 1999), and more recently with deep learning methods (Socher et al., 2011;Yih et al., 2011;Severyn and Moschitti, 2015). However, most of the recent advances focused on designing high performance classification methods, trained and tested for specific tasks and did not offer a standalone similarity measure that could be applied (i) regardless of the application domain and (ii) without requiring training corpora.
For instance, Yih and Meek (2007) presented an approach to improve text similarity measures for web search queries with a length ranging from one word to short sequences of words. The proposed method is tailored to this specific task, and the training models are not expected to perform well on different kinds of data such as sentences, questions or paragraphs. In a more general study, Achananuparp et al. (2008) compared several text similarity measures for paraphrase recognition, textual entailment, and the TREC 9 question variants task. In their experiments the best performance was obtained with a linear combination of semantic and lexical similarities, including a word order similarity proposed in (Li et al., 2006). This word order similarity is computed by constructing first two vectors representing the common words between two given sentences and using their respective positions in the sentences as term 763 weights. The similarity value is then obtained by subtracting the two vectors and taking the absolute value. While such representation takes into account the actual positions of the words, it does not allow detecting sub-sequence matches and takes into account missing words only by omission.
More generally, existing standalone (or traditional) text similarity measures rely on the intersections between token sets and/or text sizes and frequency, including measures such as the Cosine similarity, Euclidean distance, Levenshtein (Sankoff and Kruskal, 1983), Jaccard (Jain and Dubes, 1988) and Jaro (Jaro, 1989). The sequential nature of natural language is taken into account mostly through word n-grams and skipgrams which capture distinct slices of the analysed texts but do not preserve the order in which they appear.
In this paper, we use intuitions from a common representation in DNA sequence alignment to design a new standalone similarity measure called TextFlow (XF). The proposed measure uses at the same time the full sequence of input texts in a natural sub-sequence matching approach together with individual token matches and mismatches. Our contributions can be detailed further as follows: • A novel standalone similarity measure which: exploits the full sequence of words in the compared texts. is asymmetric in a way that allows it to provide the best performance on different tasks (e.g., paraphrase detection, textual entailment and ranking). when required, it can be trained with a small set of parameters controlling the impact of sub-sequence matching, position gaps and unmatched words. provides consistent high performance across tasks and datasets compared to traditional similarity measures.
• A neural network architecture to train TextFlow parameters for specific tasks.
• An empirical study on both performance consistency and standard evaluation measures, performed with eight datasets from three different tasks. • A new evaluation measure, called CORE, used to better show the consistency of a system at high performance using both its rank average and rank variance when compared to competing systems over a set of datasets.

The TextFlow Similarity
XF is inspired from a dot matrix representation commonly used in pairwise DNA sequence alignment (cf. figure 1). We use a similar dot matrix representation for text pairs and draw a curve oscillating around the diagonal (cf. figure 2). The area under the curve is considered to be the distance between the two text pairs which is then normalized with the matrix surface. For practical computation, we transform this first intuitive representation using the delta of positions as in figure  3. In this setting, the Y axis is the delta of positions of a word occurring in the two texts being compared. If the word does not occur in the target text, the delta is considered to be a maximum reference value (l in figure 2). The semantics are: the bigger the area under curve is, the lower the similarity between the compared texts. XF values are real numbers in the [0,1] interval, with 1 indicating a perfect match, and 0 indicating that the compared texts do not have any common tokens. With this representation, we are able to take into account all matched words and sub-sequences at the same time. The exact value for the XF similarity between two texts X = {x 1 , x 2 , .., x n } and Y = {y 1 , y 2 , .., y m } is therefore computed as: corresponding to the rectangular component. They are expressed as: and: With: • ∆P (x i , X, Y ) the minimum difference between x i positions in X and Y . x i position in X is multiplied by the factor |Y | |X| for nor- is set to the same reference value equal to m, (i.e., the cost of a missing word is set by default to the length of the target text), and: • S i is the length of the longest matching sequence between X and Y including the word XF computation is performed in O(nm) in the worst case where we have to check all tokens in the target text Y for all tokens in the input text X. XF is an asymmetric similarity measure. Its asymmetric aspect has interesting semantic applications as we show in the example below (cf. figure 2). The minimum value of XF provided the best differentiation between positive and negative text pairs when looking for semantic equivalence (i.e., paraphrases), the maximum value was among the top three for the textual entailment example. We conduct this comparison at a larger scale in the evaluation section.
We add 3 parameters to XF in order to represent the importance that should be given to position deltas (Position factor α), missing words (sensitivity factor β), and sub-sequence matching (sequence factor γ), such that: With: and: • α < β: forces missing words to always cost more than matched words.
The γ factor increases or decreases the impact of sub-sequence matching, α applies to individual token matches whether inside or outside a sequence, and β increases or decreases the impact of Positive Entailment E1 Under a blue sky with white clouds, a child reaches up to touch the propeller of a plane standing parked on a field of grass.
E2 A child is reaching to touch the propeller of a plane.
Negative Entailment E3 Two men on bicycles competing in a race. E4 Men are riding bicycles on the street.
Positive Paraphrase P1 The most serious breach of royal security in recent years occurred in 1982 when 30year-old Michael Fagan broke into the queen's bedroom at Buckingham Palace.
P2 It was the most serious breach of royal security since 1982 when an intruder, Michael Fagan, found his way into the Queen's bedroom at Buckingham Palace.
Negative Paraphrase P3 "Americans don't cut and run, we have to see this misadventure through," she said. P4 She also pledged to bring peace to Iraq: "Americans don't cut and run, we have to see this misadventure through."

Parameter Training
By default XF has canonical parameters set to 1. However, when needed, α, β, and γ can be learned on training data for a specific task. We designed a neural network to perform this task, with a hidden layer dedicated to compute the exact XF value. To do so we compute, for each input text pair, the coefficients vector that would lead exactly to the XF value when multiplied by the vector < α β , α βγ , 1 >. Figure 5) presents the training neural network considering several types of sequences (or translations) of the input text pairs (e.g., lemmas, words, synsets).
We use identity as activation function in the dedicated XF layer in order to have a correct comparison with the other similarity measures, including canonical XF where the similarity value is provided in the input layer (cf. figure 6).

Evaluation
Datasets. This evaluation was performed on 8 datasets from 3 different classification tasks: Tex-tual Entailment Recognition, Paraphrase Detection, and ranking relevance. The datasets are as follows: • RTE 1, 2, and 3: the first three datasets from the Recognizing Textual Entailment (RTE) challenge (Dagan et al., 2006). Each dataset consists of sentence pairs which are annotated with 2 labels: entailment, and nonentailment. They contain respectively (200, 800), (800, 800), and (800, 800) (train, test) pairs. Negative examples were collected from the same source by selecting consecutive sentences and random sentences.
• SNLI: a recent RTE dataset consisting of 560K training sentence pairs annotated with . We discarded the contradiction pairs as they do not necessarily represent dissimilar sentences and are therefore a random noise w.r.t. our similarity measure evaluation.
• MSRP: the Microsoft Research Paraphrase corpus, consisting of 5,800 sentence pairs annotated with a binary label indicating whether the two sentences are paraphrases or not.
• Semeval-16-3B: a dataset of questionquestion similarity collected from Stack-Overflow (Nakov et al., 2016). The dataset contains 3,169 training pairs and 700 test pairs. Three labels are considered: "Perfect Match", "Relevant" or "Irrelevant". We combined the first two into the same positive category for our evaluation.
• Semeval-14-1: a corpus of Sentences Involving Compositional Knowledge (Marelli et al., 2014) consisting of 10,000 English sentence pairs annotated with both similarity scores and relevance labels.
Features. After a preprocessing step where we removed stopwords, we computed the similarity values using 7 different types of sequences constructed, respectively, with the following value from each token: • Word (plain text value) • Lemma • Part-Of-Speech (POS) tag • WordNet Synset 6 OR Lemma • WordNet Synset OR Lemma for Nouns

• WordNet Synset OR Lemma for Verbs
• WordNet Synset OR Lemma for Nouns and Verbs.
In the last 4 types of sequences the lemma is used when there is no corresponding WordNet synset. In a first experiment we compare different aggregation functions on top of XF (minimum, maximum and average) in table 1. We used the Li-bLinear 7 SVM classifier for this task.
In the second part of the evaluation, we use neural networks to compare the efficiency of XF c , XF t and other similarity measures with in the same setting. We use the neural net described in figure 5 for the trained version XF t and the equivalent architecture presented in figure 6 for all other similarity measures. For canonical XF c we use by default the best aggregation for the task as observed in table 3.  (Sankoff and Kruskal, 1983), and Longest Common Subsequence (LCS) (Friedman and Sideli, 1992). Implementation. XF was implemented in Java as an extension of the Simmetrics package, made available at this address 9 . The neural networks were implemented in Python with TensorFlow 10 . We also share the training sets used for both parameter training and evaluation. The evaluation was performed on a 4-core laptop with 32GB of RAM. The initial parameters for XF t were chosen with a random function. Evaluation Measures. We use the standard accuracy values and F1, precision and recall for the 8 https://github.com/Simmetrics/ simmetrics 9 https://github.com/ymrabet/TextFlow 10 https://www.tensorflow.org/ positive class (i.e., entailment, paraphrase, and ranking relevance). We also study the relative rank in performance of each similarity measure across all datasets using the average rank, the rank variance and a new evaluation measure called Consistent peRformancE (CORE), computed as follows for a system m, a set of datasets D, a set of systems S, and an evaluation measure E ∈ {F 1, P recision, Recall, Accuracy}: With R S (E d (m)) the rank of m according to the evaluation measure E on dataset d w.r.t. competing systems S. V d∈D (R S (E d (m))) is the rank variance of m over datasets. The results in tables 2, 3, and 4 demonstrate the intuition. Basically, CORE tells us how consistent a system/method is in having high performance, relatively to the set of competing systems S. The maximum value of CORE is 1 for the best performing system according to its rank. It also allows quantifying how consistently a system achieves high performance for the remaining systems.
TextFlow outperformed the results obtained with a combination of word order similarity and semantic similarities tested in (Achananuparp et al., 2008), with gaps of +1.0 in F1 and +6.1 accuracy on MSRP and +4.2 F1 and +2.7% accuracy on RTE 3.

Canonical Text Flow
T F c had the best average and micro-average accuracy on the 8 classification datasets, with a gap of +0.4 to +6.3 when compared to the state-of-the-art measures. It also reached the best precision average with a gap of +1.8 to +6.3. On the F1 score level XF c achieved the second best performance with a gap of -1.7, mainly caused by its underperformance in recall, where it had the third best performance with a gap of -6.3 (cf.     Table 5: Recall values. The best result is highlighted, the second best is underlined.
accuracy, F1 and precision, and the second best for recall.

Trained Text Flow
When compared to state-of-the-art measures and to canonical XF, the trained version, XF t , obtained the best accuracy with a gap ranging from +1.4 to +7.8. XF t also obtained the second best F1 average with a -1.0 gap, but with clear inconsistencies according to the dataset. XF t obtained the best precision with a gap ranging from +0.8 to +7.1. XF t did not perform well on recall with 64.5% micro-average compared to WordOverlap with 72%. Both its recall and F1 performance can be explained by the fact that the measure was trained to optimize accuracy, and not the F1 score for the positive class; which also suggests that the approach could be adapted to F1 optimization if needed.

Synthesis
Canonical XF was more consistent than trained XF on all dimensions except accuracy, for which XF t was optimized. We argue that this consistency was made possible through the asymmetry of XF which allowed it to adapt to different kinds of similarities (i.e., equivalence/paraphrase, inference/entailment, and mutual distance/ranking). These results also show that the actual position difference is a relevant factor for text similarity. We explain it mainly by the natural flow of language where the important entities and relations are often expressed first, in contrast with a purely logical-driven approach which has to consider, for instance, that active forms and passive forms are equivalent in meaning. The difference in positions is also not read literally, both because of the higher impact associated to missed words and to the α parameter which allows leveraging their impact in the trained version.

Additional Experiments
In additional experiments, we compared T F c and T F t with the other similarity measures when applied to bi-grams and tri-grams instead of individual tokens. The results were significantly lower on all datasets (between 3 and 10 points loss in accuracy) for both the soa measures and TextFlow variants. This result could be explained by the fact that n-grams are too rigid when a sub-sequence varies even slightly, e.g., the insertion of a new word inside a 3-words sequence leads to a tri-gram mismatch and reduces bi-gram overlap from 100% to 50% for the considered sub-sequence. This issue is not encountered with TextFlow as it relies on the token level, and such an insertion will not cancel or reduce significantly the gains from the correct ordering of the words. It must be noted here that not all languages grant the same level of importance to sequences and that additional multilingual tests have to be carried out.
In addition to binary classification output such as textual entailment and paraphrase recognition, text similarity measures can be evaluated more precisely when we consider the correlation of their values for ranking purposes.
We conducted ranking correlation experiments on three test datasets provided at the semantic text similarity task at Semeval 2012, with gold score values for their text pairs. The datasets have 750 sentence pairs each, and are extracted from the Microsoft Research video descriptions corpus, MSRP and the SMTeuroparl 11 . When compared to the traditional similarity measures, TextFlow had the best correlation on the first two datasets with, for instance, 0.54 and 0.46 pearson correlation on the lemmas sequences and the second best on the MSRP extract where the Cosine similarity had the best performance with 0.71 vs 0.68, noting that the Cosine similarity uses word frequencies when the evaluated version of TextFlow did not use word-level weights.
Including word weights is one of the promising perspectives in line with this work as it could be done simply by making the deltas vary according to the weight/importance of the (un)matched word. Also, in such a setting, the impact of a sequence of N words will naturally increase or decrease according to the word weights (cf. figure 3). We conducted a preliminary test using the inverse document frequency of the words as extracted from Wikipedia with Gensim 12 , which led to an improvement of up to 2% for most datasets, with performance decreasing slightly on two of them. Other kinds of weights could also be included just as easily, such as contextual word relatedness using embeddings or other semantic relatedness factors such as WordNet distances (Pedersen et al., 2004).

Conclusion
We presented a novel standalone similarity measure that takes into account continuous word sequences. An evaluation on eight datasets show promising results for textual entailment recognition, paraphrase detection and ranking. Among the potential extensions of this work are the inclusion of different kinds of weights such as TF-IDF, embedding relatedness and semantic relatedness. We also intend to test other variants around the same concept, including considering the matched words and sequences to have a negative weight to balance further the weight of missing words.