Sentence Alignment using Unfolding Recursive Autoencoders

In this paper, we propose a novel two step algorithm for sentence alignment in monolingual corpora using Unfolding Recursive Autoencoders. First, we use unfolding recursive auto-encoders (RAE) to learn feature vectors for phrases in syntactical tree of the sentence. To compare two sentences we use a similarity matrix which has dimensions proportional to the size of the two sentences. Since the similarity matrix generated to compare two sentences has varying dimension due to different sentence lengths, a dynamic pooling layer is used to map it to a matrix of fixed dimension. The resulting matrix is used to calculate the similarity scores between the two sentences. The second step of the algorithm captures the contexts in which the sentences occur in the document by using a dynamic programming algorithm for global alignment.


Introduction
Neural Network based architectures are increasingly being used for capturing the semantics of the Natural Language (Pennington et al., 2014). We put them to use for alignment of the sentences in monolingual corpora. Sentence alignment can be formally defined as a mapping of sentences from one document to other such that a sentence pair belongs to the mapping iff both the sentences convey the same semantics in their respective texts. The mapping can be many-to-many as a sentence(s) in one document could be split into multiple sentences in the other to convey same information. It is to be noted that this task is different form paraphrase identification because here we are not just considering the similarity between two individual sentences but we are also considering the context in a sense that we are making use of the order in which the sentences occur in documents.
Text alignment in Machine Translation (MT) tasks varies a lot from sentence alignment in monolingual corpora as MT tasks deal with bilingual corpora which exhibits a very strong level of alignment. But two comparable documents in monolingual corpora, such as two articles written about a common entity or two newspaper reports about an event, use widely divergent forms to express same information content. They may contain paraphrases, alternate wording, change of sentence and paragraph order etc. As a result, the surface-based techniques which rely on comparing the sentence lengths, sentence ordering etc. are less likely to be useful for monolingual sentence alignment as opposed to their effectiveness in alignment of bilingual corpora.
Sentence alignment finds its use in applications such as plagiarism detection (Clough et al., 2002), information retrieval and question answering (Marsi and Krahmer, 2005). It can also be used to generate training set data for tasks such as text summarization.

Related Work
A lot of work has been done on the problem of sentence alignment which relies on the surface properties of the text in natural language such as word overlap (Hatzivassiloglou et al., 2001;Barzilay and Elhadad, 2003), bag-of-words model (Nelken and Shieber, 2006). It relies mainly in the field of statistical machine learning (Barzilay and Elhadad, 2003). A little has been done to improve upon this task by capturing the semantics of the text. Barzilay and Elhadad show that a similarity measure combined with contextual information outperforms methods based on sentence similar-ity functions. Nelken and Shieber improved upon the sentence similarity function by borrowing TF-IDF based scoring from the information retrieval literature and outperformed all other methods.
Their work can be summarized in 4 steps: 1. TF*IDF : Treat each sentence as a document and compute it's TF*IDF vector. For a word t in sentence 1 s, T F s (t) denotes the number of times t occurs in s, N is the number of sentences in document and DF (t) indicates the occurrences of t in document.
where w s (t), denotes the value for dimension corresponding to word t in TF-IDF vector of sentence s.
2. The previous step gave the similarity measure of 2 sentences. It was converted to an appropriate probability measure denoting the P r(align(s i , s j ) = 1) by using logistic regression on the training data.
3. Heuristic Alignment : They simply choose sentence pairs between two documents with pr(align) > th,where th is the threshold. Additionally heuristics such as mapping the first sentences of two documents (as justified by Quirk et al. (Dolan et al., 2004) ) and allowing 2-to-1 mapping of adjacent sentences are followed.
4. Global Alignment with Dynamic Programming: They compute the optimal alignment between sentences 1..i of one text and sentences 1..j of the elementary version by using a dynamic programming approach similar to Needleman and Wunsch (1970).

Approach
In this section, we briefly visit the neural network models and other techniques that would be used in our task.

Neural Embeddings
The idea of using neural embeddings is to get ndimensional space representations for the words in vocabulary V. We define a mapping which embeds words into a semantic vector space where the metric approximates semantic similarity. The idea of neural embeddings was first introduced by Bengio et al.(2003) and later worked upon by Turian et al.(2010). Mikolov et al.(2013) points that the words with similar meaning are mapped closer in this new feature space. The directions in the vector space correspond to different semantic concepts. Turian et al.(2010) gave us an encoding from a given word to a vector in the semantic space. Now, we want to have an embedding from a sentence to a vector in the semantic space, i.e. given, we want to get, To get such a mapping, we use autoencoders recursively on the parse tree representation of the sentence. Each node in the parse tree represents a vector of dimension n corresponding to that word or phrase in the sentence.

Unfolding Recursive Autoencoders with Dynamic Pooling
Socher et al. (2011) first used Unfolding Recursive Autoencoders with dynamic pooling for the purpose of paraphrase identification. We would be using their method in our paper for sentence alignment. We learn the embeddings of all the phrases in the parse tree of the sentences using unfolding RAE. For a given sentence with N words, we have total 2N − 1 nodes in the parse tree of the sentence, N for the words and N − 1 for the internal nodes or phrases in the sentence as determined by the parsing of the sentence. For computing the similarity matrix for two sentences, the rows and columns denote the words in their original sentence order. We then add to each row and column the nonterminal nodes of the parse tree in a depth-first and right-to-left order.
For a sentence with N words, and with word embeddings x 1:N and RAE encoding for phrases y 1:N −1 , form For two sentences (s 1 , s 2 ), the similarity matrix S contains the Euclidean distance between (s 1 ) i and (s 2 ) j .
For sentence s 1 of size n and sentence s 2 of size m, the matrix has dimension (2n − 1) × (2m − 1). Since the resulting similarity matrix has dimension which depends on the lengths of the given sentences, we would use dynamic pooling to convert it into a matrix of fixed dimension.
We would be using dynamic min-pooling to convert the variable sized matrix into a matrix of size n p × n p . As Socher et al.(2011) reported, the best suited size for n p is 15. For dynamic pooling, we divide each dimension of 2D matrix into n p chunks of len np size, where len is the length of dimension. If the length len of any dimension is lesser than n p , we duplicate the matrix entries along that dimension till len becomes greater than or equal to n p . If there are l leftover entries where l = len − n p * len np , we distribute them to the last l chunks. We do it for both the dimensions.
We are using min-pooling because closer the two phrases are, lesser is the euclidean distance between them. Min-pooling would be able to capture this relationship if there are two phrases in the window which are closer to each other.

Alignment using similarity scores
The fixed dimension matrix obtained in the previous step was fed to the softmax classifier to get a confidence score about similarity between sentences. We would use a dynamic programming algorithm to find the optimum alignment of sentences between the documents. This approach relies on the document comparability and linearity of sentence ordering in the two documents (albeit weak). We find the maximum optimum alignment between two documents and then backtrack using the alignment matrix M to find the sentences that were aligned. Here, M (i, j) denotes the maximum alignment between sentences 1..i of one document to sentences 1..j of the other document and sim(i, j) denotes the confidence score as given by softmax classifier for similarity between sentences i and j of the two documents respectively. The of f diag constant is used to skip a match between two sentences if the similarity between them is very low. The value of of f diag constant was cho-sen to be 0.1 for our experiment.

Experiment
We would list below the detailed steps of our experiment,

Unfolding RAE's training
We used a pre-trained model of RAE's as given by Socher et al.(2011) which is trained using a subset of 150,000 sentences from the NYT and AP sections of the Gigaword corpus. They used Stanford parser (De Marneffe et al., 2006) to create the parse trees for all sentences. 100-dimensional vectors computed via the unsupervised method of Collobert and Weston (Collobert and Weston, 2008) and provided by Turian et al. (Turian et al., 2010) were used. The RAE used had two encoding layers. The size of hidden layer used is 200 units.

Softmax Classifier
For training the softmax classifier to get the similarity scores between two sentences, we used the dataset for similar task i.e. Paraphrase Identification for training as both the tasks are similar when only individual sentences irrespective of their context are considered. Microsoft Research paraphrase corpus (MSRPC) consists of 5801 pairs of sentences which have been extracted from news sources on the web, along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship.All sentences are labeled by two annotators who agreed in 83% of the cases and third annotator resolved the conflicts. A total of 3,900 sentence pairs are labeled as paraphrases. We used the standard split of 70-30 for training and testing.

Dataset
For testing our algorithm we took articles literacynet archives 2 . It maintains a collection of stories from CNN and CBF5. The material is intended to be used for promoting the literacy. Each story in the archive has an abridged or shorter version. We took 5 such pairs of stories and their abridged versions leading two 2033 sentence pairs that could potentially be aligned. We manually annotated the dataset to find the ground truth. The alignment diversity measure (ADM) for two texts, T 1 , T 2 , is defined to be: where matches denote the actual number of aligned sentence pairs between two documents. Intuitively, for closely aligned document pairs, as prevalent in bilingual alignment or MT tasks, one would expect an ADM value close to 1. The average ADM in our dataset is 0.61.

Algorithm
1. Given two texts T 1 , T 2 , we split each into its sentences. For all sentences s i in T 1 and for all sentences s j in T 2 , we generate the embedding vectors for all the words and phrases in the sentences using unfolding RAE.
2. The similarity matrix S is generated for s i and s j by taking Euclidean distance of between all the possible words and phrases of both the sentences as mentioned earlier.
3. Each similarity matrix is converted to fixed size matrix S pooled by using dynamic Minpooling and is fed to softmax classifier which assigns the confidence score of the two sentences being similar. Now, we have matrix P for all the sentence pairs in T 1 and T 2 such that P i,j represents a measure of similarity between s i in T 1 and s j in T 2 .
4. Let M i,j denote the maximum similarity score obtained by aligning the sentences s 1:i of T 1 with sentences s 1:j of T 2 . We then use a dynamic programming algorithm to maximize this score. We also store the choices made at each step of dynamic programming algorithm and back track to find the optimum sentence alignment.
5. Additionally, we can use heuristics like allowing mapping of multiple sentences in the vicinity of the given sentence to the corresponding sentence in other document, such as to cover cases of splitting a sentence into sentences or vice-versa. But such cases occur rarely and this step can safely be neglected.

Results
To evaluate our result, we also implemented the Nelken and Shieber(2006)'s approach to compare their results with our results and get a better idea of our method's performance. We chose Nelken's(2006) approach because they have shown that it out performs all other methods. We tested our algorithm on the dataset and found that our approach yielded a precision of 78.84% on a recall of 67.21% giving us an F1-score of 0.7256 . While on the same dataset, Nelken and Shieber's approach gave 65.95% precision on a recall of 50.81% and thus an F1-score of 0.5739. Thus, our approach clearly outperforms Neilken and Shieber's approach. It is to be noted that Nelken and Shieber report an F1-score of 0.6676 at a recall of 0.558, while our implementation of their approach achieved an F1-score of 0.5739 at recall of 0.508. The change in F1-score may be because of the different types of dataset used in the two experiments. Nelken and Shieber had used Britannica encyclopedia and its elementary version containing information about the cities. We have used news reports and their abridged versions which used widely divergent language forms, such as abundant use of change of tense, change of grammatical person, change of writing style etc. which could not be captured by their TF-IDF based similarity. Fig. 1 shows one instance of alignment of a document pair by our approach vs. the gold alignment.

Conclusion
We have presented a novel algorithm for aligning the sentences of monolingual corpora of comparable documents. We used a neural network model to arrive at a measure of similarity between sentences. The contextual information present in the document was leveraged upon by using a dynamic programming algorithm to align sentences. Our algorithm performed better than the baseline implementation. It takes into account the semantics being conveyed by the sentences rather just relying on the bag-of-words model for sentence similarity Figure 1: Gold Assignment vs Our Approach on an example. The orange circles with blue dot denote True Positives, orange circles denote False Positives and the blue dots denote False Negatives. function.