Structured Alignment Networks for Matching Sentences

Many tasks in natural language processing involve comparing two sentences to compute some notion of relevance, entailment, or similarity. Typically this comparison is done either at the word level or at the sentence level, with no attempt to leverage the inherent structure of the sentence. When sentence structure is used for comparison, it is obtained during a non-differentiable pre-processing step, leading to propagation of errors. We introduce a model of structured alignments between sentences, showing how to compare two sentences by matching their latent structures. Using a structured attention mechanism, our model matches candidate spans in the first sentence to candidate spans in the second sentence, simultaneously discovering the tree structure of each sentence. Our model is fully differentiable and trained only on the matching objective. We evaluate this model on two tasks, natural entailment detection and answer sentence selection, and find that modeling latent tree structures results in superior performance. Analysis of the learned sentence structures shows they can reflect some syntactic phenomena.


Introduction
There are many tasks in natural language processing that require matching two sentences: natural language inference (Bowman et al., 2015;Nangia et al., 2017) and paraphrase detection (Wang et al., 2017b) are classification tasks over sentence pairs, and question answering often requires an alignment between a question and a passage of text that may contain the answer (Tan et al., 2016a;Rajpurkar et al., 2016;Joshi et al., 2017).
Most neural models for these tasks perform comparisons between the two sentences either at the word level (Parikh et al., 2016), or at the sentence level (Bowman et al., 2015). Word-level comparisons ignore the inherent structure of the sentences being compared, at best relying on a recurrent neural network such as an LSTM (Hochreiter and Schmidhuber, 1997) to incorporate some amount of context from neighboring words into each word's representation. Sentence-level comparisons can incorporate the structure of each sentence individually (Bowman et al., 2016;Tai et al., 2015), but cannot easily compare substructures between the sentences, as these are all squashed into a single vector. Some models do incorporate sentence structure by comparing subtrees between the two sentences (Zhao et al., 2016;Chen et al., 2017), but require pipelined approaches where a parser is run in a non-differentiable preprocessing step, losing the benefits of end-to-end training.
In this paper we propose a method, which we call structured alignment networks, to perform comparisons between substructures in two sentences, in a more interpretable way, and without relying on an external, non-differentiable parser. We use a structured attention mechanism (Kim et al., 2017;Liu and Lapata, 2018) to compute a structured alignment between the two sentences, jointly learning a latent tree structure for each sentence and aligning spans between the two sentences.
Our method constructs a CKY chart for each sentence using the inside-outside algorithm (Manning et al., 1999), which is fully differentiable (Li and Eisner, 2009;Gormley et al., 2015). This chart has a node for each possible span in the sentence, and a score for the likelihood of that span being a constituent in a parse of the sentence, marginalized over all possible parses. We take these two charts and find alignments between them, representing each span in each sentence with structured attention over spans in the other sentence. These span representations, weighted by the span's likelihood, are then used to compare the two sentences. In this way, we can perform comparisons between sentences by leveraging their internal structure in an end-to-end, fully differentiable model, trained only on one final objective. Our model helps obtain more precise representations of the sentence pair, with the learned tree structures and the alignment between them, and provides better interpretability, which most neural models lack in sentence matching tasks. We evaluate this model on two sentence comparison datasets: SNLI (Bowman et al., 2015) and TREC-QA (Voorhees and Tice, 2000). We find that comparing sentences at the span level consistently outperforms comparing at the word level. Additionally, the learned sentence structures represent well-formed trees that reflect some syntactic phenomena.

Word-level Comparison Baseline
We first describe a common word-level comparison model, called decomposable attention (Parikh et al., 2016). This model was first proposed for the natural language inference task, but similar mechanisms have been used in many other tasks, such as for aligning question and passage words in the bi-directional attention model for question answering (Seo et al., 2017). This model serves as our main point of comparison, as our latent tree matching model simply replaces the word-level comparisons in decomposable attention model with span comparisons. The decomposable attention model consists of three steps: attend, compare, and aggregate. As input, the model takes two sentences a and b represented by sequences of word embeddings [a 1 , · · · , a m ] and [b 1 , · · · , b n ]. In the attend step, the model computes attention scores for each pair of words across the two input sentences and normalizes them as a soft alignment from a to b (and vice versa): where F 1 is a feed-forward neural network, B i is the weighted summation of the words in b that are softly aligned to word a i and vice versa for A j .
In the compare step, the input vectors a i and b j are concatenated with their corresponding attended vector B i and A j , and fed into a feedforward neural network, giving a comparison between each word and the words it aligns to in the other sentence: The aggregate step is a simple summation of v ai and v bj for each word in sentence a and b, and the two resulting fixed-length vectors are concatenated and fed into a linear layer with W y as the weight matrix, followed by a softmax layer for predicting the distribution y: The decomposable attention model completely ignores the order and context of words in the sequence. There are some efforts to strengthen the decomposable attention model with a recurrent neural network (Liu and Lapata, 2018) or intrasentence attention (Parikh et al., 2016). However, these models amount to simply changing the input vectors a and b, and still only perform a wordlevel alignment between the two sentences.

Structured Alignment Networks
Language is inherently tree structured, and the meaning of sentences comes largely from composing the meanings of subtrees (Chomsky, 2002). It is natural, then, to compare the meaning of two sentences by comparing their substructures (Mac-Cartney and Manning, 2009). For example, when determining the relationship between two sentences in Figure 1, the ideal units of comparison are spans determined by subtrees: "is in Seattle" compared to "based in Washington state".
The challenge with comparing spans drawn from subtrees is that the tree structure of the sentence is latent and must be inferred, either during pre-processing or in the model itself. In this section we present a model that operates on the latent tree structure of each sentence, comparing all possible spans in one sentence with all possible spans in the second sentence, weighted by how A: the headquarter of BOEING is in Seattle B: Boeing is a company based in Washington state Figure 1: Example span alignments of a sentence pair, where different colors indicate matching spans. Note that some spans overlap, which cannot happen in a single tree; our model considers all possible span comparisons, weighted by the spans' marginal likelihood.
likely each span is to appear as a constituent in a parse of the sentence. We use the non-terminal nodes of a binary constituency parse to represent spans. Because of this choice of representation, we can use the nodes in a CKY parsing chart to efficiently marginalize span likelihood over all possible parses for each sentence, and compare nodes in each sentence's chart.

Learning Latent Constituency Trees
A constituency parser can be partially formalized as a graphical model with the following cliques (Klein and Manning, 2004): the latent variables c ikj ∈ 0, 1 for all i < j, indicating whether the span from the i-th token to the j-th token (span ij ) is a constituency node built from the merging of sub-node span ik and span (k+1)j . Given a sentence x = [x i , · · · , x n ], the probability of a tree z is, where Z represents all possible constituency trees for x. The parameters for the graph-based CRF constituency parser are δ ikj reflecting the scores of span ij forming a binary constituency node with k as the splitting point. It is possible to calculate the marginal probability of each constituency node p(c ijk = 1|x) using the inside-outside algorithm (Klein and Manning, 2003). Although the inside-outside algorithm is constrained to generate a binary tree, this is not a severe limitation, as most structures can be easily binarized (Finkel et al., 2008).
In a typical constituency parser, the score δ ikj is parameterized according to the production rules of a grammar, e.g., with normalized categorical distributions for each non-terminal. Our unlabeled grammar effectively has only a single production is a part of the process for calculating the outside score β ij , with target span span ij as the right child of a non-terminal. The blue space indicates β kj and two yellow spaces indicate α k(i−1) and δ kij .
rule, however, we parameterize these scores as bilinear functions operating on the representations of the two subtrees being merged. For the inside pass, as illustrated in Figure 2a, the inside score α ij for span from position i to j is marginalized over the splitting points k: where sp ij ∈ R d is the representation for the span, and W ∈ R d * d is the weight matrix. This process is calculated recursively from bottom to root, generating the score for each possible constituent. For the outside pass, the outside score β ij is: where the first term is the score for span ij being the right child on a non-terminal node and the second term is the score for span ij being the left child. In Figure 2b, we illustrate the outside process with the target span span ij being the right child of a non-terminal node. This process is calculated recursively from root to bottom. The normalized marginal probability ρ ij for each span span ij , where 1 ≤ i < n, i < j ≤ n can be calculated by: To compute the representations of all possible spans, we use Long Short-Term Memory Neural Networks (LSTMs; Hochreiter and Schmidhuber 1997) with max-pooling and minus features (Cross and Huang, 2016;Liu and Lapata, 2017). We represent each sentence as a sequence of word embeddings [w sos , w 1 , · · · , w t , · · · , w n , w eos ] and run a bidirectional LSTM to obtain the output vectors.
is the output vector for the t th word, and h t and h t are the output vectors from the forward and backward directions, respectively. We represent a constituent from position i to j with a span vector sp ij : where max(x i , · · · , x j ) is the max-pooling operation over the sequence of output vectors within this constituent.
After applying the parsing process on two sentences, we obtain the marginal probabilities for all potential spans of the two constituency trees, which can then be used for aligning.

Learning Structured Alignments
After learning latent constituency trees for each sentence, we are able to perform span-level comparisons between the two sentences, instead of the word-level comparisons done by the decomposable attention model. The structure of these two comparison models is the same, but the basic elements of our structured alignment model are spans instead of words, and the marginal probabilities obtained from the inside-outside algorithm are used as a re-normalization value for incorporating structural information into the alignments.
For sentence a, we have the representation sp a ij for each span ij and its marginal probability ρ a ij . And for sentence b, we also get sp b ij and ρ b ij . The attention scores are computed between all pairs of spans across the two sentences, and the attended vectors can be calculated as: Then, the span vectors are concatenated with the attended vectors and fed into a feed-forward neural network: To aggregate these vectors, instead of using direct summation, we apply weighted summation with the marginal probabilities as weights: where ρ a and ρ b work like the self-attention mechanism of (Lin et al., 2017) to replace the summation pooling step. We use a softmax function to compute the predicted distribution y of the input sentence pair:

Experiments
We evaluate our structured alignment model on two natural language matching tasks: question answering as sentence selection and natural language inference. We view our approach as a module for replacing the widely-used word-level alignment which can be plugged into other neural models. For that reason, our experiments are not intended to show performance improvements over state-ofthe-art neural network architectures. Rather our evaluation studies aim to address three questions: (a) whether our methods can be trained effectively in an end-to-end fashion; (b) whether they yield improvements over standard word-level alignment models; and (c) whether they can learn plausible latent constituency tree structures.   0.777 0.836 Lexical Decomposition and Composition (Wang et al., 2016) 0.771 0.845 Noise-Contrastive Estimation (Rao et al., 2016) 0.801 0.877 BiMPM (Wang et al., 2017b) 0.802 0.875 For both tasks, we initialize our model with 300D 840B GloVe word embeddings (Pennington et al., 2014). The hidden size for the BiL-STM is 150. The feed-forward networks F 1 and F 2 are two-layer perceptrons with ReLU as the hidden activation function and the size of the hidden and output layers is set to 300. All hyperparameters are selected based on the model's performance on the development set.

Answer Sentence Selection
We first study the effectiveness of our model for answer sentence selection tasks. Given a question, answer sentence selection aims to rank a list of candidate answer sentences based on their relatedness to the question. We experiment on the TREC-QA dataset (Wang et al., 2007), in which all questions with only positive or negative answers are removed. This leaves us with 1,162 training questions, 65 development questions and 68 test questions. Experimental results are listed in Table 1. We measure performance by the mean average precision (MAP) and mean reciprocal rank (MRR) using the standard TREC evaluation script.
In the first block of Table 1, we compare our model and variants thereof against several baselines. The first baseline is the Word-level Decomposable Attention model strengthened with a bidirectional LSTM for obtaining a contextualized representation for each word. The second baseline is a Simple Span Alignment model; we use an MLP layer over the LSTM outputs to calculate the unnormalized scores and replace the inside-outside algorithm with a simple softmax function to obtain the probability distribution over all candidate spans. We also introduce a pipelined baseline where we extract constituents from trees parsed by the CoreNLP  constituency parser, and use the Simple Span Alignment model to only align these constituents.
As shown in Table 1, we use two variants of the Structured Alignment model, since the structure of the question and the answer sentence may be different; the first model shares parameters across the question and the answer for computing the structures, while the second one uses separate parameters. We view the sentence selection task as a binary classification problem and the final ranking is based on the predicted probability of the sentence containing the correct answer (positive label). We apply dropout to the output of the BiL-STM with dropout ratio set to 0.2. All parameters (including word embeddings) are updated with AdaGrad (Duchi et al., 2011), and the learning rate is set to 0.05.
Table 1 (second block) also reports the performance of various comparison systems and stateof-the-art models. As can be seen, on both MAP and MRR metrics, structured alignment models perform better than the decomposable attention model, showing that structural bias is helpful for matching a question to the correct answer sentence. We also observe that using separate parameters achieves higher scores on both metrics. The simple span alignment model obtains results similar to the decomposable attention model, suggesting that the shallow softmax distribution is ineffective for capturing structural information and may even introduce redundant noise. The pipelined model with an external parser also slightly im-  (Bowman et al., 2015) 78.2 -LSTM encoders (Bowman et al., 2015) 80.6 3.0M LSTM with inter-attention (Rocktäschel et al., 2016) 83.5 252K Matching LSTMs (Wang and Jiang, 2015) 86.1 1.9M LSTMN with deep attention fusion (Cheng et al., 2016) 86.3 3.4M Enhanced BiLSTM Inference Model (Chen et al., 2016) 88.0 4.3M Densely Interactive Inference Network (Gong et al., 2017) 88.0 - proves upon the baseline, but still cannot outperform the end-to-end trained structured alignment model which achieves results comparable with several strong baselines with fewer parameters. As mentioned earlier, our model could be used as a plug-in component for other more complex models, and may boost their performance by modeling the latent structures. At the same time, the structured alignment can provide better interpretability for sentence matching tasks, which is a defect of most neural models.

Natural Language Inference
The second task we consider is natural language inference, where the input is a pair of premise and hypothesis sentences, and the goal is to predict whether the premise entails the hypothesis, contradicts the hypothesis, or neither. For this task, we use the Stanford NLI dataset (Bowman et al., 2015). After removing sentences with unknown labels, we obtain 549,367 pairs for training, 9,842 for development and 9,824 for testing. We compare our model against the same baselines used in the question answering task. All parameters (including word embeddings) are updated with AdaGrad (Duchi et al., 2011), and the learning rate is set to 0.05. Dropout is used with ratio 0.2. The structured alignment model in this experiment uses shared parameters for computing latent tree structures, since both the premise and hypothesis are declarative sentences.
The results of our experiments are shown in Table 2. Similar to the answer selection task, the tree matching model outperforms the decomposable model. Our structured alignment model gains 0.5% in accuracy over the baseline wordlevel comparison model without any additional annotation, simply from introducing a structural bias in the alignment between the sentences. Simple span alignment, however, is not helpfult and even slightly degrades the performance over the wordlevel model.

Analysis of Learned Tree Structures
In this section, we give a brief qualitative analysis of the learned tree structures. We present the CKY charts for two randomly-selected sentence pairs in the SNLI test set in Figure 3. Recall that the CKY chart shows the likelihood of each span appearing as a constituent in the parse of the sentence, marginalized over all possible parses. By visualizing these span probabilities, we can see that the model learns structures which correspond to known syntactic structures.
In subfigure (a), we can see that band is playing is a very-likely span, as is at a large venue. In subfigure (b), the phrases performing at a local bar and at a local bar or club also receive high probabilities. For the second sentence pair, we see that the model can even resolve some attachment ambiguities correctly. The prepositional phrase with green feathers, has a very low score for being attached to women. Instead, the model prefers to attach it to lingerie, forming the span lingerie with green feathers. We also present the top-5 spans and their alignments in subfigures (c) and (d), which can be used to interpret model decisions for sentence matching tasks.
The analysis above and our experimental results in the previous section suggest that our b a n d is p la y in g m u s ic a t a la r g e v e n u e model is able to learn tree structures which are closely related to syntax, and in addition reflect the semantic-level characteristics of the task at hand. In both question answering and natural language inference tasks, we observe that structured alignment leads to performance improvements over word-level models. This is in contrast to prior work , where the discovery of tree structures based on a semantic objective is not helpful. Although we use the same supervision signal in our model, a difference between the two approaches is that they are trying to learn tree structures for each sentence independently, performing comparisons at the sentence level only. Comparing spans directly forces the model to induce trees with comparable constituents, giving the model a stronger inductive bias.
Although our main goal is not to induce a grammar, we perform some simple experiments to compare the learned latent trees with parser-generated ones. We parse the sentences in both test-sets with the CoreNLP  con-stituency parser to obtain silver trees. Based on the parsing part of the trained structured alignment model, we compute the marginal probabilities of test sentences and feed them into CKY algorithm (Younger, 1967) to find the most likely constituency trees. We then convert both silver and latent trees to sets of constituent brackets, and calculate the accuracy of the learned brackets against the silver parses. We use different combinations of training-and test-sets to examine the transferability of the learned tree structures. The results are shown in Table 3. We can see that although our model does not have any treestructured input during training, it can still outperform the left-branching (LB) and right-branching baselines (RB) and achieve some consistency with the parser generated trees.

Related Work
Sentence comparison models The Stanford natural language inference dataset (Bowman et al., 2015), and the expanded multi-genre natural language inference dataset (Nangia et al., 2017)  the most well-known recent sentence comparison tasks. The literature on this comparison task is far too extensive to include here, although the recent shared task on Multi-NLI gives a good survey of sentence-level comparison models (Nangia et al., 2017). Some of these models use sentence structures, which are obtained either in a latent fashion (Bowman et al., 2016) or during preprocessing (Zhao et al., 2016), but they squash all of the structure into a single vector, losing the ability to easily compare substructures between the two sentences. For models doing a word-level comparison, the decomposable attention model (Parikh et al., 2016), which we have discussed already in this paper, is the most salient example, although many similar models exist in the literature (Chen et al., 2017;Wang et al., 2017b). The idea of word-level alignments between a question and a passage is also pervasive in the recent question answering literature (Seo et al., 2017;Wang et al., 2017a).
Finally, and most similar to our approach, several models have been proposed that directly compare subtrees between two sentences (Chen et al., 2017;Zhao et al., 2016). However, all of these models are pipelined; they obtain the sentence structure in a non-differentiable preprocessing step, losing the benefits of end-to-end training. Ours is the first model to allow comparison between latent tree structures, trained end-to-end on the comparison objective.
Structured attention While it has long been known that inference in graphical models is differentiable (Li and Eisner, 2009;Domke, 2011), and using inference in, e.g., a CRF (Lafferty et al., 2001) as the last layer in a neural network is common practice (Liu and Lapata, 2017;Lample et al., 2016), the use of inference algorithms as intermediate layers in end-to-end neural networks is a recent development. Kim et al. (2017) were the first to use inference to compute structured attentions over latent sentence variables, inducing tree structures trained on the end-to-end objective. Liu and Lapata (2018) showed how to do this more efficiently, although their work is still limited to structured attention over a single sentence. Our model is the first to include latent structured alignments between two sentences.
Grammar Induction Unsupervised grammar induction is a well-studied problem (Cohen and Smith, 2009). The most recent work in this direction was the Neural E-DMV model of . While our goal is not to induce a grammar, we do produce a probabilistic grammar as a byproduct of our model. Our results suggest that training on more complex objectives may be a good way to pursue grammar induction in the future; forcing the model to construct consistent, comparable subtrees between the two sentences is a strong signal for grammar induction. Very recently, a few models attempt to infer latent dependency tree structures with neural models in sentence modeling tasks (Yogatama et al., 2017;Choi et al., 2018).

Conclusions
In this paper we have considered the problem of comparing two sentences in natural language processing models. We have shown how to move beyond word-and sentence-level comparison to comparing spans between two sentences, without the need for an external parser. Through experiments on sentence comparison datasets, we have seen that span comparisons consistently outperform word-level comparisons, with no additional supervision. The proposed model can be trained effectively, in an end-to-end fashion and is able to induce plausible tree structures.
Our results have several implications for future work. First, the success of span comparisons over word-level comparisons suggests that it may be advantageous to include such comparisons in more complex models, either for comparing two sentences directly, or as intermediate parts of models for more complex tasks, such as reading comprehension. Second, our model's ability to infer trees from a semantic objective is intriguing, and suggestive of future opportunities in grammar induc-tion research. The use of the inside-outside algorithm unavoidably renders the full model er (by 5-8 times) compared to the decomposable attention model. We hope to find a more efficient way to accelerate this dynamic programming method on a GPU.