Task-Specific Attentive Pooling of Phrase Alignments Contributes to Sentence Matching

This work studies comparatively two typical sentence matching tasks: textual entailment (TE) and answer selection (AS), observing that weaker phrase alignments are more critical in TE, while stronger phrase alignments deserve more attention in AS. The key to reach this observation lies in phrase detection, phrase representation, phrase alignment, and more importantly how to connect those aligned phrases of different matching degrees with the final classifier. Prior work (i) has limitations in phrase generation and representation, or (ii) conducts alignment at word and phrase levels by handcrafted features or (iii) utilizes a single framework of alignment without considering the characteristics of specific tasks, which limits the framework’s effectiveness across tasks. We propose an architecture based on Gated Recurrent Unit that supports (i) representation learning of phrases of arbitrary granularity and (ii) task-specific attentive pooling of phrase alignments between two sentences. Experimental results on TE and AS match our observation and show the effectiveness of our approach.


Introduction
How to model a pair of sentences is a critical issue in many NLP tasks, including textual entailment (Marelli et al., 2014a;Bowman et al., 2015a;Yin et al., 2016a) and answer selection (Yu et al., 2014;Yang et al., 2015;Santos et al., 2016). A key challenge common to these tasks is the lack of explicit alignment annotation between the sentences of the pair. Thus, inferring and assessing the semantic relations between words and phrases in the two sentences is a core issue. Figure 1: Alignment examples in TE (top) and AS (bottom). Green color: identical (subset) alignment; blue color: relatedness alignment; red color: unrelated alignment. Q: the first sentence in TE or the question in AS; C + , C − : the correct or incorrect counterpart in the sentence pair (Q, C). Figure 1 shows examples of human annotated phrase alignments. In the TE example, we try to figure out Q entails C + (positive) or C − (negative). As human beings, we discover the relationship of two sentences by studying the alignments between linguistic units. We see that some phrases are kept: "are playing outdoors" (between Q and C + ), "are playing " (between Q and C − ). Some phrases are changed into related semantics on purpose: "the young boys" (Q) → "the kids" (C + & C − ), "the man is smiling nearby" (Q) → "near a man with a smile" (C + ) or → "an old man is standing in the background" (C − ) . We can see that the kept parts have stronger alignments (green color), and changed parts have weaker alignments (blue color). Here, by "strong" / "weak" we mean how semantically close the two aligned phrases are. To successfully identify the relationships of (Q, C + ) or (Q, C − ), studying the changed parts is crucial. Hence, we argue that TE should pay more attention to weaker alignments.
In AS, we try to figure out: does sentence C + or sentence C − answer question Q? Roughly, the content in candidates C + and C − can be classified into aligned part (e.g., repeated or relevant parts) and negligible part. This differs from TE, in which it is hard to claim that some parts are negligible or play a minor role, as TE requires to make clear that each part can entail or be entailed. Hence, TE is considerably sensitive to those "unseen" parts. In contrast, AS is more tolerant of negligible parts and less related parts. From the AS example in Figure 1, we see that "Auburndale Florida" (Q) can find related part "the city" (C + ), and "Auburndale", "a city" (C − ) ; "how big" (Q) also matches "had a population of 12,381" (C + ) very well. And some unaligned parts exist, denoted by red color. Hence, we argue that stronger alignments in AS deserve more attention.
The above analysis suggests that: (i) alignments connecting two sentences can happen between phrases of arbitrary granularity; (ii) phrase alignments can have different intensities; (iii) tasks of different properties require paying different attention to alignments of different intensities.
Alignments at word level (Yih et al., 2013) or phrase level (Yao et al., 2013) both have been studied before. For example, Yih et al. (2013) make use of WordNet (Miller, 1995) and Probase (Wu et al., 2012) for identifying hyper-and hyponymy. Yao et al. (2013) use POS tags, WordNet and paraphrase database for alignment identification. Their approaches rely on manual feature design and linguistic resources. We develop a deep neural network (DNN) to learn representations of phrases of arbitrary lengths. As a result, alignments can be searched in a more automatic and exhaustive way.
DNNs have been intensively investigated in sentence pair classifications (Blacoe and Lapata, 2012;Socher et al., 2011;Yin and Schütze, 2015b), and attention mechanisms are also applied to individual tasks (Santos et al., 2016;Rocktäschel et al., 2016;Wang and Jiang, 2016); however, most attention-based DNNs have implicit assumption that stronger alignments deserve more attention (Yin et al., 2016a;Santos et al., 2016;Yin et al., 2016b). Our examples in Figure 1, instead, show that this assumption does not hold invariably. Weaker alignments in certain tasks such as TE can be the indicator of the final decision. Our inspiration comes from the analysis of some prior work. For TE, Yin et al. (2016a) show that considering the pairs in which overlapping tokens are removed can give a boost. This simple trick matches our motivation that weaker alignment should be given more attention in TE. However, Yin et al. (2016a) remove overlapping tokens completely, potentially obscuring complex alignment configurations. In addition, Yin et al. (2016a) use the same attention mechanism for TE and AS, which is less optimal based on our observations.
This motivates us in this work to introduce DNNs with a flexible attention mechanism that is adaptable for specific tasks. For TE, it can make our system pay more attention to weaker alignments; for AS, it enables our system to focus on stronger alignments. We can treat the pre-processing in (Yin et al., 2016a) as a hard way, and ours as a soft way, as our phrases have more flexible lengths and the existence of overlapping phrases decreases the risk of losing important alignments. In experiments, we will show that this attention scheme is very effective for different tasks.
We make the following contributions. (i) We use GRU (Gated Recurrent Unit ) to learn representations for phrases of arbitrary granularity. Based on phrase representations, we can detect phrase alignments of different intensities. (ii) We propose attentive pooling to achieve flexible choice among alignments, depending on the characteristics of the task. (iii) We achieve state-of-the-art on TE task.

Related Work
Non-DNN for sentence pair modeling. Heilman and Smith (2010) describe tree edit models that generalize tree edit distance by allowing operations that better account for complex reordering phenomena and by learning from data how different edits should affect the model's decisions about sentence relations. Wang and Manning (2010) cope with the alignment between a sentence pair by using a probabilistic model that models tree-edit operations on dependency parse trees. Their model treats alignments as structured latent variables, and offers a principled framework for incorporating complex linguistic features. Guo and Diab (2012) identify the degree of sentence similarity by modeling the missing words (words that are not in the sentence) so as to relieve the sparseness issue of sentence modeling. Yih et al. (2013) try to improve the shallow semantic component, lexical semantics, by formulating sentence pair as a semantic matching problem with a latent word-alignment structure as in (Chang et al., 2010). More fine-grained word overlap and alignment between two sentences are explored in (Lai and Hockenmaier, 2014), in which negation, hypernym/hyponym, synonym and antonym relations are used. Yao et al. (2013) extend word-toword alignment to phrase-to-phrase alignment by a semi-Markov CRF. Such approaches often require more computational resources. In addition, using syntactic/semantic parsing during run-time to find the best matching between structured representation of sentences is not trivial.
DNN for sentence pair classification. There recently has been great interest in using DNNs for classifying sentence pairs as they can reduce the burden of feature engineering.
For AS, Yu et al. (2014) present a bigram CNN (convolutional neural network (LeCun et al., 1998)) to model question and answer candidates. Yang et al. (2015) extend this method and get state-of-the-art performance on the WikiQA dataset. Feng et al. (2015) test various setups of a bi-CNN architecture on an insurance domain QA dataset. Tan et al. (2015) explore bidirectional LSTM on the same dataset. Other sentence matching tasks such as paraphrase identification (Socher et al., 2011;Yin and Schütze, 2015a), question -Freebase fact matching (Yin et al., 2016b) etc. are also investigated.
Some prior work aims to solve a general sentence matching problem. Hu et al. (2014) present two CNN architectures for paraphrasing, sentence completion (SC), tweet-response matching tasks. Yin and Schütze (2015b) propose the Multi-GranCNN architecture to model general sentence matching based on phrase matching on multiple levels of granularity. Wan et al. (2016) try to match two sentences in AS and SC by multiple sentence representations, each coming from the local representations of two LSTMs.
Attention-based DNN for alignment. DNNs have been successfully developed to detect align- Figure 2: Gated Recurrent Unit ments, e.g., in machine translation (Bahdanau et al., 2015; and text reconstruction (Li et al., 2015;Rush et al., 2015). In addition, attention-based alignment is also applied in natural language inference (e.g., Rocktäschel et al. (2016), Wang and Jiang (2016)). However, most of this work aligns word-by-word. As Figure 1 shows, many sentence relations can be better identified through phrase level alignments. This is one motivation of our work.

Model
This section first gives a brief introduction of GRU and how it performs phrase representation learning, then describes the different attentive poolings for phrase alignments w.r.t TE and AS tasks.

GRU Introduction
GRU is a simplified version of LSTM. Both are found effective in sequence modeling, as they are order-sensitive and can capture long-range context. The tradeoffs between GRU and its competitor LSTM have not been fully explored yet. According to empirical evaluations in (Chung et al., 2014;Jozefowicz et al., 2015), there is not a clear winner. In many tasks both architectures yield comparable performance and tuning hyperparameters like layer size is probably more important than picking the ideal architecture. GRU have fewer parameters and thus may train a bit faster or need less data to generalize. Hence, we use GRU, as shown in Figure 2, to model text: x is the input sentence with token x t ∈ R d at position t, s t ∈ R h is the hidden state at t, supposed to encode the history x 1 , · · · , x t−1 . z and r are two gates. All U ∈ R d×h ,W ∈ R h×h are parameters in GRU.

Representation Learning for Phrases
For a general sentence s with five consecutive words: ABCDE, with each word represented by a word embedding of dimensionality d, we first create four fake sentences, s 1 : "BCDEA", s 2 : "CDEAB", s 3 : "DEABC" and s 4 : "EABCD", then put them in a matrix (Figure 3, left). We run GRUs on each row of this matrix in parallel. As GRU is able to encode the whole sequence up to current position, this step generates representations for any consecutive phrases in original sentence s. For example, the GRU hidden state at position "E" at coordinates (1,5) (i.e., 1st row, 5th column) denotes the representation of the phrase "ABCDE" which in fact is s itself, the hidden state at "E" (2,4) denotes the representation of phrase "BCDE", . . . , the hidden state of "E" (5,1) denotes phrase representation of "E" itself. Hence, for each token, we can learn the representations for all phrases ending with this token. Finally, all phrases of any lengths in s can get a representation vector. GRUs in those rows are set to share weights so that all phrase representations are comparable in the same space. Now, we reformat sentence "ABCDE" into s , as shown by arrows in Figure 3 (right), the arrow direction means phrase order. Each sequence in parentheses is a phrase (we use parentheses just for making the phrase boundaries clear). Randomly taking a phrase "CDE" as an example, its representation comes from the hidden state at "E" (3,3) in Figure 3 (left). Shaded parts are discarded. The main advantage of reformatting sentence "ABCDE" into the new sentence s * is to cre-ate phrase-level semantic units, but at the same time we maintain the order information.
Hence, the sentence "how big is Auburndale Florida" in Figure 1 will be reformatted into We can see that phrases are exhaustively detected and represented.
In the experiments of this work, we explore the phrases of maximal length 7 instead of arbitrary lengths.

Attentive Pooling
As each sentence s * consists of a sequence of phrases, and each phrase is denoted by a representation vector generated by GRU, we can compute an alignment matrix A between two sentences s * 1 and s * 2 , by comparing each two phrases, one from s * 1 and one from s * 2 . Let s * 1 and s * 2 also denote lengths respectively, thus A ∈ R s * 1 ×s * 2 . While there are many ways of computing the entries of A, we found that cosine works well in our setting.
The first step then is to detect the best alignment for each phrase by leveraging A. To be concrete, for sentence s * 1 , we do row-wise max-pooling over A as attention vector a 1 : In a 1 , the entry a 1,i denotes the best alignment for i th phrase in sentence s * 1 . Similarly, we can do column-wise max-pooling to generate attention vector a 2 for sentence s * 2 . Now, the problem is that we need to pay most attention to the phrases aligned very well or phrases aligned badly. According to the analysis of the two examples in Figure 1, we need to pay more attention to weaker (resp. stronger) alignments in TE (resp. AS). To this end, we adopt different second step over attention vector a i (i = 1, 2) for TE and AS.
For TE, in which weaker alignments are supposed to contribute more, we do k-min-pooling over a i , i.e., we only keep the k phrases which are aligned worst. For the (Q, C + ) pair in TE example of Figure 1, we expect this step is able to put most of our attention to the phrases "the kids", "the young boys", "near a man with a smile" and "and the man is smiling nearby" as they have rela- For AS, in which stronger alignments are supposed to contribute more, we do k-max-pooling over a i , i.e., we only keep the k phrases which are aligned best. For the (Q, C + ) pair in AS example of Figure 1, we expect this k-max-pooling is able to put most of our attention to the phrases "how big" "Auburndale Florida", "the city" and "had a population of 12,381" as they have relatively stronger alignments and their relations are the indicator of the final decision. We keep the original order of extracted phrases after k-min/maxpooling.
In summary, for TE, we first do row-wise maxpooling over alignment matrix, then do k-minpooling over generated alignment vector; we use k-min-max-pooling to denote the whole process. In contrast, we use k-max-max-pooling for AS. We refer to this method of using two successive min or max pooling steps as attentive pooling.

The Whole Architecture
Now, we present the whole system in Figure 4. We take sentences s 1 "ABC" and s 2 "DEFG" as illustration. Each token, i.e., A to F, in the figure is denoted by an embedding vector, hence each sentence is represented as an order-3 tensor as input (they are depicted as rectangles just for simplicity). Based on tensor-style sentence input, we have described the phrase representation learning by GRU 1 in Section 3.2 and attentive pooling in Section 3.3.
Attentive pooling generates a new feature map for each sentence, as shown in Figure 4 (the third layer from the bottom), and each column representation in the feature map denotes a key phrase in this sentence that, based on our modeling assumptions, should be a good basis for the correct final decision. For instance, we expect such a feature map to contain representations of "the young boys", "outdoors" and "and the man is smiling nearby" for the sentence Q in the TE example of Figure 1. Now, we do another GRU 2 step for: 1) the new feature map of each sentence mentioned above, to encode all the key phrases as the sentence representation; 2) a concatenated feature map of the two new sentence feature maps, to encode all the key phrases in the two sentences sequentially as the representation of the sentence pair. As GRU generates a hidden state at each position, we always choose the last hidden state as the representation of the sentence or sentence pair. In Figure 4 (the fourth layer), these final GRU-generated representations for sentence s 1 , s 2 and the sentence pair are depicted as green columns: s 1 , s 2 and s p respectively.
As for the input of the final classifier, it can be flexible, such as representation vectors (rep), similarity scores between s 1 and s 2 (simi), and extra linguistic features (extra). This can vary based on the specific tasks. We give details in Section 4.

Experiments
We test the proposed architectures on TE and AS benchmark datasets.

Common Setup
For both TE and AS, words are initialized by 300dimensional GloVe embeddings 1 (Pennington et al., 2014) and not changed during training. A single randomly initialized embedding is created for all unknown words by uniform sampling from [−.01, .01]. We use ADAM (Kingma and Ba, 2015), with a first momentum coefficient of 0.9 and a second momentum coefficient of 0.999, 2 L 2 regularization and Diversity Regularization (Xie et al., 2015). Table 1 shows the values of the hyperparameters, tuned on dev.
Classifier. Following Yin et al. (2016a), we use three classifiers -logistic regression in DNN, logistic regression and linear SVM with default parameters 3 directly on the feature vector -and report performance of the best. 1 nlp.stanford.edu/projects/glove/ 2 Standard configuration recommended by Kingma and Ba 3 http://scikit-learn.org/stable/ for both. Common Baselines. (i) Addition. We sum up word embeddings element-wise to form sentence representation, then concatenate two sentence representation vectors (s 0 1 , s 0 2 ) as classifier input. (ii) A-LSTM. The pioneering attention based LSTM system for a specific sentence pair classification task "natural language inference" (Rocktäschel et al., 2016). A-LSTM has the same dimensionality as our GRU system in terms of initialized word representations and the hidden states. (iii) ABCNN (Yin et al., 2016a). The state-of-the-art system in both TE and AS.
Based on the motivation in Section 1, the main hypothesis to be tested in experiments is: k-minmax-pooling is superior for TE and k-max-maxpooling is superior for AS. In addition, we would like to determine whether the second pooling step in attention pooling, i.e., the k-min/max-pooling, is more effective than a "full-pooling" in which all the generated phrases are forwarded into the next layer.

Textual Entailment
SemEval 2014 Task 1 (Marelli et al., 2014a) evaluates system predictions of textual entailment (TE) relations on sentence pairs from the SICK dataset (Marelli et al., 2014b). The three classes are entailment, contradiction and neutral. The sizes of SICK train, dev and test sets are 4439, 495 and 4906 pairs, respectively. We choose SICK benchmark dataset so that our result is directly comparable with that of (Yin et al., 2016a), in which nonoverlapping text are utilized explicitly to boost the performance. That trick inspires this work.
Following Lai and Hockenmaier (2014), we train our final system (after fixing of hyperparameters) on train and dev (4,934 pairs). Our evaluation measure is accuracy.

Feature Vector
The final feature vector as input of classifier contains three parts: rep, simi, extra.
Rep. Totally five vectors, three are the top sentence representation s 1 , s 2 and the top sentence pair representation s p (shown in green in Figure 4), two are s 0 1 , s 0 2 from Addition baseline. Simi. Four similarity scores, cosine similarity and euclidean distance between s 1 and s 2 , cosine similarity and euclidean distance between s 0 1 and s 0 2 . Euclidean distance · is transformed into 1/(1+ · ). method acc SemEval Top3 (Jimenez et al., 2014) 83.1 (Zhao et al., 2014) 83.6 (Lai and Hockenmaier, 2014) Table 2: Results on SICK. Significant improvement over both k-max-max-pooling and fullpooling is marked with * (test of equal proportions, p < .05).
Extra. We include the same 22 linguistic features as Yin et al. (2016a). They cover 15 machine translation metrics between the two sentences; whether or not the two sentences contain negation tokens like "no", "not" etc; whether or not they contain synonyms, hypernyms or antonyms; two sentence lengths. See Yin et al. (2016a) for details. Table 2 shows that GRU with k-min-max-pooling gets state-of-the-art performance on SICK and significantly outperforms k-max-max-pooling and full-pooling. Full-pooling has more phrase input than the combination of k-max-max-pooling and k-min-max-pooling, this might bring two problems: (i) noisy alignments increase; (ii) sentence pair representation s p is no longer discriminative -s p does not know its semantics comes from phrases of s 1 or s 2 : as different sentences have different lengths, the boundary location separating two sentences varies across pairs. However, this is crucial to determine whether s 1 entails s 2 .

Results
ABCNN (Yin et al., 2016a) is based on assumptions similar to k-max-max-pooling: words/phrases with higher matching values should contribute more in this task. However, ABCNN gets the optimal performance by combining a reformatted SICK version in which  Table 3: Results on WikiQA. Significant improvement over both k-min-max-pooling and fullpooling is marked with * (t-test, p < .05). STOA: 74.17 (MAP)/75.88 (MRR) in (Tymoshenko et al., 2016) overlapping tokens in two sentences are removed. This instead hints that non-overlapping units can do a big favor for this task, which is indeed the superiority of our "k-min-max-pooling".

Answer Selection
We use WikiQA 4 subtask that assumes there is at least one correct answer for a question. This dataset consists of 20,360, 1130 and 2352 question-candidate pairs in train, dev and test, respectively. Following Yang et al. (2015), we truncate answers to 40 tokens and report mean average precision (MAP) and mean reciprocal rank (MRR).
Apart from the common baselines Addition, A-LSTM and ABCNN, we compare further with: (i) CNN-Cnt (Yang et al., 2015): combine CNN with two linguistic features "WordCnt" (the number of non-stopwords in the question that also occur in the answer) and "WgtWordCnt" (reweight the counts by the IDF values of the question words); (ii) AP- CNN (Santos et al., 2016).

Feature Vector
The final feature vector in AS has the same (rep, simi, extra) structure as TE, except that simi consists of only two cosine similarity scores, and extra consists of four entries: two sentence lengths, WordCnt and WgtWordCnt.    Table 3 shows that GRU with k-max-max-pooling is significantly better than its k-min-max-pooling and full-pooling versions. GRU with k-max-maxpooling has similar assumption with ABCNN (Yin et al., 2016a) and AP-CNN (Santos et al., 2016): units with higher matching scores are supposed to contribute more in this task. Our improvement can be due to that: i) our linguistic units cover more exhaustive phrases, it enables alignments in a wider range; ii) we have two max-pooling steps in our attention pooling, especially the second one is able to remove some noisily aligned phrases. Both ABCNN and AP-CNN are based on convolutional layers, the phrase detection is constrained by filter sizes. Even though ABCNN tries a second CNN layer to detect bigger-granular phrases, their phrases in different CNN layers cannot be aligned directly as they are in different spaces. GRU in this work uses the same weights to learn representations of arbitrary-granular phrases, hence, all phrases can share the representations in the same space and can be compared directly.

Visual Analysis
In this subsection, we visualize the attention distributions over phrases, i.e., a i in Equation 5, of example sentences in Figure 1 (for space limit, we only show this for TE example). Figures 5(a)-5(b) respectively show the attention values of each phrase in (Q, C + ) pair in TE example in Figure 1. We can find that k-min-pooling over this distributions can indeed detect some key phrases that are supposed to determine the pair relations. Taking Figure 5(a) as an example, phrases "young boys", phrases ending with "and", phrases "smiling", "is smiling", "nearby" and a couple of phrases ending with "nearby" have lowest attention values. According to our k-min-pooling step, these phrases will be detected as key phrases. Considering further the Figure 5(b), phrases "kids", phrases ending with "near", and a couple of phrases ending with "smile" are detected as key phrases.
If we look at the key phrases in both sentences, we can find that the discovering of those key phrases matches our analysis in Section 1 for TE example: "kids" corresponds to "young boys", "smiling nearby" corresponds to "near...smile".
Another interesting phenomenon is that, taking Figure 5(b) as example, even though "are playing outdoors" can be well aligned as it appears in both sentences, nevertheless the visualization figures show that the attention values of "are playing outdoors and" in Q and "are playing outdoors near" drop dramatically. This hints that our model can get rid of some surface matching, as the key token "and" or "near" makes the semantics of "are playing outdoors and" and "are playing outdoors near" be pretty different with their sub-phrase "are playing outdoors". This is important as "and" or "near" is crucial unit to connect the following key phrases "smiling nearby" in Q or "a smile" in C + . If we connect those key phrases sequentially as a new fake sentence, as we did in attentive pooling layer of Figure 4, we can see that the fake sentence roughly "reconstructs" the meaning of the original sentence while it is composed of phrase-level se-  Table Table 1) mantic units now.

Effects of Pooling Size k
The key idea of the proposed method is achieved by the k-min/max pooling. We show how the hyperparameter k influences the results by tuning on the dev sets.
In Figure 6, we can see the performance trends of changing k value between 1 and 10 in the two tasks. Roughly k > 4 can give competitive results, but larger values bring performance drop.

Conclusion
In this work, we investigate the contribution of different intensities of phrase alignments for different tasks. We argue that it is not true that stronger alignments always matter more. We found TE task prefers weaker alignments while AS task prefers stronger alignments. We proposed flexible attentive poolings in GRU system to satisfy the different requirements of different tasks. Experimental results show the soundness of our argument and the effectiveness of our attention pooling based GRU systems.
As future work, we plan to investigate phrase representation learning in context and how to conduct the attentive pooling automatically regardless of the categories of the tasks.