Modelling Interaction of Sentence Pair with coupled-LSTMs

Recently, there is rising interest in modelling the interactions of two sentences with deep neural networks. However, most of the existing methods encode two sequences with separate encoders, in which a sentence is encoded with little or no information from the other sentence. In this paper, we propose a deep architecture to model the strong interaction of sentence pair with two coupled-LSTMs. Specifically, we introduce two coupled ways to model the interdependences of two LSTMs, coupling the local contextualized interactions of two sentences. We then aggregate these interactions and use a dynamic pooling to select the most informative features. Experiments on two very large datasets demonstrate the efficacy of our proposed architecture and its superiority to state-of-the-art methods.


Introduction
Distributed representations of words or sentences have been widely used in many natural language processing (NLP) tasks, such as text classification (Kalchbrenner et al., 2014;Liu et al., 2015), question answering and machine translation (Sutskever et al., 2014) and so on. Among these tasks, a common problem is modelling the relevance/similarity of the sentence pair, which is also called text semantic matching.
Recently, deep learning based models is rising a substantial interest in text semantic matching and have achieved some great progresses (Hu et al., 2014;Wan et al., 2016). * Corresponding author.
According to the phases of interaction between two sentences, previous models can be classified into three categories. Weak interaction Models Some early works focus on sentence level interactions, such as ARC-I (Hu et al., 2014), CNTN  and so on. These models first encode two sequences with some basic (Neural Bag-of-words, BOW) or advanced (RNN, CNN) components of neural networks separately, and then compute the matching score based on the distributed vectors of two sentences. In this paradigm, two sentences have no interaction until arriving final phase. Semi-interaction Models Some improved methods focus on utilizing multi-granularity representation (word, phrase and sentence level), such as MultiGranCNN  and Multi-Perspective CNN (He et al., 2015). Another kind of models use soft attention mechanism to obtain the representation of one sentence by depending on representation of another sentence, such as ABCNN , Attention LSTM (Rocktäschel et al., 2015;Hermann et al., 2015). These models can alleviate the weak interaction problem, but are still insufficient to model the contextualized interaction on the word as well as phrase level. Strong Interaction Models These models directly build an interaction space between two sentences and model the interaction at different positions, such as ARC-II (Hu et al., 2014), MV-LSTM (Wan et al., 2016) and DF-LSTMs (Liu et al., 2016). These models can easily capture the difference between semantic capacity of two sentences.
In this paper, we propose a new deep neural network architecture to model the strong interactions of two sentences. Different with modelling two sentences with separated LSTMs, we utilize two interdependent LSTMs, called coupled-LSTMs, to fully affect each other at different time steps. The output of coupled-LSTMs at each step depends on both sentences. Specifically, we propose two interdependent ways for the coupled-LSTMs: loosely coupled model (LC-LSTMs) and tightly coupled model (TC-LSTMs). Similar to bidirectional LSTM for single sentence (Schuster and Paliwal, 1997;Graves and Schmidhuber, 2005), there are four directions can be used in coupled-LSTMs. To utilize all the information of four directions of coupled-LSTMs, we aggregate them and adopt a dynamic pooling strategy to automatically select the most informative interaction signals. Finally, we feed them into a fully connected layer, followed by an output layer to compute the matching score.
The contributions of this paper can be summarized as follows.
1. Different with the architectures of using similarity matrix, our proposed architecture directly model the strong interactions of two sentences with coupled-LSTMs, which can capture the useful local semantic relevances of two sentences. Our architecture can also capture the multiple granular interactions by several stacked coupled-LSTMs layers.
2. Compared to previous works on text matching, we perform extensive empirical studies on two very large datasets. The massive scale of the datasets allows us to train a very deep neural network and present an elaborate qualitative analysis of our models, which gives an intuitive understanding how our model worked.

Sentence Modelling with LSTM
Long short-term memory network (LSTM) (Hochreiter and Schmidhuber, 1997) is a type of recurrent neural network (RNN) (Elman, 1990), and specifically addresses the issue of learning long-term dependencies. We define the LSTM units at each time step t to be a collection of vectors in R d : an input gate i t , a forget gate f t , an output gate o t , a memory cell c t and a hidden state h t . d is the number of the LSTM units. The elements of the gating vectors i t , f t and o t are in [0, 1].
The LSTM is precisely specified as follows.
where x t is the input at the current time step; T A,b is an affine transformation which depends on parameters of the network A and b. σ denotes the logistic sigmoid function and denotes elementwise multiplication.
The update of each LSTM unit can be written precisely as follows Here, the function LSTM(·, ·, ·) is a shorthand for Eq. (1-3).

Coupled-LSTMs for Strong Sentence Interaction
To deal with two sentences, one straightforward method is to model them with two separate LSTMs. However, this method is difficult to model local interactions of two sentences. An improved way is to introduce attention mechanism, which has been used in many tasks, such as machine translation (Bahdanau et al., 2014) and question answering (Hermann et al., 2015). Inspired by the multi-dimensional recurrent neural network (Graves et al., 2007;Graves and Schmidhuber, 2009;Byeon et al., 2015) and grid LSTM (Kalchbrenner et al., 2015) in computer vision community, we propose two models to capture the interdependences between two parallel LSTMs, called coupled-LSTMs (C-LSTMs).
To facilitate our models, we firstly give some definitions. Given two sequences X = x 1 , x 2 , · · · , x n and Y = y 1 , y 2 , · · · , y m , we let x i ∈ R d denote the embedded representation of the word x i . The standard LSTM have one temporal dimension. When dealing with a sentence, LSTM regards the position as time step. At position i of sentence x 1:n ,  the output h i reflects the meaning of subsequence To model the interaction of two sentences as early as possible, we define h i,j to represent the interaction of the subsequences x 0:i and y 0:j . Figure 1(c) and 1(d) illustrate our two propose models. For intuitive comparison of weak interaction parallel LSTMs, we also give parallel LSTMs and attention LSTMs in Figure 1(a) and 1(b) 1 .
We describe our two proposed models as follows.

Loosely Coupled-LSTMs (LC-LSTMs)
To model the local contextual interactions of two sentences, we enable two LSTMs to be interdependent at different positions. Inspired by Grid LSTM (Kalchbrenner et al., 2015) and word-byword attention LSTMs (Rocktäschel et al., 2015), we propose a loosely coupling model for two interdependent LSTMs.
More concretely, we refer to h i,j as the encoding of subsequence x 0:i in the first LSTM influenced by the output of the second LSTM on subsequence y 0:j . Meanwhile, h i,j is the encoding of subsequence y 0:j in the second LSTM influenced by the output of the first LSTM on subsequence x 0:i 1 In Rocktäschel et al. (2015) model, conditioned LSTM was used, meaning that h where

Tightly Coupled-LSTMs (TC-LSTMs)
The hidden states of LC-LSTMs are the combination of the hidden states of two interdependent LSTMs, whose memory cells are separated. Inspired by the configuration of the multi-dimensional LSTM (Byeon et al., 2015), we further conflate both the hidden states and the memory cells of two LSTMs. We assume that h i,j directly model the interaction of the subsequences x 0:i and y 0:j , which depends on two previous interaction h i−1,j and h i,j−1 , where i, j are the positions in sentence X and Y . We define a tightly coupled-LSTMs units as follows.
where the gating units i i,j and o i,j determine which memory units are affected by the inputs throughc i,j , and which memory cells are written to the hidden units h i,j . T A,b is an affine transformation which depends on parameters of the network A and b. In contrast to the standard LSTM defined over time, each memory unit c i,j of a tightly coupled-LSTMs has two preceding states c i,j−1 and c i−1,j and two corresponding forget gates f 1 i,j and f 2 i,j .

Output Layer
Input Layer Stacked C-LSTMs Pooling Layer Figure 2: Architecture of coupled-LSTMs for sentence-pair encoding. Inputs are fed to four C-LSTMs followed by an aggregation layer. Blue cuboids represent different contextual information from four directions.
where C-LSTMs can be either TC-LSTMs or LC-LSTMs.
The input consists of two type of information at step (i, j) in coupled-LSTMs: temporal dimen- The difference between TC-LSTMs and LC-LSTMs is the dependence of information from temporal and depth dimension.

Interaction Between Temporal Dimensions
The TC-LSTMs model the interactions at position (i, j) by merging the internal memory c i−1,j c i,j−1 and hidden state h i−1,j h i,j−1 along row and column dimensions. In contrast with TC-LSTMs, LC-LSTMs firstly use two standard LSTMs in parallel, producing hidden states h 1 i,j and h 2 i,j along row and column dimensions respectively, which are then merged together flowing next step.
Interaction Between Depth Dimension In TC-LSTMs, each hidden state h i,j at higher layer receives a fusion of information x i and y j , flowed from lower layer. However, in LC-LSTMs, the information x i and y j are accepted by two corresponding LSTMs at the higher layer separately.
The two architectures have their own characteristics, TC-LSTMs give more strong interactions among different dimensions while LC-LSTMs ensures the two sequences interact closely without being conflated using two separated LSTMs.

Comparison of LC-LSTMs and word-by-word Attention LSTMs
The characteristic of attention LSTMs is that they obtain the attention weighted representation of one sentence considering he alignment between the two sentences, which is asymmetric unidirectional encoding. Nevertheless, in LC-LSTM, each hidden state of each step is obtained with the consideration of interaction between two sequences with symmetrical encoding fashion.

End-to-End Architecture for Sentence Matching
In this section, we present an end-to-end deep architecture for matching two sentences, as shown in Figure 2.

Embedding Layer
To model the sentences with neural model, we firstly need transform the one-hot representation of word into the distributed representation. All words of two sequences X = x 1 , x 2 , · · · , x n and Y = y 1 , y 2 , · · · , y m will be mapped into low dimensional vector representations, which are taken as input of the network.

Stacked Coupled-LSTMs Layers
A basic block consists of five layers. We firstly use four directional coupled-LSTMs to model the local interactions with different information flows. And then we sum the outputs of these LSTMs by aggregation layer. To increase the learning capabilities of the coupled-LSTMs, we stack the basic block on top of each other.

Four Directional Coupled-LSTMs Layers
The C-LSTMs is defined along a certain predefined direction, we can extend them to access to the surrounding context in all directions. Similar to bi-directional LSTM, there are four directions in coupled-LSTMs.

Aggregation Layer
The aggregation layer sums the outputs of four directional coupled-LSTMs into a vector.
where the superscript t of h i,j denotes the different directions.

Stacking C-LSTMs Blocks
To increase the capabilities of network of learning multiple granularities of interactions, we stack several blocks (four C-LSTMs layers and one aggregation layer) to form deep architectures.

Pooling Layer
The output of stacked coupled-LSTMs layers is a tensor H ∈ R n×m×d , where n and m are the lengths of sentences, and d is the number of hidden neurons. We apply dynamic pooling to automatically extract R p×q subsampling matrix in each slice H i ∈ R n×m , similar to (Socher et al., 2011).
More formally, for each slice matrix H i , we partition the rows and columns of H i into p × q roughly equal grids. These grid are non-overlapping. Then we select the maximum value within each grid thereby obtaining a p × q × d tensor.

Fully-Connected Layer
The vector obtained by pooling layer is fed into a full connection layer to obtain a final more abstractive representation.

Output Layer
The output layer depends on the types of the tasks, we choose the corresponding form of output layer. There are two popular types of text matching tasks in NLP. One is ranking task, such as community question answering. Another is classification task, such as textual entailment. 1. For ranking task, the output is a scalar matching score, which is obtained by a linear transformation after the last fully-connected layer. 2. For classification task, the outputs are the probabilities of the different classes, which is computed by a softmax function after the last fullyconnected layer.

Training
Our proposed architecture can deal with different sentence matching tasks. The loss functions varies with different tasks. More concretely, we use maxmargin loss (Bordes et al., 2013;Socher et al., 2013) for ranking task and cross-entropy loss for classification task.
To minimize the objective, we use stochastic gradient descent with the diagonal variant of AdaGrad (Duchi et al., 2011). To prevent exploding gradients, we perform gradient clipping by scaling the gradient when the norm exceeds a threshold (Graves, 2013).

Experiment
In this section, we investigate the empirical performances of our proposed model on two different text matching tasks: classification task (recognizing textual entailment) and ranking task (matching of question and answer).

Hyperparameters and Training
The word embeddings for all of the models are initialized with the 100d GloVe vectors (840B token version, (Pennington et al., 2014)) and fine-tuned during training to improve the performance. The other parameters are initialized by randomly sampling from uniform distribution in [−0.1, 0.1].
For each task, we take the hyperparameters which achieve the best performance on the development set via an small grid search over combinations of the initial learning rate [0.05, 0.0005, 0.0001], l 2 regularization [0.0, 5E−5, 1E−5, 1E−6] and the threshold value of gradient norm [5,10,100]. The final hyperparameters are set as Table 1.

Competitor Methods
• Neural bag-of-words (NBOW): Each sequence as the sum of the embeddings of the words it contains, then they are concatenated and fed to a MLP.
• Single LSTM: A single LSTM to encode the two sequences, which is used in (Rocktäschel et al., 2015).
• Parallel LSTMs: Two sequences are encoded by two LSTMs separately, then they are concatenated and fed to a MLP.

Experiment-I: Recognizing Textual Entailment
Recognizing textual entailment (RTE) is a task to determine the semantic relationship between two sentences. We use the Stanford Natural Language Inference Corpus (SNLI) (Bowman et al., 2015). This corpus contains 570K sentence pairs, and all of the sentences and labels stem from human annotators. SNLI is two orders of magnitude larger than all other existing RTE corpora. Therefore, the massive scale of SNLI allows us to train powerful neural networks such as our proposed architecture in this paper. Table 2 shows the evaluation results on SNLI. The 3rd column of the table gives the number of parameters of different models without the word embeddings.

Results
Our proposed two C-LSTMs models with four stacked blocks outperform all the competitor models, which indicates that our thinner and deeper network does work effectively.  Besides, we can see both LC-LSTMs and TC-LSTMs benefit from multi-directional layer, while the latter obtains more gains than the former. We attribute this discrepancy between two models to their different mechanisms of controlling the information flow from depth dimension.
Compared with attention LSTMs, our two models achieve comparable results to them using much fewer parameters (nearly 1/5). By stacking C-LSTMs 2 , the performance of them are improved significantly, and the four stacked TC-LSTMs achieve 85.1% accuracy on this dataset.
Moreover, we can see TC-LSTMs achieve better performance than LC-LSTMs on this task, which need fine-grained reasoning over pairs of words as well as phrases.

Understanding Behaviors of Neurons in C-LSTMs
To get an intuitive understanding of how the C-LSTMs work on this problem, we examined the neuron activations in the last aggregation layer while evaluating the test set using TC-LSTMs. We find that some cells are bound to certain roles.
Let h i,j,k denotes the activation of the k-th neuron at the position of (i, j), where i ∈ {1, . . . , n} and j ∈ {1, . . . , m}. By visualizing the hidden state h i,j,k and analyzing the maximum activation, we Index of Cell Word or Phrase Pairs 3-th (in a pool, swimming), (near a fountain, next to the ocean), (street, outside) 9-th (doing a skateboard, skateboarding), (sidewalk with, inside), (standing, seated) 17-th (blue jacket, blue jacket), (wearing black, wearing white), (green uniform, red uniform) 25-th (a man, two other men), (a man, two girls), (an old woman, two people)  can find that there exist multiple interpretable neurons. For example, when some contextualized local perspectives are semantically related at point (i, j) of the sentence pair, the activation value of hidden neuron h i,j,k tend to be maximum, meaning that the model could capture some reasoning patterns. Figure 3 illustrates this phenomenon. In Figure 3(a), a neuron shows its ability to monitor the local contextual interactions about color. The activation in the patch, including the word pair "(red, green)", is much higher than others. This is informative pattern for the relation prediction of these two sentences, whose ground truth is contradiction. An interesting thing is there are two words describing color in the sentence " A person in a red shirt and black pants hunched over.". Our model ignores the useless word "black", which indicates that this neuron selectively captures pattern by contextual understanding, not just word level interaction.
In Figure 3(b), another neuron shows that it can capture the local contextual interactions, such as "(walking down the street, outside)". These patterns can be easily captured by pooling layer and provide a strong support for the final prediction. Table 3 illustrates multiple interpretable neurons and some representative word or phrase pairs which can activate these neurons. These cases show that our models can capture contextual interactions beyond word level.

Error Analysis
Although our models C-LSTMs are more sensitive to the discrepancy of the semantic capacity between two sentences, some semantic mistakes at the phrasal level still exist. For example, our models failed to capture the key informative pattern when predicting the entailment sentence pair "A girl takes off her shoes and eats blue cotton candy/The girl is eating while barefoot." Besides, despite the large size of the training corpus, it's still very different to solve some cases, which depend on the combination of the world knowledge and context-sensitive inferences. For example, given an entailment pair "a man grabs his crotch during a political demonstration/The man is making a crude gesture", all models predict "neutral". This analysis suggests that some architectural improvements or external world knowledge are necessary to eliminate all errors instead of simply scaling up the basic model.

Experiment-II: Matching Question and Answer
Matching question answering (MQA) is a typical task for semantic matching. Given a question, we need select a correct answer from some candidate answers.
In this paper, we use the dataset collected from Yahoo! Answers with the getByCategory function  provided in Yahoo! Answers API, which produces 963, 072 questions and corresponding best answers. We then select the pairs in which the length of questions and answers are both in the interval [4,30], thus obtaining 220, 000 question answer pairs to form the positive pairs. For negative pairs, we first use each question's best answer as a query to retrieval top 1, 000 results from the whole answer set with Lucene, where 4 or 9 answers will be selected randomly to construct the negative pairs.
The whole dataset is divided into training, validation and testing data with proportion 20 : 1 : 1. Moreover, we give two test settings: selecting the best answer from 5 and 10 candidates respectively.

Results
Results of MQA are shown in the Table 4. For our models, due to stacking block more than three layers can not make significant improvements on this task, we just use three stacked C-LSTMs.
By analyzing the evaluation results of questionanswer matching in table 4, we can see strong interaction models (attention LSTMs, our C-LSTMs) consistently outperform the weak interaction models (NBOW, parallel LSTMs) with a large margin, which suggests the importance of modelling strong interaction of two sentences.
Our proposed two C-LSTMs surpass the competitor methods and C-LSTMs augmented with multidirections layers and multiple stacked blocks fully utilize multiple levels of abstraction to directly boost the performance.
Additionally, LC-LSTMs is superior to TC-LSTMs. The reason may be that MQA is a relative simple task, which requires less reasoning abilities, compared with RTE task. Moreover, the parameters of LC-LSTMs are less than TC-LSTMs, which ensures the former can avoid suffering from overfitting on a relatively smaller corpus.

Related Work
Our architecture for sentence pair encoding can be regarded as strong interaction models, which have been explored in previous models.
An intuitive paradigm is to compute similarities between all the words or phrases of the two sentences. Socher et al. (2011) firstly used this paradigm for paraphrase detection. The representations of words or phrases are learned based on recursive autoencoders.
A major limitation of this paradigm is the interaction of two sentence is captured by a pre-defined similarity measure. Thus, it is not easy to increase the depth of the network. Compared with this paradigm, we can stack our C-LSTMs to model multiple-granularity interactions of two sentences. Rocktäschel et al. (2015) used two LSTMs equipped with attention mechanism to capture the iteration between two sentences. This architecture is asymmetrical for two sentences, where the obtained final representation is sensitive to the two sentences' order.
Compared with the attentive LSTM, our proposed C-LSTMs are symmetrical and model the local contextual interaction of two sequences directly.

Conclusion and Future Work
In this paper, we propose an end-to-end deep architecture to capture the strong interaction information of sentence pair. Experiments on two large scale text matching tasks demonstrate the efficacy of our proposed model and its superiority to competitor models. Besides, we present an elaborate qualitative analysis of our models, which gives an intuitive understanding how our model worked.
In future work, we would like to incorporate some gating strategies into the depth dimension of our proposed models, like highway or residual network, to enhance the interactions between depth and other di-mensions thus training more deep and powerful neural networks.