Convolutional Interaction Network for Natural Language Inference

Attention-based neural models have achieved great success in natural language inference (NLI). In this paper, we propose the Convolutional Interaction Network (CIN), a general model to capture the interaction between two sentences, which can be an alternative to the attention mechanism for NLI. Specifically, CIN encodes one sentence with the filters dynamically generated based on another sentence. Since the filters may be designed to have various numbers and sizes, CIN can capture more complicated interaction patterns. Experiments on three large datasets demonstrate CIN’s efficacy.


Introduction
Natural language inference (NLI) is a pivotal and challenging natural language processing (NLP) task. The goal of NLI is to identify the logical relationship (entailment, neutral, or contradiction) between a premise and a corresponding hypothesis. Generally, NLI is also related to many other NLP tasks under the paradigm of semantic matching of two texts, such as question answering Hu et al. (2014); Wan et al. (2016) and information retrieval Liu et al. (2015), and so on. An essential challenge is to capture the semantic relevance of two sentences. Due to the semantic gap (or lexical chasm) problem, natural language inference is still a challenging problem.
Recently, deep learning is raising a substantial interest in natural language inference and has achieved some great progresses Hu et al. (2014);Parikh et al. (2016); Chen et al. (2017a). To model the complicated semantic relationship between two sentences, previous models heavily utilize various attention mechanism Bahdanau et al. * Corresponding Author. † Contribution during internship at Fudan University.
(2014); Vaswani et al. (2017) to build the interaction at different granularity (word, phrase and sentence level), such as ABCNN Yin et al. (2016), Attention LSTM Rocktäschel et al. (2015), bidirectional attention LSTM Chen et al. (2017a), and so on. While attention is very successful in natural language inference, its mechanism is quite simple and can be regarded as a weighted sum of the target vectors. This paradigm results in a lack of flexibility in more complicated interaction model. In this paper, we propose a new interaction module, called Convolutional Interaction Network (CIN), which can serve as an alternative module of attention mechanism. Specifically, CIN utilizes convolutional neural network to extract the valued features (or representations) from sentences. In the case of NLI, whether a feature of one sentence being important depends on another sentence. Inspired by the idea of using one network to generate the parameters of another network Ha et al. (2016a);, we introduce a filter generation network to dynamically generate convolutional filters. Each sentence is convolved by a dynamically generated filter by another sentence. Thus, the convolved features of one sentence can be regarded as context-aware representations under the influence of another sentence.
The contributions of this paper can be summarized as follows.
1. CIN is a new interaction model, invented as an alternative module to the attention model. CIN can also capture both the intra-or interinteractions of two sentences.
2. Compared to attention model, CIN is more general and flexible to capture the complicated interaction. As discussed in Section 3.3, the attention model is approximately equivalent to a special case of CIN.
3. We perform extensive empirical studies on three very large datasets. Experiment results demonstrate that our proposed architecture is effective for natural language inference.

Attentive Interaction for Natural Language Inference
Currently, the dominative method for natural language inference is to use attention mechanism to model the interaction between two sentence. Given two input sentences x = [x 1 , x 2 , · · · , x m ] and y = [y 1 , y 2 , · · · , y n ] with length m and n respectively, we first encode them into two vectorial sequences The encoder usually consists of one or several CNN/RNN layers to get the context-aware token representations.

Word2word
Attentive Interaction The word2word attention captures the dependency between two words x i and y j from the concerned two sentences respectively.
The word2word attention computes a similarity matrix M , in which each element m i,j is the alignment score between x i and y j .
where f is a score function.
There are two most prevalent attention functions: multiplicative attention and additive attention. Multiplicative attention is: Additive attention computes a compatibility function by a feed-forward network with a single hidden layer.
where w, W 1 , W 2 and b are learnable parameters.
While these two kinds of attentions have similar performance, the multiplicative attention is more popular in practice since it requires less computation power and have less memory demand with optimized matrix multiplication. With multiplicative attention, we can compute the mimic representations for both X and Y .
where softmax(·) is column-wise normalization function. Each vectorx i ∈X is called as mimic vector, which is a weighted summation of {y j } n j=1 . Intuitively, the mimic vectorx i provides the related information of token x i extracted from sentence Y .
Prediction After interaction, a prediction module is used to aggregate the interaction information and extract the fix-length representation of two sentences. Finally, the final sentence representations are fed into a feed-forward network to predict the relationship between two sentences.

Convolutional Interaction Network
In this section, we propose a new interaction method by utilizing dynamic convolutional filters, called Convolutional Interaction Network (CIN). CIN can serve as an alternative module of attention mechanism.
We first briefly introduce how the convolution works over text sequence, then describe the proposed model and its connection to attention model.

Convolution over Sequence
Convolution is an effective operation in deep neural networks, which convolves the input with a set of filters to extract non-linear compositional features. Although originally designed for computer vision, convolutional models have subsequently shown to be effective for NLP and have achieved excellent performance in sentence modeling Kim (2014); Kalchbrenner et al. (2014), and other traditional NLP tasks Hu et al. (2014); Zeng et al. (2014); Gehring et al. (2017).
Given a sentence representation X = [x 1 , x 2 , · · · , x m ] ∈ R d×m , a convolutional filter W (f ) ∈ R d×kd , the convolution process is defined as

Filter Generation Network
Filter Generation Network W x W y where f (·) is a non-linear activation function, such as ReLU, k indicates the size of convolution window, and b (f ) ∈ R d is a bias vector.
The convolution can be abbreviated as where ⊗ denotes the convolutional operation. To ensure the output of convolution has equal length as to the input, we pad [ k 2 ] zero vectors on both sides of the input.

Convolutional Interaction Network
Convolution is very effective when it comes to extracting useful features from a sentence. But for NLI, whether a word (or feature) being important in one sentence depends on another sentence. Therefore, a better convolution operation should have the ability to extract substantial features from one sentence according to another sentence. Thus, the convolutional filter should be dynamically changeable. Inspired by ; Ha et al. (2016b), we propose a filter generation network (FGN) to generate a dynamical filter, which is used to extract the context-aware information.
Given two sentences x, y, and their representations X = [x 1 , x 2 , · · · , x m ] ∈ R d×m and Y = [y 1 , y 2 , · · · , y n ] ∈ R d×n , the filter for each sentence is generated according to the other sentence by where τ is the width of filter, FGN(·) is the filter generation network. A detailed implementation of FGN is presented in Section 3.4. Now we can convolve the two sentences with the generated filters.
where the attained matrixX andȲ can be regarded as the context-aware representation of sentences x and y, depending on each other. Figure 1 gives an illustration of CIN.

Connection to Attentive Interaction
CIN is more general than attention model. Assuming that we set k = 1 and FGN to be a function of FGN(X) = XX T , Eq. (12) and (13) of CIN can be written as Compared to Eq. (6) and (7), under the above assumption, CIN is equivalent to the word2word multiplicative attention model without softmax normalization.

An Implementation of Filter Generation Network (FGN)
To generate the dynamic filters, the key factor is how to choose the filter generation network FGN(·) in Eq. (10) and (11). Although many sophisticated networks can be employed, we give an simple implementation in this paper. For ease of presentation, we only describe how we generate dynamical filter according to sentence x. The same procedure is utilized for sentence y.
Firstly, we summarize the information of sentence x with an over-time k-max pooling on X, where U x is a non-linear transformation of X by convolution filter W u ∈ R d×d . The idea of k-max pooling is to capture the most important features (the k highest values) from sentence X.
Then we generate k filters W j x for j = 1, · · · , k by The final filter is obtained by concatenating the k generated filters, Similar to x, we can also obtain the dynamic filters W (f ) y according to the sentence y.

Incorporating CIN into a Deep
Network Architecture for NLI Our overall network architecture for NLI is based on a successful model proposed by Chen et al. (2017a). The major difference is that we use CIN to capture the interaction, instead of bi-directional attention.

Encoding Layer
The input of natural language inference task is a pair of sentences x and y. Since each word in a sentence is a symbol that can not be directly processed by neural networks, we need first map each word to a d dimensional embedding vector.
Thus, the two sentences are mapped to two matrix E x ∈ R de×m and E y ∈ R de×n respectively. We also use syntactical and lexical information such as part of speech tagging information, exact match feature and character representation. In this paper, exact match value of each word is set to 1(default to be 0) if the word concerned share the same stem or lemma with any word in counterpart sentence. Character representation of the word is obtained using a convolution neural network followed by a max pooling along sequence length dimension as same as Kim (2014). The final representation of word is a concatenation of word embedding, character encoded vector, POS tagging embedding and exact match feature. Both character embedding and POS tagging embedding are randomly initialized. All embeddings are updated during training.
We use bi-directional LSTM (BiLSTM) Hochreiter and Schmidhuber (1997)   phrase-level encoding of two input sentences, where X ∈ R d×m and Y ∈ R d×n are the phraselevel encoding representation of sentence x and y respectively.

Convolutional Interaction Layers
In the interaction layers, we use our proposed CIN to model the interaction between two sentences. We first dynamically generate context-aware filters W (f ) x and W (f ) y based on the sentence encodings X and Y respectively, which are further used for both intra-sentence and inter-sentence interaction.

Intra-Sentence Interaction
The intra-sentence convolutional interaction is to convolve one sentence by the filter generated by itself.
The role of the intra-sentence convolutional interaction is the same as self-attention Shen et al. (2017), which is also very useful in NLI.
Inter-Sentence Interaction The inter-sentence interaction takes filters generated from the coun-terpart sentence to convolve the inputs.
The inter-sentence convolutional interaction plays a role similar to the cross-attention between two sentences.
Fusion Layer After CIN, we can fuse two kinds of context-aware representations of each sentence. For sentence x, the X intra and X inter represent the extracted features under consideration of information of itself and sentence y respectively.
To efficiently utilize X intra and X inter , a fusion layer is used. We use the comparing operation proposed in Chen et al. (2017a) to fuse the two kinds of representation. Let u i and v i be intra and inter attentive vector of the i-th word in sentence x, a heuristic and effective composition operator is used to combine two vectors.
Thus, we can obtain two fused representations X (c) and Y (c) for two sentences, which are further fed into the prediction layer or another stacked interaction layer. The interaction layers can be stacked for N x times to capture the complicated matching information.

Prediction Layer
After interaction layers, an aggregation layer is employed to aggregate the two sequences of fustion vectors X (c) and Y (c) into a fixed-length matching vectors. The aggregation component usually consists of another BiLSTM layer and a following pooling layer. We then perform max pooling over time for both X (c) and Y (c) to get two fix representation vector for two sentences, p and h: where the functions max is the max pooling operations over time steps. Finally, the pooled vectors are composed as one relation vector and fed into a feed-forward network to predict the relationship between two sentences. Specially, the two-layer feed-forward network has one hidden layers with tanh activation

Training
Given a trainset {x (i) , y (i) , t (i) } N i=1 , the objective is to minimize a cross entropy loss J (θ): where θ represents all the connection weights. We use the Adam optimizer Kingma and Ba (2014) with an initial learning rate of 0.0004. Default L2 regularization λ is set to 10 −6 . To avoid overfitting, dropout is applied after each fully connected, recurrent or convolutional layer.
Initialization We take advantage of pre-trained word embeddings such as Glove Pennington et al. (2014) to transfer more knowledge from vast unlabeled data. For the words that don't appear in Glove, we randomly initialize their embeddings from a normal distribution with mean 0.0 and standard deviation 0.1.
The network weights are initialized with Xavier normalization Glorot and Bengio (2010) to maintain the variance of activations throughout the forward and backward passes. Biases are uniformly set to zero when the network is constructed.

Datasets
To make quantitative evaluation, our model was evaluated on three well known datasets: Stanford Natural Language Inference dataset (SNLI), MultiNLI dataset and Quora Question pair dataset (Quora). Detailed statistical information of these datasets is shown in Table 1.

Overall Results
We use the accuracy to evaluate the performance of our convolutional interaction network (CIN) and other models on SNLI, MultiNLI and Quora.
SNLI Table 2 shows the results of different models on the train set and test set of SNLI. The first row gives a baseline model with handcrafted features presented by Bowman et al. (2015). All the other models are attention-based neural networks. Wang and Jiang (2016) exploits the long short-term memory (LSTM) for NLI. Parikh et al.
(2016) uses attention to decompose the problem into subproblems that can be solved separately. Chen et al. (2017a) incorporates the chain LSTM and tree LSTM jointly. Zhiguo Wang (2017) proposes a bilateral multi-perspective matching for NLI.
In Table 2, the second block gives the single models. As we can see, our proposed model CIN achieves 88.0% in accuracy on SNLI test set. Compared to the previous work, CIN obtains competitive performance.
To further improve the performance of NLI systems, researchers have built ensemble models. Ensemble systems obtained the best performance on SNLI. Our ensemble model obtains 89.1% in accuracy and outperforms the current state-of-the-art model.
Overall, single model of CIN performs competitively well and outperforms the previous models on ensemble scenarios for the natural language inference task.
MultiNLI Table 3 shows the performance of different models on MultiNLI. The original aim of this dataset is to evaluate the quality of sentence representations. Recently this dataset is also used to evaluate the interaction model involving attention mechanism.
The first line of Table 3 gives a baseline model without interaction. The second block of Table 3 gives the attention-based models. The proposed   Quora Table 4 shows the performance of different models on the Quora test set. The baselines on Table 4 are all implemented in Zhiguo Wang (2017). The Siamese-CNN model and Siamese-LSTM model encode sentences with CNN and LSTM respectively, and then predict the relationship between them based on the cosine similarity. Multi-Perspective-CNN and Multi-Perspective-LSTM are transformed from Siamese-CNN and Siamese-LSTM respectively by replacing the cosine similarity calculation layer with their multi-perspective cosine matching function. The L.D.C is a general compare-aggregate framework that performs word-level matching followed by a aggregation of convolution neural networks. As we can see, our model outperforms the base-

Premise
(1) A girl playing a violin along with a group of people (2) A girl playing a violin along with a group of people

Hypothesis
(1) A girl is playing an instrument .
(2) A girl is playing an instrument . lines and achieve 88.62% in the test sets of Quora corpus.

Model Ablation
To better understand the performance of our model, we analyze the effect of each key component of the proposed model. As illustrated in Table  5, the first row is the full CIN model. By dropping convolutional interaction layers, the performance decreases to 85.1% on the test set, which indicate the interaction information is crucial for NLI. By just dropping intra-attention layer, the performance decreases to 87.7% on the test set. According to the results, all of the components positively contribute to the final performance.

Case Study
To give an intuitive understanding of how our model works, we give an analysis on the following case from the test set.
Premise: A girl playing a violin along with a group of people. Hypothesis: A girl is playing an instrument. Label: Entailment.
The visualization results are produced from model with two stacked CINs. X, Y is the hidden states at encoding layer, and X (c) , Y (c) is the hidden states at first CIN layer. For a hidden state  Correlation of X (c) T Y (c) at first CIN layer.
x i of word x i , we can calculate its gradient scale || ∂J ∂x i || 2 to show its contribution to final prediction. Table 6 shows the gradient scales of hidden states of each word in the encoding layer and the first CIN layer. As we can see, some phrases (like playing a violin and playing an instrument) instead of isolated words (like violin and instrument) become more focused after a CIN layer. It implies CIN could capture some higher level patterns. Figure 3 gives a visualization of correlations of hidden states of two sentences. (a) shows the correlations after the encoding layer, the same words are most correlated. This is because embedding layer and encoding layer are shared between premise and hypothesis. (b) shows the correlations after the first CIN layer, the correlation exists between phrases {playing a violin vs. playing an instrument}, instead of the same words. The interaction layer connects playing in Premise to Hypothesis instrument, and connects playing in Hypothesis to Premise violin. Thus, the correlation between instrument in Hypothesis and violin in Premise are boosted, as we know these are important to reasoning.

Related Work
There are mainly two threads of work related to ours.
One thread of work is using attention-based model for natural language inference (NLI). NLI has been widely investigated for many years. Ben-efiting from the development of deep learning and the availability of large-scale annotated datasets, deep neural models have achieved great success. Rocktäschel et al. (2015) firstly use LSTM with attention for text matching task. Wang and Jiang (2016) use word-by-word attention to exploit the word-level match. Parikh et al. (2016) propose a new framework to model the relationship between two sentences using interact-compare-aggregate architecture. Chen et al. (2017a) incorporates the chain LSTM and tree LSTM jointly. Zhiguo Wang (2017) use self-attention mechanism to capture contextual information from the whole sentence.
Unlike the above models, we use an alternative model to capture the complicate interaction information of two sentences.
Another thread of work is the idea of using one network to generate the parameters of another network. De  proposed the dynamic filter network to implicitly learn a variety of filtering operations. Ha et al. (2016a) proposed the model hypernetwork, which uses a small network to generate the weights for a larger network.
Unlike these models, our dynamical filter is employed for interaction. Therefore, a filter generation function is proposed to capture the related intra and inter dependent information of a sentence pair.

Conclusion
In this paper, we propose an alternative interaction model, Convolutional Interaction Network (CIN), for natural language inference. CIN utilizes the dynamic convolutional filters to model the interaction between two sentences. Specifically, each sentence is convolved by dynamical filters generated based on another sentence. CIN is more general and flexible since the filters may have various numbers and sizes, thereby capturing more complicated interaction patterns. Experiments on three very large datasets demonstrate the efficacy of our proposed model.
In future work, we hope to improve the extensibility of CIN and apply it to other NLP tasks, such as machine comprehension.