A Sentence Interaction Network for Modeling Dependence between Sentences

Modeling interactions between two sentences is crucial for a number of natu-ral language processing tasks including Answer Selection, Dialogue Act Analysis, etc. While deep learning methods like Recurrent Neural Network or Convolutional Neural Network have been proved to be powerful for sentence modeling, prior studies paid less attention on interactions between sentences. In this work, we propose a Sentence Interaction Network (SIN) for modeling the complex interactions between two sentences. By introducing “interaction states” for word and phrase pairs, SIN is powerful and ﬂexi-ble in capturing sentence interactions for different tasks. We obtain signiﬁcant improvements on Answer Selection and Dialogue Act Analysis without any feature engineering.


Introduction
There exist complex interactions between sentences in many natural language processing (NLP) tasks such as Answer Selection (Yu et al., 2014;Yin et al., 2015), Dialogue Act Analysis (Kalchbrenner and Blunsom, 2013), etc. For instance, given a question and two candidate answers below, though they are all talking about cats, only the first Q What do cats look like? A1 Cats have large eyes and furry bodies. A2 Cats like to play with boxes and bags.
answer correctly answers the question about cats' appearance. It is important to appropriately model the relation between two sentences in such cases. * Correspondence author For sentence pair modeling, some methods first project the two sentences to fix-sized vectors separately without considering the interactions between them, and then fed the sentence vectors to other classifiers as features for a specific task (Kalchbrenner and Blunsom, 2013;Tai et al., 2015). Such methods suffer from being unable to encode context information during sentence embedding.
A more reasonable way to capture sentence interactions is to introduce some mechanisms to utilize information from both sentences at the same time. Some methods attempt to introduce an attention matrix which contains similarity scores between words and phrases to approach sentence interactions (Socher et al., 2011;Yin et al., 2015). While the meaning of words and phrases may drift from contexts to contexts, simple similarity scores may be too weak to capture the complex interactions, and a more powerful interaction mechanism is needed.
In this work, we propose a Sentence Interaction Network (SIN) focusing on modeling sentence interactions. The main idea behind this model is that each word in one sentence may potentially influence every word in another sentence in some degree (the word "influence" here may refer to "answer" or "match" in different tasks). So, we introduce a mechanism that allows information to flow from every word (or phrase) in one sentence to every word (or phrase) in another sentence. These "information flows" are real-valued vectors describing how words and phrases interact with each other, for example, a word (or phrase) in one sentence can modify the meaning of a word (or phrase) in another sentence through such "information flows".
Specifically, given two sentences s 1 and s 2 , for every word x t in s 1 , we introduce a "candidate interaction state" for every word x τ in s 2 . This state is regarded as the "influence" of x τ to x t , and is actually the "information flow" from x τ to x t mentioned above. By summing over all the "candidate interaction states", we generate an "interaction state" for x t , which represents the influence of the whole sentence s 2 to word x t . When feeding the "interaction state" and the word embedding together into Recurrent Neural Network (with Long Short-Time Memory unit in our model), we obtain a sentence vector with context information encoded. We also add a convolution layer on the word embeddings so that interactions between phrases can also be modeled.
SIN is powerful and flexible for modeling sentence interactions in different tasks. First, the "interaction state" is a vector, compared with a single similarity score, it is able to encode more information for word or phrase interactions. Second, the interaction mechanism in SIN can be adapted to different functions for different tasks during training, such as "word meaning adjustment" for Dialogue Act Analysis or "Answering" for Answer Selection.
Our main contributions are as follows: • We propose a Sentence Interaction Network (SIN) which utilizes a new mechanism to model sentence interactions.
• We add convolution layers to SIN, which improves the ability to model interactions between phrases.
• We obtain significant improvements on Answer Selection and Dialogue Act Analysis without any handcrafted features.
The rest of the paper is structured as follows: We survey related work in Section 2, introduce our method in Section 3, present the experiments in Section 4, and summarize our work in Section 5.

Related Work
Our work is mainly related to deep learning for sentence modeling and sentence pair modeling.
For sentence modeling, we have to first represent each word as a real-valued vector (Mikolov et al., 2010;Pennington et al., 2014) , and then compose word vectors into a sentence vector. Several methods have been proposed for sentence modeling. Recurrent Neural Network (RNN) (Elman, 1990;Mikolov et al., 2010) introduces a hidden state to represent contexts, and repeatedly feed the hidden state and word embeddings to the network to update the context representation. RNN suffers from gradient vanishing and exploding problems which limit the length of reachable context. RNN with Long Short-Time Memory Network unit (LSTM) (Hochreiter and Schmidhuber, 1997;Gers, 2001) solves such problems by introducing a "memory cell" and "gates" into the network. Recursive Neural Network (Socher et al., 2013;Qian et al., 2015) and LSTM over tree structures Tai et al., 2015) are able to utilize some syntactic information for sentence modeling. Kim (2014) proposed a Convolutional Neural Network (CNN) for sentence classification which models a sentence in multiple granularities.
For sentence pair modeling, a simple idea is to first project the sentences to two sentence vectors separately with sentence modeling methods, and then feed these two vectors into other classifiers for classification (Tai et al., 2015;Yu et al., 2014;. The drawback of such methods is that separately modeling the two sentences is unable to capture the complex sentence interactions. Socher et al. (2011) model the two sentences with Recursive Neural Networks (Unfolding Recursive Autoencoders), and then feed similarity scores between words and phrases (syntax tree nodes) to a CNN with dynamic pooling to capture sentence interactions. Hu et al. (2014) first create an "interaction space" (matching score matrix) by feeding word and phrase pairs into a multilayer perceptron (MLP), and then apply CNN to such a space for interaction modeling. Yin et al. (2015) proposed an Attention based Convolutional Neural Network (ABCNN) for sentence pair modeling. ABCNN introduces an attention matrix between the convolution layers of the two sentences, and feed the matrix back to CNN to model sentence interactions. There are also some methods that make use of rich lexical semantic features for sentence pair modeling (Yih et al., 2013;, but these methods can not be easily adapted to different tasks. Our work is also related to context modeling. Hermann et al. (2015) proposed a LSTM-based method for reading comprehension. Their model is able to effectively utilize the context (given by a document) to answer questions. Ghosh et al. (2016) proposed a Contextual LSTM (CLSTM) which introduces a topic vector into LSTM for context modeling. The topic vector in CLSTM is computed according to those already seen words, and therefore reflects the underlying topic of the current word.

Background: RNN and LSTM
Recurrent Neural Network (RNN) (Elman, 1990;Mikolov et al., 2010), as depicted in Figure 1(a), is proposed for modeling long-distance dependence in a sequence. Its hidden layer is connected to itself so that previous information is considered in later times. RNN can be formalized as where x t is the input at time step t and h t is the hidden state. Though theoretically, RNN is able to capture dependence of arbitrary length, it tends to suffer from the gradient vanishing and exploding problems which limit the length of reachable context. In addition, an additive function of the previous hidden layer and the current input is too simple to describe the complex interactions within a sequence.
RNN with Long Short-Time Memory Network unit (LSTM, Figure 1(b)) (Hochreiter and Schmidhuber, 1997;Gers, 2001) solves such problems by introducing a "memory cell" and "gates" into the network. Each time step is associated with a subnet known as a memory block in which a "memory cell" stores the context information and "gates" control which information should be added or discarded or reserved. LSTM can be formalized as where * means element-wise multiplication, f t , i t , o t is the forget, input and output gate that control which information should be forgot, input and output, respectively.C t is the candidate information to be added to the memory cell state C t . h t is the hidden state which is regarded as a representation of the current time step with contexts.
In this work, we use LSTM with peephole connections, namely adding C t−1 to compute the forget gate f t and the input gate i t , and adding C t to compute the output gate o t .

Sentence Interaction Network (SIN)
Sentence Interaction Network (SIN, Figure 2) models the interactions between two sentences in two steps.
First, we use a LSTM (referred to as LSTM 1 ) to model the two sentences s 1 and s 2 separately, and the hidden states related to the t-th word in s 1 and the τ -th word in s 2 are denoted as z (1) t and z (2) τ respectively. For simplicity, we will use the position (t, τ ) to denote the corresponding words hereafter.
Second, we propose a new mechanism to model the interactions between s 1 and s 2 by allowing information to flow between them. Specifically, word t in s 1 may be potentially influenced by all words in s 2 in some degree. Thus, for word t in s 1 , a candidate interaction statec (i) tτ and an input gate i (i) tτ are introduced for each word τ in s 2 as follows: where |s 2 | is the length of sentence s 2 , and c (i) t can be viewed as the total interaction information received by word t in s 1 from sentence s 2 . The interaction states of words in s 2 can be similarly We now introduce the interaction states into another LSTM (referred to as LSTM 2 ) to compute the sentence vectors. Therefore, information can flow between the two sentences through these states. For sentence s 1 , at timestep t, we have By averaging all hidden states of LSTM 2 , we obtain the sentence vector v s 1 of s 1 , and the sentence vector v s 2 of s 2 can be computed similarly. v s 1 and v s 2 can then be used as features for different tasks.
In SIN, the candidate interaction statec (i) tτ represents the potential influence of word τ in s 2 to word t in s 1 , and the related input gate i (i) tτ controls the degree of the influence. The element-wise multiplicationc tτ is then the actual influence. By summing over all words in s 2 , the interaction state c (i) t gives the influence of the whole sentence s 2 to word t.

SIN with Convolution (SIN-CONV)
SIN is good at capturing the complex interactions of words in two sentences, but not strong enough for phrase interactions. Since convolutional neural network is widely and successfully used for modeling phrases, we add a convolution layer before SIN to model phrase interactions between two sentences.
Let v 1 , v 2 , ..., v |s| be the word embeddings of a sentence s, and let c i ∈ R wd , 1 ≤ i ≤ |s| − w + 1, be the concatenation of v i:i+w−1 , where w is the window size. The representation p i for phrase v i:i+w−1 is computed as: where F ∈ R d×wd is the convolution filter, and d is the dimension of the word embeddings.
In SIN-CONV, we first use a convolution layer to obtain phrase representations for the two sentences s 1 and s 2 , and the SIN interaction procedure is then applied to these phrase representations as before to model phrase interactions. The average of all hidden states are treated as sentence vectors v cnn s 1 and v cnn s 2 . Thus, SIN-CONV is SIN with word vectors substituted by phrase vectors. The two phrase-based sentence vectors are then fed to a classifier along with the two word-based sentence vectors together for classification.
The LSTM and interaction parameters are not shared between SIN and SIN-CONV.

Experiments
In this section, we test our model on two tasks: Answer Selection and Dialogue Act Analysis. Both tasks require to model interactions between sentences. We also conduct auxiliary experiments for analyzing the interaction mechanism in our SIN model.

Answer Selection
Selecting correct answers from a set of candidates for a given question is quite crucial for a number of NLP tasks including question-answering, natural language generation, information retrieval, etc.
The key challenge for answer selection is to appropriately model the complex interactions between the question and the answer, and hence our SIN model is suitable for this task.
We treat Answer Selection as a classification task, namely to classify each question-answer pair as "correct" or "incorrect". Given a questionanswer pair (q, a), after generating the question and answer vectors v q and v a using SIN, we feed them to a logistic regression layer to output a probability. And we maximize the following objective function: q,a log p θ (q, a)+ (1 −ŷ q,a ) log(1 − p θ (q, a)) whereŷ q,a is the true label for the question-answer pair (q, a) (1 for correct, 0 for incorrect). For SIN-CONV, the sentence vector v cnn q and v cnn a are also fed to the logistic regression layer.
During evaluation, we rank the answers of a question q according to the probability p θ (q, a). The evaluation metrics are mean average precision (MAP) and mean reciprocal rank (MRR).

Dataset
The WikiQA 2   correct answers from the development and test set. Some statistics are shown in Table 1.

Setup
We use the 100-dimensional GloVe vectors 3 (Pennington et al., 2014) to initialize our word embeddings, and those words that do not appear in Glove vectors are treated as unknown. The dimension of all hidden states is set to 100 as well. The window size of the convolution layer is 2. To avoid overfitting, dropout is introduced to the sentence vectors, namely setting some dimensions of the sentence vectors to 0 with a probability p (0.5 in our experiment) randomly. No handcrafted features are used in our methods and the baselines. Mini-batch Gradient Descent (30 questionanswer pairs for each mini batch), with AdaDelta tuning learning rate, is used for model training. We update model parameters after every mini batch, check validation MAP and save model after every 10 batches. We run 10 epochs in total, and the model with highest validation MAP is treated as the optimal model, and we report the corresponding test MAP and MRR metrics.

Baselines
We compare our SIN and SIN-CONV model with 5 baselines listed below: • LCLR: The model utilizes rich semantic and lexical features (Yih et al., 2013).
• PV: The cosine similarity score of paragraph vectors of the two sentences is used to rank answers (Le and Mikolov, 2014).
• LSTM: The question and answer are modeled by a simple LSTM. Different from SIN, there is no interaction between sentences.

Results
Results are shown in Table 2. SIN performs much better than LSTM, PV and CNN, this justifies that the proposed interaction mechanism well captures the complex interactions between the question and the answer. But SIN performs slightly worse than ABCNN because it is not strong enough at modeling phrases. By introducing a simple convolution layer to improve its phrase-modeling ability, SIN-CONV outperforms all the other models. For SIN-CONV, we do not observe much improvements by using larger convolution filters (window size ≥ 3) or stacking more convolution layers. The reason may be the fact that interactions between long phrases is relatively rare, and in addition, the QA pairs in the WikiQA dataset may be insufficient for training such a complex model with long convolution windows.

Dialogue Act Analysis
Dialogue acts (DA), such as Statement, Yes-No-Question, Agreement, indicate the sentence pragmatic role as well as the intention of the speakers (Williams, 2012). They are widely used in natural language generation , speech and meeting summarization (Murray et al., 2006;Murray et al., 2010), etc. In a dialogue, the DA of a sentence is highly relevant to the content of itself and the previous sentences. As a result, to model the interactions and long-range dependence between sentences in a dialogue is crucial for dialogue act analysis.
Given a dialogue (n sentences) d = [s 1 , s 2 , ..., s n ], we first use a LSTM (LSTM 1 ) to model all the sentences independently. The hidden states of sentence s i obtained at this step are used to compute the interaction states of sentence s i+1 , and SIN will generate a sentence vector v s i using another LSTM (LSTM 2 ) for each sentence s i in the dialogue (see Section 3.2) . These sentence vectors can be used as features for dialogue act analysis. We refer to this method as SIN (or SIN-CONV for adding a convolution layer).
For dialogue act analysis, we add a softmax layer on the sentence vector v s i to predict the probability distribution: With extra handcrafted features, ABCNN's performance is: MAP(0.692), MRR(0.711).  Table 2: Results on answer selection 4 . where y j is the j-th DA tag, w j and b j is the weight vector and bias corresponding to y j . We maximize the following objective function: where D is the training set, namely a set of dialogues, |d| is the length of the dialogue, s i is the i-th sentence in d,ŷ s i is the true dialogue act label of s i .
In order to capture long-range dependence in the dialogue, we can further join up the sentence vector v s i with another LSTM (LSTM 3 ). The hidden state h s i of LSTM 3 are treated as the final sentence vector, and the probability distribution is given by substituting v s i with h s i in p θ (y j |v s i ). We refer to this method as SIN-LD (or SIN-CONV-LD for adding a convolution layer), where LD means long-range dependence. Figure  3 shows the whole structure (LSTM 1 is not shown here for simplicity).   (Calhoun et al., 2010) in our experiments 5 . SwDA contains the transcripts of several people discussing a given topic on the telephone. There are 42 dialogue act tags in SwDA, 6 and we list the 10 most frequent tags in Table 3.
The same data split as in Stolcke et al. (2000) is used in our experiments. There are 1,115 dialogues in the training set and 19 dialogues in the test set 7 . We also randomly split the original training set as a new training set (1,085 dialogues) and a validation set (30 dialogues).

Setup
The setup is the same as that in Answer Selection except: (1) Only the most common 10,000 words are used, other words are all treated as unknown.
(2) Each mini batch contains all sentences from 3 dialogues for Mini-batch Gradient Descent. (3) The evaluation metric is accuracy. (4) We run 30 epochs in total. (5) We use the last hidden state of LSTM 2 as sentence representation since the sentences here are much shorter compared with those in Answer Selection.
• RCNN: Recurrent Convolutional Neural Networks (Kalchbrenner and Blunsom, 2013). Sentences are first separately embedded with CNN, and then joined up with RNN.
• LSTM: All sentences are modeled separately by one LSTM. Different from SIN, there is no sentence interactions in this method.

Results
Results are shown in Table 4. HMM variants, RCNN and LSTM model the sentences separately during sentence embedding, and are unable to capture the sentence interactions. With our interaction mechanism, SIN outperforms LSTM, and proves that well modeling the interactions between sentences in a dialogue is important for dialogue act analysis. After introducing a convolution layer, SIN-CONV performs slightly better than SIN. SIN-LD and SIN-CONV-LD model the Q: what creates a cloud A: in meteorology , a cloud is a visible mass of liquid droplets or frozen crystals made of water or various chemicals suspended in the atmosphere above the surface of a planetary body. long-range dependence in the dialogue with another LSTM, and obtain further improvements.

Interaction Mechanism Analysis
We investigate into the interaction states of SIN for Answer Selection to see how our proposed interaction mechanism works. Given a question-answer pair in Table 5, for SIN, there is a candidate interaction statec (i) τ t and an input gate i (i) τ t from each word t in the question to each word τ in the answer. We investigate into the L 2 -norm ||c τ t || 2 to see how words in the two sentences interact with each other. Note that we have linearly mapped the original L 2 -norm value to [0, 1] as follows: x − x min x max − x min As depicted in Figure 4, we can see that the word "what" in the question has little impact to the answer through interactions. This is reasonable since "what" appears frequently in questions, and does not carry much information for answer selection 8 . On the contrary, the phrase "creates a cloud", especially the word "cloud", transmits much information through interactions to the answer, this conforms with human knowledge since we rely on these words to answer the question as well.
In the answer, interactions concentrate on the phrase "a cloud is a visible mass of liquid droplets" which seems to be a good and complete answer to the question. Although there are also other highly related words in the answer, they are almost ignored. The reason may be failing to model such a complex phrase (three relatively simple sentences joined by "or") or the existence of the previous phrase which is already a good answer.
This experiment clearly shows how the interaction mechanism works in SIN. Through interaction states, SIN is able to figure out what the question is asking about, namely to detect those highly informative words in the question, and which part in the answer can answer the question.

Conclusion and Future Work
In this work, we propose Sentence Interaction Network (SIN) which utilizes a new mechanism for modeling interactions between two sentences. We also introduce a convolution layer into SIN (SIN-CONV) to improve its phrase modeling ability so that phrase interactions can be handled. SIN is powerful and flexible to model sentence interactions for different tasks. Experiments show that the proposed interaction mechanism is effective, and we obtain significant improvements on Answer Selection and Dialogue Act Analysis without any handcrafted features.
Previous works have showed that it is important to utilize the syntactic structures for modeling sentences. We also find out that LSTM is sometimes unable to model complex phrases. So, we are going to extend SIN to tree-based SIN for sentence modeling as future work. Moreover, applying the models to other tasks, such as semantic relatedness measurement and paraphrase identification, would also be interesting attempts.