Discourse Marker Augmented Network with Reinforcement Learning for Natural Language Inference

Natural Language Inference (NLI), also known as Recognizing Textual Entailment (RTE), is one of the most important problems in natural language processing. It requires to infer the logical relationship between two given sentences. While current approaches mostly focus on the interaction architectures of the sentences, in this paper, we propose to transfer knowledge from some important discourse markers to augment the quality of the NLI model. We observe that people usually use some discourse markers such as “so” or “but” to represent the logical relationship between two sentences. These words potentially have deep connections with the meanings of the sentences, thus can be utilized to help improve the representations of them. Moreover, we use reinforcement learning to optimize a new objective function with a reward defined by the property of the NLI datasets to make full use of the labels information. Experiments show that our method achieves the state-of-the-art performance on several large-scale datasets.


Introduction
In this paper, we focus on the task of Natural Language Inference (NLI), which is known as a significant yet challenging task for natural language understanding. In this task, we are given two sentences which are respectively called premise and hypothesis. The goal is to determine whether the logical relationship between them is entailment, neutral, or contradiction.
Recently, performance on NLI (Chen et al., 2017b;Gong et al., 2018;Chen et al., 2017c) * corresponding author Premise: A soccer game with multiple males playing. Hypothesis: Some men are playing a sport. Label: Entailment Premise: An older and younger man smiling. Hypothesis: Two men are smiling and laughing at the cats playing on the floor. Label: Neutral Premise: A black race car starts up in front of a crowd of people Hypothesis: A man is driving down a lonely road. Label: Contradiction has been significantly boosted since the release of some high quality large-scale benchmark datasets such as SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2017). Table 1 shows some examples in SNLI. Most state-of-the-art works focus on the interaction architectures between the premise and the hypothesis, while they rarely concerned the discourse relations of the sentences, which is a core issue in natural language understanding.
People usually use some certain set of words to express the discourse relation between two sentences 1 . These words, such as "but" or "and", are denoted as discourse markers. These discourse markers have deep connections with the intrinsic relations of two sentences and intuitively correspond to the intent of NLI, such as "but" to "contradiction", "so" to "entailment", etc.
Very few NLI works utilize this information revealed by discourse markers.  proposed to use discourse markers to help rep-resent the meanings of the sentences. However, they represent each sentence by a single vector and directly concatenate them to predict the answer, which is too simple and not ideal for the largescale datasets.
In this paper, we propose a Discourse Marker Augmented Network for natural language inference, where we transfer the knowledge from the existing supervised task: Discourse Marker Prediction (DMP) , to an integrated NLI model. We first propose a sentence encoder model that learns the representations of the sentences from the DMP task and then inject the encoder to the NLI network. Moreover, because our NLI datasets are manually annotated, each example from the datasets might get several different labels from the annotators although they will finally come to a consensus and also provide a certain label. In consideration of that different confidence level of the final labels should be discriminated, we employ reinforcement learning with a reward defined by the uniformity extent of the original labels to train the model. The contributions of this paper can be summarized as follows.
• Unlike previous studies, we solve the task of the natural language inference via transferring knowledge from another supervised task. We propose the Discourse Marker Augmented Network to combine the learned encoder of the sentences with the integrated NLI model.
• According to the property of the datasets, we incorporate reinforcement learning to optimize a new objective function to make full use of the labels' information.
• We conduct extensive experiments on two large-scale datasets to show that our method achieves better performance than other stateof-the-art solutions to the problem.
2 Task Description 2.1 Natural Language Inference (NLI) In the natural language inference tasks, we are given a pair of sentences (P, H), which respectively means the premise and hypothesis. Our goal is to judge whether their logical relationship between their meanings by picking a label from a small set: entailment (The hypothesis is definitely a true description of the premise), neutral (The hypothesis might be a true description of the premise), and contradiction (The hypothesis is definitely a false description of the premise).

Discourse Marker Prediction (DMP)
For DMP, we are given a pair of sentences (S 1 , S 2 ), which is originally the first half and second half of a complete sentence. The model must predict which discourse marker was used by the author to link the two ideas from a set of candidates.

Sentence Encoder Model
Following , we use BookCorpus  as our training data for discourse marker prediction, which is a dataset of text from unpublished novels, and it is large enough to avoid bias towards any particular domain or application. After preprocessing, we obtain a dataset with the form (S 1 , S 2 , m), which means the first half sentence, the last half sentence, and the discourse marker that connected them in the original text. Our goal is to predict the m given S 1 and S 2 . We first use Glove (Pennington et al., 2014) to transform {S t } 2 t=1 into vectors word by word and subsequently input them to a bi-directional LSTM: where Glove(w) is the embedding vector of the word w from the Glove lookup table, |S t | is the length of the sentence S t . We apply max pooling on the concatenation of the hidden states from both directions, which provides regularization and shorter back-propagation paths (Collobert and Weston, 2008), to extract the features of the whole sequences of vectors: where Max dim means that the max pooling is performed across each dimension of the concatenated vectors, [; ] denotes concatenation. Moreover, we combine the last hidden state from both directions and the results of max pooling to represent our sentences: where r t is the representation vector of the sentence S t . To predict the discource marker between S 1 and S 2 , we combine the representations of them with some linear operation: where is elementwise product. Finally we project r to a vector of label size (the total number of discourse markers in the dataset) and use softmax function to normalize the probability distribution.

Discourse Marker Augmented Network
As presented in Figure 1, we show how our Discourse Marker Augmented Network incorporates the learned encoder into the NLI model.

Encoding Layer
We denote the premise as P and the hypothesis as H. To encode the words, we use the concatenation of following parts: Word Embedding: Similar to the previous section, we map each word to a vector space by using pre-trained word vectors GloVe. Character Embedding: We apply Convolutional Neural Networks (CNN) over the characters of each word. This approach is proved to be helpful in handling out-of-vocab (OOV) words (Yang et al., 2017). POS and NER tags: We use the part-of-speech (POS) tags and named-entity recognition (NER) tags to get syntactic information and entity label of the words. Following (Pan et al., 2017b), we apply the skip-gram model (Mikolov et al., 2013) to train two new lookup tables of POS tags and NER tags respectively. Each word can get its own POS embedding and NER embedding by these lookup tables. This approach represents much better geometrical features than common used one-hot vectors. Exact Match: Inspired by the machine comprehension tasks (Chen et al., 2017a), we want to know whether every word in P is in H (and H in P ). We use three binary features to indicate whether the word can be exactly matched to any question word, which respectively means original form, lowercase and lemma form. For encoding, we pass all sequences of vectors into a bi-directional LSTM and obtain: is the concatenation of the embedding vectors and the feature vectors of the word x, n = |P |, m = |H|.

Interaction Layer
In this section, we feed the results of the encoding layer and the learned sentence encoder into the attention mechanism, which is responsible for linking and fusing information from the premise and the hypothesis words.
We first obtain a similarity matrix A ∈ R n×m between the premise and hypothesis by where v 1 is the trainable parameter, r p and r h are sentences representations from the equation (3) learned in the Section 3, which denote the premise and hypothesis respectively. In addition to previous popular similarity matrix, we incorporate the relevance of each word of P (H) to the whole sentence of H(P ). Now we use A to obtain the attentions and the attended vectors in both directions.
To signify the attention of the i-th word of P to every word of H, we use the weighted sum of u j by A i: :ũ whereũ i is the attention vector of the i-th word of P for the entire H. In the same way, thep j is obtained via:p To model the local inference between aligned word pairs, we integrate the attention vectors with the representation vectors via: where f is a 1-layer feed-forward neural network with the ReLU activation function,p i andû j are local inference vectors. Inspired by (Seo et al., 2016) and (Chen et al., 2017b), we use a modeling layer to capture the interaction between the premise and the hypothesis. Specifically, we use bi-directional LSTMs as building blocks: Here, p M i and u M j are the modeling vectors which contain the crucial information and relationship among the sentences.
We compute the representation of the whole sentence by the weighted average of each word:  where v 2 , v 3 are trainable vectors. We don't share these parameter vectors in this seemingly parallel strucuture because there is some subtle difference between the premise and hypothesis, which will be discussed later in Section 5.

Output Layer
The NLI task requires the model to predict the logical relation from the given set: entailment, neutral or contradiction. We obtain the probability distribution by a linear function with softmax function: where W is a trainable parameter. We combine the representations of the sentences computed above with the representations learned from DMP to obtain the final prediction.

Training
As shown in Table 2, many examples from our datasets are labeled by several people, and the choices of the annotators are not always consistent. For instance, when the label number is 3 in SNLI, "total=0" means that no examples have 3 annotators (maybe more or less); "correct=8748" means that there are 8748 examples whose number of correct labels is 3 (the number of annotators maybe 4 or 5, but some provided wrong labels). Although all the labels for each example will be unified to a final (correct) label, diversity of the labels for a single example indicates the low confidence of the result, which is not ideal to only use the final label to optimize the model. We propose a new objective function that combines both the log probabilities of the ground-truth label and a reward defined by the property of the datasets for the reinforcement learning. The most widely used objective function for the natural language inference is to minimize the negative log cross-entropy loss: where Θ are all the parameters to optimize, N is the number of examples in the dataset, d l is the probability of the ground-truth label l. However, directly using the final label to train the model might be difficult in some situations, where the example is confusing and the labels from the annotators are different. For instance, consider an example from the SNLI dataset: • P : "A smiling costumed woman is holding an umbrella." • H: "A happy woman in a fairy costume holds an umbrella." The final label is neutral, but the original labels from the five annotators are neural, neural, entailment, contradiction, neural, in which case the relation between "smiling" and "happy" might be under different comprehension. The final label's confidence of this example is obviously lower than an example that all of its labels are the same. To simulate the thought of human being more closely, in this paper, we tackle this problem by using the REINFORCE algorithm (Williams, 1992) to minimize the negative expected reward, which is defined as: where π(l|P, H) is the previous action policy that predicts the label given P and H, {l * } is the set of annotated labels, and is the reward function defined to measure the distance to all the ideas of the annotators. To avoid of overwriting its earlier results and further stabilize training, we use a linear function to integrate the above two objective functions: where λ is a tunable hyperparameter.  Table 3: Statistics of discouse markers in our dataset from BookCorpus.

Datasets
BookCorpus: We use the dataset from BookCorpus  to pre-train our sentence encoder model. We preprocessed and collected discourse markers from BookCorpus as . We finally curated a dataset of 6527128 pairs of sentences for 8 discourse markers, whose statistics are shown in Table 3. SNLI: Stanford Natural Language Inference(Bowman et al., 2015) is a collection of more than 570k human annotated sentence pairs labeled for entailment, contradiction, and semantic independence. SNLI is two orders of magnitude larger than all other resources of its type. The premise data is extracted from the captions of the Flickr30k corpus (Young et al., 2014), the hypothesis data and the labels are manually annotated. The original SNLI corpus contains also the other category, which includes the sentence pairs lacking consensus among multiple human annotators. We remove this category and use the same split as in (Bowman et al., 2015) and other previous work. MultiNLI: Multi-Genre Natural Language Inference (Williams et al., 2017) is another large-scale corpus for the task of NLI. MultiNLI has 433k sentences pairs and is in the same format as SNLI, but it includes a more diverse range of text, as well as an auxiliary test set for cross-genre transfer evaluation. Half of these selected genres appear in training set while the rest are not, creating in-domain (matched) and cross-domain (mismatched) development/test sets.

Implementation Details
We use the Stanford CoreNLP toolkit  to tokenize the words and generate POS and NER tags. The word embeddings are initialized by 300d Glove (Pennington et al., 2014), the dimensions of POS and NER embeddings are 30 and 10. The dataset we use to train the embeddings of POS tags and NER tags are the training set given by SNLI. We apply Tensorflow r1.3 as our neural network framework. We set the hidden size as 300 for all the LSTM layers and apply dropout (Srivastava et al., 2014) between layers with an initial ratio of 0.9, the decay rate as 0.97 for every 5000 step. We use the AdaDelta for optimization as described in (Zeiler, 2012) with ρ as 0.95 and as 1e-8. We set our batch size as 36 and the initial learning rate as 0.6. The parameter λ in the objective function is set to be 0.2. For DMP task, we use stochastic gradient descent with initial learning rate as 0.1, and we anneal by half each time the validation accuracy is lower than the previous epoch. The number of epochs is set to be 10, and the feedforward dropout rate is 0.2. The learned encoder in subsequent NLI task is trainable.

Results
In  (2016) proposed a simple baseline that uses LSTM to encode the whole sentences and feed them into a MLP classifier to predict the final inference relationship, they achieve an accuracy of 80.6% on SNLI. Nie and Bansal (2017) test their model on both SNLI and MiltiNLI, and achieves competitive results.
In the medium part, we show the results of other neural network models. Obviously, the performance of most of the integrated methods are better than the sentence encoding based models above. Both DIIN (Gong et al., 2018) and We present the ensemble results on both datasets in the bottom part of the table 4. We build an ensemble model which consists of 10 single models with the same architecture but initialized with different parameters. The performance of our model achieves 89.6% on SNLI, 80.3% on matched MultiNLI and 79.4% on mismatched MultiNLI, which are all state-of-the-art results.

Ablation Analysis
As shown in Table 5, we conduct an ablation experiment on SNLI development dataset to evaluate the individual contribution of each component of our model. Firstly we only use the results of the sentence encoder model to predict the answer, in other words, we represent each sentence by a single vector and use dot product with a linear function to do the classification. The result is obviously not satisfactory, which indicates that only using sentence embedding from discourse markers to predict the answer is not ideal in large-scale datasets. We then remove the sentence encoder model, which means we don't use the knowledge transferred from the DMP task and thus the representations r p and r h are set to be zero vectors in the equation (6) and the equation (12). We observe that the performance drops significantly to 87.24%, which is nearly 1.5% to our DMAN model, which indicates that the discourse markers have deep connections with the logical relations between two sentences they links. When Figure 2: Performance when the sentence encoder is pretrained on different discourse markers sets. "NONE" means the model doesn't use any discourse markers; "ALL" means the model use all the discourse markers. we remove the character-level embedding and the POS and NER features, the performance drops a lot. We conjecture that those feature tags help the model represent the words as a whole while the char-level embedding can better handle the outof-vocab (OOV) or rare words. The exact match feature also demonstrates its effectiveness in the ablation result. Finally, we ablate the reinforcement learning part, in other words, we only use the original loss function to optimize the model (set λ = 1). The result drops about 0.5%, which proves that it is helpful to utilize all the information from the annotators.

Semantic Analysis
In Figure 2, we show the performance on the three relation labels when the model is pre-trained on different discourse markers sets. In other words, we removed discourse marker from the original set each time and use the rest 7 discourse markers to pre-train the sentence encoder in the DMP task and then train the DMAN. As we can see, there is a sharp decline of accuracy when removing "but", "because" and "although". We can intuitively speculate that "but" and "although" have direct connections with the contradiction label (which drops most significantly) while "because" has some links with the entailment label. We observe that some discourse markers such as "if" or "before" contribute much less than other words which have strong logical hints, although they actually improve the performance of the model. Compared to the other two categories, the "contradiction" label examples seem to benefit the most from the pre-trained sentence encoder.

Visualization
In Figure 3, we also provide a visualized analysis of the hidden representation from similarity matrix A (computed in the equation (6)) in the situations that whether we use the discourse markers or not. We pick a sentence pair whose premise is "3 young man in hoods standing in the middle of a quiet street facing the camera." and hypothesis is "Three people sit by a busy street bareheaded." We observe that the values are highly correlated among the synonyms like "people" with "man", "three" with "3" in both situations. However, words that might have contradictory meanings like "hoods" with "bareheaded", "quiet" with "busy" perform worse without the discourse markers augmentation, which conforms to the conclusion that the "contradiction" label examples benefit a lot which is observed in the Section 5.5. 6 Related Work

Discourse Marker Applications
This work is inspired most directly by the DisSent model and Discourse Prediction Task of , which introduce the use of the discourse markers information for the pretraining of sentence encoders. They follow  to collect a large sentence pairs corpus from Book-Corpus  and propose a sentence representation based on that. They also apply their pre-trained sentence encoder to a series of natural language understanding tasks such as sentiment analysis, question-type, entailment, and relatedness. However, all those datasets are provided by Conneau et al. (2017) for evaluating sentence embeddings and are almost all small-scale and are not able to support more complex neural network. Moreover, they represent each sentence by a single vector and directly combine them to predict the answer, which is not able to interact among the words level.
In closely related work, Jernite et al. (2017) propose a model that also leverage discourse relations. However, they manually group the discourse markers into several categories based on human knowledge and predict the category instead of the explicit discourse marker phrase. However, the size of their dataset is much smaller than that in , and sometimes there has been disagreement among annotators about what exactly is the correct categorization of discourse relations (Hobbs, 1990).
Unlike previous works, we insert the sentence encoder into an integrated network to augment the semantic representation for NLI tasks rather than directly combining the sentence embeddings to predict the relations.

Natural Language Inference
Earlier research on the natural language inference was based on small-scale datasets (Marelli et al., 2014), which relied on traditional methods such as shallow methods (Glickman et al., 2005), natural logic methods(MacCartney and Manning, 2007), etc. These datasets are either not large enough to support complex deep neural network models or too easy to challenge natural language.
Large and complicated networks have been successful in many natural language processing tasks (Zhu et al., 2017;Chen et al., 2017e;Pan et al., 2017a). Recently, Bowman et al. (2015) released Stanford Natural language Inference (SNLI) dataset, which is a high-quality and large-scale benchmark, thus inspired many significant works (Bowman et al., 2016;Mou et al., 2016;Vendrov et al., 2016;Conneau et al., 2017;Gong et al., 2018;McCann et al., 2017;Chen et al., 2017b;Choi et al., 2017;Tay et al., 2017). Most of them focus on the improvement of the interaction architectures and obtain competitive results, while transfer learning from external knowledge is popular as well. Vendrov et al. (2016) incorpated Skipthought , which is an unsupervised sequence model that has been proven to generate useful sentence embedding. McCann et al. (2017) proposed to transfer the pre-trained encoder from the neural machine translation (NMT) to the NLI tasks.
Our method combines a pre-trained sentence encoder from the DMP task with an integrated NLI model to compose a novel framework. Furthermore, unlike previous studies, we make full use of the labels provided by the annotators and employ policy gradient to optimize a new objective function in order to simulate the thought of human being.

Conclusion
In this paper, we propose Discourse Marker Augmented Network for the task of the natural language inference. We transfer the knowledge learned from the discourse marker prediction task to the NLI task to augment the semantic representation of the model. Moreover, we take the various views of the annotators into consideration and employ reinforcement learning to help optimize the model. The experimental evaluation shows that our model achieves the state-of-the-art results on SNLI and MultiNLI datasets. Future works involve the choice of discourse markers and some other transfer learning sources.

Acknowledgements
This work was supported in part by the National Nature Science Foundation of China (Grant Nos: 61751307), in part by the grant ZJU Research 083650 of the ZJUI Research Program from Zhejiang University and in part by the National Youth Top-notch Talent Support Program. The experiments are supported by Chengwei Yao in the Experiment Center of the College of Computer Science and Technology, Zhejiang university.