Jiangnan at SemEval-2018 Task 11: Deep Neural Network with Attention Method for Machine Comprehension Task

This paper describes our submission for the International Workshop on Semantic Evaluation (SemEval-2018) shared task 11– Machine Comprehension using Commonsense Knowledge (Ostermann et al., 2018b). We use a deep neural network model to choose the correct answer from the candidate answers pair when the document and question are given. The interactions between document, question and answers are modeled by attention mechanism and a variety of manual features are used to improve model performance. We also use CoVe (McCann et al., 2017) as an external source of knowledge which is not mentioned in the document. As a result, our system achieves 80.91% accuracy on the test data, which is on the third place of the leaderboard.


Introduction
In recent years, machine reading comprehension (MRC) which attempts to enable machines to answer questions when given a set of documents, has attracted great attentions. Several MRC datasets have been released such as the Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) and the Microsoft MAchine Reading COmprehension Dataset (MS-MARCO) (Nguyen et al., 2016). These datasets provide large scale of manually created data, greatly inspired the research in this field. And a series of neural network model, such as BiDAF (Seo et al., 2016), R-Net (Wang et al., 2017), have achieved promising results on these evaluation tasks. However, machine reading comprehension is still a difficult task because without knowledge, machines cannot really understand the question and make a correct answer.
As an effort to discover how machine reading comprehension systems would be benefited from commonsense knowledge, (Ostermann et al., 2018b) developed the Machine Comprehension using Commonsense Knowledge task. In this task, commonsense knowledge is given as the form of script knowledge. Script knowledge is defined as the knowledge about everyday activities which is mentioned in narrative documents. For each document, a series questions are asked and each question is associated with a pair of candidate answers. Machines have to choose which is the correct answer. To let machines make correct decisions, explicit information which can be found in the document and external commonsense knowledge are both required. Table 1 shows an example of the dataset in this task.
In this paper, we make a description about our submission system for the task. The system is based on a deep neural network model. The input of the model is a (document, question, answer) triple and the output is the probability that the answer is the correct one for the given document and question. We also combine the neural network model with a variety of manual features, including word exact match features and token features such as part-of-speech (POS) ,named entity recognition (NER) and term frequency (TF). These manual features are helpful in solving the problem that the correct answer can be easily found in the given document.
Furthermore, for more complicated problem that the answer is not explicitly mentioned in the document, we try to model the interactions between document, question and answer by computing the attention score of question to document and question to answer respectively, which is described in (Lee et al., 2016). These features add soft alignments between similar but non-identical words (Chen et al., 2017). We evaluate our system on the shared task and obtain 80.91% accuracy on the test set, which is on the third place of the leaderboard.
The rest of this paper is organized as follows.   (Ostermann et al., 2018b). The first line shows the document and the following lines show question and answer pair respectively. The answer of question1 can be easily found in the text while answering question2 requires external knowledge which is not mentioned in the text.
Section 2 describes the submission system. Section 3 presents and discusses the experiment results. Section 4 makes a conclusion about our work.

Model
In this task, a document (D), a question (Q), and a pair of answers (A 0 , A 1 ) are given and a machine comprehension system should choose the correct answer from the answers pair. We attempt to solve this problem by leveraging a deep neural network model which can generate the probability p θ (A i |D, Q), i = 0 or 1 that the input answer is correct for the given document and question. The system predicts the probability for each answer in (A 0 , A 1 ) respectively and decides which is the correct answer by comparing their probability scores. We represent the set of all trainable parameters of the neural network model as θ. The model basically consists 3 parts: an encode layer, an interaction layer and a final inference layer, which is depicted in figure 1. Below we will discuss the model in more detail.

Encode layer
We first represent all tokens of document {d 1 , ..., d m }, question {q 1 , ..., q n } and answer {a 1 , ..., a l } as sequences of word embeddings where m, n and l are sequence lengths of document, question and answer respectively. In this task, we use the 300-dimensional 840B Glove word embeddings (Pennington et al., 2014). We then pass each sequence through a multi-layer bidirectional long short term memory network (BiLSTM) to get the word level semantic representations of each sequence: The index j represents the jth BiLSTM layer. We concat all the output units of each BiLSTM layer and get the final word level representations: h d , h q and h a . The BiLSTM layers used to encode document, question and answer sequence share same parameters in order to reduce the number of trainable parameters and make the model uneasily overfitting.

Interaction layer
This layer models the interactions between document, question and answer. We first align each word representation vectors in the question sequence to document and answer by leveraging attention mechanism and get question-aware representation Att d , Att a for document and answer respectively: The attention score s d i,j captures the similarity between the word representation vector d i and q j in document sequence and question sequence respectively. And s a i,j captures the similarity between answer vector a i and question vector q j . We get s d i,j and s a i,j by computing the dot products between the nonlinear mappings of two word representation vectors: α(·) is a single dense layer with ReLU nonlinearity. We concat Att d i and Att a i behind each h d i and h a i and get new word representation vectors r d and r a for document and answer.
Following (Chen et al., 2017), we combine the model with a variety of manual features, including word exact match features and token features. For exact match features, we use three binary features indicating whether a token in d and a can be exactly matched by one token in q, either in its original, lowercase or lemma form. For token features, we use part-of-speech (POS), named entity recognition (NER) and term frequency (TF).
For document and answer, we combine the manual features as vectors f d i , f a i and concat to r d i , r a i and get new word level representation vectors r d i and r a i :

Inference layer
In inference layer, we first convert the document and answer sequence r d , r a into fixed length vectors with weighted pooling method and get sequence level representation vectors R d and R a : The weight vector w d and w a are learnable parameters of the model. As we haven't use any external source of knowledge, we attempt to use other pre-trained language model as external knowledge, in order to get more implicit information which is not mentioned in the document. Here we use CoVe (McCann et al., 2017) in document and answer sequences. The Glove embedding of each token will pass through a pre-trained BiLSTM layer. The BiLSTM layer outputs a sequence of CoVe vectors of document and answer c . We then convert the sequences into fixed length vectors C d and C a by using the weighted pooling method which is mentioned above.
We fuse the pooled CoVe vectors with the sequence level representation vectors with semantic fusion unit (SFU) (Hu et al., 2017) and get the final sequence level representation vectors R d and R a : Finally, we represent the probability that the answer is correct by computing the bilinear match score of document and answer vectors: W is a trainable matrix and σ(·) is the sigmoid function. In this task, we use this model to predict the probability for each answer in (A 0 , A 1 ) and decide which is the correct one by selecting the answer with higher probability score.

Datasets
The statistics of official training, development and test data are shown in Table 2.
Training Dev Test Num of examples 9,731 1,411 2,797 We remove the words occurring less than 2 times and finally get about 12000 words in the vocabulary. We keep most pre-trained word embeddings fixed during training and only fine-tune the 100 most frequent words. For manual features, we get POS and NER features by using Stanford CoreNLP 1 toolkits.

Experimental Settings
We implement our model by using PyTorch 2 . The model is trained in the given training set and we choose the model which performs best on the development set among training epochs. We train the model with mini batch size 32. We use two layers BiLSTM with 128 hidden units. A dropout rate of 0.4 is applied to word embeddings and all hidden units in BiLSTM layers. We use logistic loss as the loss function optimized by using Adamax optimizer (Kingma and Ba, 2014) with learning rate η = 0.002.

Results
The performances of our model are depicted in Table 3. The single model achieves accuracy of 85.05% on the development data and 79.03% on the test data. The ensemble model which we finally submitted to the shared task achieves accuracy of 87.30% on the development data and 80.91% on the test data. From the result we can see that there is a gap between development data and test data for both single model and ensemble model. The model overfits the development data but does not perform well on the test data. Shows that the robustness of our model needs to be improved.
We conduct ablation analysis of different features used in the model on the development data. Table 4 shows the ablation analysis results from which we can see that all the features we used can contribute to model performance. Without manual features, the model accuracy is 83.70%, which is 1.3% less than the full model. and without CoVe, the accuracy drops 1.8%. The accuracy drops 6.6% when neither manual features nor CoVe are used. The results show that the model requires both explicit information which can be found in the document and external source of knowledge to make correct decisions.

Model
Acc

Conclusion
In this paper, we make a description of our submitted system to the SemEval-2018 shared task 11. The system is based on a deep neural network model which will choose the correct answer from the answers pair when the document and question are given. We combine the model with a variety of manual features which are helpful in solving the problem that the correct answer can be easily found in the given document. For the problem that the answer is not explicitly mentioned in the document, we model the interactions between document, question and answers by using attention mechanism. We also attempt to use CoVe as an external source of knowledge. We conduct experiment and prove that the features we used are helpful in contributing to the model performance.
Our system achieves 80.91% accuracy on the test data, which is on the third place of the leaderboard.