ECNU at SemEval-2018 Task 11: Using Deep Learning Method to Address Machine Comprehension Task

This paper describes the system we submitted to the Task 11 in SemEval 2018, i.e., Machine Comprehension using Commonsense Knowledge. Given a passage and some questions that each have two candidate answers, this task requires the participate system to select out one answer meet the meaning of original text or commonsense knowledge from the candidate answers. For this task, we use a deep learning method to obtain final predict answer by calculating relevance of choices representations and question-aware document representation.


Introduction
In recent years, the presentation of challenge and large-scale reading comprehension corpora has driven the development of technology for machine reading comprehension, and most of these machine comprehension datasets do not need commonsense knowledge to answer questions. The purpose of Machine Comprehension using Commonsense Knowledge task in Semeval 2018 is to provide a platform for finding a way for the machine to better understand the text and enable the machine answer questions based on the text, and encourage participants to make use any external resources (e.g., DeScript, narrative chains, Wikipedia, etc) to improve the system performance (Ostermann et al., 2018b). The task 11 is a multiple-choice machine comprehension, which requires a system read a narrative text about everyday activities (Ostermann et al., 2018a) and then answer multiple-choice questions based on this text. Some questions need to be answered according to the original text, and others can be answered by commonsense knowledge. Each question is associated with a set of two answers. Table 1 gives an example of the dataset.
To address this machine comprehension task, we utilized rule-based methods and a deep learn-  ing method. Our final submission use Gated-Attention Reader (Dhingra et al., 2016) to fuse question information into document and acquire a question-awared document representation, the degree of interaction between choices and document are regard as the probabilities of choices being returned as an answer. The above two methods do not use additional commonsense knowledge, which may lead to the poor preformance of our system. In future work, we may explore more methods to integrate common knowledge into models.
The rest of this paper is organized as follows. Section 2 describes our systems. Section 3 describes datasets, experimental setting and analyse results on datasets. Finally, Section 4 concludes this work.

Task Description
Formally, this multiple-choice machine comprehension task can be expressed as a quadruple:<D, Q, A, a>. Where D represents a narrative text about everyday activities, Q represents a question for the content of the narrative text, A is the candidate answer choice set to the question(this task contains two candidate answers choice a 0 and a 1 ) and a represents the correct answer. The system is expected to select an answer from A that best answers Q according to the evidences in document D or commonsense knowledge.

Two Rule-based Baselines
First of all, we implemented a rule-based system proposed in (Richardson et al., 2013), which used the sliding-window (SW) and word distance-based (WD) algorithms to calculate the answer scores according to the rules and return the highest-score answer. We also tried the improved SW and WD algorithms proposed in (Smith et al., 2015), and the system performance has improvement. Sliding-window and Word Distance-based algorithms are are described as follows: Sliding-Window: Given a data sample <D, Q, a 0 (or a 1 ), a>, firstly, we calculate the inverse word counts of each word in the document D. Then we set a window that slides word by word from the beginning of the document to the end. When the window slides to a position, the sum of inverse word counts of all the words that appears in the question Q or the candidate choice a 0 (or a 1 ) is the score of the window at this moment. Until the window slides to the end of the passage, we choose the highest window score as the final score of the candidate choice a 0 (or a 1 ). Window size is size of union of the question Q and the choice a 0 (or a 1 ), and the window slides over full passage only once. In the improved SW algorithm, the window size is 2-30, and window passes full passage window several times, and increasing the size of the window by one after each sliding over the full passage. The summing up values of all passes is served as improved sliding window score.
Word Distance-based: Given a data sample <D, Q, a 0 (or a 1 ), a>, firstly, we define two collections, set dq and set dc , set dq represents the intersection of the question words and the document words, and set dc represents the intersection of the words in the choice and the words of the docu-ment. If neither set dq nor set dc is empty set, we calculate the shortest distance between words of set dp and words of set dc in the document, denote the shortest distance as d min , and the word distance score of the choice is d min +1 |D|−1 , otherwise, the word distance score of the choice is zero.
The sliding-window score minus the word distance-based score is the final score of the choice. We separately calculated the scores of the two choices for the question and then selected the choice with higher score as the answer to the question.

Deep Learning model
Both of the above unsupervised methods score the overlap that between each answer and the document by making a sliding-window passes over the document. Therefore, we roughly count the proportion of words in correct answers appear in the document 1 , and we find that the proportion of correct answers whose words appear entirely in the article is not high in all correct answers. The proportion show that there is a limit to using the above method to improve system performance. Hence we used a deep learning approach to passage representations modeling. Inspired by (Lai et al., 2017), we use the state-of-art Gated-Attention Reader which performs well on several datasets. When a sample data <D, Q, A, a> is given, the steps of the model processing this data sample are described below, Figure 1 shows the system.

Passage, Question and Choice Encoder
First, each word in D, Q, and choices (two choices in candidate answer set A) is mapped to d-dimendional vector. The 300-dim GloVe embedding (Pennington et al., 2014) is used. For the input word vectors of D, we also include a 5-dim binary feature to indicates the overlap between the ducument and the question(or choices) which inspired by (Chen et al., 2017). Each dimension of the 5-dim binary match feature represent whether the word present in the query, in the choice a 0 , in the choice a 1 , in both question and 1 We use the following equations to estimate how many answers appear entirely in the document: if |answer word ∩ document word|/|answer word| = 1, it means the answer appears entirely in the document, where |A| means size of set A. Then we calculate |ansce|/|ansc|, where ansce means correct answers which entirely appeared in document, and ansc means correct answers. The percentage of the correct answers entirely appeared in document is about 24%. choice a 0 , in both question and choice a 1 , respectively. Take passage as an example, we have document D: x D 1 , x D 2 , ..., x D m ∈ R |D| * dim , and next we use bi-directional GRU to encode each document word embedding As for choices, we concat − → h C n and ← − h C 1 to make up a vector represent a choice, so we get C 0 ∈ R 2d and C 1 ∈ R 2d .

Summarize Question-aware Passage
Representation The interaction layer of Gated-Attention Reader is a l-layers multi-hop architecture with gatedattention units. Each multi-hop layer contain a bi-GRU and a gated-attention unit. As shown in Figure 1, we sent Q e ∈ R |Q| * 2d and D e ∈ R |D| * 2d into a gated-attention unit. Gated-attention unit fuses information from question to each document tokens and generates a set of vectors D GA m }, where superscript (l) denote l-th multi-hop layer. To generates D GA l , firstly, the question soft attention to each document word to obtain attention weight α i , and then we use α i to calculate a weighted question representation q i for i-th word in D, finally, the weighted question q i representation is element-wise multiplied by h i makes d i . The specific calculation steps of a gated-attention unit are as follows.
(3) After obtaining the current layer question-aware document representation, we put this representation into next hop layer, until after l layers multihops, we generate the a set of question-aware vectors D GA l for document. Finally, we sent D GA l into a layer biGRU and concat the last outputs of each di- to get a ultimate questionaware document representation vector D ∈ R 2d Figure 1: Architecture of our system.

Answer Selection
Now, we have a question-aware representation D, two choice representations C 0 and C 1 . We estimate the probability that the choice selected as the correct answer by equation (4), and the choice with a higher-probability is returned as the predict answer.
3 Experiments Table 2 shows the statistics of articles and questions in training, development, test data sets of this task. Here "#text" and "#commonsense" represent the question types, which are unknown during test and officially provided by organizors after test. Therefore, we do not use the class information of questions for system construction. Clearly, around 70% questions are from text and 30% are from commonsense. Without the aid of additional commonsense knowledge base, these questions from commonsense makes this task a huge challenge.  To evaluate the system performance, the official evaluation criterion is accuracy.

Preprocessing and Experimental Setting
For rule-based baselines, we first converted words into their lowercase and then performed tokenization and stemming using Stanford CoreNLP 2 . For deep learning system, we use 300-D pretrained word vectors provided by GloVe 3 as initial word embedding, which are fine-tuned during training. The encoding layer use one layer biGRU with 128dims hidden size to encoder texts. Learning rate is 0.3, droprate is 0.5, epoch is 100, and num of multi-hops is 2. We use cross entropy and vanilla stochastic gradient descent (SGD) to train our models. Table 3 shows the results of Task 11 with different methods on dev dataset, where "GA(biGRU)" denotes the final system we submit, "GA(biLSTM)" represents the experiment that we replace all bi-GRU units in the system with biLSTM units, "GA −f match " represents the system without 5-dim match feature, "#text" and "#commonsens" represent the accuracy under different question types, respectively.  Based on above experimental results, we find that the performance of GA system is much better than rule-based approaches, this is because 2 https://stanfordnlp.github.io/CoreNLP/ 3 http://nlp.stanford.edu/data/wordvecs/glove.6B.zip the multi-hop structure merges the information of the question and the document repeatedly which is helpful to select final answer, unlike the rulebased approach that considers only word matching within a window-size distance. Furthermore, we find that the improved SW + WD algorithm is better than SW + WD algorithm, because the improved SW + WD algorithm considers the degree of word matching at different distances. From the GA system results, we find the performance of using biGRU units is better than that of biLSTM units and matching features also improves the system performance. Compare the accuracy of different types of questions under different methods, we find that the rule-based approaches considers only the word-matched features lead to lower accuracy on the commonsense type questions. GA systems perform better than rule-based systems on both types of questions, because the GA system takes into account the semantic similarity of the question-aware document and choices. Further, there are some commonsense types questions which the document content does not clearly indicate the correct answer but clearly does not meet the meaning of wrong answer. This may be the reason why we did not use external resources but the accuracy of the commonsense type question predicted by GA system is improved.  The final result we submitted is generated by GA system used biGRU units, the specific configuration of which is mentioned in Section 3.2. Compared with the top ranked systems, there is much room for improvement in our work. In addition, the use of external knowledge resources by the system also have an impact on system performance because there are about 26% commonsense type questions in the dataset. This is where our system lacks.

Conclusion
In this paper, we implement rule-based and deep learning approaches to address Machine Comprehension Using Commonsense Knowledge task in SemEval 2018. We explored two rule-based algorithm i.e., sliding window and word distancebased algorithm. We also utilized a deep learning method which use a multi-hop architecture (Gated-attention Reader). The above two methods do not use additional commonsense knowledge, this is a point that we need to improve.