Yuanfudao at SemEval-2018 Task 11: Three-way Attention and Relational Knowledge for Commonsense Machine Comprehension

This paper describes our system for SemEval-2018 Task 11: Machine Comprehension using Commonsense Knowledge. We use Three-way Attentive Networks (TriAN) to model interactions between the passage, question and answers. To incorporate commonsense knowledge, we augment the input with relation embedding from the graph of general knowledge ConceptNet. As a result, our system achieves state-of-the-art performance with 83.95% accuracy on the official test data. Code is publicly available at https://github.com/intfloat/commonsense-rc.


Introduction
It is well known that humans have a vast amount of commonsense knowledge acquired from everyday life.For machine reading comprehension, natural language inference and many other NLP tasks, commonsense reasoning is one of the major obstacles to make machines as intelligent as humans.
A large portion of previous work focus on commonsense knowledge acquisition with unsupervised learning (Chambers and Jurafsky, 2008;Tandon et al., 2017) or crowdsourcing approach (Singh et al., 2002;Wanzare et al., 2016).Con-ceptNet (Speer et al., 2017), WebChild (Tandon et al., 2017) and DeScript (Wanzare et al., 2016) etc are all publicly available knowledge resources.However, resources based on unsupervised learning tend to be noisy, while crowdsourcing approach has scalability issues.There is also some research on incorporating knowledge into NLP tasks such as reading comprehension (Lin et al., 2017;Yang and Mitchell, 2017) neural machine translation (Zhang et al., 2017a) and text classification (Zhang et al., 2017b) etc.Though experiments show performance gains over baselines, these gains are often quite marginal over the state-of-the-art system without external knowledge.Modeling commonsense knowledge is still an open problem yet to solve.
In this paper, we present Three-way Attentive Networks(TriAN) for multiple-choice commonsense reading comprehension.The given task requires modeling interactions between the passage, question and answers.Different questions need to focus on different parts of the passage, attention mechanism is a natural choice and turns out to be effective for reading comprehension.Due to the relatively small size of training data, TriAN use word-level attention and consists of only one layer of LSTM (Hochreiter and Schmidhuber, 1997).Deeper models result in serious overfitting and poor generalization empirically.
To explicitly model commonsense knowledge, relation embeddings based on ConceptNet (Speer et al., 2017) are used as additional features.ConceptNet is a large-scale graph of general knowledge from both crowdsourced resources and expert-created resources.It consists of over 21 million edges and 8 million nodes.ConceptNet shows state-of-the-art performance on tasks like word analogy and word relatedness.
Besides, we also find that pretraining our network on other datasets helps to improve the overall performance.There are some existing multiplechoice English reading comprehension datasets contributed by NLP community such as MCTest (Richardson et al., 2013) and RACE (Lai et al., 2017).Although those datasets don't focus specifically on commonsense comprehension, they provide a convenient way for data augmentation.Augmented data can be used to learn shared regularities of reading comprehension tasks.
Combining all of the aforementioned techniques, our system achieves competitive performance on the official test set.
and a label y * ∈ {0, 1} as input.P , Q and A are all sequences of word indices.For a word P i in the given passage, the input representation of P i is the concatenation of several vectors: Pretrained 300dimensional GloVe vector E glove P i . .First, let's define a sequence attention function (Chen et al., 2017): u and v i are vectors and W 1 is a matrix.f is a non-linear activation function and is set to ReLU .
Question-aware passage representation {w q P i }

|P | i=1
can be calculated as: ).Similarly, we can get passage-aware answer representation {w p A i } |A| i=1 and question-aware answer representation {w q A i } |A| i=1 .Then three BiLSTMs are applied to the concatenation of those vectors to model the temporal dependency: h p , h q , h a are the new representation vectors that incorporates more context information.Output Layer.Question sequence and answer sequence representation h q , h a are summarized into fixed-length vectors with self-attention (Yang et al., 2016).Self-attention function is defined as follows: Then we have question representation q = Att self ({h The final output y is based on their bilinear interactions: Model Learning.
We first pretrain TriAN on RACE dataset for 10 epochs.Then our model is fine-tuned on official training data.Standard cross entropy function is used as the loss function to minimize.

Setup
Data.For data preprocessing, we use spaCy1 for tokenization, part-of-speech tagging and namedentity recognition.Statistics for official dataset MCScript (Ostermann et al., 2018a)  Hyperparameters.Our model TriAN is implemented based on PyTorch4 .Models are trained on a single GPU(Tesla P40) and each epoch takes about 80 seconds.Only the word embeddings of top 10 frequent words are fine-tuned during training.The dimension of both forward and backward LSTM hidden state is set to 96.Dropout rate is set to 0.4 for both input embeddings and BiL-STM outputs (Srivastava et al., 2014).For parameter optimization, we use Adamax (Kingma and Ba, 2014) with an initial learning rate 2 × 10 −3 .Learning rate is then halved after 10 and 15 training epochs.The model converges after 50 epochs.Gradients are clipped to have a maximum L2 norm of 10.Minibatch with batch size 32 is used.Hyperparameters are optimized by random search strategy (Bergstra and Bengio, 2012).Our model is quite robust over a wide range of hyperparameter configurations.

Main Results
The experimental results are shown in Table 2. Human performance is shared by task organizers.For TriAN-ensemble, we average the output probabilities of 9 models trained with the same datasets and network architecture but different random seeds.From Table 2, we can see that even though RACE dataset contains nearly 100k questions, TriAN-RACE achieves quite poor results.The accuracy on development set is only 64.78%, which is worse than most participants' systems.However, pretraining acts as a way of implicit knowledge transfer and is beneficial for overall performance, as will be seen in Section 3.3.The accuracy of our system TriAN-ensemble is very close to the 1st place team HFL with 0.18% difference.Yet there is still a large gap between machine learning models and human.
We also compared the performances of shallow and deep TriAN models.On datasets such as SQuAD (Rajpurkar et al., 2016), deep models typically works better than shallow ones.Notice that the attention layer in our proposed TriAN model can be stacked multiple times if we treat the output vectors of BiLSTMs as new input representations.
Maybe a little bit surprising,  TriAN.One possible explanation is that the labeled dataset is relatively small and deeper models tend to easily overfit.

Ablation Study
The input representation consists of several components: part-of-speech embedding, relation embedding and handcrafted features etc.We conduct an ablation study to investigate the effects of each component.The results are in Table 4  Pretraining on RACE dataset turns out to be the most important factor.Without pretraining, the accuracy drops by more than 1% on both development and test set.Relation embeddings based on ConceptNet make approximately 1% difference.Part-of-speech and named-entity embeddings are also helpful.In fact, combining input representations from multiple sources has been a standard practice for reading comprehension tasks.
At attention layer, our proposed TriAN involves applying several attention functions to model interactions between different text sequences.It would be interesting to examine the importance of each attention function, as shown in Table 5 Interestingly, removing any of the three word-level sequence attentions does not seem to hurt the performance much.In fact, removing passagequestion attention even results in higher accuracy on test set than TriAN-single.However, if we remove all word-level attentions, the performance drastically drops by 1.9% on development set and 1.3% on test set.

Discussion
Even though our system is built for commonsense reading comprehension, it doesn't have any explicit knowledge reasoning component.Relation embeddings based on ConceptNet only serve as additional input features.Methods like event calculus (Mueller, 2014) are more rigorous mathematically and resemble the way of how human brain works.The problem of event calculus is that it requires large amounts of domain-specific axioms and therefore doesn't scale well.Another limitation is that our system relies on hard-coded commonsense knowledge bases, just like most systems for commonsense reasoning.For humans, commonsense knowledge comes from constant interactions with the real-world environment.From our point of view, it is quite hopeless to enumerate all of them.
There are a lot of reading comprehension datasets available.When the size of training data is relatively small like this SemEval-2018 task, transfer learning among different datasets is a useful technique.This paper shows that pretraining is a simple and effective method.However, it still remains to be seen whether there is a better alternative approach.

Conclusion
In this paper, we present the core ideas and design philosophy for our system TriAN at SemEval-2018 Task 11: Machine Comprehension using Commonsense Knowledge.We build upon recent progress on neural models for reading comprehension and incorporate commonsense knowledge from ConceptNet.Pretraining and handcrafted features are also proved to be helpful.As a result, our proposed model TriAN achieves near state-ofthe-art performance.
The relation is determined by querying ConceptNet whether there is an edge between P i and any word in question {Q i } • Part-of-speech and named-entity embeddings.Randomly initialized 12-dimensional part-of-speech embedding E pos P i and 8dimensional named-entity embedding E ner P i .• Relation embeddings.Randomly initialized 10-dimensional relation embedding E rel P i .|Q| i=1 or answer {A i } |A| i=1 .The input representation for P i is w P i : w P i = [E glove P i ; E pos P i ; E ner P i ; E rel P i ; f P i ] (1) In a similar way, we can get input representation for question w Q i and answer w A i .Attention Layer.We use word-level attention to model interactions between the given passage {P i } |A| i=1
TriAN-ensemble is the model that we used for official submission.

Table 2 :
Main results.TriAN-RACE only use RACE dataset for training; HFL is the 1st place team for SemEval-2018 Task 11.The evaluation metric is accuracy.
Table 3 shows that 2-layer TriAN model performs worse than 1-layer

Table 3 :
Accuracy comparison of shallow and deepTriAN models. .

Table 4 :
Ablation study for input representation.
Ablation study for attention.The last one "w/o attention" removes all word-level attentions.