IIT-KGP at COIN 2019: Using pre-trained Language Models for modeling Machine Comprehension

In this paper, we describe our system for COIN 2019 Shared Task 1: Commonsense Inference in Everyday Narrations. We show the power of leveraging state-of-the-art pre-trained language models such as BERT(Bidirectional Encoder Representations from Transformers) and XLNet over other Commonsense Knowledge Base Resources such as ConceptNet and NELL for modeling machine comprehension. We used an ensemble of BERT-Large and XLNet-Large. Experimental results show that our model give substantial improvements over the baseline and other systems incorporating knowledge bases. We bagged 2nd position on the final test set leaderboard with an accuracy of 90.5%


Introduction
Machine Reading Comprehension (MRC) recently has been one of the most explored topics in the field of natural language processing. MRC consists of various sub-tasks , such as cloze-style reading comprehension Hermann et al. (2015); Hill et al. (2015); , span-extraction reading comprehension Rajpurkar et al. (2016) and open-domain reading comprehension Chen et al. (2017), etc. Earlier approaches to machine reading and comprehension have been based on either hand engineered grammars Riloff and Thelens (2000), or information extraction methods of detecting predicate argument triples that can later be queried as a relational database Poon et al. (2010). These methods show effectiveness, but they rely on feature extraction and language tools. Recently, with the advances and huge success of neural networks over traditional feature based models, there have been great interests in building neural architectures for various NLP task Li and Zhou (2018), including several pieces of work on machine comprehension Hermann et al. (2015); Hill et al. (2015); Yin et al. (2016); Kadlec et al. (2016); , which have gained significant performance in machine comprehension domain.
Machine comprehension using commonsense reasoning is required to answer multiple-choice questions based on narrative texts about daily activities of human beings Yuan et al. (2018). The answer to many questions does not appear directly in the text, but requires simple reasoning to achieve. In terms of the nature of the problem, this task can be considered as a binary classification. That is, for each question, the candidate answers are divided into two categories: the correct answers and the wrong answers.
In this paper, we show that pretrained Language Models alone can model commonsense reasoning better than the other models incorporating commonsense knowledge base resources like Concept-Net, NELL, etc integrated with deep neural architectures. We propose to use an ensemble architecture consisting of BERT and XLNet for this task which achieves an accuracy of 91.0% on the dev set and 90.5% on the test set outperforming the Attentive Reader baseline by a large margin of 25.4%.

Task Description & Dataset
Formally, this Shared Task: Commonsense Inference in Everyday Narrations Ostermann et al. (2019), organized within COIN 2019 is a multiplechoice machine comprehension task that can be expressed as a quadruple: < D, Q, A, a > Sheng et al. (2018). Where D represents a narrative text about everyday activities, Q represents a question for the content of the narrative text, A is the candidate answer choice set to the question(this task contains two candidate answers choice a0 and a1) and a represents the correct answer. The system is expected to select an answer from A that best answers Q according to the evidences in document D or commonsense knowledge. This task assesses how the inclusion of commonsense knowledge in the form of script knowledge would benefit machine comprehension systems. Script knowledge is defined as the knowledge about everyday activities, i.e. sequences of events describing stereotypical human activities (also called scenarios), for example baking a cake, taking a bus, etc. In addition to what is mentioned in the text, a substantial number of questions require inference using script knowledge about different scenarios, i.e. answering the questions requires knowledge beyond the facts mentioned in the text.
Answers are short and limited to a few words. The texts used in this task cover more than 100 everyday scenarios, hence include a wide variety of human activities. While for question A, it is easy to find the correct answer ("to get enough sunshine") from the text, questions B and C are more complicated to answer. For a person, it is clear that the most plausible answers are "a shovel" and "the gardener", although both are not explicitly mentioned in the texts.
Recently, a number of datasets have been proposed for machine comprehension. One example is MCTest Richardson et al. (2013), a small curated dataset of 660 stories, with 4 multiple choice questions per story. The stories are crowdsourced and not limited to a domain. Answering questions in MCTest requires drawing inferences from multiple sentences from the text passage. Another recently published multiple choice dataset is RACE Lai et al. (2017), which contains more than 28,000 passages and nearly 100,000 questions. The dataset is collected from English examinations in China, which are designed for middle school and high school students.

System Overview
We created an ensemble of two systems BERT Devlin et al. (2018) and XLNet Yang et al. (2019), each of which independently calculates the probabilities of all options for a correct answer.

Finetuned BERT
BERT is designed to train deep bidirectional representations by jointly conditioning on both left and right context in all layers. We chose BERT Large, uncased as our underlying BERT model. It consists of 24-layers, 1024-hidden, 16-heads, and 340M parameters. It was trained on the Book-Corpus (800M words) and the English Wikipedia (2,500M words). The context, questions and options were first tokenized with BertTokenizer to perform punctuation splitting, lower casing and invalid characters removal. The sequence which is fed into the model is generated in form of [CLS] + context + [SEP] + question + answer + [SEP] for every possible answer and sequence was assigned label 1 for correct answer and 0 otherwise. The maximum sequence length was set as 500 on COIN dataset, with shorter sequences padded and in longer sequences context were truncated to adjust context + question + answer to this length. We first fine-tuned BERT on the RACE dataset using a maximum sequence length of 350 for 2 epochs.
We used the PyTorch implementation of BERT from transformers 1 which had the BERT tokenizer, positional embeddings, and pre-trained BERT model. Following the recommendation for fine-tuning in the original BERT approach Devlin et al. (2018), we trained our model with a batch size of 8 for 8 epochs. The dropout probability was set to 0.1 for all layers, and Adam optimizer was used with a learning rate of 1e-5.

Semi-Finetuned XLNet
Both RACE and COIN dataset contains relatively long passages with average sequence length greater than 300. Since the use of the Transformer-XL architecture improves the capability of modeling long sequences besides the AR objective as mentioned in Yang et al. (2019). Hence we focused our attention on XLNet model which is a pre-trained language model built upon the Transformer-XL architecture. We used the XLNet Large, Cased model which has 24-layer, 1024hidden and 16-heads. The input to XLNet model is similar to BERT : [A, SEP, B, SEP, CLS], with a small difference that [CLS] token is used at the end instead of the beginning. Here A and B are the two segments, A represents the context and B represents the question + answer. We call our model semi-finetuned because we used the Google Colab TPU for fine-tuning XLNet but we had to limit the maximum sequence length to 312 owing to the huge computational capacity required by XLNet Large, Cased . We used the Tensorflow implementation of XLNet from zihangdai/xlnet 2 . So we couldn't properly fine-tune XLNet on RACE. After fine-tuning on RACE dataset for a few epochs, we fine-tuned further on the COIN dataset keeping maximum sequence length close to 400. The maximum train steps was set to 12000, batch size as 8 and Adam optimizer was used with a learning rate of 1e-5.

Figure 2: Ensemble Model
Ensemble learning is an effective approach to improve model generalization, and has been used 2 https://github.com/zihangdai/xlnet to achieve new state-of-the-art results in a wide range of natural language understanding (NLU) tasks Devlin et al. (2018); Liu et al. (2019b. For the COIN 2019 shared task, we adopt a simple ensemble approach, that is, averaging the softmax outputs from both BERT Large, uncased and XLNet Large, cased , and make predictions based on these averaged class probabilities. Our final submission follow this ensemble strategy.

Additional Experiments
We applied several approaches to the problem that did not generalize as well to the development data and were not included in the final ensemble. Due to space constraints we don't describe the simpler models which include simple rule-based, featurebased classification models, etc.
TriAN: We started with the previous stateof-the-art model for SemEval '18 Task-11 : TriAN Wang et al. (2018) as our baseline. We used both SemEval'18 Task 11 and COIN 2019 datasets for training. The Input layer uses GloVe word embeddings concatenated with the part-ofspeech tag, named-entity and relation embeddings. It then consists of a Attention Layer which models three way attention between context, question and answer. Question-aware pas- ). Similarly, we can get passage-aware answer represen- and question-aware answer rep- . These Question-aware passage representation, Passage-aware answer representation and Question-aware answer representation obtained from above are concatenated and fed into 3 BiLSTMs to model the temporal dependency. Then three BiLSTMs are applied to the concatenation of those vectors to model the temporal dependency: h p , h q , h a are the new representation vectors that incorporates more context information. Then we have question representation q = ). The final output y is based on their bilinear interactions: Question sequence and answer sequence representation are summarized into fixed-length vectors with self-attention

Results and Discussion
This section discusses regarding the results of various approaches we applied in this task. First, as a starting point we ran the best performing model on the SemEval '18 Task 11 -TriAN using the same hyper-parameters settings as stated in the paper . It achieved an accuracy close to 69.0%. We fine-tuned the single BERT Large, uncased model on the COIN + Se-mEval '18 Task 11 dataset and that achieved an accuracy of 83.4% on the dev set. We further finetuned it on the commonsense dataset RACE for a few epochs which increased the accuracy by 1%. We also fine-tuned the XLNet Large, cased model on the COIN + SemEval '18 dataset which alone achieved an accuracy of 90.6% on the dev set. But we couldn't fully fine-tune it on the RACE dataset as mentioned earlier in Section 3.2. We finally submitted our ensemble system which achieves an accuracy of 91.0% on the dev set and 90.5% on the hidden test set.
We can see there isn't much difference in the accuracy of our final ensemble model on the hidden Test set compared to the Dev set which shows that our model generalizes well to new/unseen data.

Model
Dev Accuracy The main problem with commonsense knowledge bases is that they are hard-coded  and they do not generalize well to hidden dataset. This is also evident from

Conclusion & Future Work
In this paper, we present our system for the Commonsense Inference in Everyday Narrations Shared Task at COIN 2019. We built upon the recent success of pre-trained language models and apply them for reading comprehension. Our System achieves close to state-of-art performance on this task.
As future work, we will try to explore the indepth layer by layer analysis of BERT and XLNet attention similar to  and how the attention helps in commonsense reasoning.