YNU_Deep at SemEval-2018 Task 11: An Ensemble of Attention-based BiLSTM Models for Machine Comprehension

We firstly use GloVe to learn the distributed representations automatically from the instance, question and answer triples. Then an attentionbased Bidirectional LSTM (BiLSTM) model is used to encode the triples. We also perform a simple ensemble method to improve the effectiveness of our model. The system we developed obtains an encouraging result on this task. It achieves the accuracy 0.7472 on the test set. We rank 5th according to the official ranking.


Introduction
Machine comprehension of text is one of the ultimate goals of natural language processing. The machine comprehension problem can be formulated as follows: Given an instance i, a question q and an answer candidate pool {a 1 , a 2 , . . ., a s }, the aim is to search for the best answer candidate a k , where 1 ≤ k ≤ s. The major challenge of this task is that the words in the answer do not necessarily appear in the instance.
In recent years, deep learning models are widely used in the field of NLP, such as semantic analysis (Tang et al., 2015), machine translation (Bahdanau et al., 2014) and text summarization (Rush et al., 2015). (Bahdanau et al., 2014) also introduced the attention mechanism into NLP task for the first time. This attention-based model yielded stateof-the-art performance on the machine translation task. (Hermann et al., 2015) built a supervised reading comprehension data set, the CNN/Daily Mail data sets 1 . They also presented Attentive Reader for machine comprehension, which allows a model to focus on the aspects of an instance that 1 http://www.github.com/deepmind/rc-data/ can help to answer a question, and also allows us to visualize its inference process (Hermann et al., 2015). The key point of the attention-based models is the design of attention function. Compared to Attentive Reader, Attention Sum Reader (Kadlec et al., 2016) used the dot products instead of a tanh layer to compute the attention between question and contextual embeddings. Stanford Attention Reader (Chen et al., 2016) took a bilinear term as the attention function and obtained stateof-the-art results on the CNN/Daily Mail data sets.
The reasoning process was implemented in some models for machine comprehension. Memory Networks (Sukhbaatar et al., 2015) was the first model to propose reasoning process, which had important influence on other follow-up models. Compared to the traditional attention model, Memory Networks additionally uses a function t that constantly updates the representation of the instance and the question so as to realize the reasoning process. (Tseng et al., 2016) proposed an attention-based multi-hop recurrent neural network which achieved good performance on the machine listening comprehension test of TOEFL. Other reasoning models (Dhingra et al., 2017;Sordoni et al., 2016) shared the same idea as previous models, i.e., the representations of the instance and the question embedding were updated through continuous conversion of attention. Some more complex models (Hu et al., 2017; were proposed based on SQuAD data set (Rajpurkar et al., 2016). Their performances have been very close to or even exceeded the human performance on this dataset.
In this paper, we introduce a simple ensemble method on multiple identical attention-based BiL-STM models, only changing the dropout parameters in each model. We use each model to generate a soft prediction, and sum each result, then take the sum as the final prediction result. Experiments show that the ensemble model is about 2% higher than the single model in terms of accuracy on both development and test sets. Besides, we also made our code available online 2 .

System Description
Our model is called an ensemble of attentionbased BiLSTM models. Firstly, we use an embedding layer to obtain the distributed representations of the instance, question and answer triples. They are encoded by three different BiLSTM layers. The attention mechanism is implemented by dot products via a merge layer. Finally, we assign the same weight to each model when ensembling. The final result is the sum of the soft probabilities yielded by each single model. We keep the structure of each model the same, just fine tune the dropout parameters. The model architecture is shown in Figure 1. The attention mechanism is developed by calculating the dot product of the outputs from two BiLSTM layers. Then we use '||' operation to concatenate two matrices from the previous layer in the specified dimension. Finally, a Dense fully connected layer with activation sof tmax is used to get the predicted probabilities.

BiLSTM
Single direction LSTM (Hochreiter and Schmidhuber, 1997) suffers a weakness of not using the contextual information from the future tokens. Bidirectional LSTM (BiLSTM) exploits both the previous and future context by processing the sequence on two directions and generates two independent sequences of LSTM output vectors. One processes the input sequence in the forward direction, while the other processes the input in the backward direction. The words in the instances, questions and answers are represented by the concatenation of the hidden layer outputs in both directions at each time step.

Word Embedding
Word embedding is arguably the most widely known technology in the recent history of NLP. It is well-known that using pre-trained embedding helps (Kim, 2014). We try two word embedding tools, GloVe (Pennington et al., 2014) and Word2Vec (Mikolov et al., 2013)

Attention Mechanism
The LSTM model can alleviate the problem of gradient vanishing, but this problem persists in long range reading comprehension contexts. The attention mechanism breaks the constraint on fix-length vector as the context vector, enables the model to focus on those more helpful to outputs. (Luong et al., 2015) presented several attention computation ways, such as dot, general, concat. In our model, we adopt the dot mode to compute the attention. After BiLSTM layer, we implement a dot product operation on the output vectors produced by previous layer. It is proven effective to improve the performance of our model. The attention mechanism in our model uses a matching function f to associate the target module with the source module, the function f is implemented as follows: where m t and m s correspond to instance and question vectors, or answer and question vectors produced by previous BiLSTM layers, respectively.

Model Ensemble
Combining multiple models into an ensemble by averaging their predictions is a proven strategy to improve model performance. While predicting with an ensemble is expensive at test time, recent advances in distillation allow us to compress an expensive ensemble into a much smaller model (Hinton et al., 2015;Kuncoro et al., 2016;Kim and Rush, 2016).
In our model, each single model will yield a soft probability to determine if the candidate answer is correct or not. As is shown in figure 2, we train several models and sum the results produced by them. Then we use the sum of the probabilities as the final prediction. We found that the performance of the ensemble model is always better than that of a single model.

Experiments
We run each model 10 times, taking the average results as the final experimental results to enhance reliability. In all single models, the dropout parameter is taken as 0.3, and the ensemble model is trained through 5 BiLSTM models. Their dropout parameters are changed from 0.2 to 0.6, respectively, and then the results of the 5 models are summed as the final prediction. We set epoch = 6, batch size = 512 and LSTM Units = 64. Optimization is carried out using Adaptive Moment Estimation (Adam).

Data Processing
The organizers provided training, development, and test sets, containing 9837, 1417, 2797 questions, respectively. Each question corresponds two answers, only one is correct.
We firstly substitute the abbreviation characters and remove the meaningless characters. Then we combine the instance, question and answer as the supervised training set. Labels are represented by 0 (False) or 1 (True). The TweetTokenizer 3 in NLTK is adopted for word segmentation. Furthermore, we find the maximum length of instances is much longer than that of questions and answers, so we remove the stop words in the instances. Our experiments show that doing so not only does not harm the accuracy, but also drastically reduces the training time.

Experiments and Result Analysis
We compare two word embedding tools, Word2Vec and GloVe, and the experimental results show that GloVe almost always outperforms Word2Vec on this task. Although the vocabularies in GloVe are less than those in Word2Vec, GloVe contains more abbreviations, which are especially useful after tokenizing the instance, question and answer triples, and greatly reduce the number of unknown words in word embedding, making the context semantics better learned by the model. We make random assignments on unknown words, ranging from -0.25 to 0.25.  We can see GloVe performs better, but its loading time is much longer than that of Word2Vec.
As seen in Table 3, we compare two network architectures, LSTM and BiLSTM. The results show that the BiLSTM model performs better than the LSTM model on this task.
Based on Glove word embedding and BiLSTM architecture, we train 5 single models for ensemble. The only difference between them is the difference in dropout parameters, which increases from 0.2 to 0.6. In our experiments, we train the   Table 4: Results on single and ensemble models. All models adopt GloVe + Attention-based BiLSTM architecture. The dropout layer is behind the BiLSTM layer.

Conclusion and Future Work
In this paper, we present an ensemble of attentionbased BiLSTM models for machine comprehension task. We find GloVe is superior to Word2Vec on this task, a simple ensemble method can significantly enhance the overall performance.
In the future, we plan to explore more ways to compute the attention, such as a bilinear term. Future work also involves using more external knowledge and deeper network to improve model performance. We will explore the ensemble method in greater depth, trying ensemble on the models with more structural difference.