Inspecting Unification of Encoding and Matching with Transformer: A Case Study of Machine Reading Comprehension

Most machine reading comprehension (MRC) models separately handle encoding and matching with different network architectures. In contrast, pretrained language models with Transformer layers, such as GPT (Radford et al., 2018) and BERT (Devlin et al., 2018), have achieved competitive performance on MRC. A research question that naturally arises is: apart from the benefits of pre-training, how many performance gain comes from the unified network architecture. In this work, we evaluate and analyze unifying encoding and matching components with Transformer for the MRC task. Experimental results on SQuAD show that the unified model outperforms previous networks that separately treat encoding and matching. We also introduce a metric to inspect whether a Transformer layer tends to perform encoding or matching. The analysis results show that the unified model learns different modeling strategies compared with previous manually-designed models.


Introduction
In spite of different neural network structures, encoding and matching components are two basic building blocks for many NLP tasks like machine reading comprehension (Rajpurkar et al., 2016;Joshi et al., 2017). A widely-used paradigm is that the input texts are encoded into vectors, and then these vectors are aggregated to model interactions between them by matching layers. Figure 1(a) shows a typical machine reading comprehension model, encoding components separately encode question and passage to vector representations. Then, we obtain context-sensitive representations for input words by considering the interactions between question and passage. Finally, an output layer is used to predict the prob- * Contribution during internship at Microsoft Research ability of each token being the start or end position of the answer span. The encoding layers are usually built upon recurrent neural networks (Hochreiter and Schmidhuber, 1997;Cho et al., 2014), and self-attention networks (Yu et al., 2018). For the matching component, various model components have been developed to fuse question and passage vector representations, such as match-LSTM (Wang and Jiang, 2016), coattention (Seo et al., 2016;Xiong et al., 2016), and self-matching (Wang et al., 2017). Recently, Devlin et al. (2018) employ Transformer networks to pretrain a bidirectional language model (called BERT), and then fine-tune the layers on specific tasks, which obtains state-of-the-art results on MRC. A research question is: apart from the benefits of pretraining, how many performance gain comes from the unified network architecture.
In this paper, we evaluate and analyze unifying encoding and matching components with Transformer layers (Vaswani et al., 2017), using MRC as a case study. As shown in Figure 1(b), compared with previous specially-designed MRC networks, we do not explicitly distinguish encoding stages and matching stages. We directly concatenate input question and passage into one sequence at first, and append segment embeddings to word vectors in order to indicate whether each token is belong to question or passage. Next, the packed sequence is fed into a multi-layer Transformer network, which utilizes the self-attention mechanism to obtain contextualized representations for both question and passage. The first advantage is that the unified architecture enables the model to automatically learn the encoding and matching strategy, rather than empirically specifying layers one by one. Second, the proposed method is conceptually simpler than previous systems, which simplifies the model implementation.
We conduct experiments on the SQuAD v1.  dataset (Rajpurkar et al., 2016), which is an extractive reading comprehension benchmark. Experimental results show that the unified model outperforms previous state-of-the-art models that treat encoding and matching separately. The results indicate that part of improvements of BERT (Devlin et al., 2018) attribute to the architecture used for end tasks. Moreover, we introduce a metric to inspect the ratio of encoding and matching for each layer. The analysis illustrates that the unified model learns different strategies to handle questions and passages, which sheds lights on our future model design for MRC.

Unified Encoding and Matching Model
We focus on extractive reading comprehension in the work. Given input passage x P and question x Q , our goal is to predict the correct answer span a = x P s · · · x P e in the passage. The SQuAD v1.1 dataset assumes that the correct answer span is guaranteed to exist in the passage. Figure 1(b) shows the overview of the unified model 1 . We first pack the question and passage into a single sequence. Then multiple Transformer (Vaswani et al., 2017) layers are employed to compute the vector representations of question and passage together. Finally, an output layer is used to predict the start and end positions of answer span. Compared with previous speciallydesigned networks illustrated in Figure 1(a), the model unifies encoding layers and matching layers by using multiple Transformer blocks. The selfattention mechanism is supposed to automatically 1 The implementation and models are available at github.com/addf400/UnifiedModelForSQuAD. learn question-to-question encoding, passage-topassage encoding, question-to-passage matching, and passage-to-question matching.

Embedding Layer
For each word in questions and passages, the vector representation x is constructed by the word embedding x w , character embedding x c , and segment embedding x s . The character-level embeddings are computed in the similar way as (Yu et al., 2018). The segment embeddings are vectors used to indicate whether the word belongs to question or passage. The final representation is computed via x = ϑ([x w ; x c ]) + x s , where ϑ represents a Highway network (Srivastava et al., 2015).

Unified Encoder
Given question x Q and passage x P embeddings, we first pack them together into a single sequence , which also denoted as h 0 . Then an L-layer Transformer encoder is used to encode the packed representations: where l ∈ [1, L] is the depth.
Transformer blocks use a self-attention mechanism to compute attention weights between each pair of tokens in the packed question and passage, which automatically learns the importance of encoding and matching. Specifically, for each token, the attention scores are normalized over the whole sequence. The weights between two question tokens can be regarded as question encoding. Similarly, the attention scores between two passage tokens can be viewed as passage encoding. The attention weights across the question segment and the passage segment can be considered as question-to-passage or passage-toquestion matching.

Output Layer
Inspired by Yu et al. (2018), hidden vectors of different Transformer layers h i , h j , h k (i = 6, j = 9, k = 12 in our implementation) are used to represent the input. Moreover, we employ a selfattentive method as in Wang et al. (2017) over question vectors to obtain a question attentive vector v q . Finally, we predict the probability of each token being the start (p s ) or end (p e ) position of the answer span: where represents elementwise multiplication, and W 1 , W 2 are parameters.
To train the model, we maximize the log likelihood of ground-truth start and end positions given input passage and question. At test time, we predict answer spans approximately by greedy search.

Experimental Setup
Dataset Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) is composed of over 100,000 instances created by crowdworkers. Every answer is constrained to be a continuous sub-span of the passage.
Settings We employ the spaCy toolkit to preprocess data. We use 300-dimensional GloVe embeddings (Pennington et al., 2014) to initialize word vectors of both questions and passages, and keep them fixed during training. A special trainable token <UNK> is used to represent out-of-vocabulary words. We randomly mask some words in the passage to <UNK> with 0.2 probability while training. The dimension of character embedding and segment embedding is 64 and 128, respectively. The number of Transformer layers used in our model is 12. For each Transformer layer, we set the hidden size to 128, and use relative position embedding (Shaw et al., 2018) whose clipping distance is 16. The number of the attention heads is 8.
During training, the batch size is 32 and the number of the max training epochs is 80. We use

Model EM / F1
BiDAF (Seo et al., 2016) 68.0 / 77.3 R-Net (Wang et al., 2017) 72.3 / 80.7 QAnet (Yu et al., 2018) 73 Adam (Kingma and Ba, 2015) as the optimizer with β 1 = 0.9, β 2 = 0.999, = 10 −6 . We use warmup over the first 4, 000 steps, and keep the learning rate fixed for the remainder of training. The learning rate is set to 6 × 10 −4 . We apply the exponential moving average on all trainable variables with decay rate of 0.9999. Layer dropout (Huang et al., 2016) is used in Transformer layers with 0.95 survival probability. We also apply dropout on word, character embeddings and each layers with dropout rate of 0.1, 0.05 and 0.1 respectively.
Comparison Models Apart from comparing with previous state-of-the-art models (Seo et al., 2016;Wang et al., 2017;Yu et al., 2018), we implement a baseline model that separately perform encoding and matching. The same settings as above are used. The first three Transformer layers are utilized to encode passage and question separately. Then we add a passage-question matching layer following Yu et al. (2018), with nine more Transformer layers used to compute the question-sensitive passage representations. To make a fair comparison, we only compare with the models that do not rely on pre-trained language models (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2018).

Results
Exact match (EM) and F1 scores are two evaluation metrics for SQuAD. EM measures the percentage of the prediction that matches the groundtruth answer exactly, while F1 measures the overlap between the predicted answer answer and the ground-truth answer. The scores on the development set are evaluated by the official script. As shown in Table 1, the unified model outperforms previous state-of-the-art models and the baseline model. We find that our unified model brings 1.1/1.1 absolute improvement on EM/F1 over the baseline that separately conducts encoding and matching. The results indicate the unified model not only simplifies the model architecture, but also improves performance on SQuAD.

Analysis
We introduce passage encoding ratio e p and question encoding ratio e q to quantify the encoding and matching strategies for each layer of the unified encoder. Let us take the question encoding ratio of an attention head in the l-th Transformer layer for example. Given the attention head's self-attention weight matrix A, the ratio e q is computed via: where s q|q is the average question-to-question attention weight, and s q|p is the average passage-toquestion attention weight. To be specific, if e q is close to 1, it means that the layer tends to perform question-to-question encoding. In contrast, if e q is close to 0, it indicates the layer performs more passage-to-question matching. Similarly, we can compute passage encoding ratio e p as above. As shown in Figure 2, we compute passage encoding ratio e p and question encoding ratio e q for all the attention heads on the development set, and plot their density distributions for each Transformer layer. We find that the unified model learns strategies that are clearly different from manuallydesigned architectures: • Figure 2(a) shows that the first three layers perform question-to-passage matching and the fourth layer conducts passage-to-passage encoding, while most previous models perform passage encoding first.
• Figure 2(a) indicates that upper layers tend to conduct more encoding than matching.
• Figure 2(b) shows that all layers tend to perform question-to-question encoding than passage-to-question matching.
• Some layers are automatically learned to perform encoding and matching at the same time instead of separate modeling.

Conclusion
In this work, we evaluate and analyze unifying encoding and matching components with Transformer for the MRC task. Experimental results on the SQuAD dataset illustrate that the unified model outperforms previous networks that treat encoding and matching separately. We further introduce a metric to inspect whether a layer tends to act more like encoding or matching. The analysis results show that the unified Transformer layers automatically learn strategies that are clearly different from previous specially-designed models.