Token-level Dynamic Self-Attention Network for Multi-Passage Reading Comprehension

Multi-passage reading comprehension requires the ability to combine cross-passage information and reason over multiple passages to infer the answer. In this paper, we introduce the Dynamic Self-attention Network (DynSAN) for multi-passage reading comprehension task, which processes cross-passage information at token-level and meanwhile avoids substantial computational costs. The core module of the dynamic self-attention is a proposed gated token selection mechanism, which dynamically selects important tokens from a sequence. These chosen tokens will attend to each other via a self-attention mechanism to model long-range dependencies. Besides, convolutional layers are combined with the dynamic self-attention to enhance the model’s capacity of extracting local semantic. The experimental results show that the proposed DynSAN achieves new state-of-the-art performance on the SearchQA, Quasar-T and WikiHop datasets. Further ablation study also validates the effectiveness of our model components.


Introduction
As a critical approach for evaluating the ability of an intelligent agent to understand natural language, reading comprehension (RC) is a challenging research direction, attracting many researchers' interest. In real application scenarios, such as web search, the passages may be multiple and extended, and may be comprised of relevant and irrelevant contents. It involves the problem of multi-passage reading comprehension.
Great efforts have been made to develop models for multi-passage task, such as Wang et al.  Song et al. (2018). The common practice of these approaches is that all the embeddings in a passage or a span are integrated into a single vector and the cross-passage information interactions are based on these coarse-grain semantic representations. However, it may cause potential issues. As is pointed out in Bahdanau et al. (2015); Cho et al. (2014), compressing all the necessary information into a single vector may lead to "sacrifice" some critical information due to the allocated capacity to remember other information. This problem is prevalent in Neural Machine Translation (NMT), the recent models, such as the Transformer (Vaswani et al., 2017), workaround this issue by decoding on token-level context encodings of the source text. As such, we hypothesize that fine-grain representations may keep precise semantic information, and may be beneficial to cross-passage information interactions in RC tasks. In this paper, we focus on an architecture which deals with the cross-passage information at token-level.
The proposed architecture is a variant of the Self-attention Network (SAN) (Vaswani et al., 2017;Shen et al., 2018a). Our model employs a self-attention mechanism to combine tokenlevel supportive information from all passages in a multi-step process. Directly applying selfattention over all tokens is computationally expensive. Instead, in each step, the most important tokens are dynamically selected from all passages, and information interaction only happens over these chosen tokens via the self-attention mechanism. The motivation behind it is an observation that the information used to answer the question is usually concentrated on a few words.
Our experiments verify this observation to a certain extent. We expect that our model can automatically find out these important tokens. Thus we propose a gated token selection mechanism and equip it with the self-attention module. We intend the model to achieve a balance in speed, memory, and accuracy. While the selfattention mechanism is widely used in end-to-end models to capture long-range dependency, it is intrinsically inefficient in memory usage. Shen et al. (2018b) elaborates the memory issue. The memory required to store the attention matrix grows quadratically with the sequence length. Considering real scenarios, such as web search, in which the retrieval system returns hundreds of articles, and each contains hundreds or thousands of words, thus applying self-attention on all tokens in the supporting passages is computationally expensive. Compared to recurrent neural networks, such as LSTM (Hochreiter and Schmidhuber, 1997), SAN is highly parallelizable and usually faster on long sequence (Vaswani et al., 2017). The proposed method accomplishes necessary cross-passage information interaction with a time/memory complexity linear in the length of the sequence and do not add much extra calculation burden.
Our contributions in this work are as follows: (1) We propose Dynamic Self-attention (DynSA) for information interaction in a long sequence.
(2) Token-level cross-passage information interaction is implemented through the application of the proposed DynSA at relatively less computational costs. (3) Our Dynamic Self-attention Network (DynSAN) achieves new state-of-the-art performance compared with previously published results on SearchQA, Quasar-T and WikiHop benchmarks.

Dynamic Self-attention Block
This section introduces the Dynamic Self-Attention Block (DynSA Block), which is central to the proposed architecture. The overall architecture is depicted in Figure 1.
The core idea of this module is a gated token selection mechanism and a self-attention. We expect that a gate can acquire the estimation of each token's importance in an input sequence, and use this estimated importance to extract the most important K tokens. Then we run a self-attention, instead of computing the full self-attention matrix over all the tokens, only the chosen K tokens are  taken into account. This module results in lower memory consumption and makes the self-attention focus on the active part of a long input sequence. The above idea is implemented through stacking two structures: a local encoder and a dynamic selfattention module.

Local Encoder
In the architecture, a local encoder is used to encode local information, such as short-range context, which is useful for disambiguation. The reasons for the local encoder are that (1) only computing self-attention over a few tokens among a long sequence may lead the self-attention to lose the capability of modeling short-range context for every position in the sequence, and (2) after a position receives the attended information from long-range positions, the local encoder is needed to spread this information to its neighboring positions, and (3) previous works have proven that combining a local encoder with self-attention is beneficial in some tasks . A natural candidate for the local encoder is local convolution, which is widely used as local feature extractors. Besides, restricted self-attention (Vaswani et al., 2017) is also a choice. In this work, we adopt 1D convolution as the local encoder. Specifically, let X ∈ R D×L be the in-put matrix of an L-token sequence, and each token embedding is D-dimensional. The output of a convolutional layer is calculated with a residual connection: Conv(LN (X)) + X, where LN is the layer normalization (Ba et al., 2016), Conv denotes a convolutional layer. For less computational costs, we adopt depth-wise separable convolutions (Chollet, 2017) throughout this paper. The local encoder consists of a stack of 2 convolutional layers.

Dynamic Self-attention
Since our self-attention is performed over a set of tokens which are determined dynamically, we call it Dynamic Self-Attention (DynSA). The DynSA is based on the hypothesis that the number of important tokens is much less than the sequence length in a long sequence. Here, to say a token is important means that the token contains the necessary information to enable the model to predict the answer, or the token is non-negligible for modeling long-range semantic dependency. DynSA intends to find out the most important tokens by a token selection mechanism and then performs a self-attention only over these chosen tokens.
In DynSA, we use a gate to control how much of the output, which includes non-linear transformations and attended vectors, to pass this layer. A large gate activation value implies that the corresponding output is important in this layer. Thus, we use the gate activation as the basis of token selection. Given the output of the local encoder U ∈ R D×L , the gate activation is computed via: where F U denotes a non-linear fully connected layer, F G denotes an affine transformation with sigmoid activation function. In our work, we allow to use multi-head attention (Vaswani et al., 2017). Equation 1 outputs G ∈ R H×L , which contains H heads. And we use g h ∈ R L (the h-th row in G) to represent the gate output of the h-th head. The element g h,i in g h is the gate activation corresponding to the token at the i-th position. Then, in each head we select the top K tokens according to their corresponding gate activations in g h , in which K is a hyper-parameter. In case of the actual sequence length being less than K, we select all the tokens. We get the chosen tokens' embeddings U h = [u i h,1 , · · · , u i h,j , · · · , u i h,K ] ∈ R D×K , where i h,j ∈ {1, 2, · · · , L} is the position index of the chosen token in the input sequence. We consider this as a gated token selection mechanism.
Scaled dot-product attention is adopted over the chosen tokens: H ×K are query, key, and value respectively, they are linear projections of the input U h . A h ∈ R D H ×K is the attended output matrix of the h-th head.
Next, we pad those unchosen positions with zero embeddings to complete the sequence length.
of the input embeddings. Since zero embeddings are padded at unchosen positions, by adding F h gradient vanishing can be avoided when updating the parameters of the gate in training phase. In Equation 3, the maximum operation aims to select the maximum element in vector g h , and the division operation normalizes these elements so that the maximum activation is always one.
Finally, the output Y ∈ R D×L of a DynSA block is the fusion of all heads.
in which, F Y denotes a linear projection, [·; ·] is the concatenation of the outputs of all heads. Optionally, we suggest adding a regularization on the gate activation to make it more sparse, so that those unimportant tokens' activation values are almost zero and let the model generate more discriminative gate activation. Experiments show that the regularization can produce small gains in performance. Specifically, we jointly optimize the following regularization term when training the model.
where G represents the gate activation, || · || 1 denotes 1-norm. β is a small hyper-parameter, which is set to 10 −5 in our experiments.  Figure 2: Architecture of Dynamic Self-Attention Network (DynSAN) for multi-passage reading comprehension.

Token-level Dynamic Self-attention Network
This section introduces the application of our proposed Dynamic Self-attention Network (Dyn-SAN) on the multi-passage RC task. Given a question and M passages, it requires the model to predict a span from the passages to answer the question. Figure 2 illustrates the architecture of Dyn-SAN.

Input Encoding
At the bottom of DynSAN, the input texts are first converted into distributional representations. We use the concatenation of word embeddings and character encodings for every single token. For word embedding, we adopt the pre-trained 300dimensional fasttext Mikolov et al. (2018) word embeddings and fix them during training. Character encodings are obtained by performing convolution and max-pooling on 15-dimensional randomly initialized character embeddings following (Kim, 2014). Character embeddings are trainable while word embeddings are fixed in the training phase. On top of the embeddings, we adopt a 2layer highway network (Srivastava et al., 2015) for deep transformation. The output of the highway network is immediately mapped to D dimensions through a linear projection, and we add sinusoidal positional embeddings (Vaswani et al., 2017) to the vectors for each token to expose position information to the model. Then, the vectors are fed into a layer of DynSA blocks. These DynSA Blocks are in charge of independently encoding context information inside the question and every passage, in which the parameters of DynSA blocks are shared in the layer. We use DynSA rather than the full multi-head self-attention to avoid massive memory consumption caused by exceptionally long passages.

Alignment
Alignment is a common and necessary step to generate question-aware context vectors for each passage, here, we adopt the strategy used in , in which it includes a trilinear co-attention (Weissenborn et al., 2017) and a heuristic combination with query-to-context (Seo et al., 2017). Due to the limited space, we encourage reading the references for detailed descriptions and omit the repeated introduction. Then, the question-aware context vectors are projected into the standard dimension D through a linear layer and are encoded by a layer of DynSA blocks again to build semantic representations inside each passage further.

Cross-Passage Attention
Thus far, each passage aligns with the question independently, and DynSA blocks generate contextual embeddings inside each passage independently, so there is no interaction between passages. For multi-passage reading comprehension, cross-passage information interaction is beneficial to solve the problems, such as multihop reasoning, and multi-passage verification. Previous works either omit the cross-passage interaction (Clark and Gardner, 2018) or implement it at a relatively coarse granularity (Dehghani et al., 2019a). For example, in Dehghani et al. (2019a), each passage is encoded into a singular vector and self-attention is performed over these passage vectors. Instead of passage-level or block-level interaction (Shen et al., 2018b), in this work, we focus on modeling cross-passage long-range dependencies at tokenlevel through a cross-passage attention layer. We expect that fine-grain self-attention may keep precise semantic information. This layer consists of N stacked DynSA blocks. Specifically, as is shown in Figure 2, we concatenate the vector sequences of all passages end to end, and then stack N layers of DynSA blocks on top of this long vector sequence. If these passages are given in order, for instance, the passages have been ranked by a search engine, we add a rank embedding to each passage before the concatenation. The rank embeddings are randomly initialized, and the i-th rank embedding is added to every token vector in the i-th ranked passage.

Prediction Layer
The prediction layer is used to extract the answer span based on the output of previous layers. Depend on the type of tasks, different architectures are chosen. In this work, we investigate extractive QA and multiple choice QA.

Extractive QA
Extractive QA is challenging since we have to extract the answer span from the passages without any given candidate answer. In this paper, we adopt the Hierarchical Answer Spans (HAS) model (Pang et al., 2019) to solve this problem. Details are included in Pang et al. (2019), and we do not repeat it here due to limited space. In our implementation, the differences to Pang et al. (2019) are that the start/end probability distribution is calculated over all tokens as in Equation 7, RNN is replaced with DynSA block, and the paragraph quality estimator mentioned in Pang et al. (2019) is not used.

Multiple Choice QA
In this type of task, a list of candidate answers is provided. Here, we assume S ∈ R D×L as the output of the cross-passage attention layer, L represents the total length of the M passages, q denotes the question, and P = {p 1 , · · · , p M } denotes the set of passages. We first convert the token vectors into a probability distribution r ∈ R L over all tokens, r = softmax(F S (S)) where F S is a linear projection. The probability of choosing a candidate c as the answer is computed via: where T c is a set of positions where the candidate c's mentions appear. During training, we optimize the log-likelihood of choosing the correct answer's probability.

Datasets
We conduct experiments to study the performance of the proposed approach on three publicly available multi-passage RC datasets.
SearchQA (Dunn et al., 2017) is an open domain QA dataset including about 140k questions crawled from J! Archive, and about 50 web page snippets, which are retrieved from the Google search engine, as the supporting passages for each question. The authors of SearchQA have provided a processed version of this dataset, in which all words are lower-cased, and tokenization has been completed. Our experiments are based on this processed version.
Quasar-T (Dhingra et al., 2017) is an open domain QA dataset including about 43k trivia questions collected from various internet sources, and 100 supporting passages for each question. These supporting passages are given in an order ranked by a search engine.
WikiHop (Welbl et al., 2018) is a multiple choice QA dataset constructed using a structured knowledge base. One has to submit the model and work with the author to obtain the test score. For this dataset, a binary feature is concatenated with word embeddings and character embeddings to indicate whether a token is belong to any candidate answers.
The above three datasets have their official train/dev/test sets, so we do not split them by ourselves. Some of the above datasets provide additional meta-data, we do not use this additional information in our experiments. We observe that those low-ranked passages play a critical role in improving the accuracy, thus we remain all supporting passages as the inputs of our model. The averages/medians of the total length of the concatenation of all supporting passages for each question are around 1.9k/2k, 2.4k/2.4k, and 1.2k/1k in SearchQA, Quasar-T, and WikiHop respectively. Thus, we limit the maximum length not to exceed 5k tokens and discard a few exceptionally long cases. Tokenization is completed using spaCy 1 during preprocessing.

Experimental Setup
In the DynSAN, the kernel size is 7 for all convolutional layers, the standard dimension D is 128, the number of heads H is 8, the number of chosen tokens K is 256. In the cross-passage attention layer, we stack N = 4 layers of DynSA blocks. The mini-batch size is set to 32. For regularization, we adopt dropout between every two layers and the dropout rate is 0.1. Adam (Kingma and Ba, 2015) with learning rate 0.001 is used for tuning the model parameters. We use a learning rate warm-up scheme in which the learning rate increases linearly from 0 to 0.001 in the first 500 steps. The models for multi-passage reading comprehension are trained on four 12GB K80 GPUs using synchronous SGD (Das et al., 2016). Exponential moving average is adopted with a decay rate 0.9999.

Main Results
The performance of our model and competing approaches are summarized in Table 1 and Table 2. For extractive QA, standard metrics are utilized: 1 https://spacy.io

Model
Dev Test BiDAF (Seo et al., 2017) -42.9 Coref GRU (Dhingra et al., 2018) 56.0 59.3 MHQA-GRN (Song et al., 2018) (Welbl et al., 2018) -74.1 Exact Match (EM) and F1 score (Rajpurkar et al., 2016). The scores are evaluated by the official script in Rajpurkar et al. (2016). For multiple choice QA, the performance is evaluated by the accuracy of choosing the correct answer. As we can see, the proposed model clearly outperforms all previously published approaches and achieves new state-of-the-art performances on the three datasets, which validates the effectiveness of the dynamic self-attention network for multi-passage RC. It is noteworthy that competing approaches use coarse-grain representations for cross-passage information interaction or omit cross-passage information interaction entirely.

Ablations
In order to evaluate the individual contribution of each model component, we conduct an ablation study. Explicitly, we remove or replace model components and report the performance on the SearchQA test set in Table 3. In (a), we remove the cross-passage attention. In (b), we remove all self-attention, i.e., the context information is modeled by the convolutional layers only. In (c), we  remove all convolutional layers in DynSA blocks. In (d), we remove the gated token selection mechanism in DynSA blocks; in other words, which K tokens are selected is decided randomly rather than by the gate activation. Further, we remove the gate itself from (d) in (e). In (f), we remove regularization on gate activation by setting β = 0. In (g), we replace the DynSA block with Bi-BloSA (Shen et al., 2018b), which is proposed for long-sequence modeling but a block-level selfattention. The Bi-BloSA is implemented using the author's open source code. On the basis of (g), we combine Bi-BloSA with convolutional layers in (h).
As is shown in Table 3, cross-passage attention is most critical to the performance (almost 10% drop), the results prove the necessity of formation interaction between passages. Since we set K = 256, and most singular passages are within 256 tokens, the DynSA models local context for every position before the concatenation of all passages. Therefore, removing convolutional layers does not degrade the model entirely in (c). Self-attention and convolutional layers account for 4.7% and 2.9% performance drop respectively, and it illustrates that self-attention plays a more critical role than convolutional layers in modeling context information. In (d), the performance reduces significantly, proving the effectiveness of the gated token selection mechanism in the proposed architecture. Compare (e) to (d) and compare (f) to the full architecture, it is concluded that the gate itself and the regularization also have slight benefits to the model. From (g) and (h), we learn that the token-level DynSA block outperforms the blocklevel Bi-BloSA by a large margin, verifying the superiority of fine-grain representation.

Qualitative Analysis
We conduct a case study to show which tokens are selected as important tokens by the gated token se- lection mechanism. In a DynSA block, we define the maximum gate activation in all heads as a token's activity. The activity reflects the estimated importance of a token. In this subsection, all the tokens are ranked according to the sum of a token's activities in all DynSA blocks in the crosspassage attention layer. In Figure 3, two questionanswering instances are given, and the top-ranked tokens are shaded. As we can see, the model inclines to mark cue words and plausible answers as the important tokens in DynSA blocks. We conjecture that information interactions between plausible answers may play an answer verification role,  while information interactions between cue words may be considered as multihop reasoning. We also observe that in a lot of mispredicted instances the correct answer never obtains large gate activations in cross-passage attention layers. Perhaps this is a reason for misprediction. Token's activity is defined as in subsection 4.5. We also count the average number of active tokens on the Quasar-T dev set. We define a token is active when its activity is greater than 0.01. Figure 4(b) reports the statistics. In general, the activity values tend to be polarized, i.e., either near zero or near one. It is probably caused by the normalization in Equation 3 and the regularization term in Equation 6. Besides, the intra-passage DynSA blocks (layer -1 and layer 0) have more active tokens, while the cross-passage blocks have less. It explains that more tokens take effect in understanding a single passage, while only a few important tokens are necessary for cross-passage information interaction. The results verify our observation mentioned in section 1.

Time Cost & Memory Consumption
We also conduct experiments to show the computational costs of the proposed model and other baseline models. Specifically, we replace the DynSA blocks in Figure 2 with Bi-LSTM (Hochreiter and Schmidhuber, 1997), full SAN, and Bi-BloSAN (Shen et al., 2018b) respectively. Note that the full SAN refers to the model encoder block in QANet , which is a combination of global multi-head self-attention and local convolution. It is a strong baseline, and we use  it to show the situation of full self-attention over all tokens.
To avoid the long running time of Bi-LSTM and the out-of-memory issue of full SAN on multipassage RC tasks, we select SQuAD 1.1 (Rajpurkar et al., 2016) as the benchmark dataset. Since SQuAD is a single-passage RC task, we consider it as special multi-passage RC when the number of passages M equals to 1. In this experiment, top K = 32 tokens are chosen in DynSAN. Models are trained on a single 12GB K80 GPU.
The results are shown in Table 4. Compared with the full SAN and Bi-LSTM, DynSAN has a slight accuracy drop while Bi-BloSAN degrades significantly. In terms of time cost and memory usage, DynSAN reaches 4.3x and 3.3x speedup and has a similar memory consumption to Bi-LSTM. Because of the characteristics of Bi-LSTM and the full SAN, as the sequence length increases, the advantage of DynSAN in speed and memory consumption would be more significant. Although DynSAN has a small accuracy drop to the full SAN, it seems that DynSAN is a relatively balanced model concerning speed, memory, and accuracy.

Model Analysis
Effect of Token Selection Figure 5(a) shows the effects of the token selection. As the number of chosen tokens increases, performance improves as expected. When the number of chosen tokens is large enough, the gain becomes marginal. The choice of this hyper-parameter has an impact on the balance in speed, memory, and accuracy.
Number of Passages Figure 5(b) answers following research question "How would the performance change with respect to the number of passages?" As more supporting passages are taken into consideration, both F1 and EM performance of our model continuously increase. The results verify that those low-ranked passages play a critical role in answering the questions.

Related Works
As far as multi-passage reading comprehension be concerned, a lot of powerful deep learning approaches have been introduced to solve this problem. De Cao et al. (2019); Song et al. (2018) introduce graph convolutional network (GCN) and graph recurrent network (GRN) into this task. Dhingra et al. (2018) use co-reference annotations extracted from an external system to connect entity mentions for multihop reasoning. Zhong et al. (2019) propose an ensemble approach for coarsegrain and fine-grain co-attention networks. Pang et al. (2019) propose a hierarchical answer spans model to tackle the problem of multiple answer spans. Clark and Gardner (2018) uses a sharednormalization objective to produce accurate perpassage confidence scores and marginalize the probability of an answer candidate over all passages. While it outperforms most single-passage RC models by a large margin, it processes each passage independently omitting the multi-passage information interaction completely. In Wang et al. (2018b), cross-passage answer verification is definitely proposed, in which all the word embeddings in a passage are summed through attention mechanism to represent an answer candidate, and then each answer candidate attends to other candidates to collect supportive information. In Dehghani et al. (2019a), multihop reasoning is implemented by a Universal Transformer (Dehghani et al., 2019b) which is mainly based on Multi-head Self-attention (Vaswani et al., 2017) and a transition function.
Our work is concerned with Self-attention Network (SAN) (Vaswani et al., 2017;Shen et al., 2018a). For the first time, Vaswani et al. (2017) explore the possibilities of completely replacing the recurrent neural network with self-attention to model context dependencies. Some papers propose variants of self-attention mechanisms, such as Shen et al. (2018c); Hu et al. (2018); Shaw et al. (2018); Yang et al. (2019). Besides, Shen et al. (2018b) explore reducing the computational complexity of self-attention.

Conclusion
In this paper, we proposed a new Dynamic Selfattention (DynSA) architecture, which dynamically determinates what tokens are important for constructing intra-passage or cross-passage tokenlevel semantic representations. The proposed approach has the advantages in remaining fine-grain semantic information meanwhile reaching a balance between time, memory and accuracy. We showed the effectiveness of the proposed method in handling multi-passage reading comprehension using three benchmark datasets including SearchQA, Quasar-T, and WikiHop. Experimental results showed state-of-the-art performance.