Read and Comprehend by Gated-Attention Reader with More Belief

Gated-Attention (GA) Reader has been effective for reading comprehension. GA Reader makes two assumptions: (1) a uni-directional attention that uses an input query to gate token encodings of a document; (2) encoding at the cloze position of an input query is considered for answer prediction. In this paper, we propose Collaborative Gating (CG) and Self-Belief Aggregation (SBA) to address the above assumptions respectively. In CG, we first use an input document to gate token encodings of an input query so that the influence of irrelevant query tokens may be reduced. Then the filtered query is used to gate token encodings of an document in a collaborative fashion. In SBA, we conjecture that query tokens other than the cloze token may be informative for answer prediction. We apply self-attention to link the cloze token with other tokens in a query so that the importance of query tokens with respect to the cloze position are weighted. Then their evidences are weighted, propagated and aggregated for better reading comprehension. Experiments show that our approaches advance the state-of-theart results in CNN, Daily Mail, and Who Did What public test sets.


Introduction
Recently, machine reading has received a lot of attention in the research community. Several largescale datasets of cloze-style query-document pairs have been introduced to measure machine reading capability. Deep leaning has been used for text comprehension with state-of-the-art approaches using attention mechanism. One simple and effective approach is based on Gated Attention (GA) (Dhingra et al., 2017). Viewing the attention mechanism as word alignment, GA uses document-to-query attention to align each word * indicates equal contribution position of a document with a word token in a query in a "soft" manner. Then the expected encoding of the query, which can be viewed as a masking vector, is computed for each word position of a document. Through a gating function such as the element-wise product, each dimension of a token encoding in a document is interacted with the query for information filtering. Intuitively, each token of a document becomes queryaware. Through the gating mechanism, only relevant information in the document is kept for further processing. Moreover, multi-hop reasoning is applied that performs layer-wise information filtering to improve machine reading performance.
In this paper, we propose Collaborative Gating (CG) that attempts to model bi-directional information filtering between query-document pairs. We first apply query-to-document attention so that each token encoding of a query becomes document-aware. Then we use the filtered query and apply usual document-to-query attention to filter the document. Bi-directional attention mechanisms are performed in a collaborative manner. Multi-hop reasoning is then applied like in the GA Reader. Intuitively, bi-directional attention may capture complementary information for better machine comprehension (Seo et al., 2017;Cui et al., 2017). By filtering query-document pairs, we hope that feature representation at the final layer will be more precise for answer prediction. Our experiments have shown that CG can yield further improvement compared to GA Reader.
Another contribution is the introduction of selfattention mechanism in GA Reader. One assumption made by GA Reader is that at the final layer for answer prediction, only the cloze position of a query is considered for computing the evidence scores of entity candidates. We conjecture that surrounding words in a query may be related to the cloze position and thus provide addition-al evidence for answer prediction. Therefore, we employ self-attention to weight each token of the query with respect to the cloze token. Our proposed Self-Belief Aggregation (SBA) amounts to compute the expected encoding at the cloze position which can be viewed as evidence propagation from other word positions. Then similarity scores between the expected cloze token and the candidate entities of the document are computed and aggregated at the final layer. Our experiments have shown that SBA can improve machine reading performance over GA Reader.
This paper is organized as follows: In Section 2, we briefly describe related work. Section 3 gives our proposed approaches to improve GA Reader. We present experimental results in Section 4. In Section 5, we summarize and conclude with future work.

Related Work
The cloze-style reading comprehension task can be formulated as: Given a document-query pair (d, q), select c∈C that answers the cloze position in q where C is the candidate set. Each candidate answer c appears at least once in the document d. Below are related approaches to address reading comprehension problem. Hermann et al. (2015) employed Attentive Reader that computes a document vector via attention using q, giving a joint representation g(d(q), q). In some sense, d(q) becomes a queryaware representation of a document. Impatient Reader was proposed in the same paper to model the joint representation but in a incremental fashion. Stanford Reader (Chen et al., 2016) further simplified Attentive Reader with shallower recurrent units and a bilinear attention. Attention-Sum (AS) Reader introduced a bias towards frequently occurred entity candidates via summation of the probabilities of the same entity instances in a document (Kadlec et al., 2016). Cui et al. (2017) proposed Attention-over-Attention (AoA) Reader that employed a two-way attention for reading comprehension. Multi-hop architecture for text comprehension was also investigated in (Hill et al., 2016;Sordoni et al., 2016;Shen et al., 2017;Munkhdalai and Yu, 2017;Dhingra et al., 2017). Kobayashi et al. (2016) and  built dynamic representations for candidate answers while reading the document, sharing the same spirit to GA Reader (Dhingra et al., 2017) where token encod-ings of a document become query-aware. Brarda et al. (2017) proposed sequential attention to make the alignment of query and document tokens context-aware. Wang et al. (2017a) showed that additional linguistic features improve reading comprehension.
Self-attention has been successfully applied in various NLP applications including neural machine translation (Vaswani et al., 2017), abstractive summarization (Paulus et al., 2017) and sentence embedding (Lin et al., 2017). Self-attention links different positions of a sequence to generate a structural representation for the sequence. In reading comprehension literature, self-attention has been investigated. (Wang et al., 2017b) proposed a Gated Self-Matching mechanism which produced context-enhanced token encodings in a document. In this paper, we have a different angle for applying self-attention. We employ selfattention to weight and propagate evidences from different positions of a query to the cloze position to enhance reading comprehension performance.

Proposed Approaches
To enhance the performance of GA Reader, we propose: (1) Collaborative Gating and (2) Self-Belief Aggregation described in Section 3.1 and Section 3.2 respectively. The notations are consistent to which in original GA Reader paper (see Appendix A).

Collaborative Gating
In GA Reader, document-to-query attention is applied to obtain query-aware token encodings of a document.
The attention flow is thus uni-directional. Seo et al. (2017) and Cui et al. (2017) showed that bi-directional attention can be helpful for reading comprehension. Inspired by their idea, we propose a Collaborative Gating (CG) approach under GA Reader, where query-to-document and document-to-query attention are applied in a collaborative manner. We first use query-to-document attention to generate document-aware query token encodings. Intuitively, we use the document to create a mask for each query token. In this step, the query is said to be "filtered" by the document. Then we use the filtered query to gate document tokens like in GA Reader. The document is said to be "filtered" by the filtered query in the previous step. The output document token encodings are fed into the nex- t computation layer. Figure 1 illustrates CG under a multi-hop architecture, showing that CG fits naturally into GA Reader. The mathematical notations are consistent to GA Reader described in Appendix A. Dashed lines represent dropout connections. CG modules are circled. At each layer, document tokens X and query tokens Y are fed into Bi-GRUs to obtain token encodings Q and D. Then we apply query-to-document attention to obtain a document-aware query representation using GA(Q, D): Upon this, we get the filtered query tokens Z = [z 1 , z 2 , ..., z |Q| ]. Then we apply document-toquery attention using Z to obtain a query-aware document representation using GA(D, Z): The resulting sequence X = [x 1 , x 2 , ..., x |D| ] are fed into the next layer. We also explore another way to compute the termz in equation 5. In particular, we may replace Z by Q in equation 5 since Q is in the unmodified encoding space compared to Z. We will study this effect in detail in Section 4. At the final layer of GA Reader, encoding at the cloze position is used to calculate similarity score for each word token in a document. We evaluate whether applying the query-to-document attention to filter the query is crucial before computing the similarity scores. In other words, we use D (K) to filter the query producing Z (K) . Then the score vector of document positions s is calculated as: where index l is the cloze position. Similar to GA Reader, the prediction then can be obtained using equation 19 and equation 20 in Appendix A. We will study the effect of this final filtering in detail in Section 4.

Self Belief Aggregation
In this section, we introduce self-attention for GA Reader to aggregate beliefs from positions other than the cloze position. The motivation is that surrounding words other than the cloze position of a query may be informative so that beliefs from the surrounding positions can be propagated into the cloze position in a weighted manner. We employ self-attention to measure the weight between the cloze and surrounding positions. Figure 2 shows Figure 2: Self Belief Aggregation.
the Self-Belief Aggregation module at the final layer of GA Reader. Query Y is fed into the S-BA module that uses another Bi-directional GRU to obtain token encodings Then attention weights are computed using: where l is the cloze position. λ measures the importance of each query word with respect to the cloze position. We compute weighted-sum ] using λ so that beliefs from surrounding words can be propagated and aggregated upon similarity score computation. Finally, scores at word positions of a document are calculated using: When CG is applied jointly with SBA, the filtered query Z (K) is used instead of Q (K) . Namely, ]) in equation 10. Note that self-attention can also be applied on documents to model correlation among words in documents. Considering a document sentence "Those efforts helped him earn the 2013 CNN Hero of the Year" and query "@placeholder was the 2013 CNN Hero of the Year". Obviously, the entity co-referenced by him is the answer. So we hope that self-attention may have the co-reference resolution effect for "him". We will provide empirical results in Section 4.

Experiments
We provide experimental evaluation on our proposed approaches on public datasets in this section.

Datasets
News stories from CNN and Daily Mail (Hermann et al., 2015) 1 were used to evaluate our approaches. In particular, a query was generated by replacing an entity in the summary with @placeholder. Furthermore, entities in the news articles were anonymized to erase the world knowledge and co-occurrence effect for reading comprehension. Word embeddings of these anonymized entities are thus less informative.
Another dataset was Who Did What 2 (WD-W) (Onishi et al., 2016), constructed from the LD-C English Gigaword newswire corpus. Document pairs appeared around the same time period and with shared entities were chosen. Then, one article was selected as document and another article formed a cloze-style query. Queries that were answered easily by the baseline were removed to make the task more challenging. Two versions of the WDW datasets were considered for experiments: a smaller "strict" version and a larger but noisy "relaxed" version. Both shared the same validation and test sets.

Collaborative Gating Results
We evaluated Collaborative Gating under various settings. Recall from Section 3.1, we proposed two schemes for calculating the gates: Using Q or Z in equation 5. When using Z for computation, the semantics of the query are altered. When using the original Q, the semantics of the query are not altered. Moreover, we also investigate whether to apply query filtering at the final layer (denoted as "+final filtering" in Table 2).
Results show that CG helps compared to the baseline GA Reader. This may be due to the effect of query-to-document attention which makes the token encodings of a query more discriminable. Moreover, it is crucial to apply query filtering at the final layer. Using the original Q to compute the gates brought us the best results with an absolute gain of 0.7% compared to GA Reader on both the validation and test sets. Empirically, we found  that CG using Z for gate computation seems easier to overfit. Therefore, we use CG with the setting "by Q, +final filtering" for further comparison.

Self-Belief Aggregation Results
To study the effect of SBA, we disabled CG in the reported experiments of this section. Furthermore, we compare the attention functions using dot product and a feed forward neural network with tanh() activation (Wang et al., 2017b). Results are shown in Table 3.
SBA yielded performance gain on all settings when the attention function was dot product. On the other hand, attention function using feed-

Model
Accuracy Val Test GA Reader 77.9 77.9 SBA on Q (K) (tanh) 77.1 77.1 SBA on Q (K) 78.5 78.9 SBA on D (K) 78.1 78.3 SBA on D (K) &Q (K) 78.1 78.2 forward neural network degraded accuracy compared to the baseline GA Reader which was surprising to us. Although SBA on Q (K) and D (K) individually yielded performance gain, combining them together did not bring further improvement. Even a slight drop in test accuracy was observed. Applying SBA on both query and document may make the training more difficult. From the empirical results, it seems that the learning process was led solely by document self-attention. In future work, we will consider a stepwise approach where the previous best model of a simplier network architecture will be used for initialization to avoid Query: in a video , @placeholder says he is sick of @entity3 being discriminated against in @entity5 (Correct Answer: @entity18) GA Reader (Prediction: @entity4): @entity4 , the leader of the @entity5 @entity9 ( @entity9 ) , complains that @entity5 's membership of the @entity11 means it is powerless to stop a flow of foreign immigrants , many from impoverished @entity15 , into his " small island " nation . in a video posted on @entity20 , prince @entity18 said he was fed up with discrimination against @entity3 living in @entity5 . Collaborative Gating (Prediction: @entity18): @entity4 , the leader of the @entity5 @entity9 ( @entity9 ) , complains that @entity5 's membership of the @entity11 means it is powerless to stop a flow of foreign immigrants , many from impoverished @entity15 , into his " small island " nation . in a video posted on @entity20 , prince @entity18 said he was fed up with discrimination against @entity3 living in @entity5 . Self Belief Aggregation (Prediction: @entity18): @entity4 , the leader of the @entity5 @entity9 ( @entity9 ) , complains that @entity5 's membership of the @entity11 means it is powerless to stop a flow of foreign immigrants , many from impoverished @entity15 , into his " small island " nation . in a video posted on @entity20 , prince @entity18 said he was fed up with discrimination against @entity3 living in @entity5 . joint training from scratch. Self-attention over a long document may be difficult. Constraints such as locality may be imposed to restrict the number of word candidates in self-attention. We conjecture that modeling co-reference between entities and pronouns may be helpful compared to the fullblown self-attention over all word tokens in a document. Figure 4 shows self-attention on two sample queries using a trained model. Surprisingly, the attention weight at the cloze position is almost Query: in a video , @placeholder says he is sick of @entity3 being discriminated against in @entity5 Query: @placeholder @entity0 built a vast business empire Figure 4: Self beliefs on each query positions with respect to @placeholder. equal to unity. As a result, the weighted-sum of encodings at the cloze position reduces to encoding at the cloze position, that is the assumption of GA Reader. This may imply that SBA somehow contributes to better GA Reader training. Since the attention weight at the cloze position is almost unity, SBA can be removed during test. On the other hand, SBA did not work well on smaller datasets such as WDW.

Overall Results
We compare our approaches with previous published models as shown in Table 1. Note that CG and SBA are under the best settings reported in previous sections. CG+SBA denotes the combination of the best settings of our proposed approaches described in earlier sections. Overall, our approaches achieved the best validation and test accuracies on all datasets. On CNN and Daily Mail, CG or SBA performed similarly. But the combination of them did not always yield additional gain on all datasets. CG exploited information from query and document while SBA only used query. Although these two approaches are quite different, CG and SBA may not have strong complementary relationship for combination from the empirical results.

Significance Testing
We conducted McNemar's test on the best results we achieved using sclite toolkit 3 . The test showed the gains we achieved were all significant at 95% confidence level. To complete the test, we repeated the baseline GA Reader. Our repetition of GA Reader yielded almost the same accuracies reported by the original GA Reader paper.

Conclusion
We presented Collaborative Gating and Self-Belief Aggregation to optimize Gated-Attention Reader. Collaborative Gating employs documentto-query and query-to-document attentions in a collaborative and multi-hop manner. With gating mechanism, both document and query are filtered to achieve more fine-grained feature representation for machine reading. Self-Belief Aggregation attempts to propagate encodings of other query words into the cloze position using self-attention to relax the assumption of GA Reader. We evaluated our approaches on standard datasets and achieved state-of-the-art results compared to the previously published results. Collaborative Gating performed well on all datasets while SBA seems to work better on large datasets. The combination of Collaborative Gating and Self-Belief Aggregation did not bring significant additive improvements, which may imply that they are not complementary. We hope that self-attention mechnism may capture the effect of co-reference among words. So far, experimental results did not bring gain more than we hope for. Perhaps more constraints in selfattention should be imposed to learn a better model for future work. Another future investigation would be to apply SBA at each layer of GA Reader and further investigate better interaction with Collaborative Gating.