Discourse-Aware Semantic Self-Attention for Narrative Reading Comprehension

In this work, we propose to use linguistic annotations as a basis for a Discourse-Aware Semantic Self-Attention encoder that we employ for reading comprehension on narrative texts. We extract relations between discourse units, events, and their arguments as well as coreferring mentions, using available annotation tools. Our empirical evaluation shows that the investigated structures improve the overall performance (up to +3.4 Rouge-L), especially intra-sentential and cross-sentential discourse relations, sentence-internal semantic role relations, and long-distance coreference relations. We show that dedicating self-attention heads to intra-sentential relations and relations connecting neighboring sentences is beneficial for finding answers to questions in longer contexts. Our findings encourage the use of discourse-semantic annotations to enhance the generalization capacity of self-attention models for reading comprehension.


Introduction
Transformer-based self-attention models (Vaswani et al., 2017) have been shown to work well on many natural language tasks that require largescale training data, such as Machine Translation (Vaswani et al., 2017;Dai et al., 2019), Language Modeling (Radford et al., 2018a;Devlin et al., 2019;Dai et al., 2019;Radford et al., 2019) or Reading Comprehension (Yu et al., 2018), and can even be trained to perform surprisingly well in several multi-modal tasks (Kaiser et al., 2017b).
Recent work (Strubell et al., 2018) has shown that for downstream semantic tasks with much smaller datasets, such as Semantic Role Labeling (SRL) (Palmer et al., 2005), self-attention models greatly benefit from the use of linguistic information such as dependency parsing annotations. Motivated by this work, we examine to what extent we can use discourse and semantic information to extend self-attention-based neural models for a higher-level task such as Reading Comprehension. Reading Comprehension is a task that requires a model to answer natural language questions, given a text as context: a paragraph or even full documents. Many datasets have been proposed for the task, starting with a small multi-choice dataset (Richardson et al., 2013), large-scale automatically created cloze-style datasets (Hermann et al., 2015;Hill et al., 2016) and big manually annotated datasets such as Onishi et al. (2016); Rajpurkar et al. (2016); Joshi et al. (2017); Kocisky et al. (2018). Previous research has shown that some datasets are not challenging enough, as simple heuristics work well with them (Chen et al., 2016;Weissenborn et al., 2017b;Chen et al., 2017). In this work we focus on the recent Narra-tiveQA (Kocisky et al., 2018) dataset that was designed not to be easy to answer and that requires a model to read narrative stories and answer questions about them.
In terms of model architecture, previous work in reading comprehension and question answer- ing has focused on integrating external knowledge (linguistic and/or knowledge-based) into recurrent neural network models using Graph Neural Networks (Song et al., 2018), Graph Convolutional Networks (Sun et al., 2018;De Cao et al., 2019), attention (Das et al., 2017;Mihaylov and Frank, 2018;Bauer et al., 2018) or pointers to coreferent mentions (Dhingra et al., 2017).
In contrast, in this work we examine the impact of discourse-semantic annotations ( Figure 1) in a self-attention architecture. We build on the QANet (Yu et al., 2018) model by modifying the encoder of its self-attention modeling layer. In particular, we specialize self-attention heads to focus on specific discourse-semantic annotations, such as, e.g., an ARG1 relation in SRL, a CAUSA-TION relation holding between clauses in shallow discourse parsing, or coreference relations holding between entity mentions.
Our contributions are the following: • To our knowledge we are the first to explicitly introduce discourse information into a neural model for reading comprehension.
• We design a Discourse-Aware Semantic Self-Attention mechanism, an extension to the standard self-attention models -without significant increase of computation complexity.
• We analyze the impact of different discourse and semantic annotations for narrative reading comprehension and report improvements of up to 3.4 Rouge-L over the base model.
• We perform empirical fine-grained evaluation of the discourse-semantic annotations on specific question types and context size.

Discourse-aware Semantic Annotations
Understanding narrative stories requires the ability to identify events and their participants and to identify how these events are related in discourse (e.g., by causation, contrast, or temporal sequence) (Mani, 2012). Our aim is to extract structured knowledge about these phenomena from long texts and to integrate this information in a neural self-attention model, in order to examine to what extent such knowledge can enhance the efficiency of a strong reading comprehension model applied to NarrativeQA.
Specifically, we enhance self-attention with knowledge about entity coreference (Coref), their participation in events (SRL) and the relation between events in narrative discourse (Shallow Discourse Parsing (Xue et al., 2016), DR).
All these linguistic information types are relational in nature. For integrating relational knowledge into the self-attention mechanism, we follow a two-step approach: i) we extract such relations from a multi-sentence paragraph and project them down to the token level, specifically to the tokens of the text fragments that they involve; ii) we design a neural self-attention model that uses the interaction information between these tokens in a multi-head self-attention module.
To be able to map the extracted linguistic knowledge to paragraph tokens, we need annotations that are easy to map to token level (see Figure  2). This can be achieved with tools for annotation of span-based Semantic Role Labeling, Coreference Resolution, and Shallow Discourse Parsing.
Events and Their Participants Relations between characters in a story are expressed in text through their participation in states or actions in which they fill a particular event argument with a specific semantic role (see Figure 2). For annotation of events and their participants we use the state-of-the-art SRL system of He et al. (2017)   implemented in AllenNLP . The system splits paragraphs into sentences and tokens, performs POS (part of speech tagging) and for each verb token V it predicts semantic tags such as ARG0, ARG1 (Argument Role 0, 1 of verb V), etc. When several argument-taking predicates are realized in a sentence, we obtain more than single semantic argument structure, and each token in the sentence can be involved in the argument structure of more than one verb. We refer to these annotations as different semantic views (Khashabi et al., 2018a), e.g., 'semantic view for verb 1'. Different self-attention heads will be able to attend to individual semantic views.
Coreference Resolution Narrative texts abound of entity mentions that refer to the same entity in the discourse. We hypothesize that by directing the self-attention to this specific coreference information, we can encourage the model to focus on tokens that refer to the same entity mention. Although token-based self-attention models are able to attend over wide-ranged context spans, we hypothesize that it will be beneficial to allow the model to focus directly on the parts of the text that refer to the same entity. For coreference annotation we use the medium size model from the neuralcoref spaCy extension available at https://github.com/huggingface/neuralcoref. For each token we give as information the label of the corresponding coreference cluster (see Figure 2) that it belongs to. Therefore, tokens from the same coreference cluster get the same label as input.
Discourse Relations In narrative texts, events are connected by discourse relations such as causation, temporal succession, etc. (Mani, 2012). In this work we adopt the 15 fine-grained discourse relation sense types from the annotation scheme of the Penn Discourse Tree Bank (PDTB) (Prasad et al., 2008). For producing discourse relation annotations we use the discourse relation sense disambiguation system from Mihaylov and Frank (2016) which is trained on the data provided by the CoNLL Shared Task on Shallow Discourse Parsing (Xue et al., 2016). In this annotation scheme discourse relations are divided into two main types: Explicit and Non-Explicit. Explicit relations are usually connected with an explicit discourse connective, such as because, but, if. Non-Explicit 1 relations are not explicitly marked with a discourse connective and the arguments are usually contained in two consecutive sentences (see Figure 2). To extract explicit discourse relations we take into account only arguments that are in the same sentence. We consider as separate arguments (ARG1 and ARG2) text sequences that are on the left and right of an explicit discourse connective (CONN): ex. '[Jeff went home] ARG1 CCR [because] CONN [he was hungry.] ARG2 CCR , where CCR is Contingency.Cause.Reason'. To provide Non-Explicit discourse relation sense annotations, we annotate every consecutive pair of sentences with a predicted discourse relation sense type.

A Discourse-Aware Semantic
Self-Attention Neural Model

QANet
As a base reading comprehension model we use QANet (Yu et al., 2018). QANet is a standard token-based self-attention model with the following components, which are common across many recent models: 1. Input Embedding Layer uses pre-trained word embeddings and convolutional character embeddings; 2. Encoder Layer consists of stacked Encoder Blocks (see Figure 3, A) based on Multi-Head Self-Attention (Vaswani et al., 2017) and depth-wise separable convolution (Chollet, 2016;Kaiser et al., 2017a); 3. Contextto-Query Attention Layer is a standard layer, that builds a token-wise attention-weighted questionaware context representation; 4. Modeling Layer has the same structure as 2. above but uses as input the output of layer 3.; 5. Output layer is used for prediction of start and end answer pointers. For detailed information about these layers, please refer to Yu et al. (2018). In this work we replace the standard Multi-Head Self-Attention with Discourse-Aware Semantic Self-Attention, using several different semantic and discourse annotation types. We describe this below and explain the differences to the standard Encoder Block.

Discourse-Aware Semantic Self-Attention
In Figure 3 we show the difference between the Base Multi-Head Self-Attention Encoder Block A) and the Discourse-Aware Semantic Self-Attention Encoder Block B). Both consist of positional-encoding + convolutional-layer ⇥K+ multi-head-self-attention + feed-forward layer.
The difference is that B is provided additional inputs that are used by multi-head self-attention.
The multi-head self-attention is a concatenation of outputs from multiple single self-attention heads h i followed by a linear layer. A single head of the extended multi-head self-attention is shown in Figure 3C and is formally defined as are components of the query-key-value attention and p d h is used for weight scaling as originally proposed in Vaswani et al. (2017).
.H 2 , r l 1 is the input from the previous encoder block, s t is an embedding vector for the linguistic annotation type t ('SRL Arg1', 'DiscRel Cause.Reason Arg2', etc.), a h i is the output of head h i . M t is a sentence-wise attention mask as shown in Figure 3D. s t and M t are the main difference compared to the standard selfattention ( Figure 3C).
In principle, representing edges of a graph (e.g., the V-ARG1 role from SRL) requires memory of n 2 d h H, where n is the length of the context, which would be a bottleneck for computation on a GPU with limited memory (8-16GB). Instead, we adopt a strategy where the relation is represented as a source and target node and an attention scope (one sentence for SRL; two sentences for DR (Non-Exp); full context for Coref). The latter is controlled using the attention mask. The combination of flat token labels and mask reduces the maximum memory required for representing the information in the knowledge-enhanced head to 2nd h H. The attention masks, which we use for reducing the attention scope of the different semantic and discourse annotations, are shown in Figure 3D. These masks ensure that the corresponding attention heads will only attend to tokens from the corresponding scope (SRL: single sentence; DR (NonE): two sentences, etc.). The attention masks are symmetric to the matrix diagonal. Therefore, they can easily be computed 'onthe-fly' given only the sentence boundaries (corresponding to the horizontal lines in Figure 2). To reduce the model memory further and still benefit from the full-context self-attention, we use the Discourse-Aware Semantic Self-Attention encoder ( Figure 3B) only for blocks [1,3,5] of the Modeling Layer that consists of 7 stacked encoder blocks (indexed 0 to 6). Blocks [0,2,6] are set as the base encoders that look at the entire context ( Figure 3A).

Data and Task Description
NarrativeQA We perform experiments with the NarrativeQA (Kocisky et al., 2018) reading comprehension dataset. This dataset requires understanding of narrative stories (English) in order to provide answers for a given question. It offers two sub-tasks: (i) answering questions about a long narrative summary (up to 1150 tokens) of a book or movie, or (ii) answering questions about entire books or movie scripts of lengths up to 110k tokens. We are focusing on the summary setting (i) and refer to the summary as document or context. The dataset contains 1572 documents in total, devided into Train (1102 docs, 32.7k questions), Dev (115 documents, 3.5k questions) and Test (355 documents, 10.5k questions) sets.
Generative QA as Span Prediction An interesting aspect of the NarrativeQA dataset is that in contrast to most other RC datasets, the two answers provided for each question are written by human annotators. Therefore, answers typically differ in form from the context passages that license them. To map the human-generated answers to answer candidate spans from the context, we use Rouge-L (Lin, 2004) to calculate a similarity score between token n-grams from the provided answer and token n-grams from candidate answers selected from the context (we select candidate spans of the same length as the given answer). If two answer candidates have the same Rouge-L score, we calculate the score between the candidates' surrounding tokens (window size: 15 tokens to the left and right) and the question tokens, and choose the candidate with the higher score. We retrieve the best candidate answer span for each answer and use the candidate with the higher Rouge-L score as supervision for training. We refer to this method for answer retrieval as Oracle (Ours).

Related Work
Reading Comprehension with Knowledge Recent work has proposed different approaches for integrating external knowledge into neural models for the high-level downstream tasks reading comprehension (RC) and question answering (QA). One line of work leverages external knowledge from knowledge bases for RC (Xu et al., 2016;Weissenborn et al., 2017a;Ostermann et al., 2018;Mihaylov and Frank, 2018;Bauer et al., 2018;Wang et al., 2018b) and QA (Das et al., 2017;Sun et al., 2018;Tandon et al., 2018). These approaches make use of implicit (Weissenborn et al., 2017a) or explicit (Mihaylov and Frank, 2018;Sun et al., 2018;Bauer et al., 2018) attention-based knowledge aggregation or leverage features from knowledge base relations (Wang et al., 2018b).
Another line of work builds on linguistic knowledge from downstream tasks, such as coreference resolution (Dhingra et al., 2017) or notions of co-occurring candidate mentions (De Cao et al., 2019) and OpenIE triples (Khot et al., 2017) into RNN-based encoders. Recently, several pretrained language models Radford et al., 2018b;Devlin et al., 2019) have been shown to incrementally boost the performance of well-performing models for several short paragraph reading comprehension tasks Devlin et al., 2019) and question answering (Sun et al., 2019), as well as many tasks from the GLUE benchmark (Wang et al., 2018a). Approaches based on BERT (Devlin et al., 2019) usually perform best when the weights are fine-tuned for the specific training task. Earlier, many papers that do not use self-attention models or even neural methods have also tried to use semantic parse labels (Yih et al., 2016), or annotations from upstream tasks (Khashabi et al., 2018b).
Self-Attention Models in NLP Vanilla selfattention models (Vaswani et al., 2017) use positional encoding, sometimes combined with local convolutions (Yu et al., 2018) to model the token order in text. Although they are scalable due to their recurrence-free nature, most self-attention models do not well work when trained with fixedlength context, due to the fact that they often learn global token positions observed during training, rather than relative. To address this issue, Shaw et al. (2018) proposes relative position encoding to model the distance between tokens in the context. Dai et al. (2019) address the problem of moving beyond fixed-length context by adding recurrence to the self-attention model. Dai et al. (2019) argue that the fixed-length segments used for language modeling hurt the performance due to the fact that they do not respect sentence or any other semantic boundaries. In this work we also support the claim that the lack of semantic, and also discourse boundaries is an issue, and therefore we aim to introduce structured linguistic information into the self-attention model. We hypothesize that the lack of local discourse context is a problem for answering narrative questions, where the answer is contained inside the same sentence, or neighbouring sentences and therefore, by offering discourselevel semantic structure to the attention heads, offer ways to restrict, or focus the model to wider or narrower structures, depending on what is needed  (Hu et al., 2018b) 48.40 51.50 RMR (Ens) (Hu et al., 2018b) 50.10 53.90 RMR + A2D (Hu et al., 2018b) 50 for specific questions. Self-attention architectures can be seen as graph architectures (imagine the token (node) interactions as adjacency matrix) and are applied to graph problems (Veličković et al., 2018;Li et al., 2019). Therefore, in very recent work Koncel-Kedziorski et al. (2019) have used a self-attention encoder as a graph encoder for text generation, in a dual encoder model. A dual-encoder model similar to Koncel-Kedziorski et al. (2019) is suitable for a setting where the input is knowledge from a graph knowledge-base. For a text-based setting like ours, where word order is important and the tokens are part of semantic arguments, an approach that tries to encode linguistic information in the same architecture (Strubell et al., 2018) is more appropriate. Therefore our method is most related to LISA (Strubell et al., 2018), which uses joint multi-task learning of POS and Dependency Parsing to inject syntactic information for Semantic Role Labeling. In contrast, we do not use multi-task learning, but directly encode semantic information extracted by pre-processing with existing tools.
NarrativeQA The summary setting of the Nar-rativeQA dataset (Kocisky et al., 2018) has in the past been addressed with attention mechanisms by the following models: BiAtt + MRU (Tay et al., 2018a) is similar to BiDAF (Seo et al., 2017). It is bi-attentive (attends form context-toquery and vice versa) but enhanced with a MRU (Multi-Range Reasoning Units). MRU is a compositional encoder that splits the context tokens into ranges (n-grams) of different sizes and combines them in summed n-gram representations and fully-connected layers. DecaProp (Tay et al., 2018b) is a neural architecture for reading comprehension, that densely connects all pairwise layers, modeling relationships between passage and query across all hierarchical levels. Bauer et al. (2018) observed that some of the questions require external commonsense knowledge and developed MHPGM-NOIC -a seq2seq generative model with a copy mechanism that also uses commonsense knowledge and ELMo  contextual representations. Hu et al. (2018b) used an implementation of Reinforced Mnemonic Reader (RMR) (Hu et al., 2018a). They also proposed RMR + A2D, a novel teacher-student attention distillation method to train a model to mirror the behavior of the ensemble model RMR (Ens).

Experiments and Results
In this section we describe the experiments and results of our proposed model in different configurations. We compare the results of different models using overall results (Table 1) on the dataset, but also the performance for different question types ( Figure 4) and context sizes ( Figure 5).

Overall Results
Table 1 compares our baselines and proposed model to prior work. We report results for Bleu-1, and Rouge-L scores. The first section lists results on the NarrativeQA dataset as reported in Kocisky et al. (2018). Oracle (original) uses the gold answers as queries to match a token sequence (with the answer length) in the context that has the highest Rouge-L. In contrast, using Oracle (Ours), described in Section 4, we report a +11 Rouge-L score improvement (Table 1: This work). The Oracle performance in this setting is important since the produced annotations are used for training of the span-prediction systems, and is considered upper-bound. 3 Seq2Seq (no context) is an encoder-decoder RNN model trained only on the question. ASR is a version of the Attention Sum Reader (Kadlec et al., 2016) implemented as a pointer-generator that reads the question and points to words in the context that are contained in the answer. BiDAF is Bi-Directional Attention Flow (Seo et al., 2017) trained either with the Oracle (original) or Oracle (ours). The models from Previous Work are described in Section 5. In the last section of Table 1 we present the results of our experiments (This work). Here, BiDAF and QANet are implementations available in the AllenNLP framework . In the last two rows we give the results of QANet extended with the proposed Discourse-Aware Se- 3 The previous work that uses span-prediction models do not report their Oracle model used for training supervision. mantic Self-Attention, using intra-sentential, Explicit discourse relations (DR (Exp), EMA is Exponential Moving Average).

Fine-grained Evaluation
We further analyze the performance of different configurations of our model by conducting finegrained evaluation in view of question types (Figure 4) and context length ( Figure 5).
We define a range of system configurations using attention heads enhanced with different combinations of linguistic annotation types, including Explicit (referred to as Exp or E) and Non-Explicit (NonE, NE), Discourse Relations (Dis-cRel, DR), Semantic Role Labeling (SRL), and Coreference (Coref), and configurations without any such additional information (No). We also experiment with a setting where instead of using specific discourse relation types (such as Dis-cRel Exp Cause Arg1), we only identify that a token is part of any (NoSense) discourse relation (e.g., DiscRel Exp Arg1) or simply a multisentence attention span Sent span 3 with labels Sent1, Sent2, Sent3 for each sentence. This is to examine whether the type of discourse relation is important or rather the attention scope (intrasentential, cross-sentence -2, 3 neighbouring sentences, full context).
Question Type Different question types might profit from different linguistic annotation types. We thus examine the performance of different question types, and analyze how it correlates with the presence of specific Semantic Self-Attention signals. We classify the questions into question types using a simple heuristic based on the question words as an indicator of their type (How / Where / Why / Who / What ...), and calculate the average Rouge-L for each such questions type. The resulting scores are displayed in Figure 4. In the first two columns of the figure, we report the Oracle score and the baseline (QANet) score. In the remaining columns we report (i) the improvement over the QANet baseline of BiDAF, and (ii) of our models with different combinations of discourse-aware semantic self-attention. In the first row we report the score for each of the models on all questions. We observe that best performing models on all questions are the ones that include Explicit DR, and/or SRL. In terms of hardness, how and why questions usually have the lowest score. This not surprising since Oracle performance is also low. For these type of questions, the RNN-based encoder (BiDAF) and self-attention with DR (Exp) or DR (NonE) perform best. Almost all models with additional linguistic information improve over the baseline on when questions, lead by the SRL+DR (Exp) and SRL + DR (All) + Coref. What questions are improved most by DR (Exp) and SRL alone or when combined. Who questions gain most from discourse relations and all models that contain SRL. Figure 5 we present the performance on documents of different lengths, in number of tokens. All presented models are trained on the examples from the Train set with context up to 800 tokens. Again, the models DR (Exp) and SRL+DR (Exp) show clear improvement across all context lengths. It is clear that all models show improvement over length 800-1000. This supports our hypothesis that discourse information is required for generalizing to longer contexts. One reason is that some of the questions can be answered with a local context (one-two sentences) which are better represented given short discourse scope (one-three sentences) or long dependencies given coreference.

Context Length In
In the evaluation of multiple model configurations we notice that in some cases a single discourse/semantic type (e.g. DR (Exp)) performs better than in combination with others (e.g. SRL+DR (Exp)). We hypothesize that the reason is that the linguistic annotations work well in combination with free No attention heads (see Table  2). Currently, we place multiple annotations on the same Encoder Block which reduces the number of free attention heads. For instance, for SRL+DR (Exp), each knowledge-enhanced encoder block has 3 SRL + 2 DR (Exp) + 3 No heads. In future work we plan to use different annotation heads per Encoder Block (EB): e.g., EB0 has 3 SRL + 5 No; EB1 has 2 DR (Exp) + 6 No; etc. Figures 6, 7,  8 we show examples of context 4 and questions, together with the answers from human annotators and some of the examined models. 5 We provide Context Although he terrifies the fairies when he first arrives , Peter quickly gains favour with them . He amuses them with his human ways and agrees to play the panpipes at the fairy dances . Eventually , Queen Mab grants him the wish of his heart , and he decides to return home to his mother . Question After scaring the fairies, how does Peter win them over ? Human 1: he agrees to play the panpipes at all of the fairy dances.; Human 2: He amuses them with his human ways and plays the pipes at their dances.; Oracle: human ways and agrees to play the panpipes at the fairy dances ; QANet: gains favour ; DR (Exp), DR (NE): quickly gains favour with them; Coref, SRL, SRL+DR(Exp): He amuses them with his human ways and agrees to play the panpipes; SRL+DR(NE): He amuses them with his human ways and agrees to play the panpipes at the fairy dances Rationale: To find the correct answer we need to know that (i) 'gains favor' is a synonym to 'win' in this context (commonsense); (ii) the following (2nd) sentence is the reason for the previous (1st) (DR -the model fails in this case) (iii) 'them' are 'the fairies', 'he' is Peter (Coref) Figure 6: Example of positive impact of SRL and Coref and negative impact from discourse relations (DR).

Success and Failure Examples In
Context Jacob frequently visits Jeff and Kenny , who are serving time in a juvenile hall . Jacob initially threatens them , until eventually Jeff commits suicide . Jacob befriends Kenny , soon learning he has an early release and is illegally moving to New Mexico . Question Why does Jeff committ suicide ? Human 1: Jacob threatened them; Human 2: He is threatened by Jacob.; Oracle: site which he says is ; QANet: Jeff and Kenny , who are serving time in a juvenile hall; DR (Exp), DR (NE), SRL, SRL+DR(Exp), SRL+DR(NE): Jacob initially threatens them ,; Coref: Jacob initially threatens them , until eventually Jeff commits suicide . Jacob befriends Kenny , soon learning he has an early release and is illegally moving to New Mexico Rationale: To find the correct answer we need to understand that 'until eventually' suggests that the suicide of Jeff is caused by Jacob threatening 'them' (DR) and that Jeff is part of 'them' (Coref). Context The four orphan children of the house , Edward , Humphrey , Alice and Edith , are believed to have died in the flames . However , they are saved by Jacob Armitage , a local verderer , who hides them in his isolated cottage and disguises them as his grandchildren . Under Armitage 's guidance , the children from an aristocratic lifestyle to that of simple foresters . Question Who rescues the children from fire at Arnwood ? Human 1, Human 2: Jacob Armitage; Oracle: Jacob Armitage; DR (Exp), DR (NE), Coref: Jacob Armitage; QANet, SRL, SRL+DR(Exp): Pablo; SRL+DR(NE): Patience Rationale: To find the correct answer we need to understand at least that 'they' are 'the children' (Coref) and 'who did what to whom' in the context (SRL).

Conclusion and Future Work
In this work we use linguistic annotations as a basis for a Discourse-Aware Semantic Self-Attention encoder that we employ for reading comprehension on narrative texts.
The provided annotations of discourse relations, events and their arguments as well as coreferring mentions, are using available annotation tools. Our empirical evaluation shows that discoursesemantic annotations combined with self-attention yields significant (+3.43 Rouge-L) improvement over QANet's token-based self-attention when applied to NarrativeQA reading comprehension. We analyzed the impact of different semantic annota- 6 The examples are selected from NarrativeQA Test, in such a way, that they depict the strength and weaknesses of the different models, corresponding to the empirical evaluation on Figure 4 and they fit in the space limit. tion types on specific question types and context regions. We find, for instance, that SRL greatly improves who and when questions, and that discourse relations improve also the performance on why and where questions. While all examined annotation types contribute, particularly strong and constant gains are seen with intra-sentential DR (all context ranges), followed by SRL (short to mid-sized contexts). Coreference shows positive, but weaker impact, mostly in mid-sized contexts. A promising future direction would be to include additional external knowledge such as commonsense and world knowledge, and learn all annotations jointly with the downstream task.