No Need to Pay Attention: Simple Recurrent Neural Networks Work!

First-order factoid question answering assumes that the question can be answered by a single fact in a knowledge base (KB). While this does not seem like a challenging task, many recent attempts that apply either complex linguistic reasoning or deep neural networks achieve 65%–76% accuracy on benchmark sets. Our approach formulates the task as two machine learning problems: detecting the entities in the question, and classifying the question as one of the relation types in the KB. We train a recurrent neural network to solve each problem. On the SimpleQuestions dataset, our approach yields substantial improvements over previously published results — even neural networks based on much more complex architectures. The simplicity of our approach also has practical advantages, such as efficiency and modularity, that are valuable especially in an industry setting. In fact, we present a preliminary analysis of the performance of our model on real queries from Comcast’s X1 entertainment platform with millions of users every day.


Introduction
First-order factoid question answering (QA) assumes that the question can be answered by a single fact in a knowledge base (KB). For example, "How old is Tom Hanks" is about the [age] of [Tom Hanks]. Also referred to as simple questions by Bordes et al. (2015), recent attempts that apply either complex linguistic reasoning or attention-based complex neural network architectures achieve up to 76% accuracy on benchmark sets (Golub and He, 2016;Yin et al., 2016). While it is tempting to study QA systems that can handle more complicated questions, it is hard to reach reasonably high precision for unrestricted questions. For more than a decade, successful industry applications of QA have focused on first-order questions. This bears the question: are users even interested in asking questions beyond first-order (or are these use cases more suitable for interactive dialogue)? Based on voice logs from a major entertainment platform with millions of users every day, Comcast X1, we find that most existing use cases of QA fall into the first-order category.
Our strategy is to tailor our approach to firstorder QA by making strong assumptions about the problem structure. In particular, we assume that the answer to a first-order question is a single property of a single entity in the KB, and decompose the task into two subproblems: (a) detecting entities in the question and (b) classifying the question as one of the relation types in the KB. We simply train a vanilla recurrent neural network (RNN) to solve each subproblem (Elman, 1990). Despite its simplicity, our approach (RNN-QA) achieves the highest reported accuracy on the SimpleQuestions dataset. While recent literature has focused on building more complex neural network architectures with attention mechanisms, attempting to generalize to broader QA, we enforce stricter assumptions on the problem structure, thereby reducing complexity. This also means that our solution is efficient, another critical requirement for real-time QA applications. In fact, we present a performance analysis of RNN-QA on Comcast's X1 entertainment system, used by millions of customers every day.

Related work
If knowledge is presented in a structured form (e.g., knowledge base (KB)), the standard ap-proach to QA is to transform the question and knowledge into a compatible form, and perform reasoning to determine which fact in the KB answers a given question. Examples of this approach include pattern-based question analyzers (Buscaldi et al., 2010), combination of syntactic parsing and semantic role labeling (Bilotti et al., 2007(Bilotti et al., , 2010, as well as lambda calculus (Berant et al., 2013) and combinatory categorical grammars (CCG) (Reddy et al., 2014). A downside of these approaches is the reliance on linguistic resources/heuristics, making them language-and/or domain-specific. Even though Reddy et al. (2014) claim that their approach requires less supervision than prior work, it still relies on many Englishspecific heuristics and hand-crafted features. Also, their most accurate model uses a corpus of paraphrases to generalize to linguistic diversity. Linguistic parsers can also be too slow for real-time applications.
In contrast, an RNN can detect entities in the question with high accuracy and low latency. The only required resources are word embeddings and a set of questions with entity words tagged. The former can be easily trained for any language/domain in an unsupervised fashion, given a large text corpus without annotations (Mikolov et al., 2013;Pennington et al., 2014). The latter is a relatively simple annotation task that exists for many languages and domains, and it can also be synthetically generated. Many researchers have explored similar techniques for general NLP tasks (Collobert et al., 2011), such as named entity recognition (Lu et al., 2015;Hammerton, 2003), sequence labeling (Graves, 2008;Chung et al., 2014), part-of-speech tagging (Huang et al., 2015;Wang et al., 2015), chunking (Huang et al., 2015).
Deep learning techniques have been studied extensively for constructing parallel neural networks for modeling a joint probability distribution for question-answer pairs (Hsu et al., 2016;Yang et al., 2014;Mueller and Thyagarajan, 2016) and re-ranking answers output by a retrieval engine (Rao et al., 2016;Yang et al., 2016). These more complex approaches might be needed for general-purpose QA and sentence similarity, where one cannot make assumptions about the structure of the input or knowledge. However, as noted in Section 1, first-order factoid questions can be represented by an entity and a relation type, and the answer is usually stored in a struc-tured knowledge base. Dong et al. (2015) similarly assume that the answer to a question is at most two hops away from the target entity. However, they do not propose how to obtain the target entity, since it is provided as part of their dataset. Bordes et al. (2014) take advantage of the KB structure by projecting entities, relations, and subgraphs into the same latent space. In addition to finding the target entity, the other key information to first-order QA is the relation type corresponding to the question. Many researchers have shown that classifying the question into one of the predefined types (e.g., based on patterns (Zhang and Lee, 2003) or support vector machines (Buscaldi et al., 2010)) improves QA accuracy.
3 Approach (a) From Question to Structured Query. Our approach relies on a knowledge base, containing a large set of facts, each one representing a binary [subject, relation, object] relationship. Since we assume first-order questions, the answer can be retrieved from a single fact. For instance, "How old is Sarah Michelle Gellar?" can be answered by the fact: [Sarah Michelle Gellar,bornOn,4/14/1977] The main idea is to dissect a first-order factoid natural-language question by converting it into a structured query: {entity "Sarah Michelle Gellar", relation: bornOn}. The process can be modularized into two machine learning problems, namely entity detection and relation prediction. In the former, the objective is to tag each question word as either entity or not. In the latter, the objective is to classify the question into one of the K relation types. We modeled both using an RNN.
We use a standard RNN architecture: Each word in the question passes through an embedding lookup layer E, projecting the one-hot vector into a d-dimensional vector x t . A recurrent layer combines this input representation with the hidden layer representation from the previous word and applies a non-linear transformation to compute the hidden layer representation for the current word. The hidden representation of the final recurrent layer is projected to the output space of k dimensions and normalized into a probability distribution via soft-max.
In relation prediction, the question is classified into one of the 1837 classes (i.e., relation types in Freebase). In the entity detection task, each word is classified as either entity or context (i.e., k = 2). Given a new question, we run the two RNN models to construct the structured query. Once every question word is classified as entity (denoted by E) or context (denoted by C), we can extract entity phrase(s) by grouping consecutive entity words. For example, for question "How old is Michelle Gellar", the output of entity detection is [C C C E E], from which we can extract a single entity "Michelle Gellar". The output of relation prediction is bornOn. The inferred structured query q becomes the following: {entityText: "michelle gellar", relation: bornOn} (b) Entity Linking. The textual reference to the entity (entityText in q) needs to be linked to an actual entity node in our KB. In order to achieve that, we build an inverted index I entity that maps all ngrams of an entity (n ∈ {1, 2, 3}) to the entity's alias text (e.g., name or title), each with an associated T F -IDF score. We also map the exact text (n = ∞) to be able to prioritize exact matches.
Following our running example, let us demonstrate how we construct I entity . Let us assume there is a node e i in our KB that refers to the actress "Sarah Michelle Gellar". The alias of this entity node is the name, which has three unigrams ("sarah", "michelle", "gellar"), two bigrams ("sarah michelle", "michelle gellar") and a single trigram (i.e., the entire name). Each one of these n-grams gets indexed in I entity with T F -IDF weights. Here is how the weights would be computed for unigram "sarah" and bigram "michelle gellar" (⇒ denotes mapping): I entity ("sarah") ⇒ {node : e i , score : T F -IDF ("sarah", "sarah michelle gellar")} I entity ("michelle gellar") ⇒ {node : e i , score : T F -IDF ("michelle gellar", "sarah michelle gellar")} This is performed for every n-gram (n ∈ {1, 2, 3, ∞}) of every entity node in the KB. Assuming there is an entity node, say e j , for the actress "Sarah Jessica Parker", we would end up creating a second mapping from unigram "sarah": I entity ("sarah") ⇒ {node : e j , score : T F -IDF ("sarah", "sarah jessica parker")} In other words, "sarah" would be linked to both e i and e j , with corresponding T F -IDF weights.
Once the index I entity is built, we can link en-tityText from the structured query (e.g., "michelle gellar") to the intended entity in the KB (e.g., e i ). Starting with n = ∞, we iterate over n-grams of entityText and query I entity , which returns all matching entities in the KB with associated T F -IDF relevance scores. For each n-gram, retrieved entities are appended to the candidate set C. We continue this process with decreasing value of n (i.e., n ∈ {∞, 3, 2, 1}) Early termination happens if C is non-empty and n is less than or equal to the number of tokens in entityText. The latter criterion is to avoid cases where we find an exact match but there are also partial matches that might be more relevant: For "jurassic park", for n = ∞, we get an exact match to the original movie "Jurassic Park". But we would also like to retrieve "Jurassic Park II" as a candidate entity, which is only possible if we keep processing until n = 2. (c) Answer Selection. Once we have a list of candidate entities C, we use each candidate node e cand as a starting point to reach candidate answers.
A graph reachability index I reach is built for mapping each entity node e to all nodes e that are reachable, with the associated path p(e, e ). For the purpose of the current approach, we limit our search to a single hop away, but this index can be easily expanded to support a wider search. We use I reach to retrieve all nodes e that are reachable from e cand , where the path from is consistent with the predicted relation r (i.e., r ∈ p(e cand , e )). These are added to the candidate answer set A. For example, in the example above, node e i 2 would have been added to the answer set A, since the path [bornOn] matches the predicted relation in the structured query. After repeating this process for each entity in C, the highest-scored node in A is our best answer to the question.

Experimental Setup
Data. Evaluation of RNN-QA was carried out on SimpleQuestions, which uses a subset of Freebase containing 17.8M million facts, 4M unique entities, and 7523 relation types. Indexes I entity and I reach are built based on this knowledge base.
SimpleQuestions was built by (Bordes et al., 2014) to serve as a larger and more diverse factoid QA dataset. 1 Freebase facts are sampled in a way to ensure a diverse set of questions, then given to human annotators to create questions from, and get labeled with corresponding entity and relation type. There are a total of 1837 unique relation types that appear in SimpleQuestions.
Training. We fixed the embedding layer based on the pre-trained 300-dimensional Google News embedding, 2 since the data size is too small for training embeddings. Out-of-vocabulary words were assigned to a random vector (sampled from uniform distribution). Parameters were learned via stochastic gradient descent, using categorical cross-entropy as objective. In order to handle variable-length input, we limit the input to N tokens and prepend a special pad word if input has fewer. 3 We tried a variety of configurations for the RNN: four choices for the type of RNN layer (GRU or LSTM, bidirectional or not); depth from 1 to 3; and drop-out ratio from 0 to 0.5, yielding a total of 48 possible configurations. For each possible setting, we trained the model on the training portion and used the validation portion to avoid over-fitting. After running all 48 experiments, the most optimal setting was selected by micro-averaged F-score of predicted entities (entity detection) or accuracy (relation prediction) on the validation set. We concluded that the optimal model is a 2-layer bidirectional LSTM (BiL-STM2) for entity detection and BiGRU2 for relation prediction. Drop-out was 10% in both cases.

Results
End-to-End QA. For evaluation, we apply the relation prediction and entity detection models on each test question, yielding a structured query q = {entityT ext: t e , relation: r} (Section 3a). Entity linking gives us a list of candidate entity nodes (Section 3b). For each candidate entity e cand , we can limit our relation choices to the set of unique relation types that some candidate entity e cand is associated with. This helps eliminate the artificial ambiguity due to overlapping rela-tion types as well as the spurious ambiguity due to redundancies in a typical knowledge base. Even though there are 1837 relation types in Freebase, the number of relation types that we need to consider per question (on average) drops to 36. The highest-scored answer node is selected by finding the highest scored entity node e that has an outward edge of type r (Section 3c). We follow Bordes et al. (2015) in comparing the predicted entity-relation pair to the ground truth. A question is counted as correct if and only if the entity we select (i.e., e) and the relation we predict (i.e, r) match the ground truth. Table 1 summarizes end-to-end experimental results. We use the best models based on validation set accuracy and compare it to three prior approaches: a specialized network architecture that explicitly memorizes facts (Bordes et al., 2015), a network that learns how to convolve sequence of characters in the question (Golub and He, 2016), and a complex network with attention mechanisms to learn most important parts of the question (Yin et al., 2016). Our approach outperforms the state of the art in accuracy (i.e., precision at top 1) by 11.9 points (15.6% relative). Last three rows quantify the impact of each component via an ablation study, in which we replace either entity detection (ED) or relation prediction (RP) models with a naive baseline: (i) we assign the relation that appears most frequently in training data (i.e., bornOn), and/or (ii) we tag the entire question as an entity (and then perform the n-gram entity linking). Results confirm that RP is absolutely critical, since both datasets include a diverse and well-balanced set of relation types. When we applied the naive ED baseline, our results drop significantly, but they are still comparable to prior results. Given that most prior work do not use the network to detect entities, we can deduce that our RNN-based entity detection is the reason our approach performs so well. Error Analysis. In order to better understand the weaknesses of our approach, we performed a blame analysis: Among 2537 errors in the test set, 15% can be blamed on entity detection -the relation type was correctly predicted, but the detected entity did not match the ground truth. The reverse is true for 48% cases. 4 We manually labeled a sample of 50 instances from each blame scenario. When entity detection is to blame, 20% was due to spelling inconsistencies between question and KB, which can be resolved with better text normalization during indexing (e.g., "la kings" refers to "Los Angeles Kings"). We found 16% of the detected entities to be correct, even though it was not the same as the ground truth (e.g., either "New York" or "New York City" is correct in "what can do in new york?"); 18% are inherently ambiguous and need clarification (e.g., "where bin laden got killed?" might mean "Osama" or "Salem"). When blame is on relation prediction, we found that the predicted relation is reasonable (albeit different than ground truth) 29% of the time (e.g., "what was nikola tesla known for" can be classified as profession or notable for).

RNN-QA in Practice.
In addition to matching the state of the art in effectiveness, we also claimed that our simple architecture provides an efficient and modular solution. We demonstrate this by applying our model (without any modifications) to the entertainment domain and deploying it to the Comcast X1 platform serving millions of customers every day. Training data was generated synthetically based on an internal entertainment KB. For evaluation, 295 unique question-answer pairs were randomly sampled from real usage logs of the platform.
We can draw two important conclusions from Table 2: First of all, we find that almost all of the user-generated natural-language questions (278/295∼95%) are first-order questions, supporting the significance of first-order QA as a task. Second, we show that even if we simply use an open-sourced deep learning toolkit (keras.io) for implementation and limit the computational resources to 2 CPU cores per thread, RNN-QA answers 75% of questions correctly with very reasonable latency. Correct  220  Incorrect entity  16  Incorrect relation  42  Not first-order question  17  Total Latency  76±16 ms   Table 2: Evaluation of RNN-QA on real questions from X platform.

Conclusions and Future work
We described a simple yet effective approach for QA, focusing primarily on first-order factual questions. Although we understand the benefit of exploring task-agnostic approaches that aim to capture semantics in a more general way (e.g., (Kumar et al., 2015)), it is also important to acknowledge that there is no "one-size-fits-all" solution as of yet.
One of the main novelties of our work is to decompose the task into two subproblems, entity detection and relation prediction, and provide solutions for each in the form of a RNN. In both cases, we have found that bidirectional networks are beneficial, and that two layers are sufficiently deep to balance the model's ability to fit versus its ability to generalize.
While an ablation study revealed the importance of both entity detection and relation prediction, we are hoping to further study the degree of which improvements in either component affect QA accuracy. Drop-out was tuned to 10% based on validation accuracy. While we have not implemented attention directly on our model, we can compare our results side by side on the same benchmark task against prior work with complex attention mechanisms (e.g., (Yin et al., 2016)). Given the proven strength of attention mechanisms, we were surprised to find our simple approach to be clearly superior on SimpleQuestions.
Even though deep learning has opened the potential for more generic solutions, we believe that taking advantage of problem structure yields a more accurate and efficient solution. While firstorder QA might seem like a solved problem, there is clearly still room for improvement. By revealing that 95% of real use cases fit into this paradigm, we hope to convince the reader that this is a valuable problem that requires more attention.