EviNets: Neural Networks for Combining Evidence Signals for Factoid Question Answering

A critical task for question answering is the final answer selection stage, which has to combine multiple signals available about each answer candidate. This paper proposes EviNets: a novel neural network architecture for factoid question answering. EviNets scores candidate answer entities by combining the available supporting evidence, e.g., structured knowledge bases and unstructured text documents. EviNets represents each piece of evidence with a dense embeddings vector, scores their relevance to the question, and aggregates the support for each candidate to predict their final scores. Each of the components is generic and allows plugging in a variety of models for semantic similarity scoring and information aggregation. We demonstrate the effectiveness of EviNets in experiments on the existing TREC QA and WikiMovies benchmarks, and on the new Yahoo! Answers dataset introduced in this paper. EviNets can be extended to other information types and could facilitate future work on combining evidence signals for joint reasoning in question answering.


Introduction
Most of the recent works in Question Answering (QA) have focused on the problem of semantic matching between a question and candidate answer sentences Rao et al., 2016;Yang et al., 2016). The datasets used in these works, such as Answer Sentence Selection Dataset (Wang et al., 2007) and WikiQA (Yang et al., 2015), typically contain a relatively small set of sentences, and the task is to select those that state the answer to the question. However, for many questions, a single sentence does not pro-vide sufficient information, and it may not be reliable in isolation. At the same time, the redundancy of information in large corpora, such as the Web, has been shown useful to improve information retrieval approaches to QA (Clarke et al., 2001).
This work focuses on factoid questions, which can be answered with an entity, i.e., an object in a Knowledge Base (KB) such as Freebase. Knowledge Base Question Answering (KBQA) techniques, such as Berant et al. (2013);; Bast and Haussmann (2015), can be used to answer some of the user questions directly from a KB. However, KBs are inherently incomplete (Dong et al., 2014), and do not have sufficient information to answer many other questions (Fader et al., 2014).
Previous, feature-engineering, approaches for combining different data sources to improve answer retrieval were shown to be quite effective for QA (Sun et al., 2015;Xu et al., 2016;Savenkov and Agichtein, 2016). Alternatively, Memory Networks (Sukhbaatar et al., 2015) and their extensions (Miller et al., 2016) use embeddings to represent relevant data as memories, and summarize them into a single vector, therefore losing information about answers provenances.
In this paper, we introduce EviNets, a novel neural network architecture for factoid question answering, which provides a unified framework for aggregating evidence, supporting answer candidates. Given a question, EviNets retrieves a set of relevant pieces of information, e.g., sentences from a corpora or knowledge base triples, and extracts mentioned entities as candidate answers. All the evidence signals are then embedded into the same vector space, scored and aggregated using multiple strategies for each answer candidate. Experiments on the TREC QA, WikiMovies and new Yahoo! Answers datasets demonstrate the effectiveness of EviNets, and its ability to handle both unstructured text and structured KB triples.
299 Figure 1: The EviNets neural network architecture for combining evidence in factoid question answering.

EviNets Question Answering Model
The high level architecture of EviNets is illustrated in Figure 1. For a given question, we extract potentially relevant information, e.g., sentences from documents retrieved from text corpora using a search system. Next, we can use an entity linking system, such as TagMe (Ferragina and Scaiella, 2010), to identify entities mentioned in the extracted information, which become candidate answers. EviNets can further incorporate additional supporting evidence, e.g., textual description of candidate answer entities, and potentially useful KB triples, such as types (Sun et al., 2015). Finally, question, answer candidates and supporting evidence are given as input to the EviNets neural network.
Let us denote a question by q, and {q t ∈ R |V | }, as a one-hot encoding of its tokens from a fixed vocabulary V . a i is a candidate answer from the set A, and we will assume, that each answer is represented as a single entity. For each question, we have a fixed set E = E text ∪ E KB of evidence statements e (i) , i = 1..M , and their tokens e t . In our experiments, we use the same matrix for questions, evidence, and answers. KB entities are considered to be individual tokens, while predicates and type names are tokenized into constituent words.

Memory Matching Module
Evidence matching is responsible for estimating the relevance of each of the pieces of evidence to the question, i.e., w e = sof tmax(match(q, e)). The function match(q, e) can be implemented using any of the recently proposed semantic similarity estimation architectures 1 . One of the simplest approaches is to average question and each evidence token embeddings and score the similarity using the dot product: q emb = 1 Lq t q emb,t and e

Evidence Aggregation Module
After all the evidence signals have been scored, EviNets aggregates the support for each answer candidate. Table 1 summarizes the evidence signals used. With these features, EviNets captures different aspects, i.e., how well individual sentences match the question, how frequently the candidate is mentioned and how well a set of answer emb ) · q emb Weighted memory similarity to the answer (Sukhbaatar et al., 2015) ( where RD× D is a rotation matrix Weighted memory answer mentions similarity to the answer (Miller et al., 2016) (  evidences covers the information requested in the question.

Answer Scoring Module
Finally, EviNets uses the aggregated signals to predict the answer scores, to rank them, and to return the best candidate as the final answer to the question. For this purpose, we use two fully-connected neural network layers with the ReLU activation function, with 32 and 8 hidden units respectively. The model was trained end-to-end by optimizing the cross entropy loss function using the Adam algorithm (Kingma and Ba, 2014).

Experimental Evaluation
To test our framework we used TREC QA (Sun et al., 2015), WikiMovies (Miller et al., 2016) benchmarks and the new Yahoo! Answers dataset 2 derived from factoid questions posted on the CQA 2 available for research purposes at http://ir.mathcs.emory.edu/software-data/ website (Table 2). In all experiments, embeddings were initialized with 300-dimensional vectors pre-trained with Glove (Pennington et al., 2014). Embeddings for multi-word entity names were obtained by averaging the word vectors of constituent words.

Baselines
As baselines for different experiments depending on availability and specifics of a dataset we considered the following methods: • IR-based QA systems: AskMSR (Brill et al., 2002) and AskMSR+ , which select the best answer based on the frequency of entity mentions in retrieved text snippets. • KBQA systems: SemPre (Berant et al., 2013) and Aqqu (Bast and Haussmann, 2015), which identify possible topic entities of the question, and select the answer from the candidates in the neighborhood of these entities in a KB. • Hybrid system QuASE (Sun et al., 2015) detects mentions of knowledge base entities in text passages, and uses the types and description information from the KB to support answer selection.

TREC QA dataset
The TREC QA dataset is composed of factoid questions, which can be answered with an entity, and were used in TREC 8-12 question answering tracks. Similarly to Sun et al. (2015) we used web search (using the Microsoft Bing Web Search API) to retrieve top 50 documents, parsed them, extracted sentences and ranked them using tf-idf similarity to the question. To compare our results with the existing state-of-the-art, we used the same set of candidate entities as used by the QuASE model. We note that the extracted evidence differs between the models, and we were unable to match some of the candidates to our sentences. For text+kb experiment, just as QuASE, we used entity descriptions and types from Freebase knowledge base. Table 3 summarizes the results. EviNets achieves competitive results on the dataset, beating KV MemN2N by 13% in F1 score, and, unlike QuASE, does not rely on expensive feature engineering and does not require any external resources to train.

WikiMovies dataset
The WikiMovies dataset contains questions in the movies domain along with relevant Wikipedia passages and OMDb knowledge base. Since KVMemN2N already achieves an almost perfect result answering the questions using the KB, we focus on using the provided movie articles from Wikipedia. We followed the preprocessing procedures described in Miller et al. (2016). Unlike TREC QA, where there are often multiple relevant supporting pieces of evidence, answers in the WikiMovies dataset usually have a single relevant sentence, which, however, mentions multi-  Table 4: Accuracy of EviNets and baseline models on the WikiMovies dataset. The results marked * are obtained using a different setup, i.e., they use pre-processed entity window memories, and the whole set of entities as candidates.
ple entities. To help the model distinguish the correct answer, and explore its abilities to encode structured and unstructured data, we generated additional entity type triples. For example, if an entity E appears as an object of the predicate directed by in OMDb, we added the [E, type, director] triple. As baselines, we used MemN2N and KV MemN2N models, and the results are presented in Table 4. As we can see, with the same setup using individual sentences as evidence/memories EviNets significantly outperforms the KV MemN2N model by 27%. It is important to emphasize that the best-reported results of memory networks were obtained using entitycentered windows as memories, which requires special pre-processing and increases the number of memories. Additionally, these models used all of the KB entities as candidate answers, whereas EviNets relies only on the mentioned ones, which is a more scalable scenario for open-domain question answering, where it is not realistic to score millions of candidate answers in real-time.

Yahoo! Answers dataset
Yahoo! recently released a dataset with search queries, which lead to clicks on factoid Yahoo! Answers questions, identified as questions with the best answer containing less than 3 words and a Wikipedia page as the specified source of information 3 . This dataset contains 15K queries, which correspond to 4725 unique Yahoo! Answers questions (Table 2). We took these questions, and mapped answers to KB entities using the TagMe entity linking library (Ferragina and Scaiella, 2010). We filtered out questions, for We applied the TagMe entity linker to the extracted sentences, and considered all entities of mentions with the confidence score above the 0.2 threshold as candidate answers. For candidate entities we also retrieved relevant KB triples, such as entity types and descriptions, which extended the original pool of evidences. Table 5 summarizes the results of EviNets and some baseline methods on the created Yahoo! Answers dataset. As we can see, knowledge base data is not enough to answer most of these questions, and a state-of-the-art KBQA system Aqqu gets only 0.116 precision. Adding textual data helps significantly, and Text2KB improves the precision to 0.17, which roughly matches the results of the AskMSR system, that ranks candidate entities by their popularity in the retrieved documents. Using text along with KB evidence gave higher performance metrics, boosting F1 from 0.271 to 0.291. EviNets significantly improves over the baseline approaches, beating AskMSR by 28% and KV MemN2N by almost 80% in F1 score.

Related Work
The success of deep neural network architectures in computer vision and NLP applications mo-4 A minimum ρ score of 0.2 from TagMe was required. tivated researchers to investigate applying these techniques for answer sentence selection, evaluated on TREC QA (Wang et al., 2007), Wik-iQA (Yang et al., 2015) and other datasets. A number of models proposed in recent years explore different ways of matching questions and answer sentences Yang et al., 2016;Rao et al., 2016). Our EviNets architecture allows to easily plug these sentence matching networks into the evidence matching module, and provides the aggregation layer, which helps to make a decision based on all available information.
Our evidence representation module is based on the ideas of memory networks (Sukhbaatar et al., 2015;Kumar et al., 2015;Miller et al., 2016), which also embed relevant information into a vector space. However, they use soft attention mechanism to retrieve the memories, and do not use links from memories to the corresponding answer candidates, which means that all relevant information is squeezed into a fixed dimensional vector. This limitation has been partially addressed in Wang et al. (2016) and Henaff et al. (2016), which accumulate evidence for each answer separately using a recurrent neural network. In contrast, the evidence aggregation in our EviNets model uses multiple different features, which is more flexible and can be extended with other signals.

Conclusions
We presented EviNets, a neural network for question answering, which encodes and aggregates multiple evidence signals to select answers. Experiments on TREC QA, WikiMovies and Yahoo! Answers datasets demonstrate that EviNets can be trained end-to-end to use both the available textual and knowledge base information. EviNets improves over the baselines, both in cases when there are many or just a few relevant pieces of evidence, by helping build an aggregate picture and distinguish between candidates, mentioned together in a relevant memory, as is the case for WikiMovies dataset. The results of our experiments also demonstrate that EviNets can incorporate signals from different data sources, e.g., adding KB triples helps to improve the performance over text-only setup. As a limitation of this work and a direction for future research, EviNets could be extended to support dynamic evidence retrieval, which would allow retrieving additional answer candidates and evidence as needed.