An End-to-End Deep Framework for Answer Triggering with a Novel Group-Level Objective

Given a question and a set of answer candidates, answer triggering determines whether the candidate set contains any correct answers. If yes, it then outputs a correct one. In contrast to existing pipeline methods which first consider individual candidate answers separately and then make a prediction based on a threshold, we propose an end-to-end deep neural network framework, which is trained by a novel group-level objective function that directly optimizes the answer triggering performance. Our objective function penalizes three potential types of error and allows training the framework in an end-to-end manner. Experimental results on the WikiQA benchmark show that our framework outperforms the state of the arts by a 6.6% absolute gain under F1 measure.


Introduction
Question Answering (QA) aims at automatically responding to natural language questions with direct answers (Heilman and Smith, 2010;Severyn and Moschitti, 2013;Yao et al., 2013;Berant and Liang, 2014;Sun et al., 2015;Miller et al., 2016;Sun et al., 2016). Most existing QA systems always output an answer for any question, no matter whether their answer candidate set contains correct answers or not (Feng et al., 2015;Severyn and Moschitti, 2015;Yang et al., 2016;Rao et al., 2016). In practice, however, this can greatly hurt user experience, especially when it is hard for users to judge answer correctness. In this paper, we study the critical yet under-addressed Answer Triggering (Yang et al., 2015) problem: Given a question and a set of answer candidates, determine whether the candidate set contains any correct answer, and if so, select a correct answer as system output.
The answer triggering problem can be logically divided into two sub-problems: P 1 : Build an individual-level model to rank answer candidates so that a correct one (if it exists) gets the highest score. P 2 : Make a group-level binary prediction on the existence of correct answers within the candidate set. Previous work (Yang et al., 2015;Jurczyk et al., 2016) attack the problem via a pipeline approach: First solve P 1 as a ranking task and then solve P 2 by choosing an optimal threshold upon the previous step's highest ranking score. However, the yielded answer triggering performance is far from satisfactory, with F 1 between 32% and 36%. An alternative pipeline approach is to first solve P 2 and then P 1 , i.e., first determine whether there's a correct answer in the candidate set and then rank all candidates to find a correct one. However, as we will show using state-of-theart Multiple Instance Learning (MIL) algorithms in Section 4, P 2 by itself is currently a very challenging task, partly because of the difficulty of extracting features from a set of candidate answers that are effective for answer triggering. Because both P 1 and P 2 performances are far from perfect, the above pipeline approaches also suffer from error propagation (Finkel et al., 2006;Zeng et al., 2015).
We propose Group-level Answer Triggering (GAT), an end-to-end framework for jointly optimizing P 1 and P 2 . Our key contribution in GAT is a novel group-level objective function, which aggregates individual-level information and penalizes three potential error types in answer triggering as a group-level task. By optimizing this objective function, we can directly back-propagate the final answer triggering errors to the entire framework and learn all the parameters simultaneously. We conduct evaluation using the same dataset and measure as in previous work (Yang et al., 2015;Jurczyk et al., 2016), and our framework improves the F 1 score by 6.6% (from 36.65% to 43.27%), compared with the state of the art.

Framework
Notations. Let i and j respectively be the index of question and answer candidate, l i,j be the binary label of the j-th answer candidate for question q i , and l i be the group label of the answer candidate set of q i (1 if it contains any correct answer; 0 otherwise). m i,j denotes an individual-level matching score, measuring how likely question q i can be correctly addressed by its j-th answer candidate.
The GAT framework is illustrated in Figure 1, which consists of three components: (1) Encoder. Two separate encoders process questions and answer candidates respectively, mapping them from token sequences into two different vector spaces.
(2) QA Matching. For each question and answer candidate pair, we concatenate their encoded vectors, and pass it through a feed forward neural network with a binary softmax output layer. The output is an individual-level matching score, i.e., m i,j . (3) Signed Max Pooling. Max pooling is applied on all the matching scores in a candidate set. During training when each candidate is positively/negatively labeled on whether they can answer the question or not, we use the labels to divide the scores into two disjoint subsets and perform max pooling separately: where m + i is the maximum score among correct answers (if there's any) and m − i is that among wrong ones. At testing time when labels are unavailable, it reduces to normal max pooling and pools a single score m i = max j m i,j . The answer triggering prediction is then made by comparing m i with a predefined threshold (0.5) to decide whether to return the top-scored answer candidate to the user.
The GAT framework design is generic in that the Encoder component can be instantiated with different network architectures. In this paper, we implement it with Bidirectional RNNs (Bi-RNN) (Schuster and Paliwal, 1997) with GRU cells (Cho et al., 2014), and use the temporal average pooling over the hidden states as the encoding representation. We choose Bi-RNN mainly because of its good performance in many QA problems (Wang and Nyberg, 2015;Wang et al., 2016).

Learning
The cost function for negative groups (answer candidate sets without correct answers) and positive groups (those with correct answers) are treated differently. For each negative group, the highest QA matching score is penalized by a hinge loss: where the maximum matching score m − i is compared with 0.5, a fixed threshold for our framework. The variable d − here, as well as d + and d ± that will appear shortly after, are all margin hyperparameters. O 1 is normalized by N neg , which is the number of negative groups (with l i = 0). We use O 1 to reduce false-positive answer existence predictions by penalizing the top matching score that is not safely below the 0.5 threshold.
For a positive group, it is more complicated because answer triggering prediction can have the following two error types: (1) the top matching score is below the threshold, or (2) the top ranked answer candidate is a wrong answer. We design loss terms O 2 and O 3 to penalize these two types of error, respectively. O 2 is a hinge loss that penalizes the case where the highest score among the correct answers in a group is not large enough to signify answer existence. O 3 is to penalize the case where the highest score is obtained by an incorrect candidate answer. Formally: Finally, the overall objective function in Equation 1 is a linear combination of the three loss terms and a standard 2 -regularization. Θ denotes all the trainable parameters in the framework. α, β and λ are hyper-parameters. (1)

A Naive Objective Baseline
For comparison, we provide an alternative objective formulation, which equivalently treats positive and negative groups, and does not explicitly penalize cases where an incorrect candidate answer obtains the highest QA matching score in a positive group.
Here d + is a margin and α * , λ * are weights. We hypothesize this formulation will work worse than the objective in Equation 1, and will use experiments to verify it.

Dataset
We use the WIKIQA dataset (Yang et al., 2015) for evaluation. It contains 3,047 questions from Bing query logs, each associated with a group of candidate answer sentences from Wikipedia and manually labeled via crowdsourcing. Several intuitive features are also included in WIKIQA: two word matching features (IDF-weighted and unweighted word-overlapping counts between questions and candidate answers, denoted as Cnt), the length of a question (QLen), and the length of a candidate answer (SLen). As in previous works, we also test the effect of these features, by combining them with other features as input into the Softmax layer in our framework. We use the standard 70% (train), 10% (dev), and 20% (test) split of WIK-IQA. We also use the same data pre-processing steps for fair comparison: Truncate questions and sentences to a maximum of 40-token long and initialize the 300-dimensional word vectors using pretrained word2vec embedding (Mikolov et al., 2013).

Implementation Details
We implement our full framework using Tensor-Flow (Abadi et al., 2016)

Evaluation Metrics
We use precision, recall, and F 1 , defined in the same way as in previous work. A question is treated as a positive case only if it contains one or more correct answers in its candidate set. For the prediction of a question, only the candidate with the highest matching score is considered. A true positive prediction shall meet two criteria: (1) the score is above a threshold (0.5 for our framework; tuned on dev set in other work), and (2) the candidate is labeled as a correct answer to the question.

Results a. Comparison with Baselines
We evaluate the effectiveness of the proposed GAT framework by comparing with several baseline models. To the best of our knowledge, there has only been limited work so far on answer triggering, and they are the first two baselines below. (1) Yang et al. (2015) propose CNN-Cnt, which is a combination of the CNN model from Yu et al. (2014) and two Cnt features. We use their best reported result which is achieved when CNN-Cnt is combined with QLen features.
(2) Jurczyk et al. (2016) extend the previous work with various network structures and add some more sophisticated features. Here we compare with their best model on WIKIQA, which is a CNN model combined with carefully designed tree-matching features, extracted from expensive dependency parsing results. (3) We include a third Naive baseline where the objective function in Equation 2 is used to train our architecture in Figure 1. Due to space limits, we show its best result obtained among various feature combinations. The results are summarized in Table 1.
We can see that GAT combined with Cnt features improves the F 1 score from Yang et al. (2015) and Jurczyk et al. (2016) by around 11.1% and 6.6% (from 32.17 and 36.65 to 43.27), which shows the effectiveness of our framework. We denote this configuration as our full framework. Through the comparison between Naive and GAT,

Model Prec
Rec F1 (Yang et al., 2015) 27.96 37.86 32.17 (Jurczyk et al., 2016)  we can see that our proposed objective function has a great advantage over the Naive one which does not model the complexity of answer triggering for positive candidate sets. Different from Yang et al. (2015)'s results, combining with the QLen feature does not further improve the performance in our case, possibly because we choose Bi-RNN as our encoder, which may capture some question characteristics better than a length feature.

b. Framework Breakdown
Now we conduct further analysis in order to better understand the contribution of each component in our full framework. Since the code from (Yang et al., 2015) is available, we use it (rather than (Jurczyk et al., 2016)) to assist our analysis. We first test a variant of our full framework by replacing the Encoder and QA Matching component with the CNN based model from (Yang et al., 2015) 2 , denoted as GAT w/ CNN, and train it with our objective. From the first two rows in Table 2, we observe that: (1) Using our current design Bi-RNN and feed-foward NN improves from 35.03% to 43.27%, in comparison with the CNN based model, partly because their CNN only consists of one convolution layer and one average pooling layer. However, we leave more advanced encoder and QA matching design for future work, and anticipate that more complex CNN based models can achieve similar or better results than our current design, as in many other QA-related work (Hu et al., 2014;. (2) Compared with the best result from (Yang et al., 2015) in Table 1, training the CNN based model end-to-end using our objective improves from 32.17% to 35.03%. This directly shows an end-to-end learning strategy works better than the pipeline approach in (Yang et al., 2015). Now we detach the Encoder component ENC 2 Where the QA matching score is obtained first through CNN encoding and then a bilinear model.  from our end-to-end full framework. To obtain semantic vectors of questions and candidate answers as input to the subsequent QA Matching component, we leverage Yang et al.(2015)'s released code to train the Encoder component (with CNN) through their well-tuned individual-level optimization, and use their learnt semantic vectors. Then our framework without ENC, i.e., -ENC, is trained and tested as before. We further detach the QA matching component QAM in a similar way: We directly use the matching score between a question and a candidate answer obtained by Yang et al. (2015), and concatenate it with Cnt features as input to the Softmax layer, which is our framework without ENC or QAM, denoted as -ENC -QAM, and trained by our group-level objective. By comparing them with our end-to-end frameworks on both dev and test sets, we can see that it is beneficial to jointly train the entire framework.

Error Analysis
We now demonstrate some typical mistake types made by our framework to inspire future improvements.
Q: What city was the convention when Gerald Ford was nominated? A: Held in Kemper arena in Kansas City , Missouri , the convention nominated president Gerald Ford for a full term, but only after narrowly defeating a strong challenge from former California governor Ronald Reagan.
In this case, A is correct, but our framework made a false negative prediction. Although already being the highest ranked in a set of 4 candidate answers, A only got a score of 0.134, possibly due to its complicated semantic structure (attribute clause) and the extra irrelevant information (defeating Reagan).
Q: What can SQL 2005 do? A1: Microsoft SQL server is a relational database management system developed by Microsoft.
A2: As a database , it is a software product whose primary function is to store and retrieve data as requested by other software applications, be it those on the same computer or those running on another computer across a [TRUNCATED END] The incorrect answer A1 is ranked higher than the correct answer A2, both with scores above 0.5. This is a false positive case, with incorrect ranking as well. Possible reasons are that the detailed functionality of SQL explained in A2 is hard to be captured and related to the question, and A2 gets truncated to 40 tokens long in our experiments. On the other hand, the "database management system" phrase in A1 sounds close to an explanation of functionality, if not carefully distinguished.
Both cases above show that the semantic relation between a question and its answer is hard to capture. For future research, more advanced models can be incorporated in the Encoder and QA Matching components of our framework.

Related Work
Answer Selection. Answer selection (a.k.a., answer sentence selection) is the task of assigning answer candidates with individual-level ranking scores given a question, which is similar to P1 defined in Section 1. Existing QA systems based on answer selection just select the top-scored candidate as answer, without considering the possibility that the true answer doesn't even exist. However, many neural network models recently explored in the answer existence literature (Hu et al., 2014;Wang and Nyberg, 2015;Feng et al., 2015) could be utilized for answer selection as well in the future. For example,  explore the respective advantages of different network architectures such as Long Short-Term Memory Networks (LSTMs) and CNNs. They also develop hybrid models for answer selection. Various attention mechanisms have been proposed such as (Wang et al., 2016) for RNNs and (Yin et al., 2015; for CNNs. Answer selection is also formulated as a sentence similarity measurement problem  or a pairwise ranking problem as in (Severyn and Moschitti, 2015;Yang et al., 2016;Rao et al., 2016).
Multiple Instance Learning We have briefly mentioned MIL (Babenko et al., 2011;Amores, 2013;Cheplygina et al., 2015) in Section 1. Many MIL algorithms can not be directly applied for answer triggering, because individual-level annota-tions and predictions are often assumed unavailable and unnecessary in MIL (Maron and Lozano-Pérez, 1998;Babenko et al., 2011;Amores, 2013;Cheplygina et al., 2015), but not in the answer triggering setting, where the correctness of each answer candidate is annotated during training and needs to be predicted during testing. We experimented with two popular MIL algorithms that explicitly discriminate individual-level labels: MI-SVM (Andrews et al., 2003) and Sb-MIL (Bunescu and Mooney, 2007) implemented in one of the state-of-the-art MIL toolkits (Doran and Ray, 2014), where we represented each question/answer with encoder vectors as in Section 3.4. Unfortunately, both algorithms predict no correct answer exists for any question, possibly because the training data are biased towards negative groups and the input features are not effective enough. This indicates that using MIL for answer triggering is challenging and still open for future research.

Conclusion
In conclusion, we address the critical answer triggering challenge with an effective framework based on deep neural networks. We propose a novel objective function to optimize the entire framework end-to-end, where we focus more on the group-level prediction and take into account multiple important factors. In particular, the objective function explicitly penalizes three potential errors in answer triggering: (1) false-positive and (2) false-negative predictions of the existence of a correct answer, as well as (3) ranking incorrect answers higher than correct ones. We experimented with different objective function settings and show that our GAT framework outperforms the previous state of the arts by a remarkable margin.