Improving Answer Selection and Answer Triggering using Hard Negatives

In this paper, we establish the effectiveness of using hard negatives, coupled with a siamese network and a suitable loss function, for the tasks of answer selection and answer triggering. We show that the choice of sampling strategy is key for achieving improved performance on these tasks. Evaluating on recent answer selection datasets - InsuranceQA, SelQA, and an internal QA dataset, we show that using hard negatives with relatively simple model architectures (bag of words and LSTM-CNN) drives significant performance gains. On InsuranceQA, this strategy alone improves over previously reported results by a minimum of 1.6 points in P@1. Using hard negatives with a Transformer encoder provides a further improvement of 2.3 points. Further, we propose to use quadruplet loss for answer triggering, with the aim of producing globally meaningful similarity scores. We show that quadruplet loss function coupled with the selection of hard negatives enables bag-of-words models to improve F1 score by 2.3 points over previous baselines, on SelQA answer triggering dataset. Our results provide key insights into answer selection and answer triggering tasks.


Introduction
Question answering (QA) is an active field of research, drawing attention from the natural language processing (NLP) and information retrieval (IR) community. Selection-based QA is the task of selecting an answer for a given question, from a set of candidate answers. Two tasks have been proposed for selection-based QA. Given a question and an answer candidate pool, answer selection is the task of ranking valid answers higher than irrelevant answers, where it is assumed that there is at least one valid answer in the candidate * Work done during an internship at Amazon † Work done at India Machine Learning, Amazon pool. Answer triggering, defined for cases where the candidate pool may or may not have a valid answer, is the task of finding a valid answer, while being allowed to abstain. Recently released datasets (Feng et al., 2015;Yang et al., 2015;Jurczyk et al., 2016) have led to significant development in end-to-end neural network Selection-based QA. These networks are learned by posing answer selection as either a ranking problem or a classification problem. When posed as a ranking problem, a common choice is to use a triplet loss defined over a question, a correct answer and a (usually sampled) negative answer. Triplet loss penalises relative distances between positive and negative pairs, making the choice of negatives critical. Recent progress in image understanding tasks suggests that learning in ranking tasks can be enhanced by focusing on sampling strategies in conjunction with the loss function employed (Hermans et al., 2017;. In this work, we investigate the expressive power of siamese architectures (Bromley et al., 1994) coupled with hard negatives (i.e., difficult negative samples for the model being learnt), against more sophisticated models. Specifically, all our models ignore any interaction between question and answer representations, and use a shared encoder to independently encode questions and answers. Siamese architectures provide low-latency solutions in the presence of large candidate answer pools (by caching candidate representations).
In this paper, we first show that using hard negatives with vanilla neural architectures improves over previously reported results on InsuranceQA answer selection task (Feng et al., 2015). The proposed strategy achieves P@1 of 73.3%, as compared to the previously reported 71.1% (dos Santos et al., 2016). Further, using a Transformer encoder coupled with the proposed strategy achieves P@1 of 75.6%. Next, we propose the use of quadruplet loss  for answer triggering. We show that by employing online selection of hard negatives with quadruplet loss, bagof-words models improve upon previous baselines on SelQA answer triggering task (Jurczyk et al., 2016). The proposed strategy achieves an F1 score of 53.21, as compared to the previously reported 50.89 (Jurczyk et al., 2016).
We note that the performance gains demonstrated in this paper are obtained with a restricted class of models. Specifically, we restrict to siamese networks, and don't model any interaction between question and answer representations. The previously reported results which we compare against use architectures that are not restricted to be siamese, and model the interaction between question and answer representations.

Related Work
For answer selection, CNN, LSTM and LSTM-CNN architectures (Tan et al., 2015;dos Santos et al., 2016), trained with a triplet loss, have been explored. Recent focus has been on designing the interaction layer, between a question and a candidate answer, through various models of attention (Yu et al., 2014;Rao et al., 2016;Bian et al., 2017). Compareaggregate architectures have also been studied as sentence matching models and applied to answer selection (Wang and Jiang, 2016;. Similarly, for answer triggering, different architectural changes have been proposed Acheampong et al., 2016;Gupta et al., 2018;Li and Wu, 2017). Often, the architectural advances are coupled with subtle changes in the training process. For example, dos Santos et al. (2016) incorporate a form of negative mining. This makes it difficult to separate the benefits gained from the mining strategy from the architectural advances.
Triplet loss and mining of hard negatives have been studied for computer vision tasks with triplet networks being popular for estimating feature embeddings for images. The key issue with triplet networks is that, for a training set with N samples, the number of triplets is cubic in N . Training becomes intractable for modest sizes of the training set. A solution in the form of importance sampling has been studied. Schroff et al. (2015) learn image embeddings using triplet loss, trained using moderately hard negatives. While other approaches have been studied for the task of person re-identification, which combine classification with a verification loss, Hermans et al. (2017) show that a vanilla CNN with triplet loss and the right sampling strategy could outperform the best models at the time. Wu et al. (2017) present a distance-weighed sampling approach.
Triplet loss, while useful for ranking tasks, doesn't produce globally meaningful scores for tasks such as person re-identification. This can be attributed to the fact that triplet loss doesn't try to learn a global threshold to separate all interclass pairs from all intra-class pairs, and learns only relative distances with respect to an anchor (a question, in the context of this paper). This has been addressed by accounting for the global structure of the embedding space (Kumar et al., 2016;Ustinova and Lempitsky, 2016), and directly optimizing for inter-class distances to be larger than intra-class distances. Quadruplet loss  prvoides another way to produce globally meaningful scores, and is suitable for our setting. Quadruplet loss extends triplet loss to ensure smaller intra-class distances and larger interclass distances. This is achieved by additionally penalizing positive and negative pairs with different probes (questions, in the context of this paper).
While selection of hard negatives has been used for problems in computer vision, its usefulness has not been evaluated for QA tasks. In this work, we first present the selection of hard negatives with triplet loss for answer selection. Next, we show that quadruplet loss is suitable for the task of answer triggering, and when coupled with online selection of hard negatives, improves over previous baselines. As far as we know, we are the first to use quadruplet loss for question answering. Insurance QA (Feng et al., 2015), a domainspecific non-factoid QA dataset, is suitable for evaluating answer selection. SelQA (Jurczyk et al., 2016), a recent open-domain factoid QA dataset provides data for evaluating both answer selection and answer triggering.
While these datasets provide standardised comparison, we also evaluate our methods on a large internal answer selection dataset, LargeQA, created using Community Question-Answers (CQnA) asked on a website. The dataset was created similar to InsuranceQA, with the size of candidate pool fixed to 100 answers in test sets. Table 1 presents some statistics on the training data used in these datasets.
We don't include WikiQA (Yang et al., 2015) and TrecQA (Wang et al., 2007) datasets which have been previously used to report improvements in answer selection. This is due to the large variance in our experiments as well as in the reproduction of existing methods, also noted by Crane (2018), perhaps due to their smaller sizes.

Method
For selection-based QA , the training data X, can be characterized as a list of questions Q = {q 1 , q 2 , ... , q s }, and sets of correct answer(s) A q = {a 1 , a 2 , ... , a p } for each question q ∈ Q.
In the following, we discuss online selection of hard negatives with triplet loss for answer selection, and with quadruplet loss for answer triggering. In each case, a siamese architecture is used where questions and answers are encoded independently by identical copies of a neural network denoted by f . Cosine similarity is used to compute the similarity between a question and an answer: S(x, y) = cosine(f (x), f (y)).

Answer Selection
Siamese networks can be trained for answer selection using triplets (q, a + , a − ), where a + ∈A q is a correct answer and a − / ∈A q an incorrect answer chosen randomly from the entire answer pool. A triplet loss with margin m is formulated as: In this work, we employ online selection of hard negatives within a batch of sampled questionanswer pairs. The model is trained using batch gradient descent, with batches B = {(q i , a i , a H i ) : 1 ≤ i ≤ b}, where b is the batch size, a i ∈ A q i and a H i is the hardest negative answer chosen as: Selection of negatives from within a batch of sampled question-answer pairs has several advantages. First, it does not require any extra computation from the embedding model. Second, there is no need to employ additional heuristics to avoid the hardest negatives. Selection from within a stochastic batch ensures that the selected hardest negatives in a batch are not dominated by false negatives or noisy hard negatives during training (See Appendix C for experimental results).

Answer Triggering
Answer triggering differs from answer selection in that we need to identify if a valid answer is present in the candidate pool. Triplet loss is not suitable as it promotes scores to be discriminative only relative to a given question. While classification based methods have been studied, we hypothesize that globally meaningful scores can be obtained in a ranking setup. Essentially, we need the similarity scores for all wrongly paired question-answers to be smaller than the scores for correct questionanswer pairs. We propose to use quadruplet loss for answer triggering. Quadruplet loss, introduced by Chen et al. (2017) for Person re-identification, aims at keeping all inter-class distances larger than all intra-class distances. This, we believe, is what we need to obtain globally meaningful scores for answer triggering.
The network can be trained using quadruplets of the form (q, a + , a − , q ), where a + ∈ A q , a − / ∈ A q , and q is a negative question for both a + and a − chosen randomly from the entire question pool, with a + / ∈ A q , a − / ∈ A q . Quadruplet loss for answer triggering can be formulated as: where m 1 and m 2 are fixed margins.
As with triplet loss, we believe the selection of negatives is critical for learning with quadruplet loss. We propose to select both the negative answer and the negative question using online selection of hard negatives. The model is trained using batch gradient descent with batches i is chosen as in Equation 2, and q H i is selected to be the hardest negative question for a H i :  During inference, for a given question, we predict the highest scoring answer, provided the score exceeds the threshold. The optimal threshold is obtained using performance on the development set.

Results and Analysis
We present quantitative results using MRR and P@1 for answer selection on InsuranceQA, LargeQA and SelQA, and F1 score (Yang et al., 2015) for answer triggering on SelQA, followed by an ablation study of the proposed method for answer triggering.

Experimental Setup
Following Tan et al. (2015), we experiment with an LSTM-CNN model, where a CNN layer is employed on top of LSTM encoding of the input sentence, followed by Max Pooling to obtain a fixed length representation for the sentence.
Motivated by the performance of bag-of-words like sentence representations (Arora et al., 2016), we also experiment with distributional bag-ofwords models where sentence embeddings are obtained by pooling across the dimensions of corresponding word embeddings. In particular, we used Max-Pooling and Max-Min-Pooling, the latter being obtained by concatenating the outputs of Max and Min Pooling.
Training details are available in Appendix A.

Answer Selection
InsuranceQA: LSTM-CNN model coupled with hard negatives outperforms the best reported numbers on both test sets by a significant margin (1.6 and 2.7 points gain in P@1 with p-values of 0.066 and 0.008 on Test1 and Test2 respectively) on In-suranceQA (Table 2). Finally, the Transformer encoder model with hard negatives provides significant performance gains for the task (3.9 and 7.0 points gain in P@1), while still learning with siamese networks without any interaction between question and answer representations.  LargeQA: The gains due to hard negatives are also observed on LargeQA (Table 3), with an improvement of 2.5 in P@1 and 0.015 in MRR.
Key Observations: First, siamese bag-ofwords models, coupled with hard negatives, are competitive with models which rely on various ways of question-answer interaction. Second, siamese networks coupled with hard negatives are particularly suited for domain specific QA (In-suranceQA). We believe open-domain QA has a greater need for interaction between question and answer representations. Third, we believe employing hard negatives is especially useful for the Transformer encoder model, which has a much larger set of parameters. With the Transformer encoder model, there is an improvement of 8.5 points on InsuranceQA Test1 dataset when using hard negatives as opposed to using random negatives. With the LSTM-CNN model, the corresponding gain is 3.0 points.

Answer Triggering
Method F1 Dev Test CNN (Jurczyk et al., 2016) 49.16 50.89 RNN + AP (Jurczyk et al., 2016) 44.02 45.67 Max-Min-Pooling + Quadruplet loss + Hard negatives 54.95 53.21  Figure 1: Ablation study for answer triggering. The columns on the left indicate whether online selection of hardest negative answer/question was done. To obtain best results with quadruplet loss, it is critical to sample both hard negative answers and hard negative questions.
An ablation study (Figure 1) reveals that the selection of both hardest negative answers and questions is needed to get the majority improvement. Quadruplet loss with the Max-Min-Pooling model without hard negatives is competitive with base-lines (Table 5), justifying its usage for answer triggering.
Further, we also compared against using triplet loss with the Max-Min-Pooling model for answer triggering. We found that quadruplet loss outperforms triplet loss (F1 of 51.89 on development set, and 50.08 on Test set) by 3.13 points in F1.

Conclusion & Future Work
We have shown that selection of hard negatives is a powerful tool for answer selection. We improve over previously reported results on recent benchmarks using siamese architectures and hard negatives, outperforming interaction-based models. We show the generality of the approach using shallow as well as deep neural network models.
For answer triggering, we presented results supporting the hypothesis that quadruplet loss with hard negatives is suitable for the task, and improves upon previous baselines. Our ablation study confirms the importance of using hard negatives.
As future work, we plan to further investigate the generality of the approach with other tasks and base models.