AVA: an Automatic eValuation Approach to Question Answering Systems

We introduce AVA, an automatic evaluation approach for Question Answering, which given a set of questions associated with Gold Standard answers, can estimate system Accuracy. AVA uses Transformer-based language models to encode question, answer, and reference text. This allows for effectively measuring the similarity between the reference and an automatic answer, biased towards the question semantics. To design, train and test AVA, we built multiple large training, development, and test sets on both public and industrial benchmarks. Our innovative solutions achieve up to 74.7% in F1 score in predicting human judgement for single answers. Additionally, AVA can be used to evaluate the overall system Accuracy with an RMSE, ranging from 0.02 to 0.09, depending on the availability of multiple references.


Introduction
Accuracy evaluation is essential both to guide system development as well as to estimate its quality, which is important for researchers, developers, and users.This is often conducted using benchmarking datasets, containing a data sample, possibly representative of the target data distribution, provided with Gold Standard (GS) labels (typically produced with a human annotation process).The evaluation is done by comparing the system output with the expected labels using some metrics.
This approach unfortunately falls short when dealing with generation tasks, for which the system output may span a large, possibly infinite, set of correct items.For example, in case of Question Answering (QA) systems, the correct answers for the question, Where is Rome located ? is large.As it is impossible, also for cost reasons, to annotate all possible system pieces of output, the standard approach is to manually re-evaluate the new output of the system.This dramatically limits the experi-mentation velocity, while increasing significantly the development costs.
Another viable solution in specific domains consists in automatically generating an evaluation score between the system and the reference answers, which correlates with human judgement.The BLEU score, for example, is one popular measure in Machine Translation (Papineni et al., 2002).This, however, can only be applied to specific tasks and even in those cases, it typically shows limitations (Way, 2018).As a consequence there is an active research in learning methods to automatically evaluate MT systems (Ma et al., 2019), while human evaluation becomes a requirement in machine translation benchmarking (Barrault et al., 2019).
QA will definitely benefit by a similar approach but the automatic evaluation is technically more complex for several reasons: First, segment overlapping metrics such as BLEU, METEOR, or ROUGE, do not work since the correctness of an answer loosely depends on the match between the reference and candidate answers.For example, two text candidates can be correct and incorrect even if they only differ by one word (or even one character), e.g., for the questions, Who was the 43 rd president of USA ?, a correct answer is George W. Bush, while the very similar answer, George H. W. Bush, is wrong.
Second, the matching between the answer candidates and the reference must be carried out at semantic level and it is radically affected by the question semantics.For example, match(t, r|q 1 ) can be true but match(t, r|q 2 ) can be false, where t and r are a pair of answer candidate and reference, and q 1 and q 2 are two different questions.This can especially happen for the case of the so-called non-factoid questions, e.g., asking for a description, opinion, manner, etc., which are typically answered by a fairly long explanatory text.For example, Table 1 shows an example of a non factoid question Question: What does cause left arm pain ?Reference: Arm pain can be caused by a wide variety of problems, ranging from joint injuries to compressed nerves; if it radiates into your left arm can even be a sign of a heart attack.Answer 1: It is possible for left arm pain to be caused from straining the muscles of the arm, pending heart attack, or it can also be caused from indigestion.Answer 2: Anxiety can cause muscles in the arm to become tense, and that tension could lead to pain.Answer 3: In many cases, arm pain actually originates from a muscular problem in your neck or upper spine.In this paper, we study the design of models for measuring the Accuracy of QA systems.In particular, we design several pre-trained Transformer models (Devlin et al., 2018;Liu et al., 2019) that encode the triple of question q, candidate t, and reference r in different ways.
Most importantly, we built (i) two datasets for training and testing the point-wise estimation of QA system output, i.e., the evaluation if an answer is correct or not, given a GS answer; and (ii) two datasets constituted by a set of outputs from several QA systems, for which AVA is supposed to estimate the Accuracy.
The results show a high Accuracy for point-wise models, up to 75%.Regarding the overall Accuracy estimation, AVA can almost always replicate the ranking of systems in terms of Accuracy performed by humans.Finally, the RMSE with respect to human evaluation depends on the datasets, ranging from 2% to 10%, with an acceptable Std.Dev.lower than 3-4%.
The structure of the paper is as follows: we begin with the description of the problem in Sec. 3.This is then followed by the details of the data construction and model design, which are key aspects for system development, in sections 4 and 5.We study the performance of our models in three different evaluation scenarios in Sec. 6.

Related Work
Automatic evaluation has been an interesting research for decades (Papineni et al., 2002;Magnini et al., 2002).There are two typical strategies to design an automatic evaluator: supervised and un-q: What is the population of California?r: With slightly more than 39 million people (according to 2016 estimates), California is the nation's most populous stateits population is almost one and a half times that of second-place Texas (28 million).s: 39 million t: The resident population of California has been steadily increasing over the past few decades and has increased to 39.56 million people in 2018.Table 2: An example of input data supervised.In machine translation, for example, BLEU (Papineni et al., 2002) has been a very popular unsupervised evaluation method for the task.
There are also other supervised methods recently proposed, most notably (Ma et al., 2019).For dialog systems, neural-based automatic evaluators are also studied (Ghazarian et al., 2019;Lowe et al., 2017;Tao et al., 2017;Kannan and Vinyals, 2017) QA has been traditionally studied early in literature (Green et al., 1961).QA has recently been used to evaluate a summarization task (Eyal et al., 2019).Automatic evaluation for QA was addressed by Magnini et al. (2002) and also for multiple subdomain QA systems (Leidner and Callison-Burch, 2003;Lin and Demner-Fushman, 2006;Shah and Pomerantz, 2010;Gunawardena et al., 2015).However, little progress has been made in the past two decades towards obtaining a standard method.Automating QA evaluation is still an open problem and there is no recent work supporting it.

Problem Definition
We target the automatic evaluation of QA systems, for which system Accuracy (the percentage of correct answers) is the most important measure.We also consider more complex measures such as MAP and MRR in the context of Answer Sentence Reranking/Selection.

Answer Sentence Selection (AS2)
The task of reranking answer sentence candidates provided by a retrieval engine can be modeled with a classifier scoring the candidates.Let q be a question, T q = {t 1 , . . ., t n } be a set of answer sentence candidates for q, we define R as a ranking function, which orders the candidates in T q according to a score, p (q, t i ), indicating the probability of t i to be a correct answer for q.Popular methods modeling R include Compare-Aggregate (Yoon et al., 2019), inter-weighted alignment networks (Shen et al., 2017), and BERT (Garg et al., 2020).

Automatic Evaluation of QA Accuracy
The evaluation of system Accuracy can be approached in two ways: (i) evaluation of the single answer provided by the target system, which we call point-wise evaluation; and (ii) the aggregated evaluation of a set of questions, which we call system-wise evaluation.
We define the former as a function: A (q, r, t i ) → {0, 1}, where r is a reference answer (GS answer) and the output is simply a correct/incorrect label.Table 2 shows an example question associated with a reference, a system answer, and a short answer1 .
A configuration of A is applied to compute the final Accuracy of a system using an aggregator function.In other words, to estimate the overall system Accuracy, we simply assume the point-wise AVA predictions as they were the GS.For example, in case of the Accuracy measure, we simply average the AVA predictions, i.e., 1 where s is a short answer (e.g., used in machine reading).It is an optional input, which we only use for a baseline, described in Section 4.1.

Model for AVA
The main intuition on building an automatic evaluator for QA is that the model should capture (i) the same information a standard QA system uses; while (ii) exploiting the semantic similarity between the system answer and the reference, biased by the information asked by the question.We build two types of models: (i) linear classifier, which is more interpretable and can help us to verify our design hypothesis and (ii) Transformer-based methods, which have been successfully used in several language understanding tasks.

Linear Classifier
Given an input example, (q, r, s, t), our classifier uses the following similarity features: x 1 =simtoken(s, r), x 2 =sim-text(r, t), x 3 =sim-text(r, q); and x 4 =sim-text(q, t), where sim-token between s and r is a binary feature testing if r is included in s, sim-text is a sort of Jaccard similarity: and tok (s) is a function that splits s into tokens.
Let x = f (q, r, s, t) = (x 1 , x 2 , x 3 , x 4 ) be a similarity feature vector describing our evaluation tuple.We train w on a dataset D = {d i : (x i , l i )} using SVM, where l i is a binary label indicating whether t answers q or not.We compute the pointwise evaluation of t as the test x•w > α, where α is a threshold trading off Precision for Recall in standard classification approaches.

Transformer-based models
Transformer-based architectures have proved to be powerful language models, which can capture complex similarity patterns.Thus, they are suitable methods to improve our basic approach described in the previous section.Following the linear classifier modeling, we propose three different ways to exploit the relations among the members of the tuple (q, r, s, t).
Let B be a pre-trained language model, e.g., the recently proposed BERT (Devlin et al., 2018), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019), AlBERT (Lan et al., 2020).We use a language model to compute the embedding representation of the tuple members: B (a, a ) → x ∈ R d , where (a, a ) is a sentence pair, x is the output representation of the pair, and d is the dimension of the output representations.The classification layer is a standard feedforward network as A (x) = W x + b, where W and b are parameters we learn by fine-tuning the model on a dataset D.
We describe different designs for A as follows.A 0 : Text-Pair Embedding We build a language model representation for pairs of members of the tuple, x = (q, r, t) by simply inputing them to Transformer models B in the standard sentence pair fashion.We consider four different configurations of A 0 , one for each following pair (q, r), (q, t), (r, t), and one for the triplet, (q, r, t), modeled as the concatenation of the previous three.The representation for each pair is produced by a different and independent BERT instance, i.e., B p .More formally, we have the following three models A 0 (B p (p)), ∀p ∈ D 0 , where D 0 = {(q, r), (q, t), (r, t)}.Additionally, we design a model over (q, r, t) with A 0 (∪ p∈D 0 B p (p)), where ∪ means concatenation of the representations.We do not use the short answer, s, as its contribution is minimal when using powerful Transformer-based models.A 1 : Improved Text-Triple Embedding The models of the previous section are lim-ited to pair representations.We improve this by designing B models that can capture pattern dependencies across q, r and t.To achieve this, we concatenate pairs of the three pieces of text above.We indicate this string concatenation with the • operator.Specifically, we consider D 1 = {(q, r • t), (r, q • t), (t, q • r)} and propose the following A 1 .As before, we have the individual models, A 1 (B p (p)), ∀p ∈ D 1 as well as the combined model, A 1 (∪ p∈D 1 B p (p)), where again, we use different instances of B and fine-tune them together accordingly.
A 2 : Peer Attention for Pair of Transformerbased Models Our previous designs instantiate different B for each pair, learning the feature representations of the target pair and the relations between its members, during the fine-tuning process.This individual optimization prevents to capture patterns across the representations of different pairs as there is no strong connection between the B instances.Indeed, the combination of feature representations only happens in the last classification layer.
We propose peer-attention to encourage the feature transferring between different B instances.The idea, similar to encoder-decoder setting in Transformer-based models (Vaswani et al., 2017), is to introduce an additional decoding step for each pair.Figure 1 depicts our proposed setting for learning representation of two different pairs: a 0 = (a, a ) and g 0 = (g, g ).The standard approach learns representations for these two in one pass, via B a 0 and B g 0 .In peer-attention setting, the representation output after processing one pair, captured in H [CLS] , is input to the second pass of finetuning for the other pair.Thus, the representation in one pair can attend over the representation in the other pair during the decoding stage.This allows the feature representations from each B instance to be shared both during training and prediction stages.

Dataset Creation
We describe the datasets we created to develop AVA.First, we build two large scale datasets for the standard QA task, namely AS2-NQ and AS2-GPD, derived from the Google Natural Questions dataset and our internal dataset, respectively.The construction of the datasets is described in Section 5.1.Second, we describe our approach to generate labelled data for AVA using the datasets for QA task, described in Section 5.2.Finally, we build an additional dataset constituted by a set of systems and their output on target test sets.This can be used to evaluate the ability of AVA to estimate the end-to-end system performance (system-wise evaluation), described in Section 5.3.

AS2-NQ: AS2 Dataset from NQ
Google Natural Questions (NQ) is a large scale dataset for machine reading task (Kwiatkowski et al., 2019).Each question is associated with a Wikipedia page and at least one long paragraph (long answer) that contains the answer to the question.The long answer may contain additional annotations of short answer, a succint extractive answer from the long paragraph.A long answer usually consists of multiple sentences, thus NQ is not directly applicable to our setting.
We create AS2-NQ from NQ by leveraging both long answer and short answer annotations.In particular for a given question, the (correct) answers for a question are sentences in the long answer paragraphs that contain annotated short answers.The other sentences from the Wikipedia page are considered incorrect.The negative examples can be of the following types: (i) Sentences that are in the long answer but do not contain annotated short answers.It is possible that these sentences might contain the short answer.(ii) Sentences that are not part of the long answer but contain a short answer as subphrase.Such occurrence is generally accidental.(iii) All the other sentences in the document.
The generation of negative examples impacts on the robustness of the training model when selecting the correct answer out of the incorrect ones.AS2-NQ has four labels that describe possible confusing levels of a sentence candidate.We apply the same processing both to training and development sets of NQ.This dataset enables to perform an effective transfer step (Garg et al., 2020).Table 3 shows the statistics of the dataset.

AS2-GPD: General Purpose Dataset
A search engine using a large index can retrieve more relevant documents than those available in Wikipedia.Thus, we retrieved high-probably relevant candidates as follows: we (i) retrieved top 500 relevant documents; (ii) automatically extracted the top 100 sentences ranked by a BERT model over all sentences of the documents; and (iii) had all the top 100 sentences manually annotated as correct or incorrect answers.This process does not guarantee that we have all correct answers but the probability to miss them is much lower than for other datasets.In addition, this dataset is richer than AS2-NQ as it consists of answers from multiple sources.Furthermore, the average number of answers to a question is also higher than in Table 4 shows the statistics of the dataset.

AVA Datasets
The AS2 datasets from the previous section typically consist of a set of questions Q.Each q ∈ Q has T q = {t 1 , . . ., t n } candidates, comprised of both correct answers C q and incorrect answers C q , T q = C q ∪ C q .We construct the dataset for pointwise automatic evaluation (described in Section 4) in the following steps: (i) to have positive and negative examples for AVA, we first filter the QA dataset to only keep questions that have at least two correct answers.This is critical to build positive and negative examples.
Formally, let q, r, t, l be an input for AVA, AVA-Positives = q; (r, t) ∈ C q × C q and r = t We also build negative examples as follows: AVA-Negatives = q; (r, t) ∈ C q × C q We create AVA-NQ and AVA-GPD from the QA datasets, AS2-NQ and AS2-GPD.The statistics are presented on the right side of tables 3 and 4.

AVA Datasets from Systems (ADS)
To test AVA at level of overall system Accuracy, we need to have a sample of systems and their output on different test sets.We create a dataset that has candidate answers collected from eight systems from a set of 1,340 questions.The questions were sampled from an anonymized set of user utterances.We only considered information inquiry questions.

Experiments
We study the following performance aspects of AVA in predicting: (i) the correctness of the individual answers provided by systems to questions (point-wise estimation); and (ii) the overall system Accuracy.We evaluated QA Accuracy as well as passage reranking performance, in comparison with the human labeling.
The first aspect studies the capacity of our different machine learning models, whereas the second provides a perspective on the practical use of AVA to develop QA systems.

Datasets
We trained and test models using AVA-NQ and AVA-GPD datasets, described in Section 5.2.We also evaluate the point-wise performance on the WikiQA and TREC-QA datasets.

Models
Table 5 summarizes the configurations we consider for training and testing.For the linear classifier baseline, we built a vanilla SVM classifier using scikit-learn.We set the probability parameter to enable Platt scaling calibration on the score of SVM.
We developed our Transformer-based evaluators on top of the HuggingFace's Transformer library (Wolf et al., 2019).We use RoBERTa-Base as the initial pre-trained model for each B instance (Liu et al., 2019).We use the default hyperparameter setting for typical GLUE trainings.This includes (i) the use of the AdamW variant (Loshchilov and Hutter, 2017) as optimizer, (ii) the learning rate of 1e-06 set for all fine-tuning exercises, and (iii) the maximum sequence length set to 128.The number of iterations is set to 2. We also use a development set to enable early stopping based on F1 measure after the first iteration.We fix the same batch size setting in the experiments to  avoid possible performance discrepancies caused by different batch size settings.

Metrics
We study the performance of AVA in evaluating passage reranker systems, which differ not only in methods but also in domains and application settings.We employ the following evaluation strategies to benchmark AVA.

Point-wise Evaluation
We study the performance of AVA on point-wise estimation using traditional Precision, Recall, and F1.The metrics indicate the performance of AVA in predicting if an answer candidate is correct or not.

System-wise evaluation
We measured AVA when used in a simple aggregator to compute the overall system performance over a test set.The metrics we consider are: Precision-at-1 (P@1), Mean Average Precision (MAP), and Mean Reciprocal Rank (MRR), when computing the performance on TREC-QA and WikiQA, since such datasets contain ranks of answers.In contrast, we only use P@1 on ADS dataset, as this only includes the selected answers for each system.We use Kendall's Tau-b2 to measure the correlation between the ranking produced by AVA and the one available in the GS: τ = c−d c+d , where c and d are the numbers of concordant and discordant pairs between two rankings.
We additionally analyze the gap of each performance given by AVA and the one computed with the GS, using root mean square error: , where a and h are the measures given by AVA and from human annotation respectively.

Results on Point-wise Evaluation
We evaluate the performance of AVA in predicting if an answer t is correct for a question q, given a reference r.Table 6 shows the result: Column 1 reports the names of the systems described in Section 4, while columns 2 and 3 show the F1 measured on AVA-NQ and AVA-GPD, respectively.We note that: (i) the F1 on AVA-GPD is much higher than the one on AVA-NQ, this is because the former dataset is much larger than latter; (ii) A 0 ({(q, r)}) cannot predict if an answer is correct as it does not use it in the representation, thus its Accuracy is lower than 7%; (iii) A 0 ({(r, t)}) is already a reasonable model mainly based on paraphrasing between r and t; (iv) A 0 ({(q, t)}) is also a good model as it is as much powerful as a QA system; (v) the A 1 models that takes the entire triplet q, r and t are the most accurare achieving an F1 of almost 74%; (vi) the use of combinations of triplets, e.g., A 1 ({(r, q • t) , (t, q • r)}), provides an even more accurate model; and finally, (vii) the peer-attention model, i.e., A 2 ((r, q • t) , (t, q • r)) reaches almost 75%.
training set from AVA-NQ AVA-GPD development set from AVA-NQ AVA-GPD

Results on system-wise evaluation
We evaluate the ability of AVA in predicting the Accuracy of QA systems as well as the performance of answer sentence reranking tasks.We conduct two evaluation studies with two public datasets, TREC-QA and WikiQA, and an internal ADS dataset.

Results on public datasets
For TREC-QA and WikiQA, we used a bag of different models against the development and test sets and compared the results with the performance measured by AVA using one of the best model according to the point-wise evaluation, i.e., A 2 ((r, q • t) , (t, q • r)).
More specifically, we apply each model m to select the best answer t from the list of candidates for q in the dataset.We first compute the performance of model m based on the provided annotations.The metrics include Accuracy or Precision-at-1 (P@1), MAP, and MRR.We then run AVA for (q, t) using the GS answers of q as reference r.The final AVA score is the average of AVA scores applied to different references for q.Before computing the Accuracy on the test set, we tune the AVA threshold to minimize the RMSE between the Accuracy (P@1) measured by AVA and the one computed with the GS, on the development set of each dataset.
We use these thresholds to evaluate the results on the test sets.We considered six different models, including one Compare-Aggregate (CNN) trained model and five other Transformers-based models.Four of the latter are collected from public resources 3 (Garg et al., 2020).These models differ in the architectures and their training data thus their output is rather different.We removed questions that have no correct or no incorrect answers.
Table 7 reports the overall results averaged over the six models.We note that (i) setting the right threshold on the dev.set, the error on P@1 is 0; (ii) this is not the case for MAP, which is a much harder value to predict as it requires to estimate an entire ranking; (iii) on the TREC-QA test set, AVA has an error ranging from 2 to 4.1 points on any measure; (iv) on the WikiQA test set, the error is higher, reaching 10%, probably due to a larger complexity of the questions; (v) the std.dev. is low, suggesting that AVA can be used to estimate system performance.Additionally, we compute the Kendall's Tau-b correlation between the ranking of the six systems sorted in order of performance (P@1) according to the GS and AVA.We observe a perfect correlation on TREC-QA and a rather high correlation on WikiQA.This means that AVA can be used to determine if a model is better than another, which is desirable when developing new systems.The low p-values indicate reliability of our results.
Finally, Table 8 shows the comparison between the performance evaluated with GS (Human) and AVA for all six models.The predictions of AVA are close to those from human judgement.

Results on ADS
We use ADS dataset in this evaluation.The task is more challenging as AVA only receives one best answer for a system selected from different candidate pools.There was also no control of the sources for the candidates.Table 9 shows the result.We note a lower correlation due to the fact that the 8 evaluated systems have very close Accuracy.On the other hand, the RMSE is rather low 3.1% and the std.dev. is also acceptable < 0.02, suggesting an error less than 7% with a probability > 95%.

Qualitative Analysis
Table 10 reports some example questions from TREC-QA test set, the top candidate selected by the TANDA system (Garg et al., 2020), the classification score of the latter, and the AVA score.AVA judges an answer correct if the score is larger than 0.5.We note that even if the score of TANDA system is low, AVA assigns to the answer a very high score, indicating that it is correct (see the first three examples).Conversely, a wrong answer could be classified as such by AVA, even if TANDA assigned it a very large score (see the last two examples).

Conclusion
We presented AVA, an automatic evaluator method for QA systems.Specifically, we discussed our data collection strategy and model design to enable AVA development.First, we collected seven different datasets, classified into three different types, which we used to develop AVA in different stages.Second, we proposed different Transformer-based modeling designs of AVA to exploit the feature signals relevant to address the problem.Our extensive experimentation has shown the effectiveness of AVA for different types of evaluation: point-wise and system-wise over Accuracy, MAP and MRR.

Table 1 :
Example of a non-factoid questions and three different valid answers, which share similarity with respect to the question.However, if the question were, what may cause anxiety ?, Answer 1 and Answer 3 would intuitively look less related to Answer 2.

Table 5 :
The AVA configurations used in training

Table 8 :
Details of system-wise Evaluation on TREC-QA and WikiQA using AVA model and GS, A 2

Table 9 :
Details of system-wise Evaluation on ADS benchmark dataset

Table 10 :
Garg et al. (2020)can detect the failures of the State-of-the-art model byGarg et al. (2020).