Machine Translation Evaluation Meets Community Question Answering

We explore the applicability of machine translation evaluation (MTE) methods to a very different problem: answer ranking in community Question Answering. In particular, we adopt a pairwise neural network (NN) architecture, which incorporates MTE features, as well as rich syntactic and semantic embeddings, and which efficiently models complex non-linear interactions. The evaluation results show state-of-the-art performance, with sizeable contribution from both the MTE features and from the pairwise NN architecture.


Introduction and Motivation
In a community Question Answering (cQA) task, we are given a question from a community forum and a thread of associated text comments intended to answer the given question; and the goal is to rank the comments according to their appropriateness to the question. Since cQA forum threads are noisy (e.g., because over time people tend to engage in discussion and to deviate from the original question), as many comments are not answers to the question, the challenge lies in learning to rank all good comments above all bad ones.
Here, we adopt the definition and the datasets from SemEval-2016 Task 3  on "Community Question Answering", focusing on subtask A (Question-Comment Similarity) only. 1 See the task description paper and the task website 2 for more detail. An annotated example is shown in Figure 1.
In this paper, we tackle the task from a novel perspective: by using ideas from machine translation evaluation (MTE) to decide on the quality of a comment. In particular, we extend our MTE neural network framework from (Guzmán et al., 2015), showing that it is applicable to the cQA task as well. We believe that this neural network is interesting for the cQA problem because: (i) it works in a pairwise fashion, i.e., given two translation hypotheses and a reference translation to compare to, the network decides which translation hypothesis is better, which is appropriate for a ranking problem; (ii) it allows for an easy incorporation of rich syntactic and semantic embedded representations of the input texts, and it efficiently models complex non-linear relationships between them; (iii) it uses a number of machine translation evaluation measures that have not been explored for the cQA task before, e.g., TER (Snover et al., 2006), METEOR (Lavie and Denkowski, 2009), and BLEU (Papineni et al., 2002).
The analogy we apply to adapt the neural MTE architecture to the cQA problem is the following: given two comments c 1 and c 2 from the question thread-which play the role of the two competing translation hypotheses-we have to decide whether c 1 is a better answer than c 2 to question q-which plays the role of the translation reference. If we have a function f (q, c 1 , c 2 ) to make this decision, then we can rank the finite list of comments in the thread by comparing all possible pairs and by accumulating for each comment the scores for it given by f .
From a general point of view, MTE and the cQA task addressed in this paper seem similar: both reason about the similarity of two competing texts against a reference text in order to decide which one is better. However, there are some profound differences, which have implications on how each task is solved. In MTE, the goal is to decide whether a hypothesis translation conveys the same meaning as the reference translation. In cQA, it is to determine whether the comment is an appropriate answer to the question. Furthermore, in MTE we can expect shorter texts, which are typically much more similar. In contrast, in cQA, the question and the intended answers might differ significantly both in terms of length and in lexical content. Thus, it is not clear a priori whether the MTE network can work well to address the cQA problem. Here, we show that the analogy is not only convenient, but also that using it can yield state-of-the-art results for the cQA task.
To validate our intuition, we present series of experiments using the publicly available SemEval-2016 Task 3 datasets, with focus on subtask A. We show that a naïve application of the MTE architecture and features on the cQA task already yields results that are largely above the task baselines. Furthermore, by adapting the models with indomain data, and adding lightweight task-specific features, we are able to boost our system to reach state-of-the-art performance.
More interestingly, we analyze the contribution of several features and parts of the NN architecture by performing an ablation study. We observe that every single piece contributes important information to achieve the final performance. While taskspecific features are crucial, other aspects of the framework are relevant as well: syntactic embeddings, machine translation evaluation measures, and pairwise training of the network.
The rest of the paper is organized as follows: Section 2 introduces some related work. Section 3 presents the overall architecture of our MTEinspired NN framework for cQA. Section 4 summarizes the features we use in our experiments. Section 5 describes the experimenal settings and presents the results. Finally, Section 6 offers further discussion and presents the main conclusions.

Related Work
Recently, many neural network (NN) models have been applied to cQA tasks: e.g., question-question similarity dos Santos et al., 2015;Lei et al., 2016) and answer selection (Severyn and Moschitti, 2015; Wang and Nyberg, 2015;Shen et al., 2015;Feng et al., 2015;Tan et al., 2015). Most of these papers concentrate on providing advanced neural architectures in order to better model the problem at hand. However, our goal here is different: we extend and reuse an existing pairwise NN framework from a different but related problem.
There is also work that uses machine translation models as a features for cQA (Berger et al., 2000;Echihabi and Marcu, 2003;Jeon et al., 2005;Soricut and Brill, 2006;Riezler et al., 2007;Li and Manandhar, 2011;Surdeanu et al., 2011;Tran et al., 2015) e.g., a variation of IBM model 1, to compute the probability that the question is a possible "translation" of the candidate answer. Unlike that work, here we port an entire MTE framework to the cQA problem. A preliminary version of this work was presented in (Guzmán et al., 2016).
x c2 x q q c1 c2 sentences embeddings pairwise nodes pairwise features output layer

Neural Model for Answer Ranking
The NN model we use for answer ranking is depicted in Figure 2. It is a direct adaptation of our feed-forward NN for MTE (Guzmán et al., 2015). Technically, we have a binary classification task with input (q, c 1 , c 2 ), which should output 1 if c 1 is a better answer to q than c 2 , and 0 otherwise. The network computes a sigmoid function transforms the input x through the hidden layer, w v are the weights from the hidden layer to the output layer, and b v is a bias term.
We first map the question and the comments to a fixed-length vector [x q , x c 1 , x c 2 ] using syntactic and semantic embeddings. Then, we feed this vector as input to the neural network, which models three types of interactions, using different groups of nodes in the hidden layer. There are two evaluation groups h q1 and h q2 that model how good each comment c i is to the question q. The input to these groups are the concatenations [x q , x c 1 ] and [x q , x c 2 ], respectively. The third group of hidden nodes h 12 , which we call similarity group, models how close c 1 and c 2 are. Its input is [x c 1 , x c 2 ]. This might be useful as highly similar comments are likely to be comparable in appropriateness, irrespective of whether they are good or bad answers in absolute terms.
In summary, the transformation φ(q, c 1 , c 2 ) = [h q1 , h q2 , h 12 ] can be written as where g(.) is a non-linear activation function (applied component-wise), W ∈ R H×N are the associated weights between the input layer and the hidden layer, and b are the corresponding bias terms.
We use tanh as an activation function, rather than sig, to be consistent with how the word embedding vectors we use were generated.
The model further allows to incorporate external sources of information in the form of skip arcs that go directly from the input to the output, skipping the hidden layer. These arcs represent pairwise similarity feature vectors between q and either c 1 or c 2 . In these feature vectors, we encode MT evaluation measures (e.g., TER, ME-TEOR, and BLEU), cQA task-specific features, etc. See Section 4 for detail about the features implemented as skip arcs. In the figure, we indicate these pairwise external feature sets as ψ(q, c 1 ) and ψ(q, c 2 ). When including the external features, the activation at the output is f (q, c 1 ,

A. Embedding Features
We used two types of vector-based embeddings to encode the input texts q, c 1 and c 2 : (1) GOOGLE VECTORS: 300dimensional embedding vectors, trained on 100 billion words from Google News (Mikolov et al., 2013). The encoding of the full text is just the average of the word embeddings.
(2) SYNTAX: We parse the entire question/comment using the Stanford neural parser (Socher et al., 2013), and we use the final 25-dimensional vector that is produced internally as a by-product of parsing. Also, we compute cosine similarity features with the above vectors: cos(q, c 1 ) and cos(q, c 2 ). BLEUCOMP. We further use as features various components involved in the computation of BLEU: n-gram precisions, n-gram matches, total number of n-grams (n=1,2,3,4), lengths of the hypotheses and of the reference, length ratio between them, and BLEU's brevity penalty.

B. MTE features
C. Task-specific features First, we train domain-specific vectors using WORD2VEC on all available QatarLiving data, both annotated and raw (QL VECTORS).
Second, we compute various easy taskspecific features (TASK FEATURES), most of them proposed for the 2015 edition of the task (Nicosia et al., 2015).
This includes some comment-specific features: (1) number of URLs/images/emails/phone numbers; (2) number of occurrences of the string "thank"; 3 (3) number of tokens/sentences; (4) average number of tokens; (5) type/token ratio; (6) number of nouns/verbs/adjectives/adverbs/ pronouns; (7) number of positive/negative smileys; (8) number of single/double/triple exclamation/interrogation symbols; (9) number of interrogative sentences (based on parsing); (10) number of words that are not in WORD2VEC's Google News vocabulary. 4 Also some question-comment pair features: (1) question to comment count ratio in terms of sentences/tokens/nouns/verbs/adjectives/adverbs/pronouns; (2) question to comment count ratio of words that are not in WORD2VEC's Google News vocabulary. Finally, we also have two meta features: (1) is the person answering the question the one who asked it; (2) reciprocal rank of the comment in the thread.

Experiments and Results
We experiment with the data from SemEval-2016 Task 3. The task offers a higher quality training dataset TRAIN-PART1, which includes 1,412 questions and 14,110 answers, and a lower-quality TRAIN-PART2 with 382 questions and 3,790 answers. We train our model on TRAIN-PART1 with hidden layers of size 3 for 100 epochs with minibatches of size 30, regularization of 0.005, and a decay of 0.0001, using stochastic gradient descent with adagrad (Duchi et al., 2011); we use Theano (Bergstra et al., 2010) for learning. We normalize the input feature values to the [−1; 1] interval using minmax, and we initialize the network weights by sampling from a uniform distribution as in (Bengio and Glorot, 2010). We train the model using all pairs of good vs. bad comments, in both orders, ignoring ties. At test time, we get the full ranking by scoring all possible pairs, and we accumulate the scores at the comment level.
We evaluate the model on TRAIN-PART2 after each epoch, and ultimately we keep the model that achieves the highest accuracy; 5 in case of a tie, we prefer the parameters from an earlier epoch. We selected the above parameter values on the DEV dataset (244 questions and 2,440 answers) using the full model, and we used them for all experiments below, where we evaluate on the official TEST dataset (329 questions and 3,270 answers). We report mean average precision (MAP), which is the official evaluation measure, and also average recall (AvgRec) and mean reciprocal rank (MRR). Table 1 shows the evaluation results for three configurations of our MTE-based cQA system. We can see that the vanilla MTE system (MTE vanilla ), which only uses features from our original MTE model, i.e., it does not have any task-specific features (TASK FEATURES and QL VECTORS), performs surprisingly well despite the differences in the MTE and cQA tasks. It outperforms a random baseline (Baseline rand ) and a chronological baseline that assumes that early comments are better than later ones (Baseline time ) by large margins: by about 11 and 17 MAP points absolute, respectively. For the other two measures the results are similar.

Results
We can further see that adding the task-specific features in MTE-CQA pairwise improves the results by another 8 MAP points absolute. Finally, the second line shows that adapting the network to do classification (MTE-CQA classif ication ), giving it a question and a single comment as input, yields a performance drop of 0.6 MAP points absolute compared to the proposed pairwise learning model. Thus, the pairwise training strategy is confirmed to be better for the ranking task, although not by a large margin.  Table 2: Results of the ablation study. Table 2 presents the results of an ablation study, where we analyze the contribution of various features and feature groups to the performance of the overall system. For the purpose, we study ∆ MAP , i.e., the absolute drop in MAP when the feature group is excluded from the full system.
Not surprisingly, the most important turn out to be the TASK FEATURES (contributing over five MAP points) as they handle important information sources that are not available to the system from other feature groups, e.g., the reciprocal rank alone contributes about two points.
Next in terms of importance come word embeddings, QL VECTORS (contributing over 2 MAP points), trained on text from the target forum, QatarLiving. Then come the GOOGLE VECTORS (contributing over one MAP point), which are trained on 100 billion words, and thus are still useful even in the presence of the domain-specific QL VECTORS, which are in turn trained on four orders of magnitude less data.
Interestingly, the MTE-motivated SYNTAX vectors contribute half a MAP point, which shows the importance of modeling syntax for this task. The other two MTE features, MTFEATS and BLEU-COMP, together contribute 0.8 MAP points. It is interesting that the BLEU components manage to contribute on top of the MTFEATS, which already contain several state-of-the-art MTE measures, including BLEU itself. This is probably because the other features we have do not model n-gram matches directly.
Finally, Table 3 puts the results in perspective. We can see that our system MTE-CQA would rank first on MRR, second on MAP, and fourth on AvgRec in SemEval-2016 Task 3 competition. 6 These results are also 5 and 16 points above the average and the worst systems, respectively. 6 The full results can be found on the task website: http://alt.qcri.org/semeval2016/task3/index.php?id=results  (Mihaylov and Nakov, 2016)   This is remarkable given the lightweight taskspecific features we use, and confirms the validity of the proposed neural approach to produce stateof-the-art systems for this particular cQA task.

Conclusion
We have explored the applicability of machine translation evaluation methods to answer ranking in community Question Answering, a seemingly very different task, where the goal is to rank the comments in a question-answer thread according to their appropriateness to the question, placing all good comments above all bad ones.
In particular, we have adopted a pairwise neural network architecture, which incorporates MTE features, as well as rich syntactic and semantic embeddings of the input texts that are non-linearly combined in the hidden layer. The evaluation results on benchmark datasets have shown stateof-the-art performance, with sizeable contribution from both the MTE features and from the network architecture. This is an interesting and encouraging result, as given the difference in the tasks, it was not a-priori clear that an MTE approach would work well for cQA.
In future work, we plan to incorporate other similarity measures and better task-specific features into the model. We further want to explore the application of this architecture to other semantic similarity problems such as question-question similarity, and textual entailment.