MTE-NN at SemEval-2016 Task 3: Can Machine Translation Evaluation Help Community Question Answering?

We present a system for answer ranking (SemEval-2016 Task 3, subtask A) that is a direct adaptation of a pairwise neural network model for machine translation evaluation (MTE). In particular, the network incorporates MTE features, as well as rich syntactic and semantic embeddings, and it efﬁciently models complex non-linear interactions be-tween them. With the addition of lightweight task-speciﬁc features, we obtained very encouraging experimental results, with sizeable contributions from both the MTE features and from the pairwise network architecture. We also achieved good results on subtask C.


Introduction
We present a system for SemEval-2016 Task 3 on Community Question Answering (cQA), subtask A (English). In that task, we are given a question from a community forum and a thread of associated text comments intended to answer the question, and the goal is to rank the comments according to their appropriateness to the question. Since cQA forum threads are noisy, as many comments are not answers to the question, the challenge lies in learning to rank all good comments above all bad ones. 1 In this work, we approach subtask A from a novel perspective: by using notions of machine translation evaluation (MTE) to decide on the quality of a comment. In particular, we extend the MTE neural network framework from Guzmán et al. (2015).
We believe that this neural network is interesting for the cQA problem because: (i) it works in a pairwise fashion, i.e., given two translation hypotheses and a reference translation to compare to, the network decides which translation hypothesis is better; this is appropriate for a ranking problem; (ii) it allows for an easy incorporation of rich syntactic and semantic embedded representations of the input texts, and it efficiently models complex non-linear relationships among them; (iii) it uses a number of MT evaluation measures that have not been explored for the cQA task (e.g., TER, Meteor and BLEU).
The analogy we apply to adapt the neural MTE architecture to the cQA problem is the following: given two comments c 1 and c 2 from the question thread-which play the role of the two translation hypotheses-we have to decide whether c 1 is a better answer than c 2 to question q-which plays the role of the translation reference.
The two tasks seem similar: both reason about the similarity of two competing texts against a reference text, to decide which one is better. However, there are some profound differences. In MTE, the goal is to decide whether a hypothesis translation conveys the same meaning as the reference translation. In cQA, it is to determine whether the comment is an appropriate answer to the question. Furthermore, in MTE we can expect shorter texts, which are much more similar among them. In cQA, the question and the intended answers might differ significantly both in length and in lexical content. Thus, it is not clear a priori whether the MTE network can work well for cQA. Here, we show that the analogy is convenient, allowing to achieve competitive results.
At competition time, we achieved the sixth best result on the task from a set of twelve systems. Right after the competition we introduced some minor improvements and extra features, without changing the fundamental architecture of the network, which improved the MAP result by almost two points. We also performed a more detailed experimental analysis of the system, checking the contribution of several features and parts of the NN architecture. We observed that every single piece contributes important information to achieve the final performance. While task-specific features are crucial, other aspects of the framework are relevant too: syntactic embeddings, MT evaluation measures, and pairwise training of the network.
Finally, we used our system for subtask A to solve subtask C, which asks to find good answers to a new question that was not asked before in the forum by reranking the answers to related questions. For the purpose, we weighted the subtask A scores by the reciprocal rank of the related questions (following the order given by the organizers, i.e., the ranking by Google). Without any subtask C specific addition, we achieved the fourth best result in the task.

Related Work
Recently, many neural network (NN) models have been applied to cQA tasks: e.g., question-question similarity dos Santos et al., 2015;Lei et al., 2016) and answer selection (Severyn and Moschitti, 2015;Wang and Nyberg, 2015;Shen et al., 2015;Feng et al., 2015;Tan et al., 2015). Also, other participants in the SemEval 2016 Task 3 applied NNs to solve some of the subtasks . However, our goal was different: we were interested in extending an existing pairwise NN framework from a different but related problem.
There is also work that uses scores from machine translation models as a features for cQA (Berger et al., 2000;Echihabi and Marcu, 2003;Jeon et al., 2005;Soricut and Brill, 2006;Riezler et al., 2007;Li and Manandhar, 2011;Surdeanu et al., 2011;Tran et al., 2015), e.g., a variation of IBM model 1, to compute the probability that the question is a "translation" of the candidate answer. Unlike that work, here we use machine translation evaluation (MTE) instead of machine translation models. Another relevant work is that of Madnani et al. (2012), who applied MTE metrics as features for paraphrase identification. However, here we have a different problem: cQA. Moreover, instead of using MTE metrics as features, we port an entire MTE framework to the cQA problem.

Neural Model for Answer Ranking
The NN model we use for answer ranking is depicted in Figure 1. It is a direct adaptation of the feed-forward NN for MTE described in (Guzmán et al., 2015). Technically, we have a binary classification task with input (q, c 1 , c 2 ), which should output 1 if c 1 is a better answer to q than c 2 , and 0 otherwise. 2 The network computes a sigmoid function where φ(x) transforms the input x through the hidden layer, w v are the weights from the hidden layer to the output layer, and b v is a bias term.
We first map the question and the comments to a fixed-length vector [x q , x c 1 , x c 2 ], using syntactic and semantic embeddings. Then, we feed this vector as input to the neural network, which models three types of interactions, using different groups of nodes in the hidden layer. There are two evaluation groups h q 1 and h q 2 that model how good each comment c i is to the question q. The input to these groups are the concatenations [x q , x c 1 ] and [x q , x c 2 ], respectively. The third group of hidden nodes h 12 , which we call similarity group, models how close c 1 and c 2 are. Its input is [x c 1 , x c 2 ]. This might be useful as highly similar comments are likely to be comparable in appropriateness, irrespective of whether they are good or bad answers in absolute terms.
In summary, the transformation φ(q, c 1 , c 2 ) = [h q 1 , h q 2 , h 12 ] can be written as follows: where g(.) is a non-linear activation function (applied component-wise), W ∈ R H×N are the associated weights between the input layer and the hidden layer, and b are the corresponding bias terms. We use tanh as an activation function, rather than sig, to be consistent with how parts of our input vectors (the word embeddings) are generated.
The model further allows to incorporate external sources of information in the form of skip arcs that go directly from the input to the output, skipping the hidden layer. These arcs represent pairwise similarity feature vectors between q and either c 1 or c 2 . In these feature vectors, we encode MT evaluation measures (e.g., TER, Meteor, and BLEU), cQA task-specific features, etc. See Section 4.3 for details about the features implemented as skip arcs. In the figure, we indicate these pairwise external feature sets as ψ(q, c 1 ) and ψ(q, c 2 ). When including the external features, the activation at the output is

Learning Features
We experiment with three kinds of features: (i) input embeddings, (ii) features motivated by previous work on Machine Translation Evaluation (MTE) (Guzmán et al., 2015) and (iii) task-specific features, mostly proposed by participants in the 2015 edition of the task .

Embedding Features
We use the following vector-based embeddings of (q, c 1 , c 2 ) as input to the NN: • GOOGLE VEC: We use the pre-trained, 300dimensional embedding vectors, which Tomas Mikolov trained on 100 billion words from Google News (Mikolov et al., 2013).
• SYNTAX VEC: We parse the entire question/comment text using the Stanford neural parser (Socher et al., 2013), and we use the final 25-dimensional vector that is produced internally as a by-product of parsing.
Moreover, we use the above vectors to calculate pairwise similarity features. More specifically, given a question q and a pair of comments c 1 and c 2 for it, we calculate the following features: ψ(q, c 1 ) = cos(q, c 1 ) and ψ(q, c 2 ) = cos(q, c 2 ).

MTE features
MTFEATS (in MTE-NN-improved only). We use (as skip-arc pairwise features) the following six machine translation evaluation features, to which we refer as MTFEATS, and which measure the similarity between the question and a candidate answer: • BLEU: This is the most commonly used measure for machine translation evaluation, which is based on n-gram overlap and length ratios (Papineni et al., 2002).
• NIST: This measure is similar to BLEU, and is used at evaluation campaigns run by NIST (Doddington, 2002).
• TER: Translation error rate; it is based on the edit distance between a translation hypothesis and the reference (Snover et al., 2006).
• METEOR: A measure that matches the hypothesis and the reference using synonyms and paraphrases (Lavie and Denkowski, 2009).
• PRECISION: measure, originating in information retrieval.
• RECALL: another measure coming from information retrieval.
BLEUCOMP. Following (Guzmán et al., 2015), we further use as features various components that are involved in the computation of BLEU: n-gram precisions, n-gram matches, total number of n-grams (n=1,2,3,4), lengths of the hypotheses and of the reference, length ratio between them, and BLEU's brevity penalty. We will refer to the set of these features as BLEUCOMP.

Task-specific features
QL VEC (in MTE-NN-improved only). Similarly to the GOOGLE VEC, but on task-specific data, we train word vectors using WORD2VEC on all available cQA training data (Qatar Living) and use them as input to the NN.
QL+IWSLT VEC (in MTE-NN-{primary, con-trastive1/2} only). We also use trained word vectors on the concatenation of the cQA training data and the English portion of the IWSLT data, which consists of TED talks (Cettolo et al., 2012) and is thus informal and somewhat similar to cQA data. TASK FEAT. We further extract various taskspecific skip-arc features, most of them proposed for the 2015 edition of the task . This includes some comment-specific features: • number of URLs/images/emails/phone numbers; • number of occurrences of the string thank; 3 • number of tokens/sentences; • average number of tokens; • type/token ratio; • number of nouns/verbs/adjectives/adverbs/pronouns; • number of positive/negative smileys; • number of single/double/triple exclamation/interrogation symbols; • number of interrogative sentences (based on parsing); • number of words that are not in word2vec's Google News vocabulary. 4 And also some question-comment pair features: • question to comment count ratio in terms of sentences/tokens/nouns/verbs/adjectives/adverbs/pronouns; • question to comment count ratio of words that are not in word2vec's Google News vocabulary.
We also have two meta features: • is the person answering the question the one who asked it; • reciprocal rank of the comment in the thread.
3 When an author thanks somebody, this post is typically a bad answer to the original question. 4 Can detect slang, foreign language, etc., which would indicate a bad answer.

Experiments and Results
Below we explain which part of the available data we used for training, as well as our basic settings. Then, we present in detail our experiments and the evaluation results.

Data and Settings
We experiment with the data from SemEval-2016 Task 3 . The task offers a higher quality training dataset TRAIN-PART1, which includes 1,412 questions and 14,110 answers, and a lower-quality TRAIN-PART2 with 382 questions and 3,790 answers. We train our model on TRAIN-PART1 with hidden layers of size 3 for 63 epochs with minibatches of size 30, regularization of 0.0015, and a decay of 0.0001, using stochastic gradient descent with adagrad (Duchi et al., 2011); we use Theano (Bergstra et al., 2010) for learning. We normalize the input feature values to the [−1; 1] interval using minmax, and we initialize the network weights by sampling from a uniform distribution as in (Bengio and Glorot, 2010). We train the model using all pairs of good and bad comments, ignoring ties. At test time we get the full ranking by scoring all possible pairs, and accumulating the scores at the comment level.
We evaluate the model on TRAIN-PART2 after each epoch, and ultimately we keep the model that achieves the highest Kendall's Tau (τ ); in case of a tie, we prefer the parameters from a later epoch. We selected the above parameter values on the DEV dataset (244 questions and 2,440 answers) using the full model, and we use them for all experiments below, where we evaluate on the official TEST dataset (329 questions and 3,270 answers).
For evaluation, we use mean average precision (MAP), which is the official evaluation measure. We further report scores using average recall (AvgRec), mean reciprocal rank (MRR), Precision (P), Recall (R), F-measure (F 1 ), and Accuracy (Acc). Note that the first three are ranking measures, to which we directly give our ranking scores. However, the latter four measures require Good vs. Bad categorical predictions. We generate them based on the ranking scores using a threshold: if the score is above 0.95 (chosen on the DEV set), we consider the comment to be Good, otherwise it is Bad.

Contrastive Runs
We submitted two contrastive runs, which differ from the general settings above as follows: • MTE-NN-contrastive1: a different network architecture with 50 units in the hidden layer (instead of 3 for each of h q1 ,h q2 ,h 12 ) and higher regularization (0.03, i.e., twenty times bigger). On the development data, it performed very similarly to those for the primary run, and we wanted to try a bigger NN.
• MTE-NN-contrastive2: the same architecture as the primary but different training. We put together TRAIN-PART1 and DEV and randomly split them into 90% for training and 10% for model selection. The idea here was to have some training examples from development, which was supposed to be a cleaner dataset (and so more similar to the test set). Table 1 shows the results for our submissions for subtask A. Our primary submission was ranked sixth out of twelve teams on MAP. Note, however, that it was third on MRR and F 1 . It is also 3 and 14 points above the average and the worst systems, respectively, and well above the baselines. Both our contrastive submissions performed slightly better, but neither of them is strong enough to change the overall ranking if we had chosen one of them as primary.

Official Results
For subtask C, we multiplied (i) our scores for subtask A for the related question by (ii) the given reciprocal rank of the related question in the list of related questions. That is, we did not try to address question-question similarity (subtask B). We achieved 4th place with a MAP of 49.38, which is well above the baseline of 40.36. Our contrastive2 run performed slightly better at 49.49.

Post-submission Analysis on the Test Set
After the competition, we produced a refined version of the system (MTE-NN-improved) where the settings changed as follows: (i) using QL VEC instead of QL+IWSLT VEC, (ii) adding MTFEATS to the set of features, (iii) optimizing accuracy instead of Kendall's tau, (iv) training for 100 epochs instead of 63, and (v) regularization of 0.005 instead of 0.0015.  Table 2: Ablation study of our improved system on the test data.
Note that the training and development set remained unchanged. MTE-NN-improved showed notable improvements on the DEV set over our primary submission. In Table 2, we present the results on the TEST set. To gain additional insight about the contribution of various features and feature groups to the performance of the overall system, we also present the results of an ablation study where we removed different feature groups one by one. For this purpose, we study ∆ MAP , i.e., the absolute change in MAP when the feature or feature group is excluded from the full system. Not surprisingly, the most important turn out to be the TASK FEATS (contributing over 5 MAP points) as they handle important information sources that are not available to the system from other feature groups, e.g., the reciprocal rank of the comment in the comment thread, which alone contributes 2.12 MAP points, and the feature checking whether the person who asked the question is the one who answered, which contributes 1.60 MAP points. Next in terms of importance come word embeddings, QL VEC (contributing over 2 MAP points), trained on text from the target forum, Qatar-Living. Then come the GOOGLE VEC (contributing over 1 MAP point), which are trained on 100 billion words, and thus are still useful even in the presence of the domain-specific QL VEC, which are in turn trained on four orders of magnitude less data. Interestingly, the MTE-motivated SYNTAX VEC vectors contribute half a MAP point, which shows the importance of modeling syntax for this task. Next, we can see that using just the vectors is not enough, and adding cosines as pairwise features for the three kinds of vectors contributes over one MAP point.  Finally, the two MTE features, MTFEATS and BLEUCOMP, together contribute 0.8 MAP points. It is interesting that the BLEU components manage to contribute on top of the MTFEATS, which already contain several state-of-the-art MTE measures, including BLEU itself. This is probably because the other features we have do not model ngram matches directly.
We further used the output of our MTE-NNimproved system to generate predictions for subtask C, as explained above. This yielded improvements from 49.38 to 49.87 on MAP, from 55.44 to 56.08 on AvgRec, and from 51.56 to 52.16 on MRR.

Conclusion
We have explored the applicability of machine translation evaluation metrics to answer ranking in community Question Answering, a seemingly very different task (compared to MTE). In particular, with ranking in mind, we have adopted a pairwise neural network architecture, which incorporates MTE features, as well as rich syntactic and semantic embeddings of the input texts that are non-linearly combined in the hidden layer.
Our post-competition improvements have shown state-of-the-art performance (Guzmán et al., 2016), with sizeable contribution from both the MTE features and from the network architecture. This is an encouraging result as it was not a priori clear that an MTE approach would work well for cQA.
In future work, we plan to incorporate fine-tuned word embeddings as in the SemanticZ system (Mihaylov and Nakov, 2016b), and information from entire threads (Nicosia et al., 2015;Joty et al., 2016). We also want to add more knowledge sources, e.g., as in the SUper Team system (Mihaylova et al., 2016), including veracity, sentiment, complexity, troll user features as inspired by (Mihaylov et al., 2015a;Mihaylov et al., 2015b;Mihaylov and Nakov, 2016a), and PMI-based goodness polarity lexicons as in the PMI-cool system .
We further plan to explore the application of our NN architecture to subtasks B and C, and to study the interactions among the three subtasks in order to solve the primary subtask C. Furthermore, we would like to try a similar neural network for other semantic similarity problems, such as textual entailment.