Beihang-MSRA at SemEval-2017 Task 3: A Ranking System with Neural Matching Features for Community Question Answering

This paper presents the system in SemEval-2017 Task 3, Community Question Answering (CQA). We develop a ranking system that is capable of capturing semantic relations between text pairs with little word overlap. In addition to traditional NLP features, we introduce several neural network based matching features which enable our system to measure text similarity beyond lexicons. Our system significantly outperforms baseline methods and holds the second place in Subtask A and the fifth place in Subtask B, which demonstrates its efficacy on answer selection and question retrieval.


Introduction
In task 3 of SemEval 2017, participants are required to address typical problems in modern CQA forums.
We participate two subtasks: question-comment similarity (Subtask A) and question-question similarity (Subtask B). In Subtask A, given a question and 10 comments in its comment thread, one is required to re-rank the 10 comments according to their relevance with the question. Subtask B gives a question and asks participants to re-rank 10 related questions according to their similarity to the input question.
The challenge of both subtasks is that two natural language sentences often express similar meanings with different but semantically related words, which results in semantic gaps between them. To bridge the semantic gaps, we build a ranking system with a variety of features. In addition to traditional NLP features such as tf-idf (Salton and Buckley, 1988), the longest common subsequence (Allison and Dix, 1986), translation models (Jeon * Corresponding Author et al., 2005), and tree kernels (Schlkopf et al., 2003;Collins and Duffy, 2002;Moschitti, 2006), which match sentences based on word overlap, syntax (tree kenerls), and word-word translations (translation models), we also introduce neural network based matching models into the system as features. The neural matching features, including a long short term memory network (LSTM) (Schuster and Paliwal, 1997) and a 2D matching network which is a variant of our model in , can extract high level matching signals from distributed representations of the sentences and capture their similarity beyond lexicons. We also design some specific features for each subtask. All the features are combined as a ranking model by a gradient boosted regression tree which is implemented by Xgboost (Chen and Guestrin, 2016). Our system significantly outperforms baseline methods on the two subtasks. On Subtask A, it holds the second place and is comparable with the best system. On Subtask B, it holds the fifth place. The results demonstrate that our system can alleviate the semantic gaps in the tasks of CQA and effectively rank relevant comments and similar questions to high positions.

System Description
Our system is built under a learning to rank framework (Liu et al., 2009). It takes a question and a group of candidates (comments or related questions) as input, and outputs a ranking list of the candidates based on scores of question-candidate pairs. The ranking scores are calculated in three steps: text preprocessing, feature extraction, and feature combination. In preprocessing, we replace special characters and punctuations with spaces, normalize all letters to their lowercase, remove stop-words, and conduct stemming and syntax analysis. Subsequently, we extract a variety of fea-tures from text pairs including traditional NLP features and neural matching features for both subtasks and some task-specific features. Finally, we feed the features to a ranking model which is trained under a pairwise loss using the training data provided in the subtasks to calculate the ranking scores.
In the following, we will describe details of preprocessing, features, and feature combination.

Preprocessing
We exploit NLTK toolkit (Loper and Bird, 2002) to conduct stemming, tokenization, and POS tagging. We use Stanford PCFG parser (Klein and Manning, 2003) to get the parse tree of each sentence.

Traditional NLP Features
The following features are designed based on words and syntactic analysis.
Tf-idf cosine: each piece of text is converted to a one hot representation weighted by tf-idf values, where tf is the term frequency in the text, and idf is calculated using the unannotated Qatar corpora (Nakov et al., 2017). The cosine of representations of the two pieces of text is used as a feature.
Longest common subsequence: we measure the lexical similarity of each text pair with the term-level longest common subsequence (LCS) (Allison and Dix, 1986). The length of LCS is normalized by dividing the maximum length of the two pieces of text.
Tree kernels: tree kernels are similarity functions used to measure the syntactic similarity of a text pair. We compute the subtree kernel (ST) (Schlkopf et al., 2003), the subset tree kernel (SST) (Collins and Duffy, 2002), and the partial tree kernel (PTK) (Moschitti, 2006) on the parse trees of a text pair.
Translation probability: we learn word-toword translation probabilities using GIZA++ 1 with the unannotated Qatar Living data. In training, we regard questions as source language and their answers as target language. Following (Jeon et al., 2005), we use translation probability p(qusetion A|question B) and p(comment|question) as features for a question-question pair and a question-comment pair respectively.
In Subtask A, we compute the features on both (question body, comment) and (question subject, comment), and in Subtask B, we compute the features on (question body, question body) and (question subject, question subject).

Neural Matching Features
In addition to the traditional NLP features, we also use neural matching features to measure text similarity based on their distributed representations. These neural network based models have proven their effectiveness in previous works (Zhang et al., 2016;.
Word embedding cosine: we employ a pre-trained word embedding from https://github.com/tbmihailov/ semeval2016-task3-cqa, where the dimensionality of word vectors is 200. We average the embedding of words in a piece of text as its representation, and compute the cosine of the representations of two pieces of text as a feature.
Bi-LSTM: long short term memory (LSTM) is an advanced type of recurrent neural network which leverages memory cells and gates to learn long-term dependencies within a sequence (Hochreiter and Schmidhuber, 1997). We use a bidirectional LSTM (bi-LSTM) with a multi-layer perceptron (MLP) to calculate a matching score for a text pair as a feature.
Specifically, given a text pair (S x , S y ), the model looks up an embedding table to convert S x and S y to S x = [e x,1 , ..., e x,i , ..., e x,I ] and S y = [e y,1 , ..., e y,i , ..., e y,J ] respectively, where e x,i , e y,i are the embeddings of the i-th words of S x and S y respectively. Then S x and S y are encoded in hidden sequences by a bi-LSTM which consists of a forward LSTM and a backward LSTM. The forward LSTM reads S x in its order (i.e., from w x,1 to w x,I ) and transforms it to a forward hidden se- where σ(·) is a sigmoid function and tanh(·) is a hyperbolic tangent function. and b (u) are parameters. Similarly, the backward LSTM reads S x in its reverse order (i.e., from w x,I to w x,1 ) and transforms it to a backward hidden Following the same procedure, we have v y as the representation of S y . Finally, we concatenate (v x , v y ) as an input of a multi-layer perceptron (MLP) to calculate a score.
2D matching network: the model is a variant of the one proposed in  which has proven effective on the data of SemEval-2015. The model in  leverages prior knowledge and performs text matching with multiple channels. In our system, we only use two channels, which means we do not take prior knowledge such as knowledge base (Zheng et al., 2016) and topic information into consideration. The architecture is shown in Figure 1. Given a text pair (S x , S y ), their word embedding representations S x , S y and their bi-LSTM representations where e u,i is the i-th word embedding of the utterance, and e r,j is the j-th word embedding of the response. The (i, j)-th element of M 2 is defined by where A is a parameter. After that, a convolutional neural network (CNN) takes M 1 and M 2 as input channels, and alternates convolution and max-pooling operations (The system only has one convolution and one pooling layer). Suppose that I (l,f ) ×J (l,f ) denotes the output of feature maps of type-f on layer-l, where z (0,f ) = M f , ∀f = 1, 2. On convolution layers, we employ a 2D convolution operation with a window size r where σ(·) is a ReLU (Nair and Hinton, 2010), and w (l,f ) ∈ R r (l,f ) w ×r (l,f ) h and b l,k are parameters of the f -th feature map on the l-th layer, and F l−1 is the number of feature maps on the (l − 1)-th layer. A max pooling operation can be formulated as Feature vectors at the last pooling layer are concatenated to form a similarity vector v, which is fed to an MLP to predict the final similarity score. We learn the bi-LSTM and the 2D matching network by minimizing cross entropy on training data. Let Θ denote the parameters, then the objective function can be formulated as where l i ∈ {0, 1} is a label, f (s x,i , s y,i ) is the neural network we want to learn, and N is the number of instances in the training data.
We use two data sets to learn the neural networks, which means we obtain two features from each model. The first one is the training data provided by SemEval-2017 task 3, and the other one is 2 million Yahoo! Answer data we crawled, which is released in (Zhang et al., 2016). In both data, question subjects and question bodies are concatenated together. In SemEval-2017 data, comments in Subtask A are annotated as Good, PotentiallyUseful, and Bad, and we treat "Good" as 1 and the others as 0. In Subtask B, each related question is annotated as PerfectMatch, Relevant, and Irrelevant, and we treat "PerfectMatch" and "Relevant" as 1 and "Irrelevant" as "0". The Yahoo Answer data is only used to learn the neural networks for Subtask A, in which we take a question and its best answer as a positive instance, and randomly sample an answer from other questions as a negative instance. The motivation of leveraging external data is that the training data of SemEval-2017 is small, which may cause overfitting in learning of neural networks.  Train  Test  Train  Test  2016-train 2016-dev 2016-test 2017-test 2016-train 2016-dev 2016-test 2017-test  Original questions  ----267  50  70  88  Related questions  6154  244  327  293  2670  500  700  880  Comments  37848  2440  3270  2930  ---Table 1: Statistics of the datasets

Task Specific Features
The features described above are used in both Subtask A and Subtask B. In addition to them, we also design some specific features for each subtask.
In Subtask A, we design some features based on heuristic rules which might indicate whether a comment is good or not: (i) whether a comment is written by the author of the question. (ii) the length of a comment. (iii) whether a comment contains URLs or email addresses. (iv) whether a comment contains positive or negative smileys, e.g., ;), :), ;(, :(.
In Subtask B, a related question has a metadata field that shows its relative rank in an external search engine by considering its similarity with the original question. We use the relative rank as a feature for subtask B.

Feature Combination
Since both Subtask A and Subtask B are ranking problems, we learn gradient boosted regression trees using XgBoost (Chen and Guestrin, 2016) as ranking models to combine all features. The ranking models are learned by minimizing pairwise loss on training instances provided by the subtasks.

Data Sets and Evaluation Metrics
We used the data sets provided by SemEval-2017 (Nakov et al., 2017). Table 1 gives the statistics. We employed Mean Average Precision (MAP), Average Recall (AveRec), and Mean Reciprocal Rank (MRR) as evaluation metrics.

Parameter Tuning and Feature Selection
We tuned parameters according to average MAP on 5-fold cross validation (CV) with grid search algorithm. There are three sensitive parameters of XGBoost that should be tuned in training, namely gamma, subsample, colsample bytree. The best parameters of two subtasks is shown in Table 2   We adopted Adagrad (Duchi et al., 2011) which is a stochastic gradient descent method to optimize the neural network models. In order to prevent overfitting, we used early-stopping (Lawrence and Giles, 2000) and dropout (Srivastava et al., 2014) with rate of 0.5. In bi-LSTM and 2D matching network (2D MN), the dimensionality of word embedding is 200. Word embedding was initialized by the result of word2vec (Mikolov et al., 2013) trained on unannotated Qatar data (Nakov et al., 2017) and updated in training. We set the initial learning rate and batch size as 0.001 and 30 respectively. The other parameters of the two models are listed in Table 3.
We conducted feature selection by 5-fold CV to filter out useless features for the two subtasks. Our approach is that we first used all features and obtained an MAP on 5-fold CV, then we removed the features one by one and checked how MAP changes. If MAP increased significantly by removing that feature, we removed the feature. The final result is that we preserved all features for Subtask A, and removed neural matching features for Subtask B. Details of feature contributions will be described in Section 3.5.
Apart from the primary submission, we also   submitted two contrastive results. The only difference is the parameter setting of XgBoost. In the primary submission, we selected the parameters with which our system achieved the best performance on 5-fold CV, while in the two contrastive submissions, we selected two parameter combinations that correspond to the smallest and the second smallest variance of MAP on 5 runs.

Baseline
We selected the relative rank provided by the search engine, Google, as a baseline method, and denote it as IR baseline.

Overall results
We show the primary and contrastive results of Subtask A and Subtask B in Table 6 and Table 7 respectively. There is no significant difference be-   on the task of answer selection. Our improvement is not big on Subtask B, which is only 3 points on MAP score. This is because we only use shallow features on this task and neural matching features are useless according to our experiments. There are two reasons why neural matching fails on this task: (1) training data provided by SemEval-2017 is too small to train a neural network, and our external data only consists of question-answer pairs which does not support learning neural networks for question-question similarity; (2) a question and its question often share most of words and are only different on a small proportion of function words. Neural matching models, however, are not good at capturing such difference.

Feature Contribution
We conducted ablation experiments on training data with 5-fold CV and on test data to examine the usefulness of features. The conclusion is that traditional NLP features are effective on both subtasks, while neural matching features can only improve the system performance on Subtask A. In Table 4, we present the results on Subtask A, including our system with all features and the system with one of the features excluded. We can observe that all features are useful on training data, but the system can achieve a better result on test data if we exclude the word overlap feature. Neural matching features are important on Subtask A, with which we obtain 5 point gain on training data and 3 point gain on test data. Meta-data features are also useful, indicating that they are good complementary to the similarity based features.
In Table 5, we show the results of ablation experiments on Subtask B. Neural matching features caused performance drop on this task, therefore we did not include them in our submitted system. Although all the traditional NLP features are useful on training data, word overlap, tree kernels, and meta-data feature hurt the performance on the test data. It is also worth noting that our system can be further improved on the test data if the meta-data feature, i.e., relative rank of Google, is excluded from our system.

Conclusion
We developed a ranking system with neural matching features for Subtask A and Subtask B in SemEval-2017. The system holds the second place in Subtask A and the fifth place in Subtask B, which demonstrates its efficacy on answer selection and similar question retrieval.