bunji at SemEval-2017 Task 3: Combination of Neural Similarity Features and Comment Plausibility Features

This paper describes a text-ranking system developed by bunji team in SemEval-2017 Task 3: Community Question Answering, Subtask A and C. The goal of the task is to re-rank the comments in a question-and-answer forum such that useful comments for answering the question are ranked high. We proposed a method that combines neural similarity features and hand-crafted comment plausibility features, and we modeled inter-comments relationship using conditional random field. Our approach obtained the fifth place in the Subtask A and the second place in the Subtask C.


Introduction
This paper explains the participation of the bunji team in SemEval-2017 Task 3 on Community Question Answering (CQA) (Nakov et al., 2017), Subtask A and Subtask C. The goal of the task is to re-rank the comments in a question-and-answer forum such that useful comments for answering the question are ranked high. Subtask A is extraction of relevant answers from comments in a question thread. Given a question and its comments, the system must re-rank the comments according to their relevance with respect to the question. Subtask C is extraction of relevant answers from comments in different question threads. Given a question (the original question), questions that are possibly related to the original question (the relevant questions) and comments to the relevant questions, the system must re-rank the comments according to their relevance with respect to the original question. Since the task is ranking, the primary metric is mean average precision (MAP).
Our model consists of three elements; use of similarity features, use of comment plausibility features and a supervised scoring method that models inter-comments relationship. The similarity features are designed to capture the similarities between a question and a comment because a valid answer should be on the same topic as the question. Similarity features were utilized by many teams in SemEval-2016 . In this work, we take a deep learning approach to extract similarity features. The comment plausibility features are designed to capture characteristics that relevant answers tend to have. Similar concept was proposed by Mihaylova et al. (2016), who tried to model readability, credibility, sentiment and trollness. The comment plausibility features were hand-crafted to incorporate human knowledge about CQA.
In past CQA tasks, some teams incorporated inter-comments relationship. An example of such relationship is acknowledgement, where a good answer is likely to be followed by acknowledgement of the questioner. Barrn-Cedeo et al. (2015) modeled inter-comments relationship by taking distance to nearest acknowledgement as a feature and using Conditional Random Field (CRF) to model transition probability between relevant and irrelevant comments. In our work, we try to model inter-comments relationship in much simpler way; by concatenating features of adjacent comments and by utilizing CRF for final ranking function.

Method
Our proposed method is constructed in following steps: (i) Neural network is trained to extract similarity features independently to the rest of the system, (ii) comment plausibility features are extracted with hand-crafted rules,  (iii) neural similarity features and comment plausibility features are concatenated to form the combined features, and (iv) CRF is optimized on the combined features while keeping the neural network fixed. We used almost identical method for Subtask C. The differences in the system for Subtask A and for C are discussed in Section 2.4.

Neural Similarity Features
One of the challenges in the CQA task is that question and comment texts tend to be long. This makes use of recurrent neural network difficult, because recurrent neural network is known to be less effective against a long sequence (Lai et al., 2015). In this work, we make assumption that only a very small region of a question and a comment is needed to decide whether the comment is relevant. For example, given a 62-words question, ... and would like to know the typical business dress code in Doha for Non Nationals. :: Is :: it :::: OK ::: for ::::: men ::: to ::::: wear ::::: short :::::: sleeve :::::: shirts? For women; I am assuming the more conservative; ... 1 and a 50-words comment, I agree with MR M; :: its ::: not :::::: much :: to ::::: worry ::: of ::::: your :::::: dress.. its not an issue over here ;just be modest... 1 We only need underlined parts of the question and the answer to identify that the comment is relevant.
We propose a feature extraction method based on a decomposable attention model (Parikh et al., 2016). This method is designed to model alignment between two sequences of text, allowing the system to jointly identify informative region and predict whether the comment is relevant.
The overview of our neural network is shown in Figure 1. Each question-comments thread (one question and multiple comments) is mapped to a real value score using a decomposable attention model. The loss for stochastic gradient decent is calculated for each thread using list-wise ranking loss.
As preprocessing, we remove HTML tags, apply tokenization and lowercase all characters. Named entities, image tags, URLs and numerics are each converted to special symbols. A question subject text is prepended to the corresponding question text. We truncate question and comment text to first 50 tokens.
The c-th token of j-th comment text (1 ≤ j ≤ N ) in i-th thread is then mapped to word vector representation x C i,j,c ∈ R M and the q-th token of the question text in i-th thread to x Q i,q ∈ R M . The word vector was pretrained with the raw forum text provided by the organizers which contained approximately 100 million words.We only use 50,000 most frequent words and the rest of the words are mapped to an averaged vector of 50 least frequent words. Each mapped to a questioncomment vector z i,j using the decomposable attention model. First, the model compares and calculates attention e i,j,c,q for each token combination, where F is a feed forward neural network. Then, the model extracts subphrase of x C i,j that is softaligned against x Q i using attention mechanism: Then we compare the word vector to the softaligned subphrase and aggregate all the combina- where G is a feed forward neural network and [•, •] denotes the concatenation of vectors. This is calculated vice-versa for v C i,j . Finally, we map v Q i,j and v C i,j to a score y i,j ∈ R: where H is a feed forward neural network, σ is an activation function, and W and b are model parameters. The representation z i,j is used as the neural features, which is combined with comments plausibility features to form our final model. The scores y i = {y i,j } are optimized to predict ground truth label sequence t i = {t i,j } with permutation probability loss (Cao et al., 2007). A ground truth label is set 1 if it is labeled "Good" and 0 if it is labeled "PotentiallyUseful" or "Bad" in accordance to the task rules (Nakov et al., 2017). We use k = 1 permutation probability distribution function P : The permutation probability loss is defined as D KL (P (t i )∥P (y i )) where D KL is Kullback-Leibler divergence between two distributions. Since decomposable attention model and permutation probability loss are fully differentiable, we can optimize the whole network with minibatch stochastic gradient descent with backpropagation. We use rmsprop with momentum (Graves, 2013) and learning rate reduction of 1% for every 100 batches. Dropout (Srivastava et al., 2014) and L2-norm regularization are applied to each layer of feed forward neural network to avoid overfitting. Batch normalization (Ioffe and Szegedy, 2015) is applied and gradient norm is clipped to 5.0 to improve the training stability. We use leaky rectified linear unit for activation function σ as shown in Equation (8) to stabilize the training.
Other hyperparameters are shown in Table 1

Comment Plausibility Features
Comment plausibility features are designed to extract information that is not captured by neural similarity features. These features are divided into five groups: (1) function of a comment, (2) answer adequacy, (3) dialog structure, (4) answerer's meta-information, and (5) miscellaneous. Part of speech tagging and named entity recognition for comment plausibility features are carried out using Stanford CoreNLP (Manning et al., 2014).

Function of a Comment
This group of 39 features is designed to capture the function of a comment; e.g. trying to answer the question, making remarks, or asking the questioner for more information. This group of features is extracted from each comment.
The occurrence of each word in Table 2 within each comment is extracted as a binary feature. We use the part of speech tag for the first and the final word of the comment. This is expressed as onehot representation of whether the first/final word is noun, adjective, adverb, verb, auxiliary verb, conjunction (for the final word only) or interjection. We also added a feature whether the first word is "is." We also use ratio of each part of speech tag to the number of tokens.

Answer Adequacy
This group of 27 features is designed to capture whether the comment has adequate information to answer the question. For this purpose, this group of features is extracted from each question-answer pair.
The presence of each word (what, which, who, where, when, why, whom, how, hi, and any of do, does, or did) within a question is extracted as a binary feature. The presence of each type of named entities (location, person, organization, money, percent, date and time) and the presence of any type of the named entities, numerics, image tags and URLs in each comment are also extracted.
The relative length of a comment to a question is also extracted. This is based on the idea that the answer tends to be long when a question is long. This relative length is calculated for 6 variants; i.e. the number of words/characters in a comment divided by, (i) the total number of words/characters in the question and the comment, (ii) the total number of words/characters in the question subject, text and the comment text, and (iii) the total number of words/characters in the question subject and the comment text.

Dialog Structure
This group of four features is designed to capture the dialog structure of comments. For this purpose, this group of features is extracted for each comment using the whole thread. Dialog structure features include the binary fea-tures for each of the following statements: (i) If the comment is posted by the question author. (ii) If the comment contains the name of the question author. (iii) If the comment contains a name of the user other than the question author (comment contains a string with "@" prefix). We use reciprocal chronological order (e.g. 1/3 for the third comment) to capture the global position of a comment.

Answerer's Meta-information
This group of two features is designed to capture the answerer's meta-information. For this purpose, this group of features is extracted for each comment using the whole dataset.
For example, whether a comment is written by the author of the question is important information because he or she hardly ever knows the answer.
Answerer's meta-information features are binary features for each of the following statements: (i) If the comment author is anonymous.
(ii) If the comment author has posted a comment elsewhere in the dataset.

Miscellaneous
To further improve the performance, we adopted a lexicon of 23 words with the lowest semantic orientation in CQA  and extracted the occurrence of these words from the comments. We also use the cosine distance between the term frequency-inverse document frequency (TF-IDF) vectors of a question and a comment. We use eight types of TF-IDF as listed in Table 3, each characterized by document blocks (question subject, question text or comment text) to compare and to calculate IDF. We also used presence of word overwrap in the question-comment and the subject-comment pair as binary features. While redundant to neural similarity features, redundant features increase the overall performance by acting like an ensemble.
We use the cosine distance between the TF-IDF vector of a comment and an averaged TF-IDF vector of all comments in thread. This is extracted for an averaged TF-IDF vector of all comments in dataset, as well. These features are intended to capture amount of distinctive information that each comment contains.

Combined features
The neural features and the comment plausibility features are concatenated to form Primary run for Subtask A and C. The features are further extended by concatenating features from two comments before and after the target comment, resulting in concatenated features over five comments. This allows extending the dialog structure features (Section 2.2.3) without adding too many features, as described in Section 1. We use first order linear CRF by regarding each comment as an observation and a thread as a sequence. Along with concatenated features, CRF allows non-local optimization of inter-comments relationship. For example, presence of "yes" after a good answer is likely to be acknowledgement by the questioner. In this case, effect of "yes" is conditioned on the label of the previous comment.
CRF is trained using L-BFGS with L1 regularization coefficient of 1.0 and L2 regularization coefficient of 0.001. We use CRFsuite (Okazaki, 2007) as an implementation of CRF.

Modification for Subtask C
For neural similarity features, hyperparameters were manually tuned for Subtask C as shown in Table 1. On training neural models for Subtask C, we added all the question-comment pairs from Subtask A to augment the data.
For comment plausibility feature, we ap-plied greedy stepwise backward elimination using SemEval-2016 test data as validation data. We tested the deletion of each feature and removed the feature whose deletion gives the best MAP improvement. We repeated the process until MAP no longer improves. The process removed following features: (i) Presence of any of words ⟨what, which, who, where, when, why, whom, how, hi⟩. (ii) The relative length of a comment (Section 2.2.2, (ii)). (iii) Reference to the question author (Section 2.2.3, (i) and (ii)). (iv) Answerer's meta-information.

Experiments
Our Primary submission was CRF with combined features. Contrastive 1 was CRF with only the comment plausibility features. Contrastive 2 was CRF with only the neural similarity features.
The official results for the 2017 test data are shown in Table 4. The Primary obtained the fifth and the second in Subtask A and C, respectively.
The combined features (Primary) was much better than Contrastive 1 and 2 in Subtask A, as expected. The large increase of 1.29 MAP score from Contrastive 1 to the Primary implies that the neural features and comment plausibility features were capturing different aspects of the problem.
On the other hand, Contrastive 1 performed poorly in Subtask C. This was partially because the similarity was more important in Subtask C, which contained many unrelated comments. Thus neural similarity features performed much better than in Subtask A and comment plausibility feature did much worse. Another reason for Contrastive 1's poor performance may have been due to the over-fitting to development dataset, as implied by large performance drop from 2016 dataset (Table 5).    Table 6 (Subtask A) and 7 (Subtask C). From the result, the comment plausibility features seem to work as a blacklist for comments that are unlikely to be an answer. For example, occurrence of words "?," "do," "does," "did," and "what" all contribute to identifying a question which are less likely to be a comment.
Our neural similarity feature performed worse than the previous application of recurrent neural network to Subtask A (MAP scores of 75.7 against our 71.4) and to Subtask C (MAP scores of 47.2 against our 28.0) (Wu and Lan, 2016). The reason for the inferior performance may be due to very large vocabulary of CQA, which caused the neural network to fall back to only using commonly appearing words in many cases. As a supporting observation, attention weight seem to localize on very few commonly appearing words instead of on  Figure 2: Visualization of attention (ē Q i,j,c,q on left andē C i,j,c,q on right) in failing case. Attention had concentrated on commonly appearing words rather than more informative regions. more meaningful region of text ( Figure 2). Use of sub-word vocabulary can help overcome this problem (Yoon Kim et al., 2016;. Also, we manually tuned the hyperparamters for neural network. Random searching for better hyperparameters can improve the overall performance.

Conclusions
This paper explains the participation in SemEval-2017 Task 3, Subtask A and Subtask C, which is a problem of ranking the comments in community question answering forum according to their relevance to the question. We proposed a method that combines neural similarity features and comment plausibility features, and modeled inter-comments relationship. Our approach obtained the fifth place in the Subtask A and the second place in the Subtask C.
For future work, we will improve the neural method so that it can better handle large vocabulary of CQA. We will also incorporate systematic end-to-end tuning on both feature selection and neural method to deal with over-fitting problem.