VectorSLU: A Continuous Word Vector Approach to Answer Selection in Community Question Answering Systems

Continuous word and phrase vectors have proven useful in a number of NLP tasks. Here we describe our experience using them as a source of features for the SemEval-2015 task 3, consisting of two community question answering subtasks: Answer Selection for cate-gorizing answers as potential , good , and bad with regards to their corresponding questions; and YES/NO inference for predicting a yes , no , or unsure response to a YES/NO question us-ing all of its good answers. Our system ranked 6th and 1st in the English answer selection and YES/NO inference subtasks respectively, and 2nd in the Arabic answer selection subtask.

To evaluate the effectiveness of continuous vector representations for Community question answering (CQA), we focused on using simple features derived from vector similarity as input to a multi-class linear SVM classifier. Our approach is language independent and was evaluated on both English and Arabic. Most of the vectors we use are domain-independent.
CQA services provide forums for users to ask or answer questions on any topic, resulting in high variance answer quality (Màrquez et al., 2015). Searching for good answers among the many responses can be time-consuming for participants. This is illustrated by the following example of a question and subsequent answers.

Q: Can I obtain Driving License my QID is written
Employee?
A1: the word employee is a general term that refers to all the staff in your company ... you are all considered employees of your company A2: your qid should specify what is the actual profession you have. I think for me, your chances to have a drivers license is low.
A3: his asking if he can obtain. means he have the driver license.
Answer selection aims to automatically categorize answers as: good if they completely answer the question, potential if they contain useful information about the question but do not completely answer it, and bad if irrelevant to the question. In the example, answers A1, A2, and A3 are respectively classified as potential, good, and bad. The Arabic answer selection task uses the labels direct, related, and irrelevant.
YES/NO inference infers a yes, no, or unsure response to a question through its good answers, which might not explicitly contain yes or no keywords. For example, the answer for Q is no with respect to A2 that can be interpreted as a no answer to the question.
The remainder of this paper describes our features and our rationale for choosing them, followed by an analysis of the results, and a conclusion.

Text-based features
Text-based similarities yes/no/probably-like words existing Vector-based features Q&A vectors OOV Q&A yes/no/probably-based cosine similarity Metadata-based features Q&A identical user Rank-based features Normalized ranking scores
Our system analyzes questions and answers with a DkPro (Eckart de Castilho and Gurevych, 2014) uimaFIT (Ogren and Bethard, 2009) pipeline. The DkPro OpenNLP (Apache Software Foundation, 2014) segmenter and chunker tokenize and find sentences and phrases in the English questions and answers, followed by lemmatization with the Stanford lemmatizer . In Arabic, we only apply lemmatization, with no chunking, using MADAMIRA (Pasha et al., 2014). Stop words are removed in both languages.
As shown in Table 1, we compute text-based, vector-based, metadata-based and rank-based features from the pre-processed data. The features are used for a linear SVM classifier for answer selection and YES/NO answer inference tasks. YES/NO answer inference is only performed on good YES/NO question answers, using the YES/NO majority class, and unsure otherwise. SVM parameters are set by grid-search and cross-validation.
Text-based features These features are mainly computed using text similarity metrics that mea-sure the string overlap between questions and answers: The Longest Common Substring measure (Gusfield, 1997) identifies uninterrupted common strings, while the Longest Common Subsequence measure (Allison and Dix, 1986) and the Longest Common Subsequence Norm identify common strings with interruptions and text replacements, while Greedy String Tiling measure (Wise, 1996) allows reordering of the subsequences. Other measures which treat text as sequences of characters and compute similarities include the Monge Elkan Second String (Monge and Elkan, 1997) and Jaro Second String (Jaro, 1989) measures. A Cosine Similarity-type measure based on term frequency within the text is also used. Sets of (1-4)-grams from the question and answer are compared with Jaccard coefficient (Lyon et al., 2004) and Containment measures (Broder, 1997). 1 Another group of text-based features identifies answers that contain yes-like (e.g., "yes", "oh yes", "yeah", "yep"), no-like (e.g., "no", "none", "nope", "never") and unsure-like (e.g., "possibly", "conceivably", "perhaps", "might") words. These word groups were determined by selecting the top 20 nearest neighbor words to the words yes, no and probably based on the cosine similarity of their Word2Vec vectors. These features are particularly useful for the YES/NO answer inference task.
Vector-based features Our vector-based features are computed from Word2Vec vectors (Mikolov et al., 2013a;Mikolov et al., 2013b;Mikolov et al., 2013d). For English word vectors we use the GoogleNews vectors dataset, available on the Word2Vec web site, 2 which has a 3,000,000 word vocabulary of 300-dimensional word vectors trained on about 100 billion words. For Arabic word vectors we use Word2Vec to train 100-dimensional vectors with default settings on a lemmatized version of the Arabic Gigaword (Linguistic Data Consortium, 2011), obtaining a vocabulary of 120,000 word lemmas.
We also use Doc2Vec, 3 an implementation of (Le and Mikolov, 2014) in the gensim toolkit (Řehůřek and Sojka, 2010). Doc2Vec provides vectors for text of arbitrary length, so it allows us to directly model answers and questions. The Doc2Vec vectors were trained on the CQA English data, creating a single vector for each question or answer. These are the only vectors that were trained specifically for the CQA domain.
We implemented a UIMA annotator that associates a Word2Vec word vector with each invocabulary token (or lemma). No vectors are assigned for out of vocabulary tokens. Another annotator computes the average of the vectors for the entire question or answer, with no vector assigned if all tokens are out of vocabulary.
We initially used the cosine similarity of the question and answer vectors as a feature for the SVM classifier, but we found that we had better results using the normalized vectors themselves. We hypothesize that the SVM was able to tune the importance of the components of the vectors, whereas cosine similarity weights each component equally. If the question or answer has no vector, we use a 0 vector. To make it easier for the classifier to ignore the vectors in these cases, we add boolean features indicating out of vocabulary, OOV Question and OOV Answer.
Even though the bag of words approach showed encouraging results, we found it to be too coarse, so we also compute average vectors for each sentence. For English, we also compute average vectors for each chunk. Then we look for the best matches between sentences (and chunks) in the question and answer in terms of cosine similarity, and use the pairs of (unnormalized) vectors as features. 4 More formally, given a question with sentence vectors {q i } and an answer with sentence vectors {a j }, we take as features the values of the vector pair (q,â) defined as: We also have six features corresponding to the greatest cosine similarity between the comment word vectors and the vectors for the words yes, Yes, no, No, probably and Probably. These features are more effective for the YES/NO classification task.

doc2vec.html.
4 Post-evaluation testing showed no significant difference between using normalized or unnormalized vectors.
Metadata-based features As a metadata-based indicator, the Q&A identical user identifies if the user who posted the question is the same user who wrote the answer. This indicator is useful for detecting irrelevant dialogue answers.
Rank-based features We employ SVM Rank 5 to compute ranking scores of answers with respect to their corresponding questions. After generating all other features, SVM Rank is run to produce ranking scores for each possible answer. For training SVM Rank, we convert answer labels to ranks according to the following heuristic: good answers are ranked first, potential ones second, and bad ones third. Ranking scores are then used as features for the classifier. The normalization of these scores can be used as rank-based features to provide more information to the classifier, although these scores are also used without any other features as explained in Section 3.

Evaluation and Results
We evaluate our approach on the answer selection and YES/NO answer inference tasks. We use the CQA datasets provided by the Semeval 2015 task that contain 2600 training and 300 development questions and their corresponding answers (a total number of 16,541 training and 1,645 development answers). About 10% of these questions are of the YES/NO type. We combined the training and development datasets for training purposes. The test dataset includes 329 questions and 1976 answers. About 9% of the test questions are bipolar.
We also evaluate our performance on the Arabic answer selection task. The dataset contains 1300 training questions, 200 development questions, and 200 test questions. This dataset does not include YES/NO questions.
English answer selection Our approach for the answer selection task in English ranked 6th out of 12 submissions and its results are shown in Table  2. VectorSLU-Primary shows the results when we include all the features listed in Table 1    the rank-based and text-based features. Interestingly, VectorSLU-Contrastive leads to a better performance than VectorSLU-Primary. The lower performance of VectorSLU-Primary could be due to the high overlap between text-based features in different classes that can clearly mislead classifiers. For example, A1, A2 and A3 (see Section 1) all have a considerable word overlap with their question, while only A2 is a good answer. The last two rows of the table are respectively related to the best performance among all submissions and the majority class baseline that always predicts good.
Arabic answer selection Our approach for answer selection in Arabic ranked 2nd out of 4 submissions. Table 3 shows the results. In these experiments, we employ all features listed in Table 1 except for yes/no/probably-based features, since the Arabic task does not include YES/NO answer inference. Vectors were trained from the Arabic Gigaword (Linguistic Data Consortium, 2011). We found lemma vectors to work better than token vectors.
We computed ranking scores with SVM Rank for both VectorSLU-contrastive and VectorSLU-Primary. In the case of VectorSLU-contrastive, we used these scores to predict labels according to the following heuristic: the top scoring answer is labeled as direct, the second scoring answer as related, and all other answers as irrelevant. This decision mechanism is based on the distribution in the training and development data, and proved to work well on the test data. However, for our primary  submission we were interested in a more principled mechanism. Thus, in the VectorSLU-primary system we computed 10 extra classification features from the ranking scores. These features are used to provide prior knowledge about relative ranking of answers with respect to their corresponding questions.
To compute these features, we first rank answers with respect to questions and then scale the resultant scores into the [0,1] range. We then consider 10 binary features that indicate whether the score of each input answer is the range of [0,0.1), [0.1,0.2), ..., [0.9,1), respectively. Note that each feature vector contains exactly one 1 and nine 0s.
The last two rows of the table are related to the best performance and the majority class baseline that always predicts irrelevant.
English YES/NO inference For the indirect YES/NO answer inference task, we achieve the best performance and ranked 1st out of 8 submissions. Table 4 shows the results. VectorSLU-Primary and VectorSLU-Contrastive have the same definition as in Table 2. Both approaches with or without the textbased features outperform the baseline that always predicts yes as the majority class and other submissions. This indicates the effectiveness of the vectorbased features.

Related Work
We are not aware of any previous CQA work using continuous word vectors. Our vector features were somewhat motivated by existing text-based features, taken from the QCRI baseline system, replacing text-similarity heuristics with cosine similarity. Some of the approaches to classifying answers can be found in the general CQA literature, such as (Toba et al., 2014;.

285
In summary, we represented words, phrases, sentences and whole questions and answers in vector space, and computed various features from them for a classifier, for both English and Arabic. We showed the utility of these vector-based features for addressing the answer selection and the YES/NO answer inference tasks in community question answering.