ECNU: Using Multiple Sources of CQA-based Information for Answers Selection and YES/NO Response Inference

This paper reports our submissions to community question answering task in SemEval-2015, which consists of two subtasks: (1) predict the quality of answers to given question as good , bad , or potentially relevant and (2) identify yes , no or unsure response to a given YES/NO question based on the good answers identiﬁed by subtask 1. For both subtasks, we adopted supervised classiﬁcation method and examined the effects of heterogeneous features generated from community question answering data, such as bag-of-words, string matching, semantic similarity, answerer information, answer-speciﬁc features, question-speciﬁc features, etc. Our submitted primary systems ranked the forth and the second for the two subtasks of English data respectively.


Introduction
Community Question Answering (CQA) systems such as Yahoo!Answers rely on users to provide answers (i.e., user generated content) for questions posted. Generally such systems are quite open and the answers provided by users are not always of high quality. For example, a bad answer may present irrelevant opinions or issues, contain only URL links without direct answer, or even be written informally. Therefore, in order to achieve high-quality user experience and maintain high levels of adherence, it is critical to present high-quality answers and provide direct responses for users.
The CQA task in SemEval-2015 (Màrquez et al., 2015) provides such a universal platform for re-searchers to make a comparison between different approaches. This task consists of two subtasks: (1) subtask A is to classify the quality of answers as good, potential or bad, which also refers to the task of answer quality prediction (Jeon et al., 2006;Agichtein et al., 2008); (2) subtask B is to infer the global answer of a YES/NO question to be yes, no or unsure based on individual good answers.
Most of the previous research on answer quality prediction has focused on extracting various features to employ ranking or classification methods (Surdeanu et al., 2011;Shah and Pomerantz, 2010), such as textual features (Agichtein et al., 2008;Blooma et al., 2010) including the length of an answer, overlapped words between a question-answer (QA) pair, etc. Another kind of widely used feature is extracted from answerer profile information (Shah and Pomerantz, 2010), such as the number of best answers, the achieved levels and the earned points. However, such information is not often available in real world. Moreover, a recent study (Toba et al., 2014) has taken question type into consideration to make the answers quality prediction.
In this paper, we built two classification systems for the two tasks respectively. For Task A, we extracted six types of features from multiple sources of CQA-based information to predict the answer quality, such as answer-, question-, answerer-specific information, surface word similarity and semantic similarity between question-answer pair, ect. For Task B, the global answer of a YES/NO question is summarized just from the individual good answers identified by Task A. Specifically, we first built a classifier to predict Yes/No/Unsure labels for each predicted good answer, then we performed a majority voting to summarize the global answer for each question.
The rest of this paper is structured as follows. Section 2 describes our systems, including features, algorithms, etc. Section 3 shows experiments on training data and results on test data. Finally, conclusions and future work are given in Section 4.

Our Systems
For both tasks we adopted supervised classification methods and extracted various features from multiple sources to predict answer quality and infer YES/NO response.

Data Extraction
English data is extracted from Qatar Living Forum 1 and provided with XML-format. Each data file consists of a list of question tags, where each question is followed by a list of answer tags to this question.
Each question or answer has a subject, a body, and a list of attributes from which we can extract significant features. For example, a question has attributes of question category (overall 27 categories, e.g., Education, Cars, etc.), identifier of asker, question type (GENERAL or YES/NO) and an answer also has answerer identifier.
To obtain complete contents of a question or an answer, we merged the contents extracted from subject and body. Exceptionally, if subject is substring of body or subject of an answer starts with "RE:", we just extracted the contents from body.
Moreover, to reduce the influence of Not English answers to the subsequent classification, we filtered out the Not English answers from data. To discover such answers we found out unusual words for each answer by comparing word set of this answer with an English vocabulary with 235, 887 words from NLTK 2 words corpus, if the number of unusual words is over 10 and the ratio over answer length is above 60% we then regarded it as Not English.

Pre-processing
After data extraction we performed the following preprocessing operations. Firstly, HTML character encodings are substituted by the actual characters (e.g., "&amp;" is converted into whitespace). Then HTML tags, URLs, emoticons, ending signatures and repeating punctuation are removed from data. After that, we collected a slang list from Internet and replaced the informal words with formal words (e.g., "u r" is converted into "you are"). For the processed data, we performed tokenization and POS tagging using Penn Treebank tokenizer and POS tagger in NLTK. The words are lemmatized using WordNetbased lemmatizer implemented in NLTK.

Features of Task A
We extracted six types of features from multiple sources of CQA-based information, i.e., bag-ofwords (BoW) and answer-specific features (AS) from answer, string matching (SM) and semantic similarity (SS) from QA pair, answerer information features (AI) from answerer profile, questionspecific features (QS) from question.

Bag-of-Words for Answer (BoW)
We collected words from training and development answer set and adopted binary BoW representation. To reduce the problem of data sparse, we selected the words with frequency higher than four, resulting in 5, 730 words.

Answer-Specific Features (AS)
For each question, we extracted three answerspecific features. The first is answer length, which is computed at three levels, i.e., word, sentence and paragraph. We used L 1 normalization on the global answer set. To gain insight on the effect of answer length for each individual question, we also designed a length ratio feature to record the ratio of the length of each answer to the maximal answer length for the same question.
A good answer is generally supposed to answer a question explicitly instead of starting a new question or suggesting other consulting approaches. Therefore, the second binary feature is to represent whether an answer contains a question mark or not. In addition, we manually collected eight words and phrases from training set, which contains the meaning of suggestion (i.e., "suggest", "recommend", "advise", "try", "call", "you may", "may be", "you could"). Thus the third binary feature is to represent if there is at least one of above suggestion words in a given answer.

String Matching between QA (SM)
The above two types of features are both extracted from answer regardless of the question asked. However, the string matching features are to consider the overlapped words from a given QA pair.
Word: This feature group records the proportions of co-occurred words between a QA pair, which are calculated using six measures: |A∩B|/|A|, |A∩B|/|B|, |A−B|/|A|, |B −A|/|B|, |A∩B|/|A∪B|, 2 * |A∩B|/(|A|+ |B|), where |A| and |B| denote the number of nonrepeated words of question A and answer B. However, the same word appearing in different context could vary in word forms and normalizing words may obtain more accurate overlapped proportions, so we computed each measure at three word forms: original, lemmatized and stem form.
POS: This POS feature is similar to the above word feature. We use three measures: |A ∩ B|/|A|, |A ∩ B|/|B|, |A ∩ B|/|A ∪ B| to compute overlapped proportion of POS tags for nouns, verbs, adjectives and adverbs.
n-gram: Unlike the above two features measuring the overlap of single words or POS without considering multiple continuous words, the n-gram feature is to calculate the Jaccard similarity of overlapped n-grams between each QA pair. The n-grams are obtained at word level (n = 2, 3) and character level (n = 2, 3, 4). In addition, the n-grams at word level are obtained from original form and lemmatized form respectively.
Longest Common Sequence (LCS): The LCS feature is to measure the LCS similarity for a QA pair on the original and lemmatized form. It is calculated as the length of the LCS between each QA pair at word level divided by the length of question.

Semantic Similarity between QA (SS)
The previous string matching feature only considers the overlapped surface words or substrings in a QA pair and it may not capture the semantic information between a QA pair. Therefore, we presented the following semantic similarity features, which are borrowed from previous work.
Determining semantic similarity of sentences commonly uses measures of semantic similarity be-tween individual words. We used knowledge-based and corpus-based word similarity features. The knowledge-based similarity estimation relies on a semantic network of words such as WordNet. In this work, we employed four WordNet-based word similarity metrics: Path (Banea et al., 2012), WUP (Wu and Palmer, 1994), LCH (Leacock and Chodorow, 1998) and Lin (Lin, 1998) similarity. Following (Zhu andMan, 2013), the best alignment strategy and the aggregation strategy are employed to propagate the word similarity to the text similarity. Moreover, Latent Semantic analysis (LSA) (Landauer et al., 1997) is a widely used corpus-based measure when evaluating textual similarity. We used the vector space sentence similarity proposed by (Šarić et al., 2012), which represents each sentence as a single distributional vector by summing up the LSA vector of each word in the sentence. In this work, two corpora are used to compute the LSA vector of words: New York Times Annotated Corpus (NYT) and Wikipedia.
Besides, following , we adopted the weighted textual matrix factorization (WTM-F) (Guo and Diab, 2012) to model the semantics representations of sentences and then employed the new representations to calculate the semantic similarity between QA pairs using Cosine, Manhattan, Euclidean, Person, Spearmanr, Kendalltau measures respectively.

Answerer Information (AI)
Previous work (Zhou et al., 2012) showed that information about answerer has great impact on answer ranking in CQA. Inspired by this work, we designed two answerer-specific features to represent answerer level and answerer expert domain information. To calculate the answerer level feature, we used the number of answers and the percentage of good answers for each answerer. For expert domain feature, we employed the question categories where the answerer is an expert. Specifically, for each answerer, let G be the number of good answers the answerer responses and G i be the number of good answers to the i-th question category (i<=27). Then we used G i /G to measure the answerer's expert domain. Besides, for each of the 27 question categories (e.g., Education, Cars), we recorded the maximal value M i over all values of G i from each answerer and then calculated the G i /M i score to measure expert level of an answerer in current domain among all answerers. Totally, we adopted 54 features to indicate expert domain for each answerer.

Question-Specific Features (QS)
Since the domain of questions may also affect the performance of answer selection, we considered to use 27 binary features to indicate the question category. In addition, we manually collected 9 question words (i.e., where, what, when, which, who, whom, whose, why and how) and used 9 binary features to indicate if one of these question words occurs in the question.

Features of Task B
To address task B, we performed two steps. Firstly, we extracted features from good answers identified from task A and trained a classifier to predict the Yes, No or Unsure label for each good answer. Secondly, for each given YES/NO question, we counted the answer labels of Yes, No or Unsure and used majority voting to obtain the global answer.
We used three types of features for this task, which are all extracted from answer: (1) Bag-of-Words from answer (BoW), the same as in Task A; (2) Semantic Word2Vec (W2V): this feature indicates a vector representation of answer. We used word2vec tool 3 to get word vectors with dimension d = 300 and then summed up all the word vectors to obtain the answer vector. (3) Yes/No Word List (YN): we manually collected 50 affirmative words and 45 negation words by starting from several seed words (e.g., "yes", "sure", "definitely", "no", "seldom", "never", etc) and then expanding the list using snowball with the aid of WordNet synset. Besides, several phrases are manually added in the list (e.g., "beyond a doubt", "beyond question", "not at all", "only just", etc). We utilized 2 binary features to indicate whether an answer contains at least one of these affirmative and negation words or not.

Evaluation Measures
The official evaluation measures for both tasks is macro-averaged F 1 . For Task A the official score is calculated on three labels: Good, Bad, Potential (where Bad includes Dialogue, Not English and Other).

English Data Set
The English training and development set contain 2, 900 questions with 18, 186 answers and the test set contains 329 questions with 1, 976 answers, consisting of around 50% good, 40% bad and 10% potential answers. The YES/NO questions are about 10% of all the questions, which indicates that the data for Task B is much less than Task A. For both tasks we used training set with 2, 600 questions to build classifiers and validated the performance on development set with 300 questions for algorithms comparison and features choosing.

Algorithm Choosing Experiments
We performed algorithm choosing experiments using all designed features. All the parameters of algorithms are set to be default values from scikit-learn (Pedregosa et al., 2011). Table 1 lists the preliminary algorithm comparison experimental results. We found SVM with linear kernel outperforms other algorithm choices for both tasks. Moreover, we tuned the trade-off parameter c of SVM and when set c to 0.8 we obtained a better score 54.78% and 58.82% for Task A and B respectively. Therefore, in the following experiments on training and test data, we set the algorithm to SVM with linear kernel.

Feature Comparison Experiments
We performed a series of experiments for both tasks to explore the effects of various feature types using SVM (linear). In Task B we always chose the predicted good answers from the system with the best macro-F 1 in Task A.  First, for both tasks the most effective feature type is bag-of-words from answer and this feature alone achieves 48.91% for Task A and 47.82% for Task B, which both outperforms the baseline system provided by organizers respectively. The baseline of Task A which predicts all answers as good just achieves 22.36% and for Task B it achieves 25.0% which predict all answers as yes. Moreover, in Task A the performance of other five feature types alone is far lower than bag-of-words, ranging from 23% to 38% approximately.
Second, for Task A, when combining all the features together the system achieves the best performance, which indicates that all types of features make contribution more or less. Specially, among the six types of features, answerer information and semantic similarity between QA pairs make more contribution than others. This indicates that answerer profile information is important, which is consistent with the findings in (Zhou et al., 2012). Besides, the semantic similarity captures deep relationship between Q-A pair than the surface word, which is helpful for performance improvement. In Task B, we also observed the similar findings, i.e., the system using all types of features achieves the best performance. Moreover, the YES/NO word list feature makes great contribution to the performance improvement. This is consistent with our expectation. Besides, although in this work the word vector feature improves the performance, this improvements is not as much as our expectation. The possible reason may be the simple way of using the vector by only summing up.

Results on Test Data
According to the above experiments on training data, we configured one primary and two contrastive systems for both tasks. The only difference between these systems lies in the features and parameters in SVM. Table 3 lists the configuration of three systems and their corresponding results on test data. Besides, we also list the top three results officially released by organizers.  Table 3: Configurations and results of our three submitted systems and top three results, the numbers in bracket are the official ranking out of all submitted systems.
Our primary system ranked the 4th out of 12 participants in Task A and the 2nd out of 7 participants in Task B. For both tasks the performance of the primary system is higher than the two contrastive systems, which is consistent with the results on training data.

Conclusion
We build two supervised classification systems for answer selection and YES/NO response inference in CQA. Specially, we extract heterogeneous features from various information sources, i.e., answer, question, answer-question pair and answerer. Our experiments reveal that our designed features are all effective and when we combine all types of features together the system achieves the best performance.
Although multiple features extracted from CQA, the way of using these features are quite simple. Besides, due to the huge number of bag-of-word feature, the effects of other specific features are impaired. For future work, we may explore other underlying useful features and the advanced way of integrating these features to further improve the performance, such as the fine-grained semantic relationship between question and answer, etc. 240