UH-PRHLT at SemEval-2016 Task 3: Combining Lexical and Semantic-based Features for Community Question Answering

In this work we describe the system built for the three English subtasks of the SemEval 2016 Task 3 by the Department of Computer Science of the University of Houston (UH) and the Pattern Recognition and Human Language Technology (PRHLT) research center - Universitat Polit`ecnica de Val`encia: UH-PRHLT. Our system represents instances by using both lexical and semantic-based similarity measures between text pairs. Our semantic features include the use of distributed representations of words, knowledge graphs generated with the BabelNet multilingual semantic network, and the FrameNet lexical database. Experimental results outperform the random and Google search engine baselines in the three English subtasks. Our approach obtained the highest results of subtask B compared to the other task participants.


Introduction
The key role that the Internet plays today for our society benefited the dawn of thousands of new Web social activities. Among those, forums emerged with special relevance following the paradigm of the Community Question Answering (CQA). These type of social networks allow people to post a question to other users of that community. The usage is simple, without much restrictions, and infrequently moderated. The popularity of CQA is a strong indicator that users receive some good and valuable answers.
However, there are several issues related to that type of community. First is the large amount of answers received that makes it difficult and time-consuming for users to search and distinguish the good ones. This is exacerbated with the amount of noise that these questions contain. It is not uncommon to have wrong or misguiding answers that produce more unrelated answers, discussions and sub-threads. Finally, there is a lot of redundancy, questions may be repeated or closely related to previously asked questions.
Details of the SemEval 2016 Task 3 on CQA can be found in the overview paper (Nakov et al., 2016). In this work we evaluate the three English-related Task 3 subtasks on CQA. We first represent each instance to rank -question versus (vs.) comments, question vs. related questions, or question vs. comments of related questions -with a set of similarities computed at two different levels: lexical and semantic. This representation allows us to estimate the relatedness between text pairs in terms of what is explicitly stated and what it means. Our lexical similarities employ representations such as word and character n-grams, and bag-of-words (BOW). The semantic similarities include the use of distributed word bidirectional alignments, distributed representations of text, knowledge graphs, and frames from the FrameNet lexical database (Baker et al., 1998). This type of dual representations have been successfully employed for question answering by the highest performing system in the previous edition of this Se-mEval task (Tran et al., 2015). Other Natural Language Processing (NLP) tasks such as cross-language document retrieval and categorization also benefited from similar representations (Franco-Salvador et al., 2014). In this task, if the question or comment includes multiple text fields, e.g. body and subject, similarities are estimated using all possible combinations (see Section 3.2). Finally, the ranking of instances is performed using a state-of-the-art machine-learned ranking algorithm: SVM rank .

Related Work
Automatic question answering has been a popular interest of research in NLP from the beginning of the Internet to more recently where voice interfaces have been incorporated (Rosso et al., 2012). The use of BOW representations allowed to correctly answer 60% of the questions of the first large-scale question answering evaluation at the TREC-8 Question Answering track (Voorhees, 1999). More complex systems used inference rules to connect expressions between questions and answers (Lin and Pantel, 2001). Similarly, Ravichandran and Hovy (2002) employed bootstrapping to generate surface text patterns in order to successfully answer questions. Other works such as Buscaldi et al. (2010) are based on the redundancy of n-grams in order to find one or more text fragments that include tokens of the original question and the answer. Jeon et al. (2005) studied the semantic relatedness between texts for question answering. They used translation obfuscation to paraphrase the text and to detect which terms are closer in context. Probabilistic topic models have been also useful for detecting the semantics in this task. Celikyilmaz et al. (2010) used Latent Dirichlet Allocation (LDA) (Blei et al., 2003) for representing questions by means of latent topics.
The previous edition of the SemEval CQA task included two English subtasks (Nakov et al., 2015). The first one was focused on classifying answers as good, bad, or potentially relevant with respect to one question. The second subtask answered a question with yes, no, or unsure based on the list of all answers. In addition, the first subtask was also available in Arabic. Several teams experimented with complex solutions that included meta-learning, external resources, and linguistic features such as syntactic relations and distributed word representations. Similarly to our work, the highest performing approach employed a combination of lexical and semanticbased similarity measures (Tran et al., 2015). Its semantic features included the use of probabilistic topic models, translation obfuscation-based alignments, and pre-computed distributed representations of words both generated with the word2vec 1 and GloVe 2 toolkits. Their lexical features included BOW, word alignments, and noun matching. They employed a regression model for classification. Another interesting approach, Hou et al. (2015), included textual features -word lengths and punctuation -in addition to syntactical-based features -Part-of-Speech (PoS) tags.
In this work we aim at differentiating from the other approaches by enhancing our ranking model with new similarity measures. These include the use of knowledge graphs obtained using the largest multilingual semantic network -BabelNet -frames from the FrameNet lexical database, and bidirectional distributed word alignments.

Lexical and Semantic-based Community Question Answering
In this section we detailed the system that we designed for this CQA task. First in Section 3.1 we described our set of lexical features and semantic-based ones. Next, in Section 3.2 we detail the specific adaptation that we employed for each subtask and the ranking algorithm that we used. We note that all our features are similarity scores obtained with different text similarity measures. More details and examples can be found in their respective papers.

Feature Description
Our system exploits both the verbatim and the contextual similarities between texts, i.e., questions and comments. In Section 3.1.1 we detailed our lexical and in Section 3.1.2 our semantic-based features.

Lexical Features
The lexical features that we employed are the following: • Cosine Similarity. We used cosine similarity to measure lexical similarity between two text snippets. We calculated cosine similarity based on word n-grams(n=1,2), character 3-grams and tf-idf (Salton and McGill, 1986) scores of words.
• Word Overlap. We used the count of common words between two texts. This count was normalized by the length.
• Noun Overlap. We used NLTK 3 to partof-speech tag the text and computed the normalized count of overlapping nouns in two texts as a similarity measure.

Semantic Features
The semantic features that we employed are the following: • Distributed representations of texts.
We used the continuous Skip-gram model (Mikolov et al., 2013) of the word2vec toolkit to generate distributed representations of the words of the complete English Wikipedia. 4 Next, for each text, e.g. question or comment, we averaged its word vectors in order to have a single representation of its content as this setting has shown good results in other NLP tasks (e.g. for language variety identification (Franco-Salvador et al., 2015a) and discriminating similar languages (Franco-Salvador et al., 2015b)). Finally, the similarity between texts, e.g. question vs. comment, is estimated using the cosine similarity.
• Distributed word alignments. The use of word alignment strategies has been employed in the past for textual semantic relatedness (Hassan and Mihalcea, 2011). Tran et al. (2015) employed distributed representations to align the words of the question with the words of the comment. A more recent work introduced the Continuous Word Alignment-based Similarity Analysis (CWASA) (Franco-Salvador et al., 2016a). CWASA uses distributed representations to measure the similarity by doubledirection aligning words of texts. In this work we selected as feature the similarity provided by CWASA between questions and comments.
• Knowledge graphs. A knowledge graph is a labeled, weighted, and directed graph that expands and relates the concepts belonging to a text. Knowledge Graph Analysis (KGA) (Franco-Salvador et al., 2016b) measures semantic relatedness between texts by means of their knowledge graphs. In this work we used the Babel-Net (Navigli and Ponzetto, 2012) multilingual semantic network to generate knowledge graphs from questions and comments, and measured their similarity using KGA.
• Common frames. We used Framenet (Baker et al., 1998) to extract the frames associated with the lexical items in the text. For each frame present in the text, we calculated the common lexical items between sentences associated with this frame. The goal is to allow inference of similarity at the level of semantic roles.
As additional feature, for Subtasks A and C we also used the ranking provided by the Google search engine for the questions related to the original questions.

Data Representation and Ranking
Due to the representation of questions -composed by subject and body fields -and answers -a comment field -we adapted our system for the different English subtasks: • Subtask A (question-comment similarity ranking): we used the aforementioned similarity-based features at three levels: question subject vs. comment, question body vs. comment, and full question vs. comment.
• Subtask B (question-related question similarity ranking): for this subtask we measured the similarities at body, subject, and full question level.
• Subtask C (question-external comment similarity ranking): we employed all the features of Subtasks A and B, plus the similarities of the original questionsubject, body, and full levels -with the related question comments.
In order to rank the questions and comments, we selected a variant of Support Vector Machines (SVM) (Hearst et al., 1998) optimized for ranking problems: SVM rank (Joachims, 2002). In our evaluation of Section 4, we call our system as the combination of the acronyms of our member institutions: UH-PRHLT.
Preproscessing steps included stopword removal, lemmatization, and stemming. However, for the distributed representation and knowledge graph-based features we did not employ stemming. These decisions were motivated for performance reasons during our prototyping.
Note that each subtask allows to submit three runs per team: primary, contrastive 1 (contr. 1), and contrastive 2 (contr. 2). We used a linear kernel and optimized the cost factor parameter using Bayesian optimization 5 (Snoek, 2013). Our three runs differ only in the value for that parameter and correspond with the three best -and considerably distant -values. In addition to the ranking, the task requires also to provide with a label for each instance that reflects if the question or comment is relevant to the compared question. For each subtask we optimized a threshold to determine the relevance of each instance that is based on our predicted relevance ranking. In other words, we binarize our ranking.

Evaluation
This section presents the evaluation of the Se-mEval 2016 Task 3 on CQA. Details about this task, the datasets, and the three subtasks can be found in the task overview (Nakov et al., 2016). Note that for our system we did not use data from SemEval 2015 CQA as we did not observe gains in performance.
We compared the results of our approach with those provided by the random baseline and the Google search engine when ranking the questions and comments. 6 The official measure of the task is the Mean Average Precision (MAP), but we included also two alternative ranking measures: Average Recall (Av-gRec) and Mean Reciprocal Rank (MRR). In addition, we included four classification measures: Accuracy (acc.), Precision (P), Recall (R), and F1-measure (F1).

Results and Discussion
The best results per partition and subtask are highlighted in bold. In addition, we always refer to the run with the highest performance. Finally, our percentage comparisons use always absolute values. We can see the results of Subtask A (question-comment similarity ranking) in Table 1. In terms of ranking measures, our system outperformed both the random and the search engine baseline. Using the development set, we observed a MAP improvement of 9.4% compared with the results obtained by the search engine. We can see similar differences with respect to the other two ranking measures. Classification results are also superior. We obtain improvements in accuracy and F1 of 24.9% and 5.2% respectively. These results manifest the potential of the selected lexical and semantic-based features for this subtask.
Similar to Subtask A, the performance of our approach has been also superior in Subtask B   (question-related question similarity ranking).
As we can see in Table 2, using the development set, the improvement of MAP, AvgRec, and MRR has been of 4.6%, 5%, and 6.4% respectively compared to the search engine baseline. In this case, the similarity between questions was easier to estimate -also for the baselines -and the improvements in performance were slightly reduced. With respect to the classification measures, we outperformed the random baseline with 27.4% and 16.1 % of accuracy and F1-measure respectively.
In Table 3 we can see the results of the Subtask C (question-external comment similarity ranking). In this case, we are ranking 100 comments (10 times more compared to the other subtasks). Therefore, this has been the most difficult subtask. However, we obtained improvements in line with those reported for the other subtasks. Compared to the search engine baseline, the MAP, AvgRec, and MRR improved 8.7%, 8.5%, and 7.5% respectively when using the development partition. The accuracy and F1-measure improved 61.5% and 12.2% respectively. The largest number of comments to rank, and the use of top 10 results when measuring results, benefited our approach with this especially high difference in accuracy.
After the analysis of results in the three English subtasks, we highlight that the combination of lexical and semantic-based features that we employ in this work offers a competitive performance for the CQA task. This is true also when comparing results with other task participants. Our approach obtained the highest results -with considerable margin (1.04%) -for subtask B. It is worth mentioning that we designed our system for the subtask B and adapted it later for the other tasks. However, for the other two subtasks, we obtained a low ranking position. At this point we have not discovered any coding error that could explain this difference. In addition, we analysed the information gain ratio of the features for the three subtasks. That results showed an average decrease of ∼66% for subtasks A and C. Therefore, we conclude that our approach is more adequate for tasks of similarity rather than question answering. That analysis also manifests that the most relevant features are the word n-gram ones followed by the CWASA, distributed representation-based, and knowledgegraph-based ones. The comparison of results of all the submitted systems and task participants can be found in the task overview (Nakov et al., 2016).

Conclusions
In this work we evaluated the three English subtasks of the SemEval 2016 Task 3 on CQA. In order to measure similarities, our proposed approach combined lexical and semantic-based features. We included simple -and effective -representations based on BOW, character and word n-grams. We also employed semantic features which used distributed representations of words to represent documents or to directly measure similarity by means of distributed word bidirectional alignments. The use of knowledge graphs generated with the BabelNet multilingual semantic network has been exploited too. Experimental results showed that our system was able to outperform -with considerably differences -the random and Google search engine baselines in all the evaluated subtasks. In addition, our approach obtained the highest results in subtask B compared to the other task participants. This fact manifests the potential of our combination of lexical and semantic features for the CQA subtask.
As future work we will continue studying how to approach CQA with knowledge graphs and distributed representations. In addition, we will further explore how to employ this type of lexical and semantic-based representations for other NLP tasks such as plagiarism detection.