IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Question Answering and Implicit Dialogue Identification

In this paper we present the system for Answer Selection and Ranking in Community Question Answering, which we build as part of our participation in SemEval-2017 Task 3. We develop a Support Vector Machine (SVM) based system that makes use of textual, domain-specific, word-embedding and topic-modeling features. In addition, we propose a novel method for dialogue chain identification in comment threads. Our primary submission won subtask C, outperforming other systems in all the primary evaluation metrics. We performed well in other English subtasks, ranking third in subtask A and eighth in subtask B. We also developed open source toolkits for all the three English subtasks by the name cQARank [https://github.com/TitasNandi/cQARank].


Introduction
This paper presents the system built for participation in the SemEval-2017 Shared Task 3 on Community Question Answering (CQA). The task aims to classify and rank a candidate text c in relevance to a target text t. Based on the nature of the candidate and target texts, the main task is subdivided into three subtasks in which the teams are expected to solve the problem of Question-Comment similarity, Question-Question similarity and Question-External Comment similarity (Nakov et al., 2017). In this work, we propose a rich feature-based system for solving these problems. We create an architecture which integrates textual, semantic and domain-specific features to achieve good results in the proposed task. Due to the extremely noisy nature of the social forum data, we also develop a 1 https://github.com/TitasNandi/cQARank customized preprocessing pipeline, rather than using the standard tools. We use Support Vector Machine (SVM) (Cortes and Vapnik, 1995) for classification, and its confidence score for ranking. We initially define a generic set of features to develop a robust system for all three subtasks, then include additional features based on the nature of the subtasks. To adapt the system to subtasks B and C, we include features extracted from the scores of the other subtasks, propagating meaningful information essential in an incremental setting. We propose a novel method for identification of dialogue groups in the comment thread by constructing a user interaction graph and also incorporate features from this graph in our system. Our algorithm outputs mutually disjoint groups of users who are involved in conversation with each other in the comment thread. The rest of the paper is organized as follows: Section 2 describes the related work. Sections 3, 4, and 5 elucidate the system architecture, features used and algorithms developed. Section 6 provides experimentation details and reports the official results.

Related Work
In Question Answering, answer selection and ranking has been a major research concern in Natural Language Processing (NLP) during the past few years. The problem becomes more interesting for Community Question Answering due to the highly unstructured and noisy nature of the data. Also, domain knowledge plays a major role in such an environment, where meta data of users and context based learning can capture trends well. The task on Community Question Answering in SemEval began in 2015, where the objective was to classify comments in a thread as Good, Bad or PotentiallyUseful. In subsequent years, the task was extended and modified to focus on ranking and duplicate question detection in a cross domain setting.
In their 2015 system, Belinkov (2015) used word vectors of the question and of the comment, various text-based similarities and meta data features. Nicosia (2015) derived features from a comment in the context of the entire thread. They also modelled potential dialogues by identifying interlacing comments between users. Establishing similarity between Questions and External comments (subtask C) is quite challenging, which can be tackled by propagating useful context and information from other subtasks. Filice (2016) introduced an interesting approach of stacking classifiers across subtasks and Wu & Lan (2016) proposed a method of reducing the errors that propagated as a result of this stacking.

System Pipeline
The system architecture of our submission to subtask A is depicted in Figure 1. We explain the pre-processing pipeline in the next subsection. The cleaned data is fed into our supervised machine learning framework. We train our wordembedding model on the unannotated and training data 2 provided by the organizers, and train a probabilistic topic model on the training data. The detailed description of features is provided in the following section. After obtaining the feature vectors, we perform feature selection using wrapper methods to maximize the accuracy on the development set. We Z-score normalize the feature vectors and feed them to a SVM. We tune the hyperparameters of SVM and and generate classification labels and probabilities, the latter being used for computing the MAP score.

Preprocessing Pipeline
Due to the highly unstructured, spelling and grammatical error-prone nature of the data, adaptation of any standard tokenization pipeline was not well motivated. We customized the preprocessing according to the nature of the data. We unescaped HTML special characters and removed URLs, emails, HTML tags, image description tags, punctuations and slang words (from a defined dictionary). Finally, we expanded apostrophe words and removed stopwords. The cleaned data is then used in all further experiments.

Features
We use a rich set of features to capture the textual and semantic relevance between two snippets of text. These features are categorized into several broad classes:

String Similarity Features
This set of features makes use of a number of string matching algorithms to compute the string similarity between the question and comment. This generates a continuous set of values for every comment, and is apt for a baseline system. The bag of algorithms used is a careful combination of various string similarity, metric distances and normalized string distance methods, capturing the overall profiling of texts. The string similarity functions used include Longest Common Subsequence (LCS), Q-Gram (q = 1,2,3), Weighted Levenshtein and Optimal String Alignment. The normalized similarity algorithms used are Jaro-Winkler, Normalized Levenshtein, n-gram (n = 1,2,3), cosine-similarity (n = 1,2,3), Jaccard Index (n = 1,2,3), and Sorensen-Dice coefficient (n = 1,2,3). The metric distance methods implemented are Levenshtein, Damerau, and Metric LCS.

Word Embedding Features
Semantic features constitute the core of our feature engineering pipeline. These try to capture the proximity between the meanings encoded in the word sequences of question and comments. We train word embeddings using Word2Vec (Mikolov et al., 2013) on the unannotated and given training data. The unannotated data is in the same format as the training data, except that the comments are not annotated. We performed experiments with different vector sizes (N = 100, 200, 300), and finally settled on using 100 dimensional word vectors. We also used a pre-trained model on Google News dataset in order to compare the performance of the two models. Interestingly, the domain-specific model trained on the unannotated and training data proved to be better than the one trained on Google News dataset, hence we used the former in building our final system. Since we wanted a feature vector corresponding to each comment in the thread, we had to transform these trained word vectors into sentence vectors. Two approaches were considered for this: • Construct the sentence vector by taking an average of the vectors of all words that constitute the sentence. • Construct the sentence vector as a weighted average of all the word vectors constituting that sentence. Here the weight corresponds to the Inverse Document Frequency (IDF) value of the word in the thread.
Although the first approach has an evident disadvantage of not assigning importance to the keywords in the sentence (which is why we resorted to the IDF-based weighted averaging), it yielded better results, which is why we included it in our final system. We extract two sets of features from these sentence vectors: • The vector subtraction of the comment vector from the vector of the question at the head of the thread is used as the scoring vector for that comment. • We calculate the cosine similarity, Euclidean and Manhattan distances between question and comment vectors.

Topic Modeling Features
To capture the thematic similarity between the question and comment texts, we train a LDA topic model on the training data using Mallet (McCallum, 2002). We perform different experiments by varying the number of topics (n = 10, 20, 50, 100) and obtain the best performance with 20 topics. We generate a topic vocabulary of 50 words for each topic class. The following features were entailed from these topic distributions and words: • The vector subtraction of question and comment topic vectors, measuring the topical distance between the two snippets of text. • Cosine, Euclidean and Manhattan distance between the topic vectors. • We generate a vocabulary for each text by taking the union of topic words of its first 10 most probable topic classes.
where each t i represents one of the top 10 topic classes for comment or question T . We then determine the word overlap of the topic vocabulary of the question with (i) the entire comment string and (ii) the topic vocabulary of the comment.

Domain Specific Features
In CQA sites, comments in a thread typically reflect an underlying discussion about a question, and there is usually a strong correlation among the nearby comments in the thread. Users reply to each other, ask for further details, can acknowl-edge others' answers or can tease other users. Therefore, as discussed in (Barrón-Cedeño et al., 2015), comments in a common thread are strongly interconnected. We extract various features from the meta data of the thread and from our surface observation of the thread's structure and properties. We extract if the comment is written by the asker of the question.
In the case of repeated comments by the asker, we monitor if the comment is an acknowledgement (thanks, thank you, appreciate) or a further question. With the likely assumption that the comments at the beginning of a thread will be more relevant to the question, we have a feature capturing the position of comment in the thread. We also compute the coverage (the ratio of the number of tokens) of question by the comment and comment by the question.
We further try to model explicit conversations among users in the thread. We do it in two ways: • Repeated and interlacing comments by a user in the same thread • Explicitly mentioning the name of some previous user in the comment The case of implicit dialogues (where the intent of the conversation has to be inferred solely from the context of the comment by a user) is discussed in a separate section later. These domain-specific features proved to be quite effective in classification, and thus form an integral part of our system.

Keyword and Named Entity Features
Finding the focus of the question and comment is important in measuring if the comment specifically covers the aspects of the question. We extract keywords from the texts using the RAKE keyword extraction algorithm (Rose et al., 2010), and derive features from the keyword match between question and comment. We also use the relative importance of common keywords as feature values. In case of factoid questions, or especially in subtask B, Named Entity Recognition becomes an important tool for computing the relevance of a text. We extract named entities using the Stanford Named Entity Recognizer (Finkel et al., 2005) and classify words into seven entity categories including PERSON, LOCATION, ORGANIZA-TION, DATE, MONEY, PERCENT, and TIME.
We compute if both question and comment have named entities, and if these belong to the same classes, if the named entity is an answer to a Whtype question or not.

Implicit Dialogue Identification
Data driven error analysis on the Qatar Living Data indicated the presence of implicit dialogue chains. Users were almost always engaging in conversations with each other but this could only be captured by the content of their comment. Here we propose a novel algorithm based on construction of a user interaction graph to model these potential dialogues. The components of our construction are as follows: • Vertices -Users in the comment thread and the Question Initially Empty Q → Question node Node indexed 0 2: procedure DIALOGUE IDENTIFICATION 3: for each comment cx in thread do 5: ui commented cx 6: if ui is a new user then 7: end if 10: for Q and each previous comment cy do 11: uj commented cy 12: if i = j and eij doesn't exist in E(U ) then 13: Add eij in E(U ) The algorithm to construct this dynamic graph is given in Algorithm 1. We simultaneously construct two graphs, a user graph and a dialogue graph. Initially, the user graph has the question node and the dialogue graph is empty. We add new users to the graphs according to the timestamp of their occurrence in the thread. For each new comment, we add edges to each previous user and the question, in the user graph for the user who commented. Then we pick the maximum outgoing edge to some previous user from the user who commented, and add that edge in the dialogue graph. Finally, we find the weakly connected components (WCCs) in the dialogue graph and the users in each such WCC are in mutual dialogue. Note that the user graph at the end of each iteration depicts the current conversational interaction of the user who commented, with respect to all other users in the thread.
Algorithm 2 Compute Weight Function 1: procedure COMPUTE WEIGHT(cx, cy, i, j) 2: ui commented cx 3: uj commented cy 4: eij ← 0.0 5: if user ui explicitly mentions user uj in comment then 6: eij ← eij + 1.0 Explicit dialogue 7: end if 8: cx → {w1, w2, ..., w k } wi is the i th word in cx 9: cy → {w 1 , w 2 , ..., w l } 10: tr score ← ( 1≤m≤k max 1≤n≤l cos( vw m , v w n ))/k 11: tx ← topic vector for cx 12: ty ← topic vector for cy 13: to score ← cos( tx, ty) Topic similarity score 14: eij ← eij + tr score + to score Edge weight 15: return eij 16: end procedure The main part of the algorithm is where we compute the edge weights between a pair of users after some comment, see Algorithm 2 for details. We have three components that constitute the weight: • if the user mentions the other user explicitly • we calculate the score of reformulating one comment from the other by closest word match based on cosine scores of word vectors (tr score) • cosine of the topic vectors of a pair of comments (to score) In addition to identifying latent dialogue groups, we also extract features from this graph and these features prove to be very helpful in classification.

Classifier
We use an SVM classifier as implemented in Lib-SVM (Chang and Lin, 2011) for classification. We experiment with different kernels (Hsu et al., 2003), and achieve the best results with the RBF kernel, which we use to train the model for our primary submission. We also achieve comparable re-sults with the linear kernel and L2-regularized logistic regression. The ranking score for a questioncomment pair in subtask A is the calculated probability of the pair to be classified as Good.
The ranking score for subtask B is the SVM probability score for the original question-related question pair multiplied by the reciprocal search engine rank provided in the data. For subtask C, the scoring value is the sum of the log probabilities of the SVM scores of all subtasks f inal score = log (svm A) + log (svm B) + log (svm C)

Stacking features for other subtasks
We implemented a generic system for tackling semantic similarity for any two snippets of text. We further fine tuned it with domain specific features for subtask A. For subtasks B and C, we again adopted this generic system with slight modifications. But, the strong interconnectivity and incremental nature of the subtasks motivated the development of a stacking strategy where we propagate useful information from other subtasks as features for the present subtask and re-run the classifier. Filice (2016) developed a stacking strategy that we adopt with modifications. For subtask B, we consider the scores for subtasks A and C as probability distributions and calculate various features and correlation coefficients (Spearmann, Kendall, Pearson) over these distributions.
For subtask C, we calculate feature values from the SVM scores of all three subtasks, and re-run our system with these stacking features. These features include average, minimum and maximum of subtask A and B scores, and binary features capturing if these probability scores are above 0.5.

Experimentation and Results
We extensively experimented with a lot of feature engineering. Notable features that were discarded in the feature ablation process are: • Statistical Paraphrasing: We found the top 10 semantically related words corresponding to every word in the comment, based on word vectors and did an n-gram matching (n = 1,2,3) on the extended wordlist.
• Doc2Vec: We also used Doc2Vec (Le and Mikolov, 2014) to generate sentence vectors directly, but these degraded the results.  Our primary submission for subtasks A and B uses SVM with an RBF kernel for classification as this yielded the best results on the dev set. We also achieved similar results with the linear and L2regularized logistic regression classifiers and we use these for our contrastive submissions. All the submissions comprised of same number of features. For subtask C, we oversample the training data using the SMOTE (Chawla et al., 2002) technique in the ImbalancedLearn 3 toolkit, due to the highly skewed distribution of labels. We use regular SMOTE for our primary and SMOTE SVM for our first contrastive submission. For the sec-3 https://github.com/ scikit-learn-contrib/imbalanced-learn ond contrastive submission, we integrate the feature sets of subtasks A and B directly in the feature set of subtask C. The feature ablation results on the development set and the results of different runs on the test set are presented in Table 1. It reports the system performance on all evaluation metrics including Mean Average Precision (MAP), Average Recall (AvgRec), Mean Reciprocal Rank (MRR), Precision (P), Recall (R), F1-score (F1) and Accuracy.

Conclusion
We establish the importance of domain specific and dialogue identification features in tackling the given task. In future work, we would like to fo-cus on extracting more information from intercomment dependencies. This should improve our algorithm for dialogue group detection and model conversational activity better. We also wish to work on a Deep Learning architecture for handling this, as in (Wu and Lan, 2016) and (Guzmán et al., 2016). The problem can be modeled as a semisupervised classification task, where the unannotated data can help supervised classification. Subtask C still presents a challenging research problem and we will investigate novel methods to integrate results from other subtasks to tackle this subtask better.