PMI-cool at SemEval-2016 Task 3: Experiments with PMI and Goodness Polarity Lexicons for Community Question Answering

We describe our submission to SemEval-2016 Task 3 on Community Question Answering. We participated in subtask A, which asks to rerank the comments from the thread for a given forum question from good to bad. Our approach focuses on the generation and use of goodness polarity lexicons, similarly to the sentiment polarity lexicons, which are very popular in sentiment analysis. In particular, we use a combination of bootstrapping and pointwise mutual information to estimate the strength of association between a word (from a large unannotated set of question-answer threads) and the class of good/bad comments. We then use various features based on these lexicons to train a regression model, whose predictions we use to induce the ﬁnal comment ranking. While our system was not very strong as it lacked important features, our lexicons contributed to the strong performance of another top-performing system.


Introduction
Online forums have been gaining a lot of popularity in recent years. In these forums, one can ask a question, and based on the wisdom of the crowd, expect to get some good answers. In practice, unless there is strong moderation, most such forums get populated with bad answers, which can be annoying for users as it takes time to read through all answers in a long thread. As the importance of the problem was recognized in the research community, this gave rise to two shared tasks on Community Question Answering at SemEval-2015  and SemEval-2016.
Here we describe the PMI-cool system, which we developed to participate in SemEval-2016 Task 3, subtask A, which asks to rerank the answers in a question-answer thread, ordering them from good to bad . As the name of our system suggests, our approach is heavily based on pointwise mutual information (PMI), which we use to estimate the association strength between a word and a class, e.g., the class of good or the class of bad comments. Based on this association strength, we perform bootstrapping in a large unannotated set of question-answer threads to generate specialized goodness polarity lexicons. We then use various features based on these lexicons to train a regression model, whose predictions we use to induce the final comment ranking.
While our PMI-cool system did not perform very well at the competition as it lacked important features and as we found a bug in our submission, our goodness polarity lexicons proved useful and contributed to the strong performance of another topperforming system at SemEval-2016 Task 3: SUper team (Mihaylova et al., 2016).

Method
Our solution can be separated into two phases: (i) feature extraction, and (ii) machine learning. The feature extraction phase consists of extracting various PMI-based and other features, which we describe in the following sections. In the second phase, we apply a support vector machine (SVM) regression model (Drucker et al., 1997), taking the features as an input and returning the similarity score for each question-answer pair as an output.
At test time, we generate regression scores for each answer in a question-answer thread and we rerank the answers accordingly. Before exploring our features, we will first introduce PMI and how we use it to generate goodness polarity lexicons.

Pointwise Mutual Information and Strength of Association
The pointwise mutual information (PMI) is a notion from the theory of information: given two random variables A and B, the mutual information of A and B is the "amount of information" (in units such as bits) obtained about the random variable A, through the random variable B (Church and Hanks, 1990). Let a and b be two values from the sample space of A and B, respectively. The pointwise mutual information between a and b is defined as follows: PMI is central to a popular approach for bootstrapping sentiment lexicons proposed by Turney (2002). The idea is to start with a small set of seed positive (e.g., excellent) and negative words (bad), and then to use these words to induce sentiment polarity orientation for new words in a large unannotated set of texts (in his case, product reviews). The idea is that words that co-occur in the same text with positive seed words are likely to be positive, while those that tend to co-occur with negative words are likely to be negative. To quantify this intuition, Turney defines the notion of semantic orientation (SO) for a term w as follows: SO(w) = pmi(w, pos) − pmi(w, neg) where pos and neg stand for any positive and negative seed word, respectively.
The idea was later used by other researchers, e.g., Mohammad et al. (2013) built several lexicons based on PMI between words and tweet categories. Here the categories (positive and negative) were defined by a seed set of emotional hashtags, e.g., #happy, #sad, #angry, etc. or by simple positive and negative smileys, e.g., ;), :), ;(, :(. In this case, the resulting lexicons included not only words, but also bigrams and discontinuous pairs of words. Another related work is that of Severyn and Moschitti (2015), who proposed an approach to lexicon induction, which, instead of using PMI for SO, assigns positive/negative labels to the unlabeled tweets (based on the seeds), and then trains an SVM classifier on them, using word n-grams as features. These n-grams are then used as lexicon entries with the learned classifier weights as polarity scores. While this is an interesting approach, in our experiments below, we will stick to PMI as a more established method to estimate SO.
Finally, there is a related task at SemEval-2016 on predicting the out-of-context sentiment intensity of phrases (Kiritchenko et al., 2016), but there the focus is on multiword phrases.

Features
We used a variety of features based on the textual content of the question and of the answer and on metadata about the question and about the questionanswer pair.

Metadata features
All the metadata features we used are included with their SO with the good/bad class. We used the following features: • SameAuthor. This feature checks whether the target answer is given by the same user who asked the question.
The assumption here is that the author of the question is unlikely to provide a good answer to his/her own question. We do not use this boolean feature directly, but we use the SO between it and the good/bad classes.
• AnswerNumber. This is the rank of the answer (e.g., first, second, third, . . ., tenth). The assumption is that most discussions tend to degenerate and to lose focus over time. This is also visible in the baseline that ranks the answers based on their chronologicak rank, which performs better than random . The feature value is the SO of the answer rank and of the good/bad classes.
• AnswerAuthor. This is the ID of the person who gave the answer. The idea is that some users might tend to give mostly good/bad answers. Thus, the SO between the author ID and the good/bad classes is useful for user modeling.

Word PMI
The main feature of our PMI-cool system is based on the lexicon constructed by computing the SO of each word, used in all the answers in the training corpus. Using this technique, we identify words commonly used in good versus bad answers in general, regardless of the question; we used words without stemming as stemming lowered the performance. For instance, bad answers often contain variants of thanks statements, insults, words generally used in off-topic comments and interjections, etc. Table 1 shows some of the top words that are most strongly associated with bad answers in terms of SO.

Sentiment Lexicon
Another resource we used is the Sentiment140 lexicon, which was constructed by Mohammad et al. (2013) using SO for word weighting, as we mentioned above. Our assumption here is that good/bad sentiment expressed in the answer suggests good/bad answers, as previously suggested in (Nicosia et al., 2015). The feature we calculate is the sum of the sentiment scores of the sentiment-bearing words in the answer.

Bootstrapped PMI
As explained above, we used bootstrapping to induce larger goodness polarity lexicons from the large unnanotated corpus provided for the task. For this purpose, we first used PMI to build a lexicon from the labeled training data, removing rarely mentioned words and taking the top and the bottom 5% of the rest, based on the SO score. Then, we used PMI again, using these words as good/bad seeds to generate the large lexicon. Unfortunately, due to implementation issues before the submission deadline, this feature was not included in the submitted system's feature set.

Ineffective Features
We further used some features that turned out to be ineffective. Still, we describe them here as we believe this might be useful for other researchers.

Personality Trait Features
We used the lexicons of Schwartz et al. (2013), which were designed to measure a user's big-five personality traits: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism. These lexicons were also computed using PMI and SO metrics between words in a large set of usergenerated content on social media: 75,000 Facebook user profiles with personality values and 700 million words, phrases, and topic instances. There are two lexicons for each trait, with the 100 most and the 100 least characteristic words for the target trait. We calculated a personality score for each of the five traits by summing the scores of all matching words in all answers written by the author of the answer, thus modeling his/her personality profile. However, this feature, did not yield improvements.

Topic Features
The text features used in PMI-cool are based on words only. We also tried to build a Latent Dirichlet Allocation (LDA) model (Blei et al., 2003) on the question and on the answers and to add the resulting topic distributions as features. However, this did not help on the development dataset, and thus we did not use it in our final submission.

Experiments and Evaluation
For training the prediction model for good versus bad answers, we used an SVM with a linear kernel as implemented in LibLinear (Fan et al., 2008).
We treated each answer as a separate instance with all the above features, merging the PotentiallyUseful and the Bad labels under the bad class, and we ranked the answers based on the SVM score.  Here is a list of the useful features we experimented with: • SO SameAuthor; • SO AnswerNumber; • SO AnswerAuthor; • sum of SO(w) for the answer words; • sum of the sentiment for the answer words; • one feature for each personality trait: sum of the scores of the lexicon words for that trait in all posts by the answer author; • number of words with positive BootstrapedSO in the answer; • number of words with negative BootstrapedSO in the answer; • fraction of words with positive BootstrapedSO in the answer; • fraction of words with negative BootstrapedSO in the answer; • sum of the positive BootstrapedSO scores for the answer words; • sum of the negative BootstrapedSO scores for the answer words; • maximum value of BootstrapedSO for a word in the answer; • minimum value of BootstrapedSO for a word in the answer; • sum of BootstrapedSO scores for all answer words.
The MAP scores resulting from our experiments on the development dataset are shown in Table 2. The first row shows the maximum possible score: it is lower than 1, as 33 of the 244 dev threads had no good answers. Next, we show the MAP score when all features are enabled; we can see that it outperforms both the chronological and the random baseline, by 4 and 10 MAP points absolute, respectively. The following four rows show results with some class of features disabled. We can see that the personality features had virtually no impact on the results, sentiment had a minimal impact (0.2 MAP points), metadata had a real impact (1.8 MAP points), while the word PMI features had the largest impact (6.6 MAP points).

Post-Submission Analysis
After the competition ended, we fixed a bug in the bootstrapped lexicon construction, which resulted in sizable improvements. We further replaced the uniting SVR with SVC, and we excluded the Potential-lyUseful comments from training. These changes collectively yielded a boost in MAP to 74.67 on the test dataset. As Table 3 shows, this is 6 MAP points absolute higher than the score for the system we submitted to the competition. It is also only 4.5 MAP points behind the best, and only 3 points behind the second-best team.

Conclusion and Future Work
We have described our PMI-cool system for SemEval-2016, Task 3 on Community Question Answering, subtask A, which asks to rerank the comments from the thread for a given forum question from good to bad. Our approach relied on using SO scores based on PMI to construct various features, the most important of which were our goodness polarity lexicons, which are based on an idea we borrowed from sentiment analysis. In particular, we used a combination of bootstrapping and pointwise mutual information to estimate the strength of association between a word (from a large unannotated set of question-answer threads) and the class of good/bad comments. We then used various features based on these lexicons to train a regression model, whose predictions we used to induce the final comment ranking.
While our PMI-cool system did not perform very well at the competition as it lacked important features and as we had a bug at submission time, our goodness polarity lexicons proved useful and contributed to the strong performance of another topperforming system at SemEval-2016 Task 3: SUper team (Mihaylova et al., 2016).
In future work, we plan to strengthen our system with more features. In particular, we would like to incorporate rich knowledge sources, e.g., semantic similarity features based on fine-tuned word embeddings and topics similarities as in the SemanticZ system (Mihaylov and Nakov, 2016b). There are also plenty of interesting features to borrow from the SUper Team system (Mihaylova et al., 2016), including veracity, text complexity, and troll user features as inspired by (Mihaylov et al., 2015a;Mihaylov et al., 2015b;Mihaylov and Nakov, 2016a). It would be interesting to combine these in a deep learning architecture, e.g., as in the MTE-NN system (Guzmán et al., 2016a;Guzmán et al., 2016b), which borrowed an entire neural network framework and achitecture from previous work on machine translation evaluation (Guzmán et al., 2015).
We further plan to use information from entire threads to make better predictions, as using threadlevel information for answer classification has already been shown useful for SemEval-2015 Task 3, subtask A, e.g., by using features modeling the thread structure and dialogue (Nicosia et al., 2015;, or by applying threadlevel inference using the predictions of local classifiers Joty et al., 2016). How to use such models efficiently in the ranking setup of 2016 is an interesting research question.  Table 3: Comparison to the official results on SemEval-2016 Task 3, subtask A. The first column shows the rank of the primary runs with respect to the official MAP score. The second column contains the team's name and its submission type. The following columns show the results for the primary, and then for other, unofficial evaluation measures. The subindices show the rank of the primary runs with respect to the evaluation measure in the respective column.