UINSUSKA-TiTech at SemEval-2017 Task 3: Exploiting Word Importance Levels for Similarity Features for CQA

The majority of core techniques to solve many problems in Community Question Answering (CQA) task rely on similarity computation. This work focuses on similarity between two sentences (or questions in subtask B) based on word embeddings. We exploit words importance levels in sentences or questions for similarity features, for classification and ranking with machine learning. Using only 2 types of similarity metric, our proposed method has shown comparable results with other complex systems. This method on subtask B 2017 dataset is ranked on position 7 out of 13 participants. Evaluation on 2016 dataset is on position 8 of 12, outperforms some complex systems. Further, this finding is explorable and potential to be used as baseline and extensible for many tasks in CQA and other textual similarity based system.


Introduction
Community Question Answering (CQA) is getting popular for requesting valid information from experienced people. However, waiting for such favorable answers for a new submitted question, is a boring task for users once querying to online community forums. IR system can utilize thread in online community forum for question queries. Even so, the appropriate answers are often mixed among snippets of many irrelevant documents, and opening full articles is still required. A post-processing system is needed in order to obtain the most relevant answers. CQA tasks want to address this need, to help user get the most favorable answers by improving IR system results.
SemEval CQA Task 3 is designed to gather some possible solutions, in five coherent subtasks (Nakov et al., 2017). Since some subtasks are re-lated, we focus only on subtask B, with goal to provide a good basis framework for solving problem in other subtasks.
In Task 3 of the previous year, word embeddings obtained with a tool such as word2vec (Mikolov et al., , 2013b contributed to the best systems for all subtasks. In addition, machine learning based methods were mostly ranked in the top positions for all subtasks. The most popular machine learning approach was SVM for classification, regression and ranking, while neural networks, even though widely used, did not win any subtasks . Most machine learning approaches rely on several similarity features as the basis. Various techniques to compute semantic similarity based on word embeddings, were used by Franco-Salvador et al. (2016), Filice et al. (2016), Mohtarami et al. (2016), Wu and Lan (2016), and . Besides, they also used various lexical and semantic similarities including simple match counts on words or n-grams. Specifically, Franco-Salvador et al. (2016), also used nouns and n-grams overlaps, distributed word alignments, knowledge graphs, and common frame.
Interestingly, Mihaylova et al. (2016) used cosine distance between topic pairs, and text distance for SVM learning features, rather than using similarity features. They also implemented other Boolean and Qatar Living Forum users as task specific features. Filice et al. (2016) constructed many types of similarity based on text pairs, e.g. n-grams of word lemmas, n-grams of POS tags, parse tree, and LCS for SVM learning features. Then they stack the classifiers across subtasks to solve substasks B and C in such a way that utilizes other subtasks' results. This task-specific features seem to be the key success for the team to get the relatively best performance on all English subtasks.
In this CQA task, we focus on machine learning approaches with a small number of features. We attempt to find an effective way to use word embeddings as the basis of our similarity features. We also make use of the words (lemmas) that are frequent in a thread or small document collection (i.e. the original and the 10 related questions), in the calculation of similarity between sentences. We create several sets of words with different 'word importance levels', from which we derive similarity features for machine learning methods.
The experiment on this 2017 shared task (subtask B) shows good results with respect to MAP scores. Our method also surpasses IR baseline and achieved the 7 th position out of 13 teams for the primary submission.

System Description
The framework of our system contains three main phases, i.e. (1) pre-processing, (2) feature generation, and (3) training and classification.

Pre-processing
From each dataset, i.e. development, train and test sets, we extract the questions to form threads for subtask B. Each thread contains one original question (orgQ) and the 10 related questions (relQ). We use the term 'collection of documents' for the thread, which contains questions (each with subject and body 1 ) as the documents.
From each collection of documents, we extract all lemmas and select only content words: nouns, verbs, adjectives, named entities, question words, and foreign words. For this need we use lemmatizer, POS tagger and Named Entity Recognizer from Stanford CoreNLP (Manning et al., 2014). We also count each lemma's frequency in each collection of documents for each certain thread, not from the whole dataset.
Intuitively, in a QA forum, if the frequency of a word is high in a certain thread, the word is likely to be an important matter in the conversation discussed by majority users. For this reason, we rank the words by their frequencies. We list top-N rank of words 2 for next process. In our experiments, we set N to 4.

Word Importance Level
We first derive several sets of content words from orgQ subj (the set of words in the subject of orgQ), orgQ body (the set of words in the body of orgQ), and TopN consisting the top N words in the ranking obtained in Section 2.1. Specifically, the following sets are supposed to have different levels of importance: . For example, the words in L1 belong to both set of orgQ-subject and TopN, and thus supposed to be very important.

Similarity Feature
We next calculate a number of similarities between two sets of content words: representing orgQ such as L1 and L2, and representing relQ such as relQ sub , relQ body , and their union. We later use these similarities as features for the classifier as in Table 1.

Semantic Similarity
The first semantic similarity type in this work is the cosine similarity (Equation (2)) between the sums (resultant R as in Equation (1)) of word embeddings of the words w in the sets.

Lexical Semantic Similarity
For the second type of similarity, we use lexical semantic similarity, which is similar to Konopık et al. (2016). We denote the union of and by ( . ., = ∪ , which consists of m unique words {b 1 , …, b m }.
Given two sets and , we derive their m-dimensional lexical vector representations and respectively. For each word in , we calculate the maximum cosine similar-ity score between the embeddings of and a word in , which we regard as an element of : Similarly, we calculate each element of from . Lastly, we calculate the cosine similarity between and to form a new feature.

Feature Generation
For our supervised learning, we compose feature sets as Table 1 below. Semantic cosine similarity is indexed with i in {1, ..., 10} and lexical semantic similarity with j in {11, ..., 20}. As additional features, we investigated influence of named entities (NE) in and . We extract only sentences or questions containing NE-words in orgQ subject and body as .

Learning, Classification and Ranking
We use machine learning for relevance classification and ranking tasks on the same feature combinations. We extract gold annotations (i.e., relevance and score) from the training set and compose separate SVM input files for both tasks. We run the training to produce models for both tasks. For classification task, SVM binary classifier with a linear kernel (Joachims, 1999) is used to assign label on each relQ, relevant (true) or not relevant (false) on the test set. For ranking task, SVM rank (Joachims, 2002) is used to produce scores. The score assigned to each relQ is regarded as rank, where a higher score means more related to the orgQ. Then, we take both results (relevance and score) into a system prediction file.

Dataset
We use 2016 Task 3 datasets provided by the organizer 3 , i.e. TRAIN-part1, DEV and TEST. We do not use TRAIN-part2 for it is less reliable and contains more noise as informed in the readmefile. We also conduct experiments on TEST-2016 dataset to test our system performance and compare it with the published official scores in  as seen in Table 4.

Feature Selection
We create a simple baseline, which uses only a single similarity feature. This baseline only computes semantic cosine similarity of , i.e. using all content words in orgQ and relQ (word importance level L5). For tuning the parameters and seeking the best combination of features, we train SVM with a linear kernel on TRAIN dataset, and applied the model on DEV dataset. We choose two best cost-parameters C with specific feature combinations in Table 2 (Nakov et al., , 2017. To analyze the influence of each feature, we conducted experiments on many possible combinations as in Table 2. We combine each word important level from the lowest level (i.e. L6, L4, L3, L2, L1), with baseline (L5) and see how it influences the MAP score. Generally, by combining with other single word importance level features, the MAP score is increased. Combined feature set , , i.e. word important level L2 (top-N words appear in orgQ subject and body) improves the MAP score by about 1 point when compared with single baseline feature L5. Moreover, we get more improvement when baseline is combined with word important level L1, i.e. top-N words in orgQ subject only (experiment with feature set , ).
We are also curious to join more word importance level features, to compute using both similarity types, and to use different content words of relQ, e.g. content words that appear in subject only or in both subject and body. Some interesting results are also reported in Table 2.
When adding the similarity between L1 and relQ subject only ( , , ), the MAP score slightly decreases for C=100, but decreases by more than 1 point for C=1. Interestingly, adding one more feature from L2 ( , , , ), gives the better score than the aforementioned features.
L1 and L2 tend to have higher influence on the MAP score, compared with L3, L4, and L6. When combining with their lexical semantic similarity features ( , , , , , , , ), L1 and L2 increase MAP score for C=100, but a little bit decrease the score for C=1. Considering that each of L3 to L6 has its own contribution to the improvement of the baseline, we incorporate all features and use both similarity types. The results give the two best MAP scores among all our experiments in this parameter tuning and feature selection phase.

Final Results
For our participation in Subtask B, we use combination of -, and TRAIN-part1 for training. We choose C=1 and C=100, as the primary and contrastive Con-1 respectively. For contrastive Con-2, we use C=1 and join TRAIN-part1+TEST-2016 for training.  Our system achieved the 7 th position out of 13 teams for the primary submission with MAP score is 43.44. Our contrastive-1 has the best score among our three submissions, i.e. 44.29, which is nearly about 1 point higher than the primary submission.  We also conduct experiment to test our system performance on TEST-2016 dataset. We use model from TRAIN-part1 dataset training with C=100 (our best result as in Table 3, i.e. Constrastive-1). In respect of previous year results, this result achieved the 8 th position out of 12 teams, if it is put into the leaderboard. In respect of the scores, our results in the 2017 and 2016 dataset are consistently in the middle range between the top and the lowest MAP score as seen in Table 4.

Conclusion and Future Work
As many CQA tasks rely on similarity measure as the basis, utilizing word importance classes in such a way for semantic similarity metrics can increase the MAP score significantly. Taking into consideration the top-n words in a thread, can contribute to find alternative words, which are unseen in the original question.
Our future work is to implement this method as baseline for other subtasks, and later combine with rich features, which involve various taskspecific operations to solve the main problem in CQA.