FuRongWang at SemEval-2017 Task 3: Deep Neural Networks for Selecting Relevant Answers in Community Question Answering

We describes deep neural networks frameworks in this paper to address the community question answering (cQA) ranking task (SemEval-2017 task 3). Convolutional neural networks and bi-directional long-short term memory networks are applied in our methods to extract semantic information from questions and answers (comments). In addition, in order to take the full advantage of question-comment semantic relevance, we deploy interaction layer and augmented features before calculating the similarity. The results show that our methods have the great effectiveness for both subtask A and subtask C.


Introduction
Answer selection is regarded as a key step in question answering tasks, especially for community question answering (cQA), which is greatly valuable for user to retrieve information. Some cQA forums, such as Quora, Qatar Living and Stack Overflow, are quite open to users, providing a convenient platform for asking and answering questions. Hence, the development of cQA forums makes it urgent to answer questions automatically.
SemEval-2017 (Bethard et al., 2017) task 3 (Nakov et al., 2017) is the task for selecting relevant answers (or comments) for questions in community question answering (cQA). The data is collected from an online cQA forum, Qatar Living, which is close to some real application needs. The task consists of several subtasks, which are described briefly as follows: • Question-Comment Similarity: Given one question and ten candidates comments {c 1 , c 2 , · · · , c n }, the goal is to rank these Figure 1: One Basic Deep Learning cQA Framework. The question and the comment are mapped into fixed-length word vectors. After that, they are fed into the neural network framework to extract features. Through a fully connected neural network, the framework outputs the similarity of the question and the comment. comments according to their relevance to the question.
• Question-External Comment Similarity: Given one question and ten candidate questions, which also have ten candidate comments. The questions can be relevant or completely irrelevant. As for the original question, the target is to rank the top ten relevant comments from these 100 candidate comments.
The relevance between the question and the comment can be divided as "Good", "Potential-lyUseful" and "Bad". "PotentiallyUseful" and "Bad" do not make a clear distinction. "Good" comments are considered useful and should be ranked before "PotentiallyUseful" and "Bad" comments. From the above description, these tasks can be regarded as a binary classification tasks: the relation between the question and the comment is divided into "Relevant" or "Irrelevant".
Deep neural networks techniques accelerate the development of automatic question answering, which due to the great capability of capturing the semantic meaning of texts. Our work is mainly motivated by the previous work Feng et al., 2015;Cui et al., 2016;Hu et al., 2014), which utilizes neural networks to extract features from texts. We deploy a deep neural networks framework to measure the relevance of questions and comments in the cQA task, and then rank the candidates according to their similarity to the original question. In this study, methods based on deep neural networks extract semantic features from the question and the comment respectively, which involve little manual operation and show an advantage on the huge amount of data.
In addition, to increase the connection between the question and the comment, we apply an interaction layer before calculating the similarity. Then, we add augmented features to improve the performance of deep neural networks. Finally, the system we proposed are used to address subtask A and subtask C, and the results significantly surpass the baselines provided by the organizers.

Methodology
Neural networks are increasingly applied in various of natural language processing tasks, due to the capability of capturing semantic meaning of texts. Our models are fundamentally motivated by previous work in Question Answering.
One basic deep learning architecture for cQA is shown as Figure 1, which is used in a great many papers (Tan et al., 2015;Feng et al., 2015;Cui et al., 2016;Hu et al., 2014). The question and the comment are mapped into fixed-length word vectors. After that, they are fed into the neural network framework to extract features. The frameworks output semantic vectors of the question and the comment, which are then concatenated into a vector. Through a fully connected neural network, the architecture outputs the similarity of the question and the comment. Finally, the similarity value can be the criteria to rank the candidate comments according to the original question.
Currently, convolutional neural networks (CNNs) have proved superiority in a variety of tasks due to the ability of learning the representation of short texts or sentences. Meanwhile, recurrent neural networks (RNNs), especially the variant: long short term memory networks (LSTMs), successfully model the long and short term information of the sequence.

CNN-based Architecture
The question and the comment are represented by word embedding sequences with a fixed length: {w 1 , · · · , w l }, each element is real-valued and w ∈ R d . Each sentence is normalized to a fixed length sequence by adding paddings if the sentence is short or truncating the excess otherwise. After embeddings, each sentence can be presented by a matrix S ∈ R l×d .
In order to capture higher semantic information of sentences, convolutional layer is applied after embeddings, which consists of several convolutional feature maps. Suppose we have k feature maps z i ∈ R s×d , after convolutional operation, the outputs of CNN is C ∈ R (l−s+1)×k .
A pooling layer is added after the convolutional layer. Max pooling and average pooling are commonly used in the model, which choose the max or average value of the features extracted from the former layer to reduce the presentation. In this study, we use 1-max pooling as our methods to select the max value of each filter, and the question and the comment vectors generated by neural networks are q, c ∈ R k respectively.
In order to extract features from different scales, we use different types of feature maps altogether. These feature maps have different width to capture information from different contexts, which contribute to the feature extraction of CNNs.

LSTM-based Architecture
The long short term memory (LSTM) network is a variant of recurrent neural network (RNN) which has recently received great results on various of sequence modeling tasks. LSTM overcomes the shortcoming of RNN in handling "long-term dependencies". Memory cells and forget gates are the key points of the LSTM: they make it capable of handling both long and short sequences through controlling the information flow of a sequence. A LSTM network is made up of several cells with the following formula: (2) where W and b are parameters which are trained and shared by all cells, and ⊙ indicates elementwise production. C t is the memory cell, which stores previous values. The forget gate f t and the input gate i t control the percentage of information from the previous memory and new inputs respectively, while the output gate o t controls the output of hidden states. σ(·) indicates the sigmoid function.
Single directional LSTM only utilizes the former information of a sequence, while bidirectional LSTMs utilize the information both forward and afterward. Usually, in order to capture more information of the sequence, bidirectional LSTMs are adopted, with the one processes information from the front to the end of a sequence while the other processes information in the reverse direction. The output of bi-directional LSTMs can be the concatenation of two output vectors from both direction, i.e. h t = − → h t || ← − h t .

Augmented Features
In general, neural networks are able to extract features automatically. However, (Fu et al., 2016) and (Yu et al., 2014) have shown that augmented features contribute to the behavior of neural networks. Some commonly used augmented features such as word overlap indices, part-of-speech tags, position indices, etc.
In this work, we use word overlap indices as augmented features.
Given a question q i = (x q 1 , x q 2 , · · · , x q m ) and a comment c i = (x c 1 , x c 2 , · · · , x c m ), the overlap features q f eat and c f eat are calculated as follows: c (i) where q (i) f eat is the i th element of q f eat , so is c As shown in Figure 2, q f eat and c f eat are added at the tail of sequence embeddings. Also, their concatenation x f eat is regarded as augmented features before feeding into the fully connected networks.

Interaction Layer
In order to make full use of the connection of the question and the comment, we design an interaction layer to capture the relevance of them, which is shown in Figure 2.
Given a question vector q ∈ R k and a comment vector c ∈ R k produced by neural networks, the interaction layer calculates the matrix multiplication as follows: where z int ∈ R is the output of interaction layer, and M ∈ R k×k is the parameter of the layer and updated when training, while f (·) is the activation function.

Objective Function and Optimizer
Features extracted by deep neural networks are concatenated with extra features, which are then fed into a fully connected neural network altogether.
where o h i is the output of hidden layer node i, and x f eat is the augmented features. W o and b o are the weight and the bias of the hidden layer respectively.
After that, the softmax function is applied to obtain the similarity of the question and the comment: where o i ∈ [0, 1] is the output of the network, which satisfies ∑ i o i = 1. W s and b s are the weight and the bias of the output layer respectively.
The objective function in this study is cross entropy, which is illustrated as follows: where y i ∈ {0, 1} is the ground truth label of the question-comment relation. N is the number of samples.
The relation of the question and the comment is divided to two classes, "Good" and "Bad" ("Po-tentiallyUseful" is regraded as "Bad"). To train the parameters in networks, Adagrad optimizer (Duchi et al., 2011) is applied. Adagrad is an algorithm designed for gradient-based optimization, which adapt the learning rate for better convergence when training the parameters. (Dean et al., 2012) point out that Adagrad increase the robustness of stochastic gradient descent when training large-scale neural networks.

Experimental Results
In this section, we describe the detail of training deep neural networks, including the datasets, metrics, baseline, parameters settings and so on. Then, we present the results and give a brief analysis of these results.

Data Description
Datasets provided by the organizers are collected from Qatar Living forum, which is an online cQA website. The components of the datasets are shown as Table 3.1: In our experiment, datasets "train 1", "train 2" and "dev" are used to train neural networks. In addition, dataset "test 2016" is used for development  and parameters searching, and dataset "test 2017" is submitted for evaluation.

Metrics and Baselines
The official evaluation metrics in this task is Mean Average Precision (MAP), which is used for ranking submissions from different teams. MAP is often used to measure the quality of ranking in information retrieval. In addition, Average Recall (AvgRec), Mean Reciprocal Rank (MRR), Accuracy (Acc), etc. are also reported by the official scorer. Baselines are given by the organizer, which consists of Information Retrieval (IR) baseline and random baseline. IR baseline is the rank of candidates given by a search engine, such as Google or Bing. Random baseline is the results of given a random number (from 0 to 1) to rank each candidate.

Experimental Setup
Our models in this work are mainly implemented with tensorflow v1.0 (Abadi et al., 2016)   All texts from questions and comments are used at first to train Word2Vec vectors (Mikolov et al., 2013a,b) by a Python package gensim 1 , whose length is fixed to 100. The max sequence length of the question and the comment are fixed to 200. We add paddings if the sequence is short or truncate the excess otherwise.
The single-type CNN networks have the filter size of 3, and 800 feature maps, while the multitype CNN networks have the filter sizes of 1,2,3 and 5 with 800 feature maps each. The bi-LSTMs have the output length of 400 of each direction, and the hidden states are outputted directly for the higher layer. The number of nodes in hidden layer is 256 and the activation function used in fully connected neural networks is ReLu. The optimizer is set to AdagradOptimizer and the learning rate is set to 0.01 initially. 1 https://radimrehurek.com/gensim/ Table 3.1 summarizes the results on subtask A: Question-Comment Similarity. The first two rows illustrate the random baseline and the IR baseline, follow by 4 rows of CNN results. The last three rows are the results of LSTMs. As shown in the table, our results significantly outperform the baselines. Neural network based methods perform quite stable, the differences between their results are less than 1%.

Results on subtask A
The multiCNN along with augmented features and interaction layer achieves the best MAP scores among these methods, although it does not rank the first as for MRR scores. It is also clear that interaction layer and augmented features contribute to the behavior of neural networks. Table 3.1 illustrates the results on subtask C: Question-External Comment Similarity. It should be noted that we use the reciprocal rank of questions to improve the rank of the comments.

Results on subtask C
From the table, our methods surpass the baselines and the best method obtains the MAP of 13.55. The circumstance is quite similar to subtask A, which has proved that our deep neural based architectures are of great stability. However, LSTMbased architectures is far behind CNN-based architectures in terms of MAP and AvgRec scores, but have the advantage in terms of MRR scores.

Conclusion
In this paper, we present deep neural networks frameworks to address community question answering tasks. CNNs and biLSTMs are used to extract the semantic features of questions and comments. We add an interaction layer and augmented features to improve the performance of the framework. The results illustrate that our methods are greatly superior to the baselines provided by organizers both in subtask A and subtask C.