SwissAlps at SemEval-2017 Task 3: Attention-based Convolutional Neural Network for Community Question Answering

In this paper we propose a system for reranking answers for a given question. Our method builds on a siamese CNN architecture which is extended by two attention mechanisms. The approach was evaluated on the datasets of the SemEval-2017 competition for Community Question Answering (cQA), where it achieved 7th place obtaining a MAP score of 86:24 points on the Question-Comment Similarity subtask.


Introduction
Community Question Answering (cQA) describes the task of finding a relevant answer to a neverbefore seen question (Nakov et al., 2017). The cQA task in SemEval-2017 is subdivided into three subtasks: (a) Question-Comment Similarity, (b) Question-Question Similarity, and (c) Question-External Comment Similarity. We participated at the Question-Comment Similarity subtask, which consists of re-ranking a set of 10 answers to a given question, such that all the relevant answers are ranked higher than the irrelevant answers. We evaluated this system on the dataset provided by SemEval-2017 for the Question-Comment Similarity subtask, wich consits of approximately 2000 questions with 10 answers each. Our system ranked 7 th place, achieving a MAP score of 86.2 which was outperformed by 2 points by the 1 st ranked system. In this paper we describe the implementation details of our system, which follows a siamese CNN architecture based on (Severyn and Moschitti, 2015) extended by the attention mechanisms introduced by (Yin et al., 2015).
Siamese Architecture Siamese architectures usually consist of two parallel CNNs, each processing one sentence and then using the representations for the classification. Siamese architectures have been proposed for various tasks, e.g. (Bromley et al., 1993) used the structure for signature verification, and they have been shown to be very useful for modelling sentence pairs: (He et al., 2015), (Severyn and Moschitti, 2015), and (Tan et al., 2015) used the siamese architecture to generate representations for both sentences which then are used for classification.

Attention Mechanisms
Recently the notion of attention has been introduced in neural network architectures to mimic human behaviour, as we tend to focus on key parts of the sentences to extract relevant parts. Most of the work on attention mechanisms is focused on LSTMs: for instance in (Bahdanau et al., 2014) the authors use an attention mechanism for language translation, and in (Vinyals et al., 2015) the authors use it for generating parse trees. Regarding attention mechanisms for CNNs, we are only aware of (Yin et al., 2015), on which our system is based on.
The rest of the paper is structured as follows: in Section 2 we present our model showing the siamese architecture augmented with two different attention mechanisms. In Section 3 we describe our experimental setup and show the results obtained with our system. We conclude our discussion in Section 4.

Model
The input to the system are pairs of questions and answer candidates where the model should classify if the answer candidate is relevant to the question, thus, being a binary classification problem. Given such an input a question and an answer candidate, the parallel CNNs produce a representation  for both sentences, which are then concatenated and fed into a fully connected hidden layer before being fed into a softmax layer for classification (see Figure 1). The model is extended with two attention mechanisms: one modifies the input to the convolution and the second modifies the output of the convolution. Both methods aim at giving more weight to relevant parts of the sentences.

Language Model
We use word embeddings based on word2vec (Mikolov et al., 2013) as input to the convolutions. As described in (Mikolov et al., 2013) we first learn representations for phrases by learning which bigrams and trigrams appear frequently together. These n-grams are replaced by a unique token, e.g. 'New York' is replaced by the token 'New York'. The word embeddings are generated using the skip-gram model setting the context window to 5 and the dimensionality to d = 200.
The data used to create the word embeddings is a large corpus of 200M English Twitter messages. The word embeddings are stored as a matrix E ∈ R n×d where n is the number of tokens in the vocabulary. We generate a mapping V from each token t to the index of the corresponding word vector in the matrix E where V (t) denotes the index of token t.

Siamese CNN Architecture
Input Layer A minimal preprocessing is applied to both sentences. First, each sentence is lower-cased and tokenized. Each token t is replaced with the corresponding vocabulary index V (t). Thus, each sentence is represented as a vec-tor s of indices. We denote the length of the vector as s q for the length of the question and s a for the length of the answer.
Embedding Layer The embedding layer uses the indices provided by the input layer to select and concatenate the vectors from the embedding matrix E , thus, creating a matrix representation S for the sentence. For the question we have S q ∈ R sq×d and for the answer candidate S a ∈ R sa×d .
Convolution Layer This layer applies a set of m convolutional filters of length h over the sentence matrix S ∈ {S q , S a }. Let S [i:i+h] denote the concatenation of word vectors S i to S i+h . A feature c i is generated for a given filter F by: The concatenation of all vectors in a sentence defines a feature vector c ∈ R s−h+1 , where s denotes the sentence length. The vectors are then aggregated from all m filters into a feature map matrix C ∈ R m×(s−h+1) . The output of the convolutional layer is passed through the relu-activation function (Nair and Hinton, 2010), before entering a pooling layer.
Zero Padding When computing the convolution at the boundary of the sentence, the convolutional filter is off the edge. Zero Padding is applied by adding h − 1 zero vectors at the beginning and the end of the sentence matrix. The padded sentence matrix is of the form: S zq ∈ R sq+2 * (h−1)×d and S za ∈ R sa+2 * (h−1)×d for the question and answer candidate respectively. Note that the feature map matrix has the form: C ∈ R m×(s+h−1) if the input is padded.
Pooling Layer The pooling layer aggregates the vectors in the feature map matrix C by taking the maximum value for each feature vector. This reduces the representation of both the question and the answer candidate to c q,pooled , c a,pooled ∈ R m .
Hidden Layer The two vectors c q,pooled and c a,pooled are concatenated to a vector x ∈ R 2m and passed into a fully connected hidden layer which computes the following transformation: where W ∈ R 2m×2m is the weight matrix and b ∈ R 2m the bias vector.
Softmax Finally, the outputs of the previous layer x ∈ R 2m are fully connected to a softmax regression layer, which returns the classŷ ∈ [1, K] with largest probability, i.e., where w j and a j denotes the weights and bias of class j.

Attention Mechanism
We implemented the two different ways of introducing an attention mechanism into the siamese structure. The first manipulates the input to the convolution directly, the second modifies output to the convolution. Both approaches are based on an attention matrix.

Attention Matrix
The attention matrix A ∈ R sq×sa is derived from the sentence matrices S q and S a by computing the pairwise Euclidean similarity between the word embeddings of S q and the word embeddings of S a . Thus, A i,j = (1 + |S q i − S a j |) −1 denotes the similarity of the i-th word in the question with the j-th word in the answer candidate.

Convolution Modification
The first mechanism modifies the input to the convolution by applying a linear transformation to the attention matrix A to create the attention features. For this, two weight matrices are used: one for the question W q ∈ R sc×d and one for the answer candidate W a ∈ R sq×d . To attention matrix is multiplied with the weight matrices to generate the attention features: A q = A * W q and A a = A T * W a with A q ∈ R sq×d and A a ∈ R sa×d , where the weight matrices are learned during the training phase. The attention features are stacked on top of the sentence matrix, creating an order-3-tensor: S 2 q ∈ R sq×d×2 for the question and S 2 a ∈ R sa×d×2 for the answer candidate. These tensors are used as the input into the convolution layer, giving more weight to the relevant regions in the sentence. As in (Yin et al., 2015) we refer to this architecture as ABCNN 1.

Attention Based Pooling
The second mechanism modifies the output of the convolution. First a sliding window is applied on h consecutive columns of the feature map matrix C w :,i = k=i:i+h C :,k where i ∈ [1..s q ] and the window size h is the same as the filter length h used for the convolution. The values of the resulting feature map matrix C w ∈ R m×sq are weighted to include the attention values. The attention values are generated by summing the attention matrix columnwise for the question and row-wise for the answer candidate. Thus, a q = A j,: ∈ R sq and a a = A :,j ∈ R sa represent the attention values for each token in the question and the answer candidate, respectively. These vectors are used to weight the feature map matrix, thus, we get C q :,i = a q i * C wq :,i for the question and C a :,i = a a i * C wa :,i where C wq and C wa denote the window averaged feature map matrices for the question and answer candidate, respectively. Finally, standard max pooling is applied to the attention weighted feature map matrices. As in (Yin et al., 2015) we refer to this architecture as ABCNN 2.

Experiments
For the experiments we compared the three different architectures: (i) the siamese architecture without the attention mechanism; we refer to this as siamese CNN (sCNN) (ii) the ABCNN 1 architecture, and (iii) the ABCNN 2 architecture.

Setup
For all experiments we used the same pre-trained 200-dimensional word embeddings introduced in Section 2.1. We employ AdaDelta (Zeiler, 2012) as optimizer and L 2 regularization to avoid overfitting. Table 1   ranking of the answer candidates for a question is derived from the softmax probability, i.e. the answer candidates are sorted by their probability of being relevant.

Data
The training data provided by SemEval consist of approx. 2000 questions with 10 answer candidates each. Each answer candidate is manually labelled as either Relevant, Irrelevant, or Potentially Useful. Table 2 gives an overview of the data. For the training phase we combined the Training Part 1, Training Part 2, and Dev 2016, and we used Test 2016 as validation set for early stopping. Furthermore, we aggregated the Irrelevant and the Potentially Useful pairs to reduce the problem to a binary classification task.    Based on these results we decided to use ABCNN2 as our primary submission and ABCNN1 as the contrastive submission. Table 4 shows the results obtained on the SemEval-2017 test-set. We observe the same pattern as with Test 2016, i.e. ABCNN2 outperforms ABCNN1 by 1 point. We included the scores of the 1 st , 2 nd , and 3 rd placed submissions for comparison. Our system is outperformed by 2 points by the KeLP and the Beihang-MSRA submission and by only 0.6 points by the IIT-UHH submission.

Conclusion
We described a deep learning approach to question-answering. The proposed architecture is based on parallel CNNs that compute a sentence representation for the question and the answer. These representations are then concatenated and used to predict whether the answer is relevant to the question. The architecture is augmented by two different attention mechanisms which improve the performance. Our system was evaluated on the SemEval-2017 competition for Community Question Answering, where it ranked 7 th on the Question-Comment subtask. Our system performed poorly on the other two subtasks, thus, for future work we will improve our system to tackle these tasks with high performance.