Semantic Linking in Convolutional Neural Networks for Answer Sentence Selection

State-of-the-art networks that model relations between two pieces of text often use complex architectures and attention. In this paper, instead of focusing on architecture engineering, we take advantage of small amounts of labelled data that model semantic phenomena in text to encode matching features directly in the word representations. This greatly boosts the accuracy of our reference network, while keeping the model simple and fast to train. Our approach also beats a tree kernel model that uses similar input encodings, and neural models which use advanced attention and compare-aggregate mechanisms.


Introduction
Modeling a match between pieces of text is at the core of many NLP tasks. Recently, manual feature engineering methods have been shadowed by neural network approaches. These networks model the interaction of two pieces of text, or word-toword interactions across sentences, using sophisticated attention mechanisms (Wang et al., 2016a;Santos et al., 2016) and compare-aggregate frameworks .
Architectural complexity is tied to longer training times 1 . Meaningful features may take long time to emerge by only leveraging word representations and the training data of the task at hand. This is especially problematic with little data, as it often happens in question answering (QA) tasks, e.g., answer sentence selection (Wang et al., 2007;Yang et al., 2015). Thus, effective word representations are crucial in neural network models to get state-of-the-art performance. * Now at Google † This work was partially carried out when the author was at the University of Trento 1 http://dawn.cs.stanford.edu/ benchmark/ In this work, we try to answer the following research questions: (i) in addition to lexical links, can we incorporate higher-level semantic links between the words in a question and a candidate answer passage, and (ii) can we show that such information has an impact on the quality of our model, and also allows us to keep the architecture simple?
We show that modeling semantic relations improves the performance of a neural network for answer sentence selection with (i) a little number of semantic annotations, and (ii) a little increase in training time w.r.t. more complex architecture.

Related Work
Traditional work on QA makes heavily use of syntactic and semantic features (Hickl et al., 2007;Ferrucci et al., 2010). A different direction consists in using structural kernels on text encoded as trees (Severyn and Moschitti, 2012;Severyn et al., 2013a,b;Tymoshenko et al., 2014;Tymoshenko and Moschitti, 2015). Recently, deep learning methods have been very successful in NLP tasks. Words and sentences are mapped into low dimensional representations using convolutional (Krizhevsky et al., 2012) and recurrent networks (Schuster and Paliwal, 1997), and then adoperated for classification. Complex networks for such a task include attentive networks and compare-aggregate networks.
Attentive Networks (Bahdanau et al., 2015;Parikh et al., 2016;Yin et al., 2016) build a sentence representation by also considering the other sentence, weighting the contribution of its parts with the so-called attention mechanism.
Compare-Aggregate Networks (Wang and Jiang, 2017) apply several decompositions to each sentence in a pair. The resulting vectors are compared or composed with multiple functions, and possibly some attention mechanisms. All the intermediate results are then aggregated into a fixed size vector to quantify the final match.
In this work, we take some elements of the traditional QA research, i.e., semantic features, and use them to model relationships between sentence pairs, in the context of a neural network, which is less complex than attentive and compareaggregate counterparts.

Question Analysis
Question Analysis is an important part of a QA system (Lally et al., 2012) and can give us syntactic and semantic clues that greatly help in scoring answer passages, and in identifying the final answer. Leveraging a relatively small number of annotated examples, we can automatically extract question properties that may be exploited by a QA model to increase the accuracy of its answers. We use classifiers to extract the question category and the question focus. Question Category. Questions can be broadly classified into categories according to a given taxonomy. When the category is indicative of the answer type, the latter can be furtherly characterized by the Lexical Answer Type (LAT), which according to Lally et al. (2012) is a word or noun phrase in the question that specifies the type of the answer without any attempt to understand its semantics. Question Focus. In the literature there are multiple definitions of question focus. According to Ferrucci et al. (2010), the focus is the question part that substituted with the answer, renders the question a stand-alone statement. According to Bunescu and Huang (2010), the focus is the "set of all maximal noun phrases in the question that corefer with the answer". Their definition allows a question to have multiple focuses or an implicit focus. Additionally, it is more tied to the LAT and indeed the focus can be used to infer the answer type. We adopt such definition since we build our question focus identifier using the annotated data they provide. Note that we do not consider multiword or implicit focus.

Answer Sentence Selection with CNNs
Given a query or question q and a candidate answer passage a, the task of answer selection can be defined as learning a function f (q, a) that outputs a relevancy probability s ∈ [0, 1]. Multiple answers associated with a question are sorted in descending order by the score s. A good an-swer selection system places the highest number of correct answers at the top of a candidate answer list. In this paper, we use convolutional neural networks, referred to as CNNs (Kim, 2014;Kalchbrenner et al., 2014), to (i) classify a question into a category, (ii) identify the focus word in a question, and (iii) build a question and answer representations for QA.

Sentence Matrix Encoding
A sentence s of length n is a sequence of words (w 1 , ..., w n ), which are drawn from a vocabulary V . Each word is encoded with an integer id from 1 to |V |, and then represented as a vector, w ∈ R d , looked up into an embedding matrix, E ∈ R d×|V | . The matrix E is obtained by concatenating all the embeddings of the words in V . The id 0 is used for padding and it is mapped to the zero vector. The i th column in E corresponds to the word with integer id i to facilitate the lookup.

Question Analysis Networks
We use CNNs for question analysis. The question category network applies convolutions of different width and then pooling on the question. The results are concatenated and fed to a multilayer perceptron (MLP) that outputs a probability distribution over the possible categories seen during training. The question focus network applies convolutions that operate on windows centered on each question word. Therefore, the input and output resolutions are the same. We stack a number of convolutions to increase the receptive field. Every output vector from the last convolution of the stack is passed through an MLP, which produces a scalar value. All those values are normalized across each sentence with a softmax, to form a probability distribution over the sentence tokens.

Answer Sentence Selection Network
Our neural model is based on the Moschitti (2015, 2016) model (S&M from now on), showed in Figure 1. This model is simple, fast and well studied. It has also been reproduced in other work Chen et al., 2017;Sequiera et al., 2017).
The S&M model embeds the question and answer passage and operates independent convolutional and max-pooling layers on each. A bilinear transformation (Bordes et al., 2014) produces a similarity value x sim for the pair. The similarity, the encoded question and passage, and a vec- tor of real valued features x f eat are concatenated in the join layer. The latter is fed to a hidden layer with a non-linearity, and the final softmax layer outputs the matching probability. The word vectors of the question and the answer are augmented with an additional feature, which is embedded in a small dimensional space. This feature signals if a word appears in both the question and answer. We found that the real valued features and the similarity matrix do not increase the network accuracy and we removed them from our model. This finding is consistent with recent reproduction papers by ; Sequiera et al. (2017).

Our QA Network with Semantic Overlap
We propose to add semantic features to the sentence matrix to establish links between words that go beyond lexical matching. Figure 2 describes our network. The key addition to the S&M model is the semantic overlap vector. Each word is therefore represented by concatenating three vectors: the word embedding vector, a feature embedding vector which can represent two values -if a word is contained or not in both question and answerand the semantic overlap embedding vector. The semantic vector w so , with dimensionality s, embeds a feature so which can assume C + 1 values, if we consider the C question classes plus a nomatch value. Each feature value is looked up into an embedding matrix W so ∈ R s×|C|+1 . Analogously, the word overlap binary feature is looked up into an embedding matrix W wo ∈ R r×|2| . The final word representation will be the concatenation of all these vectors: w = [w; w wo ; w so ].
Here we describe how the semantic word overlap feature is computed. For each question we collect the output of our question analysis CNNs. The question focus CNN determines which word in the  question is the focus. The question category CNN assigns a class to the question. After that, each word is associated with a semantic overlap feature so (which will eventually be embedded using W wo ) according to the following strategy: 1. for each word in the question which is not the question focus so is equal to 0. For the question focus word so is equal to the id of the question category (the question focus and category are output by our CNN classifiers); 2. for each answer word so is equal to 0, with the exeception of words covered by named entities (NEs), for which so is equal to the id of the question category that is compatible with their entity type, according to the mapping in Table 1.
The W wo and W so matrices are parameters of the model, and they are learned during training. The question category and question focus annotations for the QA datasets are produced by our neural network classifiers. The NEs are obtained with an off-the-shelf processor 2 , trained on OntoNotes (Weischedel et al., 2012).

Experimental Results
Here we describe how we train our networks for question analysis and then we present the answer sentence selection experiments. More details about preprocessing, training and hyperparameter choice can be found in the appendix.

Question Classification
Dataset. The CNN question classifier is trained on the UIUC dataset (Li and Roth, 2006). We use the 6 coarse classes to train the classifier. The semantic overlap vectors of the question focus word boss, and the answer word claire are the same, because the latter is an entity of type Person. The question has HUM category. Ignoring stopwords, the word boss appears in the question and the answer, and this is reflected in the word overlap embedding space.
Results. The classifier has accuracy of 91.2% on the UIUC test set. Our goal is to annotate new questions with reasonable accuracy. Since the model convergences well, we annotate the questions in the QA datasets after training on the UIUC data, and select the best model on the test data.

Question Focus Identification
Dataset. The CNN focus identifier is trained on the dataset from Bunescu and Huang (2010), which contains the first 2,000 UIUC questions annotated with focus information. After removing the questions with implicit and multi-focus, we end up with 1,030 questions. Results. The cross-validation accuracy of the classifier is 92.3%. After convergence, we annotate the focus words in the QA datasets.

TrecQA
Dataset. We test our model on TrecQA (Wang et al., 2007), one of the most popular benchmarks for answer selection. The dataset contains factoid questions and candidate answer sentences. We use the same splits of the original data, but we run our experiments using the larger provided training set (TRAIN-ALL). This is noisier data, which, on the other hand, gives us more examples for training. We remove from the dev. and test sets questions without answers, and questions with   Yin et al. (2016) 69.21 71.08 Severyn and Moschitti (2016) 69.51 71.07 Chen et al. (2017) 70.10 71.80 Rao et al. (2016) 70.90 72.30 Tymoshenko et al. (2016) 71.25 72.30 Guo et al. (2017) 71.71 73.36  71.80 73. 10 Shen et al. (2017) 73.30 75.00 Wang et al. (2016a) 73.41 74.18 Wang and Jiang (2017) 74  training instances with difficult negative examples. Our system beats several others that use word alignments and attention mechanisms. The better systems employ expensive bidirectional networks, sophisticated attention mechanisms, and extract multiple views of questions and answers for comparing and aggregating them.

WikiQA
Dataset. TrecQA and its test set are small, so results may be unstable. In addition, lexical overlap between questions and answer candidates is high (Yih et al., 2013). This means that simple lexical similarity features are highly discriminative. Therefore, we also experiment with Wik-iQA (Yang et al., 2015), which is an order of magnitude larger than TrecQA. We use the Yin et al. Our network is able to make better use of the provided semantic clues. Surprisingly, CNN W O+SO also achieves higher MAP than the model by , which is a state-of-the-art complex approach mixing attention and interaction factors of multiple sentence perspectives.

Discussion
The results with the CNN W O+SO model suggest that the semantic overlap vectors are an effective way of linking questions and answers. This is especially true, given the results on WikiQA, where the questions and answers have little lexical overlap. With the additional semantic information, the CNN is able to better model the relevancy of candidate passages. It also surpasses the accuracy of more complex systems, which have higher training time. The annotation networks (which can be trained only once) and the answer selection networks take little time to train: from 10 to 20 minutes in total, depending on the number of question/answer pairs. CNNs are faster at training and inference time with respect to RNNs, especially when the latter incorporate attention mechanisms, which increase the number of computations. We argue that annotating a relatively small number of examples with semantic information, could be time well spent to increase model accuracy, without increasing its architectural complexity. We would like to add that we also experimented with RNNs (LSTM and GRU) in place of the CNN sentence model. Such encoders easily overfitted, requiring careful regularization, and did not yield better results for us.

Conclusion and Future Work
In this paper, we presented a neural network that models semantic links between questions and answers, in addition to lexical links. The annotations for establishing such links are produced by a set of fast neural components for question analysis, trained on publicly available datasets. The evaluation on two QA datasets shows that our approach can achieve state-of-the-art performance using a simple CNN, leading to a low complexity and training time. Our approach is an interesting first step towards a future architecture, in which we will jointly optimize the semantic annotators and the answer sentence selection model, in an end-to-end fashion.