FA3L at SemEval-2017 Task 3: A ThRee Embeddings Recurrent Neural Network for Question Answering

In this paper we present ThReeNN, a model for Community Question Answering, Task 3, of SemEval-2017. The proposed model exploits both syntactic and semantic information to build a single and meaningful embedding space. Using a dependency parser in combination with word embeddings, the model creates sequences of inputs for a Recurrent Neural Network, which are then used for the ranking purposes of the Task. The score obtained on the official test data shows promising results.


Introduction
Community Question Answering (cQA) systems have proven to be useful for a long time and they still are an invaluable source of information. However, due to their rapid growth and to the large amount of data provided it is not easy to find a relevant answer or a good related question amongst all the others. For these reasons we present a model which tries to tackle these problems. The subtasks we have worked on can be described as follows: A) Question-Comment Similarity -Given a question q and 10 comments c 1 , . . . , c 10 , rank such comments from the most relevant to the least one with respect to q, and assign to each one a label which can be "Good" or "Bad". B) Question-Question Similarity -Given a question q and a set of 10 related questions q 1 , . . . , q 10 , rank the 10 questions from "Relevant" to "Irrelevant", according to q. A more detailed description of the task can be found in (Nakov et al., 2017). Our work has been inspired by studies regarding embedding spaces. Indeed, in  GloVe embeddings (Pennington et al., 2014) are used to solve the same subtasks as ours, achieving good results using just word embeddings which encode semantic information into a vector. Moreover, the model proposed in (Yu et al., 2013), where autoencoders are used to build an embedding space, has been exploited to propose an approach that mixes semantic and syntactic information through the use of word embeddings and dependency parsing. These are then put together and become an input for the neural network. In this way we try to enhance the capability of the learning system. In principle, our approach aims at enriching semantic information with syntactic relations holding between elements of the couples (questioncomment or question-question). This should serve well for both subtasks A and B, since the model will learn relations between a question and a comment or between a question and another one. However, further research would be useful to understand to what extent there exist differences in the kind of relations learnt, and therefore in the subtasks. The paper is organised as follows: Section 2 outlines the preprocessing and additional features used by the model, while Section 3 describes the key models used. Section 4 shows the model selection strategy and the alternatives we explored with respect to word embeddings and their combination. Finally, Section 5 reports performances on different models and Section 6 wraps up everything and discusses about future works. From now on, we will refer to "comment" for indicating both a comment (Subtask A) or a related question (Subtask B), since our model does not make distinctions between them. We participated to Semeval 2017, ranking 8th in Subtask A and 10th in Subtask B.  Figure 1: Dependency parsing of two sentences taken from a question and a comment in the training set.
In this example the first input x(t) of the RNN is going to be: <"is",SUBJ,"there","is",SUBJ,"It">.

Data Preprocessing
We applied standard preprocessing to question and comment body, so as to achieve better performance during syntactic parsing and a better alignment of our vocabulary to the GloVe one. Each question also includes the subject of the topic. Preprocessing included the following steps: • Portions of text that include HTML tags and special sequences were removed or substituted with simpler strings.
• Using a set of regular expressions, we replaced URLs, nicknames, email addresses with a placeholder for each category.
• Too long repetitions of characters inside tokens were replaced by a single character (e.g. loooot became lot). Indeed, in the language spoken on community forums, letters are often repeated to emphasize words; with our approach we were able to reconstruct their standard form. Moreover, multiple punctuation was also collapsed.
• Standard use of spacing after punctuation was restored, in order to avoid problems during tokenization.
• Using a hand-written dictionary, the most common abbreviations were replaced with the corresponding extended form.
We then performed sentence splitting and tokenization using nltk (Bird et al., 2009). During the tokenization step, we performed spelling corrections.
Finally, texts were analyzed using Tanl pipeline (Attardi et al., 2007), adding morpho-syntactic and syntactic information (i.e., part of speech tagging and dependency parsing). Figure 1 shows an example of a question and a comment which are parsed accordingly.

Additional Features
After that, we generated several features, representing both metadata and some properties of the couple Question-Comment. These features have been commonly used in literature, both with Neural Networks (as in (Mohtarami et al., 2016)), linear or SVM models as in (Mihaylova et al., 2016), in order to include additional and potentially relevant information not easily conveyed through semantic representations. In our case, they are used as additional input beside the RNN output. Features can be grouped as follows: • Features encoding information about standard similarity between question and comment (all measures are expressed in terms of number of tokens): -size of intersection between question and comment -Jaccard Coefficent (ratio between intersection size and union size of question and comment) -comment length -ratio between comment length and question length -length of the longest common subsequence between question and comment • Features encoding metadata information, in particular: -number of the comment in default ordering -whether the comment was posted by the same user asking the question -whether the user posting the comment had already posted a comment for the same question • Features encoding presence of certain elements in comment body, in particular we looked for: -presence of question marks -presence of URLs (through regex) -presence of username (through regex) -presence of a username among those that are authors of comments preceeding the considered one

Model
The proposed model 1 makes use of the previous steps (i.e. a dependency parser) whose output is a tree, to generate a sequence of triples. The ith triple is made of < W i , rel, W r >, where W i is the ith word of the text and W r is the word associated through rel (i.e. the dependency relation extracted by the parser). Then triples < e i , rel, e r > are generated, where e i and e r are word-embeddings vectors for the two words, and rel is a 1-hot-encoding of the dependency relations. The kth input to be fed to the RNN is simply made by concatenating the kth embedding triple of 1 An implementation is available at https://github.com/AntonioCarta/ThreeRNN the comment with the kth one of the question. Figure 1 shows an example of how to obtain a valid input for our model. Our goal is to let the system learn the correct composition rule through syntactic dependencies. Hence, the input of our model is dual: a sequence of triples which represents the question and another sequence for the comments. These are then passed to a sentence encoder, which is a Recurrent Neural Network (RNN), that is used to return a single output aiming to represent the entire sequences. In particular we describe a Long Short Term Memory (Hochreiter and Schmidhuber, 1997) which are capable of learning long-term dependencies; Then, given x as input in the form: we have: where f (t) is the forget gate, g (t) the input gate, s (t) the state, o (t) the output gate and h (t) the hidden state. U and W are the weight matrices for each gate (e.g., U o refers to the matrix for the output gate) and is the Hadamard product. Then the RNN output, along with a vector made up of additional features, become the inputs passed to the final feed-forward layers which performs the scoring. Each layer of the final network uses a sigmoid activation function. Hence, given x, the layer input, W and b the layer matrix and bias, the output y is defined as The final output o of the network uses a softmax activation, thus we have: Where n is the length of the vector y. The latter provides a distribution over two classes: 'Good' and 'Bad/Partially Useful' for subtask A, 'Per-fectMatch/Relevant' and 'Irrelevant' for subtask B. To obtain the final ranking we took the probability of a given input to be labeled as the positive class. The entire network is trained with back-propagation using a cross-entropy loss function. Figure 2 shows the conceptual schema of the model.

Experiments
To perform model selection we merged training and development files provided by Semeval organisers, then we shuffled and extracted a training and a validation set. We selected various hyperparameters, shown with their values in Table 4, such as learning rate, number of hidden units and hidden layers for the recurrent and feedforward layers, dropout (Srivastava et al., 2014), L2 regularization, activation functions (i.e. ReLu (Nair and Hinton, 2010), sigmoid and hyperbolic tangent), optimization algorithms (i.e. adam (Kingma and Ba, 2014) and rmsprop (Tieleman and Hinton, 2012)). The length threshold for the number of triples in input to the RNN as been also added as hyper-parameter (i.e., Max length); if the comment/question is shorter, it is filled up with zeros ("null triples"). Since each training required quite a large amount of time, we opted for a random search technique (Bergstra and Bengio, 2012 The embeddings layer uses pretrained embeddings which are fixed during the training phase. We tried to update them together with the entire network during training but the resulting network always ended up to over-fit. Two different types of embeddings have been evaluated: GloVe (Pennington et al., 2014), which are trained using Wikipedia, and embeddings trained directly with questions and answers extracted from the Qatar Living forum . However, in our model both embeddings worked well, thus with the latter we did not obtained any particular improvements.
To encode the RNN input into a single embedding we compare three different approaches: SUM (which sums all the triples given as input), LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014). Finally, the neural network model was implemented using Keras (Chollet, 2015), which provides an efficient and easy-to-use deep learning utilities.

Results
The results obtained in the test set, in both sub-task A and B, are summarized in Table 2. The primary submission uses LSTM for subtask A and GRU for subtask B. Instead, the contrastive model uses SUM as aggregation and it has been submitted just for the subtask A. Using the SUM model, which is computationally less expensive than RNN, we obtained just a slightly worst MAP (i.e. around 0.5%), which suggests we could further improve the performance by making the RNN exploit better the sequence in input. Moreover, there is a tradeoff between representation length and computational costs, achieved with the use of the length threshold; this may be regarded as a crucial choice for our model.

Conclusions
To sum up, we have developed a model which tries to combine semantic and syntactic information into a single vector space. We will further investigate this combination, through the use of syntactic relations holding between content words, rather than exploiting the whole set of dependency relations (e.g. different tag-sets, partial or shallow parsing of sentences etc.). Our experiments have explored different possibilities regarding the choice of the word embedding system; all of them proved in the end to achieve similar results. However, it may be worth trying to build an ad-hoc embedding space which mixes parsing and lexical information, aiming to improve the performances of our model. Future works may include improvements to the RNN in order to better represent longer sentences, or the use of recursive neural network that directly use the tree structure given by the dependency parsing, with different weights matrices for each dependency relation.
pean Conference on Machine Learning and Knowledge Discovery in Databases. Springer, pages 208-223.