LCT-MALTA’s Submission to RepEval 2017 Shared Task

System using BiLSTM and max pooling. Embedding is enhanced by POS, character and dependency info.


Introduction
The RepEval 2017 Shared Task aims to evaluate fixed-length vector representations (or embeddings) of sentences on the basis of a natural language understanding task, viz. natural language inference (NLI), also known as recognising textual entailments. Given two sentences, the first being the premise and the second the hypothesis, the goal of NLI is to train a classifier to predict whether the relation of the hypothesis to the premise is one of entailment, contradiction or a neutral relation. The training and test data for this 3-way classification task at RepEval 2017 are drawn from the Multi-Genre NLI, or MultiNLI corpus (see Williams et al. (2017) for details). Task participants are provided with both training and development datasets, where parts of the development data match the training data in terms of genre, topic etc. (referred to as matched examples) and other parts do not (referred to as mismatched examples).
This paper presents Team LCT-MALTA's submission to the shared task. In line with previous research, we obtain a single vector which is the combined representation of both the premise and the hypothesis and feed it into a Multilayer Perceptron (MLP) for the actual 3-way classification.

Related Work
Various works in recent years have dealt with the creation of distributed sentence representations, typically based on existing word embeddings such as word2vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014). The baseline models at the shared task use GloVe vectors and present three approaches to obtaining sentence embeddings (Williams et al., 2017): a) taking the sum of the embeddings of all the words in the sentence (continous bag of words, CBOW); b) taking the average of the hidden state outputs of a bidirectional LSTM (BiLSTM; Hochreiter and Schmidhuber 1997) across all the words; and c) the Enhanced Sequential Inference Model (ESIM) by (Chen et al., 2017), which, however, relies on cross-sentence attention, which submissions to the shared task may not make use of.
Instead of the BiLSTM architecture, Tai et al. (2015) propose a tree-structured LSTM to capture the hierarchical structure of natural language sentences. Conneau et al. (2017) use BiLSTM with max pooling and achieve state-of-art results when testing their sentence representations on an NLI task based on the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015). Lin et al. (2017) introduce a self-attention mechanism with multiple hops of attention on top of BiLSTM, where the different hops attend to different parts of the input sentence. Their approach represents sentence embeddings as 2-D matrices instead of vectors.

Our Approach
We present in our submission a simple BiLSTMbased approach related to the second baseline model. Crucially, however, we add the following alterations: • Our input word vectors are enhanced with part-of-speech (POS), dependency and word character information.
• Instead of taking the average of the BiLSTM outputs across all words (mean pooling), we use max pooling.

Enhanced Word Embeddings
A central component of our approach is the enhancement of the pre-trained GloVe vectors with additional linguistic information. The main motivation is the following: BiLSTM proceeds linearly, processing an input sentence word by word. The structure of natural language sentences, however, is hierarchical in nature 1 . We wish to encode some information on linguistic structure while keeping the simple, standard BiLSTM architecture. Therefore, we attach such additional information to the representations of individual words.
In conrete terms, we initialise our model with 300-D pre-trained GloVe vectors (as is done in the BiLSTM baseline model) and enhance them with the following content:

POS-tag Embeddings
Part-of-speech (POS) tagging is a common first step in syntactic sentence processing. We postulate that explicit knowledge of a the input words' syntactic categories might help representing the meaning of the input sentence. Thus, using modules from UDPipe (Straka et al., 2016), we tokenise and tag input sentences with universal POStags. We then generate randomly-initialised, 20-D embeddings for all POS-tags.

Dependency Label Embeddings
Dependency parsing captures the binary dependency relation between words in a sentence and determines the central word of an input sentence (the head word of the sentence) (Kübler et al., 2009). In particular, it is enable to encode longer 1 As mentioned in section 2, Tai et al. (2015) propose a tree-structured LSTM for similar reasons dependencies across multiple words. As such, dependency parsing provides vital information on the sentence's structure.
Hence, we apply UDPipe's (Straka et al., 2016) state-of-the-art dependency parser to the input sentence and subsequently equip each pre-trained word embedding with that word's dependency information, which in turn consists of the word's head word and its dependency relation to the head. In concrete terms, for each word w i , we map the embedding of its head word w j to a 50-D vector and generate a 50-D randomly initialised embedding for the dependency relation from w i to its head. We then take the element-wise product between these two 50-D vectors to obtain the full dependency embedding for the word w i . Formally, the dependency embedding w i d for any token w i is computed as follows: where d ij is the dependency relation between token w i and the head token w j , E d is the 50-D embedding of that dependency relation, W d is a matrix of size 50x300 which maps w j (originally a 300-D GloVe vector) to a 50-D vector, and is the element-wise product.

Character Embeddings
The usage of character embeddings is mainly inspired by various works which incorporate character-level embeddings into the distributed representation of words to yield improved word embeddings (Santos and Zadrozny, 2014;Bojanowski et al., 2016;Kim et al., 2016). For each token, we employ LSTMs to compute embeddings for each of its characters. The LSTM input at each time step is one character, i.e. one letter, and the output is a 100-D hidden state. The last 100-D hidden state vector is then considered the full character embedding for that token.

Final Input Word Embeddings
Finally, for each word, we concatenate its original GloVe embeddings with all of the above-mentioned additional linguistic information. Thus, our final embedding for each word, which we input to the BiLSTM to compute sentence embeddings, consists of the concatenation of the word's 300-D Glove embedding, its 20-D POS-tag embedding, its 50-D dependency embedding and 100-D character embedding. As mentioned, all of the embeddings except for the GloVe  Figure  1 illustrates an instance of our final enhanced word embeddings.

BiLSTM Sentence Embedding with Max Pooling
The enhanced word embeddings described above are fed into a standard BiLSTM architecture, with a 100-D hidden state output vector for each unidirectional LSTM. Concatenating the forward ( − → h t ) and backward ( ← − h t ) output vectors for each node, we obtain n hidden state vectors − → h , where n is the number of words in the input sentence and − → h is a 200-D vector corresponding to one word.

Combining Premise and Hypothesis Representations
We separately obtain the previously described sentence embeddings for the premise and the hypothesis. Mou et al. (2015) examine multiple heuristics for combining the same-length embeddings of two sentences in NLI tasks, including concatenating the two vectors and taking their elementwise difference or product. We use a combination of some of these heuristics. More specifically, we concatenate the two sentence embeddings, then further concatenate a) their elementwise maximum and b) their element-wise product to the concatenation of the two sentence embeddings. Hence, we obtain as a result a single 800-D vector that is the combined representation of the premise and the hypothesis.

Classification
Finally, we feed the single vector representation of the two sentences to a tanh layer with a softmax layer on top of it to perform the 3-way classification to the classes entailment, contradiction, and neutral. Our complete model is illustrated in Fig

Experiments & Results
We experimented with BiLSTM-based sentence encoders including and excluding our enhanced word embeddings as well as in combination with max pooling and average pooling. We use L2 regularization and set dropout rate to 0.1 to prevent overfitting. The models are trained in 10 epoches using Adam optimizer with learning rate 10 −3 . The models having the best performance on development set are selected to evaluate on test set. Furthermore, we implemented two systems presented in literature, viz. (Lin et al., 2017)'s selfattentive embeddings approach and (Kim, 2014)'s convolutional neural network (CNN) approach, and compared their performance with that of our  (Kim, 2014) 67.3 68.0 Automatically learned self-attentive embeddings (Lin et al., 2017)

Discussion
Our results favour max pooling over average pooling, which is in agreement with findings by Conneau et al. (2017). Moreover, our enhanced word embeddings are shown to be effective. Their addition alone produces the accuracy scores superior to what the incorporation of (Lin et al., 2017)'s automatically learned self-attention matrix yields. The combination of max pooling and enhanced word embeddings, which are extremely simple alterations to the BiLSTM baseline, yield results which clearly beat the baseline.
Thus, our submitted system to the RepEval 2017 shared task demonstrates that simple alterations to the standard BiLSTM architecture for computing sentence embeddings can obtain visible improvements. In particular, linguistic information is shown to be useful for the present NLI task. Therefore, with respect to distributed representations of sentence meaning, more sophisticated systems which take into account linguistic and grammatical relationships are worth further investigation.