UTA DLNLP at SemEval-2016 Task 1: Semantic Textual Similarity: A Unified Framework for Semantic Processing and Evaluation

In this paper, we propose a deep neural network based natural language processing sys-tem for semantic textual similarity prediction. We leverage multi-layer bidirectional LSTM to learn sentence representation. After that, we construct matching features followed by High-way Multilayer Perceptron to make predictions. Experimental results demonstrate that this approach can’t get better results on standard evaluation datasets.


Introduction
Traditional approaches (Lai and Hockenmaier, 2014;Zhao et al., 2014;Jimenez et al., 2014) for semantic textual similarity prediction usually build the supervised model using a variety of hand crafted features. Hundreds of features generated at different linguistic levels are exploited to boost classification. With the success of deep learning in many machine learning related applications, there has been much interest in applying deep neural network based techniques to further improve the prediction tasks in natural language processing (NLP) (Socher et al., 2011b;Iyyer et al., 2014;Tai et al., 2015).
A key component of deep neural network is word embeddings which serves as a lookup table to get word representations. From low-level NLP tasks such as language modeling, POS tagging, name entity recognition, and semantic role labeling, to high- * To whom all correspondence should be addressed. This work was partially supported by NSF-IIS 1117965, NSF-IIS  1302675, NSF-IIS 1344152, NSF-DBI 1356628, NIH R01  AG049371. level tasks such as machine translation, information retrieval and semantic analysis (Kalchbrenner and Blunsom, 2013;Socher et al., 2011a;Tai et al., 2015). Deep word representation learning has demonstrated its importance for these tasks. All the tasks get performance improvement via further learning either word level representations or sentence level representations.
In this work, we focus on deep neural network based semantic textual similarity prediction. We use multi-layer bidirectional LSTM (Long Short Term Memory) (Graves et al., 2013) to learn sentence representations. After that, we construct matching features followed by Highway Multilayer Perceptron to learn high-level hidden matching feature representations. Below, we will briefly introduce Multi-Layer Bidirectional LSTM.
2 Multi-Layer Bidirectional LSTM 2.1 RNN vs LSTM Recurrent neural networks (RNNs) are capable of modeling sequences of varying lengths via the recursive application of a transition function on a hidden state. For example, at each time step t, an RNN takes the input vector x t ∈ R n and the hidden state vector h t−1 ∈ R m , then applies affine transformation followed by an element-wise nonlinearity such as hyperbolic tangent function to produce the next hidden state vector h t : A major issue of RNNs using these transition functions is that it is difficult to learn long-range de-pendencies during training step because the components of the gradient vector can grow or decay exponentially (Bengio et al., 1994).
The LSTM architecture (Hochreiter and Schmidhuber, 1998) addresses the problem of learning long range dependencies by introducing a memory cell that is able to preserve state over long periods of time. Concretely, at each time step t, the LSTM unit can be defined as a collection of vectors in R d : an input gate i t , a forget gate f t , an output gate o t , a memory cell c t and a hidden state h t . We refer to d as the memory dimensionality of the LSTM. One step of an LSTM takes as input x t , h t−1 , c t−1 and produces h t , c t via the following transition equations: where σ(·) and tanh(·) are the element-wise sigmoid and hyperbolic tangent functions, is the element-wise multiplication operator.

Model Description
One shortcoming of conventional RNNs is that they are only able to make use of previous context. In semantic text similarity prediction task, the decision is made after the whole sentence pair is digested. Therefore, exploring future context would be better for sequence meaning representation. Bidirectional RNNs architecture (Graves et al., 2013) proposed a solution of making prediction based on future words. At each time step t, the model maintains two hidden states, one for the left-to-right propagation − → h t and the other for the right-to-left propagation ← − h t . The hidden state of the Bidirectional LSTM is the concatenation of the forward and backward hidden states. The following equations illustrate the main ideas: Deep RNNs can be created by stacking multiple RNN hidden layer on top of each other, with the output sequence of one layer forming the input sequence for the next. Assuming the same hidden layer function is used for all N layers in the stack, the hidden vectors h n are iteratively computed from n = 1 to N and t = 1 to T : Multilayer bidirectional RNNs can be implemented by replacing each hidden vector h n with the forward and backward vectors − → h n and ← − h n , and ensuring that every hidden layer receives input from both the forward and backward layers at the level below. Furthermore, we can apply LSTM memory cell to hidden layers to construct multilayer bidirectional LSTM.
Finally, we can concatenate sequence hidden matrix − → M ∈ R n×d and reversed sequence hidden matrix ← − M ∈ R n×d to form the sentence representation. Here n is the number of layers, d is the memory dimensionality of the LSTM. In the next section, we will use the two matrices to generate matching feature planes via linear algebra operations.

Learning from Matching Features
Inspired by (Tai et al., 2015), we apply element-wise merge to first sentence matrix M 1 ∈ R n×2d and second sentence matrix M 2 ∈ R n×2d . Similar to previous method, we can define two simple matching feature planes (FPs) with below equations: where is the element-wise multiplication. The F P 1 measure can be interpreted as an element-wise comparison of the signs of the input representations. The F P 2 measure can be interpreted as the distance between the input representations.

Highway MLP
Inspired by (Kim et al., 2016), we build Highway Multilayer Perceptron (HMLP) layer to further enhance representation learning. Conventional MLP applies an affine transformation followed by a nonlinearity to obtain a new set of features: One layer of a highway network does the following: where g is a nonlinearity, t = σ(W T y + b T ) is called as the transform gate, and (1 − t) is called as the carry gate. Similar to the memory cells in LSTM networks, highway layers allow adaptively carrying some dimensions of the input directly to the input for training deep networks.

Experiments
We use all previous dataset to train our LSTM classifier. The total number of training examples is 12912, and the number of dev examples is 680. Note that we didn't use cross validation to find the best model. Table 1 shows the Pearson correlation results of STS task.

Hyperparameters and Training Details
We first initialize our word representations using publicly available 300-dimensional Glove word vectors 1 . LSTM memory dimension is 100, the number of layers is 2. Training is done through stochastic gradient descent over shuffled mini-batches with the AdaGrad update rule (Duchi et al., 2011). The learning rate is set to 0.05. The mini-batch size is 25. The model parameters were regularized with a perminibatch L2 regularization strength of 10 −4 . Note that word embeddings were fixed during training.

Objective Functions
The task of semantic similarity prediction tries to measure the degree of semantic similarity of a sentence pair by assigning a similarity score ranging from 1 (completely unrelated) to 5 (semantically equivalent). Inspired by (Tai et al., 2015), given a sentence pair, we wish to predict a real-valued similarity score in a range of [1, K], where K > 1 is an integer. The sequence 1, 2, ..., K is the ordinal scale of similarity, where higher scores indicate greater degrees of similarity. We can predict the similarity scoreŷ by predicting the probability that the learned hidden representation x h belongs to the ordinal scale. This is done by projecting an input representation onto a set of hyperplanes, each of which corresponds to a class. The distance from the input 1 http://nlp.stanford.edu/projects/glove/ to a hyperplane reflects the probability that the input will located in corresponding scale. Mathematically, the similarity scoreŷ can be written as:ŷ where r T = [1 2 . . . K] and the weight matrix W and b are parameters.
In order to introduce the task objective function, we define a sparse target distribution p that satisfies y = r T p: where 1 ≤ i ≤ K. The objective function then can be defined as the regularized KL-divergence between p and p θ : where m is the number of training pairs and the superscript k indicates the k-th sentence pair (Tai et al., 2015).

Conclusions and Discussions
In this paper, we propose a deep neural network architecture that leverages pre-trained word embeddings to learn sentence meanings. Our approach first generates word sequence representations as inputs into a multilayer bidirectional LSTM to learn sentence representations. After that, we construct matching features followed by highway MLP to learn high-level hidden matching feature representations. Experimental results on benchmark datasets demonstrate that our model didn't achieved the state-of-the-art performance compared with other approaches. Our approach is above the median scores only on question-question domain. We suspect our model have worse capability of domain adaption. Also the Highway MLP may increase the model complexity and lead to worse performance.

Method
All