An Empirical Study on End-to-End Sentence Modelling

Accurately representing the meaning of a piece of text, otherwise known as sentence modelling, is an important component in many natural language inference tasks. We survey the spectrum of these methods, which lie along two dimensions: input representation granularity and composition model complexity. Using this framework, we reveal in our quantitative and qualitative experiments the limitations of the current state-of-the-art model in the context of sentence similarity tasks.


Introduction
Accurately representing the meaning of a piece of text remains an open problem. To illustrate why it is difficult, consider the pair of sentences A and B below in the context of a sentence similarity task.
A:The shares of the company dropped. B:The organisation's stocks slumped.
If we use a very naïve model such as bagof-words to represent a sentence and use discrete counting of common words between the two sentences to determine their similarity, the score would be very low although they are highly similar. How then do we represent the meaning of sentences? Firstly, we must be able to represent them in ways that computers can understand. Based on the Principle of Compositionality (Frege, 1892), we define the meaning of a sentence as a function of the meaning of its constituents (i.e., words, phrases, morphemes). Generally, there are two main approaches to representing constituents: localist and distributed representations. With the localist representation 1 , we represent each constituent with a unique representation usually taken 1 The best example of this sparse representation is the "one-hot" representation (see Appendix A for details) from its position in a vocabulary V. However, this kind of representation suffers from the curse of dimensionality and does not consider the syntactic relationship of a constituent with other constituents. These two shortcomings are addressed by the distributed representation (Hinton, 1984) which encodes a constituent based on its co-occurrence with other constituents appearing within its context, into a dense n-dimensional vector where n ⌧ |V |. Estimating the distributed representation has been an active research topic in itself. Baroni et al. (2014) conducted a systematic comparative evaluation of context-counting and context-predicting models for generating distributed representations and concluded that the latter outperforms the former, but Levy et al. (2015) later have shown that simple pointwise mutual information (PMI) methods also perform similarly if they are properly tuned. To date, the most popular architectures to efficiently estimate these distributed representations are word2vec (Mikolov et al., 2013a) and GloVe (Pennington et al., 2014). Subsequent developments estimate distributed representations at other levels of granularity (see Section 2.1).
While much research has been directed into constructing representations for constituents, there has been far less consensus regarding the representation of larger semantic structures such as phrases and sentences (Blacoe and Lapata, 2012). A simple approach is based on looking up the vector representation of the constituents (i.e., embeddings) and taking their sum or average which yields a single vector of the same dimension. This strategy is effective in simple tasks but loses word order information and syntactic relations in the process (Mitchell and Lapata, 2008;Turney et al., 2010). Most modern neural network models have a sentence encoder that learns the representation of sentences more efficiently while preserving word or-der and compositionality (see Section 2.1).
In this work, we present a generalised framework for sentence modelling based on a survey of state-of-the-art methods. Using the framework as a guide, we conducted preliminary experiments by implementing an end-to-end version of the stateof-the-art model in which we reveal its limitations after evaluation on sentence similarity tasks.

Related Work
The best way to evaluate sentence models is to assess how they perform on actual natural language inference (NLI) tasks. In this work, we examine three related tasks which are central to natural language understanding: paraphrase detection (Dolan et al., 2004;Xu et al., 2015), semantic similarity measurement (Marelli et al., 2014;Xu et al., 2015;Agirre et al., 2016a) and interpretable semantic similarity measurement (Agirre et al., 2016b). (We refer the reader to the respective papers for the task description and dataset details).
Among the four broad types of methods we have identified in the literature (see Appendix C.1), we focus in this paper on deep learning (DL) methods because they support end-to-end learning, i.e., they use few hand-crafted features-or none at all, making them easier to adapt to new domains. More importantly, these methods have obtained comparable performance relative to other top-ranking methods.

Sentence Modelling Framework
As a contribution of this work, we survey the spectrum of DL methods, which lie on two dimensions: input representation granularity and composition model complexity, which are both central to sentence modelling (see Appendix Figure C.2 for a graphical illustration).
The first dimension (see horizontal axis of Appendix Figure C.2) is the granularity of input representation. This dimension characterises a tradeoff between syntactic dependencies captured in the representation and data sparsity. On the one hand, character-based methods (Vosoughi et al., 2016;dos Santos and Zadrozny, 2014; are not faced with the data sparsity problem; however, it is not straightforward to determine whether composing sentences based on individual character representations would represent the originally intended semantics. On the other hand, while sentence embeddings (Kiros et al., 2015), which are learned by predicting the previous and next sentences given the current sentence, could intuitively represent the actual semantics, it suffers from data sparsity.
The second dimension (see vertical axis of Appendix Figure C.2) is the spectrum of composition models ranging from bag-of-items-driven 2 architectures to compositionality-driven ones to account for the morphological, lexical, syntactic, and compositional aspects of a sentence. Some of the popular methods are based on Bag-of-Item models, which represent a sentence by performing algebraic operations (e.g., addition or averaging) over the vector representations of individual constituents (Blacoe and Lapata, 2012). However, these models have received criticism as they use linear bag-of-words context and thus do not take into account syntax. Spatial neural networks, e.g., Convolutional Neural Networks or ConvNets (LeCun et al., 1998), have been shown to capture morphological variations in short subsequences (dos Santos and Zadrozny, 2014; Chiu and Nichols, 2016). However, this architecture still does not capture the overall syntactic information. Thus Sutskever et al. (2014) proposed the use of sequence-based neural networks, e.g., Recurrent Neural Networks, Long Short Term Memory models (Hochreiter and Schmidhuber, 1997), because they can capture long-range temporal dependencies. Tai et al. (2015) introduced Tree-LSTM, a generalisation of LSTMs to tree-structured network topologies, e.g., Recursive Neural Networks (Socher et al., 2011). However, this type of network requires input from an external resource (i.e., dependency/constituency parser).
More complex models involved stacked architectures of the three basic forms above Yin et al., 2015;Cheng and Kartsaklis, 2015;Zhang et al., 2015;He et al., 2015) which capture the syntactic and semantic structure of a language. However, in addition to being computationally intensive, most of these architectures model sentences as vectors with a fixed size, they risk losing information especially when input sentence vectors are of varying lengths. Recently, Bahdanau et al. (2014) introduced the concept of attention, originally in the context of machine translation, where the network learns to align parts of the source sentence that match the constituents of the target sentence, without having to explicitly form these parts as hard segments. This enables phrase-alignments between sentences as described by Yin and Schütze (2016) in the context of a textual entailment recognition task.

Preliminary Experiments
In this section, we describe the preliminary experiments we conducted in order to gain deeper understanding on the limitations of the state-of-the-art model.
Firstly, we define sentence similarity as a supervised learning task where each training example consists of a pair of sentences denoting constituent vectors from each sentence, respectively, which may be of different lengths T a 6 = T b ) along with a single real-valued label y for the pair. We evaluated the performance of the state-of-the-art model on this task.

Model Overview
Since we focus on end-to-end sentence modelling, we implement a simplified (see Table 1) version of MaLSTM (Mueller and Thyagarajan, 2016), i.e., the state-of-the-art model on this task (see Appendix Figure C.1). The model uses a siamese architecture of Long-Short Term Memory (LSTM) to read word vectors representing each input sentence. Each LSTM cell has four components: input gate i t , forget gate f t , memory state c t , and output gate o t ; which decides the information to retain or forget in a sequence of inputs. Equations 1-6 are the updates performed at each LSTM cell for a sequence of input ( (6) This model computes the sentence similarity based on the Manhattan distance between the final hidden state representations for each sentence:  (Mueller and Thyagarajan, 2016).

Training Details
We use the 300-dimensional pre-trained word2vec 3 (Mikolov et al., 2013b) word embeddings and compare the performance with that of GloVe 4 (Pennington et al., 2014) embeddings. Out-of-embedding-vocabulary (OOEV) words are replaced with an <unk> token. We retain the word cases and keep the digits. For character representation, we fine-tune the 50-dimensional initial embeddings, modifying them during gradient updates of the neural network model by back-propagating gradients. The chosen size of the embeddings was found to perform best after initial experiments with different sizes. Our model uses 50-dimensional hidden representations h t and memory cells c t . Optimisation of the parameters is done using the SGD-based Adam method (Kingma and Ba, 2014) and we perform gradient clipping to prevent exploding gradients. We tune the hyper-parameters on the validation set by random search since it is infeasible to do a random search across the full hyper-parameter space due to time constraints. After conducting initial experiments, we found the optimal training parameters to be the following: batch size = 30, learning rate = 0.01, learning rate decay = 0.98, dropout = 0.5, number of LSTM layers = 1, maximum epochs = 10, patience = 5 epochs. Patience is the early stopping condition based on performance on validation sets. We used the Tensorflow 5 library to implement and train the model.

Dataset and Evaluation
We measure the model's performance on three benchmark datasets, i.e., SICK 2014 (Marelli et al., 2014) (Xu et al., 2015), using Pearson correlation. We assert that a robust model should perform consistently well in these three datasets. Furthermore, using the framework described in Section 2.1, we chose to compare the model performance at two levels of input representation (i.e., character-level vs word-level) and composition models (i.e., LSTM vs vector sum) in order to eliminate the need for external tools such as parsers. Table 2 shows the performance across input representations and composition models. As expected, our simplified model performs relatively worse (Pearson correlation = 0.7355) when compared to what was reported in the original MaLSTM paper (Pearson correlation = 0.8822) on the SICK dataset (using word2vec). This performance difference (around 15%) could be attributed to the additional features (see Table 1) that the state-ofthe-art model added to their system.

Results and Discussion
With respect to input representation, the wordbased one yields better performance in all datasets over character-level representation for the obvious reason that it carries more semantic information. Furthermore, the character-level representation using LSTM performs better than using Vector Sum (VS) because it is able to retain sequential information. Regarding word embeddings, GloVe resulted in higher performance com-pared to word2vec in all datasets and models except with VS on the SICK dataset where word2vec is slightly better. Table 3 shows the percentage of OOEV words in each dataset with respect to its vocabulary size. Upon closer inspection, we found out that word2vec does not have embeddings for stopwords (e.g., a, to, of, and). With respect to token-based statistics, these OOEVs comprised 95% (SICK), 67% (PIT) and 44% (STS) respectively in each dataset. Although further work is needed to ascertain the effect of this type of OO-EVs, we hypothesise that GloVe's superior performance could be attributed to it, if not to its word vector quality as claimed by Pennington et al. (2014).  With respect to the composition model, LSTM performs better than VS but only in the SICK dataset while VS dominates in both the PIT and STS datasets. Specifically, Figure 1 shows the overall and the per-category performance of the model on the STS dataset. Overall, we can clearly see that VS outperforms LSTM by a considerable margin and also in each category except in Postediting and Headlines. On the one hand, this suggests that simple compositional models can perform competitively on clean and noisy datasets (e.g., less OOEVs). On the other hand, this shows the ability of LSTM models to capture long term dependencies especially on clean datasets (e.g., SICK dataset) because they contain sufficient semantic information while their performance decreases dramatically on noisy data or on datasets with high proportion of OOEVs (e.g., PIT and STS datasets).
The worst performance was obtained on the PIT dataset in both the baseline 6 and composition models. Aside from PIT dataset's comparatively higher percentage of OOEV words (see Table 3), its diverse, short and noisy user-generated text (Strauss et al., 2016)-typical of social media text-make it a very challenging dataset.
To better understand the reason behind the performance drop of the model, we extracted the 100 most difficult sentence pairs in each dataset by ranking all of the pairs in the test set according to the absolute difference between the gold standard and predicted similarity scores.
We observed that around 60% of the difficult sentence pairs share many similar words (except for a word or two) or contain OOEV words that led to a complete change in meaning. Meanwhile the 6 We represent each sentence with term-frequency vectors. rest are sentence pairs which are topically similar but completely mean different.
In Table 4, we show examples from each dataset and their corresponding scores (i.e., Pearson correlation) from the gold standard and the composition models. The two sentences come from an actual pair in the dataset.
Example 1 (from SICK dataset) shows a pair of sentences which, although can be interpreted to come from the same domain food preparation, are semantically different in their verb, subject, and direct object, for which, presumably, they were labelled in the gold standard as highly dissimilar. However, both of the word-based models predicted them to be highly similar (in varying degrees). This limitation can be attributed to the relatedness of their words (e.g., person vs woman, cutting vs scrubbing). Under the distributional hypothesis assumption (Harris, 1940;Firth, 1957), two words will have high similarity if they occur in similar contexts even if they neither have the same nor similar meanings. Since word embeddings are typically generated based on this assumption, the relatedness aspect is captured more than genuine similarity. Furthermore, the higher similarity obtained by the LSTM model over Vector Sum can be attributed to its ability to capture syntactic structure in sequences such as sentences.
Examples 2 and 3 (from STS dataset) show sentence pairs which were labelled as completely dissimilar but were predicted with high similarity in both models. This shows the inability of the models to put more weight on semantically rich words which change the overall meaning of a sentence when compared with another.
Example 4 (from PIT dataset) shows a sentence pair which was labelled as completely dissimi-lar, presumably because it lacks sufficient context for meaningful interpretation. However, they were predicted to some degree as similar possibly because some words are common to both sentences and some are likely related by virtue of cooccurrence in the same context (e.g., England, Europe). See Appendix B for more examples.

Future Work
This work is intended to serve as an initial study on end-to-end sentence modelling to identify the limitations associated with it. The models and representations compared, while typical of current sentence modelling methods, are not an exhaustive set and some variations exist. A natural extension to this study is to explore other input granularity representations and composition models presented in the framework. For example, in this study we did not go beyond word representations; however, multi-word expressions are common occurrences in the English language. This could be addressed by modelling sentence constituents using recursive tree structures (Tai et al., 2015) or by learning phrase representations (Wieting et al., 2015).
The limitations of the current word embeddings as revealed in this paper has been studied in the context of word similarity tasks (Levy and Goldberg, 2014;Hill et al., 2016) but to our knowledge had never been investigated explicitly in the context of sentence similarity tasks. For example, Kiela et al. (2015) have shown that specialising semantic spaces to downstream tasks and applications requiring similarity or relatedness can improve performance. Furthermore, some studies (Faruqui et al., 2014;Yu and Dredze, 2014;Ono et al., 2015;Ettinger et al., 2016) have proposed to learn word embeddings by going beyond the distributional hypothesis assumption either through a retrofitting or joint-learning process with some using semantic resources such as ontologies and entity relation databases. Thus, we will explore this direction as this will be particularly important in semantic processing since entities encode much of the semantic information in a language.
Furthermore, the inability of the state-of-theart model to encode semantically rich words (e.g., socket, bug in Example 2) with higher weights relative to other words, supports the assertion of Blacoe and Lapata (2012) that distributive semantic representation and composition must be mutually learned. Wieting et al. (2015) have showed that this kind of weighting for semantic importance can be learned automatically when training on a paraphrase database. Recent models (Hashimoto et al., 2016) proposed end-to-end joint modelling at different linguistic levels of a sentence (i.e. morphology, syntax, semantics) on a hierarchy of tasks (i.e., POS tagging, dependency parsing, semantic role labelling)-often done separately-with the assumption that higher-level tasks benefit from lower-level ones.