The Lifted Matrix-Space Model for Semantic Composition

Tree-structured neural network architectures for sentence encoding draw inspiration from the approach to semantic composition generally seen in formal linguistics, and have shown empirical improvements over comparable sequence models by doing so. Moreover, adding multiplicative interaction terms to the composition functions in these models can yield significant further improvements. However, existing compositional approaches that adopt such a powerful composition function scale poorly, with parameter counts exploding as model dimension or vocabulary size grows. We introduce the Lifted Matrix-Space model, which uses a global transformation to map vector word embeddings to matrices, which can then be composed via an operation based on matrix-matrix multiplication. Its composition function effectively transmits a larger number of activations across layers with relatively few model parameters. We evaluate our model on the Stanford NLI corpus, the Multi-Genre NLI corpus, and the Stanford Sentiment Treebank and find that it consistently outperforms TreeLSTM (Tai et al., 2015), the previous best known composition function for tree-structured models.


Abstract
Tree-structured neural network architectures for sentence encoding draw inspiration from the approach to semantic composition generally seen in formal linguistics, and have shown empirical improvements over comparable sequence models by doing so. Moreover, adding multiplicative interaction terms to the composition functions in these models can yield significant further improvements. However, existing compositional approaches that adopt such a powerful composition function scale poorly, with parameter counts exploding as model dimension or vocabulary size grows. We introduce the Lifted Matrix-Space model, which uses a global transformation to map vector word embeddings to matrices, which can then be composed via an operation based on matrix-matrix multiplication. Its composition function effectively transmits a larger number of activations across layers with relatively few model parameters. We evaluate our model on the Stanford NLI corpus, the Multi-Genre NLI corpus, and the Stanford Sentiment Treebank and find that it consistently outperforms TreeLSTM (Tai et al., 2015), the previous best known composition function for treestructured models.

Introduction
Contemporary theoretical accounts of natural language syntax and semantics consistently hold that sentences are tree-structured, and that the meaning of each node in each tree is calculated from the meaning of its child nodes using a relatively simple semantic composition process which is applied recursively bottom-up (Chierchia and McConnell-Ginet, 1990;Dowty, 2007). In tree-structured recursive neural networks (TreeRNN; Socher et al., 2010), a similar procedure is used to build representations for sentences for use in natural language understanding tasks, with distributed representations for words repeatedly fed through a neural network composition function according to a binary tree structure supplied by a parser. The success of a tree-structured model largely depends on the design of its composition function.
It has been repeatedly shown that a composition function that captures multiplicative interactions between the two items being composed yields better results (Rudolph and Giesbrecht, 2010;Socher et al., 2012 than do otherwise-equivalent functions based on simple linear interactions. This paper presents a novel model which advances this line of research, the Lifted Matrix-Space model. We utilize a tensor-parameterized LIFT layer that learns to produce matrix representations of words that are dependent on the content of pre-trained word embedding vectors. Composition of two matrix representations is carried out by a composition layer, into which the two matrices are sequentially fed. Figure 1 illustrates the model design. Our model was inspired by Continuation Semantics (Barker and Shan, 2014;Charlow, 2014), where each symbolic representation of words is converted to a higher-order function. There is a consensus in linguistic semantics that a subset of natural language expressions correspond to higher-order functions. Inspired by the works in programming language theory, Continuation Semantics takes a step further and claims that all expressions have to be converted into a higher-order function before they participate in semantic composition. The theory bridges a gap between linguistic semantics and programming language theory, and reinterprets various linguistic phenomena from the view of computation. While we do not directly implement Continuation Semantics, we follow its rough contours: We convert low-level representations (vectors) to higher-order functions (matrices), and composition only takes place between the higher-order functions.
A number of models have been developed to capture the multiplicative interactions between distributed representations. While having a similar objective, the proposed model requires fewer model parameters than the predecessors because it does not necessarily learn each word matrix representation separately, and the number of parameters for the composition function is not proportional to the cube of the hidden state dimension. Because of this, it can be trained with larger vocabularies and more hidden state activations than was possible with its predecessors. We evaluate our model primarily on the task of natural language inference (NLI; MacCartney, 2009). The task consists in determining the inferential relation between a given pair of sentences. It is a principled and widely-used evaluation task for natural language understanding, and knowing the inferential relations is closely related to understanding the meaning of an expression (Chierchia and McConnell-Ginet, 1990). While other tasks such as question answering or machine translation require a model to learn task-specific behavior that goes beyond understanding sentence meaning, NLI results highlight sentence understanding performance in isolation. We also include an evaluation on sentiment classification for comparison with some earlier work.
We find that our model outperforms existing approaches to tree-structured modeling on all three tasks, though it does not set the state of the art on any of them, falling behind other types of complex model. We nonetheless expect that this method will be a valuable ingredient in future models for sentence understanding and a valuable platform for research of compositionality in learned representations.

Related work
Composition functions for tree-structured models have been thoroughly studied in recent years (Mitchell and Lapata, 2010;Baroni and Zamparelli, 2010;Zanzotto et al., 2010;Socher et al., 2011). While this line of research has been successful, the majority of the existing models ultimately rely on the additive linear combination of vectors. The Tree-structured recursive neural networks (TreeRNN) of Socher et al. (2010) compose two child node vectors h l and h r using this method: Throughout this paper, d stands for the number of activations of a given model. However, there is no reason to believe that the additive linear combination of vectors is adequate for modeling semantic composition. Formal work in linguistic semantics has shown that many linguistic expressions are well-represented as functions. Accordingly, composing two meanings typically require feeding an argument into a function (function application; Heim and Kratzer, 1998). Such an operation involves a complex interaction between the two meanings, but the classic TreeRNN does not supply any additional means to capture the interaction. Rudolph and Giesbrecht (2010) report that matrix multiplication, as opposed to element-wise addition, is more suitable for semantic composition. Their Compositional Matrix-Space model (CMS) represents words and phrases as matrices, and they are composed via a simple matrix multiplication: (2) P = AB where A, B, P ∈ R d × d are matrix representations of the word embeddings. They provide a formal proof that element-wise addition/multiplication of vectors can be subsumed under matrix multiplication. Moreover, they claim that the ordersensitivity of matrix multiplication is adequate for capturing the semantic composition because natural language is order-sensitive.
However, as Socher et al. (2012) note, CMS loses syntactic information during composition due to the associative character of matrix multiplication. For instance, the following two tree structures in (3) are syntactically distinct, but CMS would produce the same result for both structures because its mode of composition is associative.
A B C b. A B C CMS cannot distinguish the meaning of the two tree structures and invariably produces ABC (a sequence of matrix multiplications). Therefore, the information on syntactic constituency would be lost. This makes the model less desirable for handling semantic composition of natural language expressions for two reasons: First, the principle of compositionality is violated. Much of the success of the tree-structured models can be credited to the shared hypothesis that the meaning of every tree node is derived from the meanings of its child nodes. Abandoning this principle of compositionality gives up the advantage. Second, it cannot handle structural ambiguities exemplified in (4).
(4) John saw a man with binoculars. The sentence has two interpretations that can be disambiguated with the following paraphrases: (i) John saw a man via binoculars, and (ii) John saw a man who has binoculars. The common syntactic analysis of the ambiguity is that the prepositional phrase with binoculars can attach to two different locations. If it attaches to the verb phrase saw a man, the first interpretation arises. On the other hand, if it attaches to the noun man, the second interpretation is given. However, if the structural information is lost, we would have no way to disambiguate the two readings. Socher et al.'s (2012) Matrix-Vector RNN (MV-RNN) is another attempt to capture the multiplicative interactions between two vectors, while conforming to the principle of compositionality. They hypothesize that representing operators as matrices can better reflect operator semantics. For each lexical item, a matrix (trained parameter) is assigned in addition to the pre-trained word embedding vector. The model aims to assign the right matrix representations to operators while assigning an identity matrix to words with no operator meaning. One step of semantic composition is defined as follows: MV-RNN is computationally costly as it needs to learn an additional d × d matrix for each lexical item. It is empirically known that the size of the vocabulary is roughly proportional to the size of the corpus (Heaps' law; Herdan, 1960), therefore the number of model parameters increases as the corpus gets bigger. This makes the model less ideal for handling a large corpus: having a huge number of parameters causes a problem both for memory usage and for learning efficiency. Chen et al. (2013) and  present the recursive neural tensor network (RNTN) which reduces the computational complexity of MV-RNN, while capturing the multiplicative interactions between child vectors. The model introduces a third-order tensor V which interacts with the child node vectors as follows: RNTN improves on MV-RNN in that its parameter size is not proportional to the size of the corpus. However, the addition of the thirdorder tensor V of dimension d×d×d still requires proportionally more parameters.
The last composition function relevant to this paper is the Tree-structured long short-term memory networks (TreeLSTM; Tai et al., 2015;Zhu et al., 2015;Le and Zuidema, 2015), particularly the version over a constituency tree. It is an

Model
Params. Associative Multiplicative Activation size w.r.t. TreeRNN extension of TreeRNN which adapts long shortterm memory (LSTM; Hochreiter and Schmidhuber, 1997) networks. It shares the advantage of LSTM networks in that it prevents the vanishing gradient problem (Hochreiter et al., 2001). Unlike TreeRNN, the output hidden state h of TreeLSTM is not directly calculated from the hidden states of its child nodes, h l and h r . Rather, each node in TreeLSTM maintains a cell state c that keeps track of important information of its child nodes. The output hidden state h is drawn from the cell state c by passing it through an output gate o.
The cell state is calculated in three steps: (i) Compute a new candidate g from h l and h r . TreeLSTM selects which values to take from the new candidate g by passing it through an input gate i. (ii) Choose which values to forget from the cell states of the child nodes, c l and c r . For each child node, an element-wise product ( ) between its cell state and the forget gate (either f l and f r , depending on the child node) is calculated. (iii) Lastly, sum up the results from (i) and (ii).
TreeLSTM achieves the state-of-the-art performance among the tree-structured models in various tasks, including natural language inference and sentiment classification. However, there are non-tree-structured models on the market that outperform TreeLSTM. Our goal is to design a stronger composition function that enhances the performance of tree-structured models. We develop a composition function that captures the multiplicative interaction between distributed representations. At the same time, we improve on the predecessors in terms of scalability, making the model more suitable for larger datasets.
To recapitulate, TreeRNN and TreeLSTM reflect the principle of compositionality but cannot capture the multiplicative interaction between two expressions. In contrast, CMS incorporates multiplicative interaction but violates the principle of compositionality. MV-RNN is compositional and also captures the multiplicative interaction, but it requires a learned d × d matrix for each vocabulary item. RNTN is also compositional and incorporates multiplicative interaction, but it requires less parameters than MV-RNN. Nevertheless, it requires significantly more parameters than TreeRNN or TreeLSTM. Table 1 is an overview of the discussed models.

Base model
We present the Lifted Matrix-Space model (LMS) which renders semantic composition in a novel way. Our model consists of three subparts: the LIFT layer, the composition layer, and the TreeL-STM wrapper. The LIFT layer takes a word embedding vector and outputs a corresponding √ d × √ d matrix (eq. 13).
The resulting H matrix serves as an input for the composition layer.
Given the matrix representations of two child nodes, H l and H r , the composition layer first takes H l and returns a hidden state H inner ∈ R √ d× √ d (eq. 14). Since H inner is also a matrix, it can function as the weight matrix for H r . The composition layer multiplies H inner with H r , adds a bias, and feeds the result into a non-linear activation function (eq. 15). This yields H cand ∈ R √ d× √ d , which for the base model is the output of semantic composition.
As in CMS, the primary mode of semantic composition is matrix multiplication. However, LMS improves on CMS in that it avoids associativity. LMS differs from MV-RNN in that it does not learn a d × d matrix for each vocabulary item. Compared to RNTN, LMS transmits a larger number of activations across layers, given the same parameter count. In both models, the size of the third-order tensor is the dominant factor in determining the number of model parameters. The parameter count of LMS is approximately proportional to the number of activations (d), but the parameter count of RNTN is approximately proportional to the cube of the number of activations (d 3 ). Therefore, LMS can transmit the same number of activations with fewer model parameters.

LMS augmented with LSTM components
We augment the base model with LSTM components (LMS-LSTM) to circumvent the problem of long-term dependencies. As in the case of TreeL-STM, we additionally manage cell states ( c l , c r ). Since the LSTM components operate on vectors, we reshape H cand , H l , and H r into d × 1 column vectors respectively, and produce g, h l , and h r . The output of the LSTM components are calculated based on these vectors, and is reshaped back to a √ d × √ d matrix (eq. 22).

Simplified variants
We implement two LMS-LSTM variants with a simpler composition function as an ablation study. The first variant replaces the equations in (14) and (15) The second variant is more complex than the first, in a way that a weight matrix W COMB ∈ R √ d× √ d is added to the equation (eq. 24). But unlike the full LMS-LSTM which has two tanh layers, it only utilizes one.

Implementation details
As our interest is in the performance of composition functions, we compare LMS-LSTM with TreeLSTM, the previous best known composition function for tree-structured models. To allow for efficient batching, we use the SPINN-PI-NT approach (Bowman et al., 2016), which implements standard TreeLSTM using stack and buffer data structures borrowed from parsing, rather than tree structures. We implement our model by replacing SPINN-PI-NT's composition function with ours and adding the LIFT layer.
We use the 300D reference GloVe vectors (840B token version; Pennington et al., 2014) for word representations. We fine-tune the word embeddings for improved results. We follow Bowman et al. (2016) and other prior work in our use of an MLP with product and difference features to classify pairs of sentences.
The feature vector is fed into an MLP that consists of two ReLU neural network layers and a softmax layer. In both models, the objective function is a sum of a cross-entropy loss function and an L2 regularization term. Both models use the Adam optimizer (Kingma and Ba, 2014). Dropout (Srivastava et al., 2014) is applied to the classifier and to the word embeddings. The MLP layer also utilizes Layer Normalization (Ba et al., 2016). 1

Datasets
We first train and test our models on the Stanford Natural Language Inference corpus (SNLI; Bowman et al., 2015). The SNLI corpus contains 570,152 pairs of natural language sentences that are labeled for entailment, contradiction, and neutral. It consists of sentences that were written and validated by humans. Along with the MultiNLI corpus introduced below, it is two orders of magnitude larger than other human-authored resources for NLI tasks. The following example illustrates the general format of the corpus.
(26) PREMISE: A soccer game with multiple males playing. HYPOTHESIS: Some men are playing a sport.

LABEL: Entailment
We test our models on the Multi-Genre Natural Language Inference corpus (MultiNLI; Williams et al., 2017). The corpus consists of 433k pairs of examples, and each pair is labeled for entailment, contradiction, and neutral. MultiNLI has the same format as SNLI, so it is possible to train on both datasets at the same time (as we do when testing on MultiNLI). Two notable features distinguish MultiNLI from SNLI: (i) It is collected from ten distinct genres of spoken and written English. This makes the dataset more representative of human language use. (ii) The examples in MultiNLI are considerably longer than the ones in SNLI. These two features make MultiNLI classification fairly more difficult than SNLI. The pair of sentences in (27) is an illustrative example. The sen-tences are from the section of the corpus that is transcribed verbatim from telephone speech. The MultiNLI training set consists of five different genres of spoken and written English, the matched test set contains sentence pairs from only those five genres, and the mismatched test set contains sentence pairs from additional genres.
We also experiment on the Stanford Sentiment Treebank (SST; , which is constructed by extracting movie review excerpts written in English from rottentomatoes.com, and labeling them with Amazon Mechanical Turk. Each example in SST is paired with a parse tree, and each node of the tree is tagged with a finegrained sentiment label (5 classes). Table 2 summarizes the results on SNLI and MultiNLI classification. We use the same preprocessing steps for all results we report, including loading the parse trees supplied with the datasets. Dropout rate, size of activations, number and size of MLP layers, and L2 regularization term are tuned using repeated random search. MV-RNN and RNTN introduced in the earlier sections are extremely expensive in terms of computational resources, and training the models with comparative hyperparameter settings quickly runs out of memory on a high-end GPU. We do not include them in the comparison for this reason. TreeLSTM performs the best with one MLP layer, while LMS-LSTM displays the best performance with two MLP layers. The difference in parameter count is largely affected by this choice, and in principle one model does not demand notably more computational resources than the other.

Results and Analysis
On the SNLI test set, LMS-LSTM has an additional 1.3% gain over TreeLSTM. Also, both of the simplified variants of LMS-LSTM outperform TreeLSTM, but by a smaller margin. On the MultiNLI test sets, LMS-LSTM scores 1.3% higher on the matched test set and 1.9% higher on the mismatched test set.   We cite the state-of-the-art results of non-treestructured models, although these models are only relevant for our absolute performance numbers. The Shortcut-Stacked sentence encoder achieves the state-of-the-art result among non-attentionbased models, outperforming LMS-LSTM. While this paper focuses on the design of the composition function, we expect that adding depth along the lines of Irsoy and Cardie (2014) and shortcut connections to LMS-LSTM would offer comparable results. Gong et al.'s (2018) attention-based Densely Interactive Inference Network (DIIN) displays the state-of-the-art performance among all models. Applying various attention mechanisms to tree-structured models is left for future research.
We inspect the performance of the models on certain subsets of the MultiNLI corpus that manifest linguistically difficult phenomena, which was categorized by Williams et al. (2017). The phenomena include pronouns, quantifiers (e.g., every, each, some), modals (e.g., must, can), negation, wh-terms (e.g., who, where), belief verbs (e.g., believe, think), time terms (e.g., then, today), discourse markers (e.g., but, thus), presupposition triggers (e.g., again, too), and so on. In linguistic semantics, these phenomena are known to involve complex interactions that are more intricate than a simple merger of concepts. For instance, modals express possibilities or necessities that are  beyond "here and now". One can say 'John might be home' to convey that there is a possibility that John is home. The utterance is perfectly compatible with a situation in which John is in fact at school, so modals like might let us reason about things that are potentially false in the real world. We use the same code as Williams et al. (2017) to
In addition to the categories offered by Williams et al. (2017), we inspect whether sentences containing adjectives/adverbs affect the performance of the models. Baroni and Zamparelli (2010) show that adjectives are better represented as matrices, as opposed to vectors. We also inspect whether the presence of a determiner in the hypothesis that refers back to a salient referent in the premise affects the model performance. Determiners are known to encode intricate properties in linguistic semantics and have been one of the major research topics (Elbourne, 2005;Charlow, 2014). Lastly, we examine whether the performance of the models varies with respect to sentence length, as longer sentences are harder to comprehend. Table 3 summarizes the result of the inspection on linguistically difficult phenomena. We see gains uniformly across the board, but with particularly clear gains on negation (+2% on the matched set/+1.7% on the mismatched set), quantifiers (+2.3%/+1.2%), time terms (+2.7%/+1.7%), tense matches (+2.3%/+1.3%), adjectives/adverbs (+2.3%/+1.3%), and longer sentences (length 15-19: +1.7%/+2.5%; length 20+: +5.4%/-0.8%). Table 4 summarizes the results on SST classification, particularly on the fine-grained task with 5 classes. While our implementation does not exactly reproduce Tai et al.'s (2015) TreeLSTM results, a comparison between our trained TreeL-STM and LMS-LSTM is consistent with the patterns seen in NLI tasks.
We examine how well the constituent representations produced by LMS-LSTM and TreeLSTM encode syntactic category information. As mentioned earlier, there is a consensus in linguistic semantics that semantic composition involves func-  tion application (i.e., feeding an argument to a function) which goes beyond a simple merger of two concepts. Given that the syntactic category of a node determines whether the node serves as a function or an argument in semantic composition, we hypothesize that the distributed representation of each node would encode syntactic category information if the models learned how to do function application. To assess the quality of the representations, we first split the SNLI development set into training and test sets. From the training set, we extract the hidden state of every constituent (i.e., phrase) produced by the best performing models. For each of the models, we train linear classifiers that learn to do the following two tasks: (i) 3-way classification, which trains and tests exclusively on noun phrases, verb phrases, and prepositional phrases, and (ii) 19-way classification, which trains and tests on all 19 category labels attested in the SNLI development set. The distribution of the 19 category labels is provided in Table 5. We opt for a linear classifier to keep the classification process simple, so that we can properly assess the quality of the constituent representations. Table 6 summarizes the results on the syntactic category classification task. As a baseline, we train a bag-of-words (BOW) model which produces the hidden state of a given phrase by summing the GloVe embeddings of the words of the phrase. We train and test on the hidden states produced by BOW as well. The hidden state representations produced by LMS-LSTM yield the best results on both 3-way and 19-way classification tasks. Comparing LMS-LSTM and TreeLSTM representations, we see a 5.1% gain on the 3-way classification and a 5.5% gain on the 19-way classification.   Figure 2 depicts the corresponding confusion matrices for the 19-way classification task. We show the most frequent eight classes due to space limitations. We observe notable gains on adverbial phrases (ADVP; +24%), adjectival phrases (ADJP; +12%), verb phrases (VP; +9%), and clauses introduced by subordinate conjunction (SBAR; +9%). We also observe a considerable gain on non-terminal declarative clauses (S; +9%), although the absolute number is fairly low compared to other categories. While we do not have a full comprehension of the drop in classification accuracy, we speculate that the ambiguity of infinitival clauses and gerund phrases is one of the culprits. As exemplified in (28) and (29) respectively, infinitival clauses and gerund phrases are not only labeled as a VP but also as an S in the SNLI dataset. Given that our experiment is set up in a way that each constituent is assigned exactly one category label, a good number of infinitival clauses and gerund phrases that are labeled as an S could have been classified as a VP, resulting in a drop in classification accuracy. On the other hand, VP constituents are less affected by the ambiguity because the majority of them are neither an infinitival clause nor a gerund phrase, as shown in (30) With the exclusion of non-terminal declarative clauses, the categories we see notable gains are known to play the role of a function in semantic composition. On the other hand, both models are efficient in identifying noun phrases (NP), which are typically arguments of a function in semantic composition. We speculate that the results are indicative of LMS-LSTM's ability to identify func- tions and arguments, and this hints that the model is learning to do function application.

Conclusion
In this paper, we propose a novel model for semantic composition that utilizes matrix multiplication. Experimental results indicate that, while our model does not reach the state of the art on any of the three datasets under study, it does substantially outperform all known tree-structured models, and lays a strong foundation for future work on treestructured compositionality in artificial neural networks.