Learning Semantically and Additively Compositional Distributional Representations

This paper connects a vector-based composition model to a formal semantics, the Dependency-based Compositional Semantics (DCS). We show theoretical evidence that the vector compositions in our model conform to the logic of DCS. Experimentally, we show that vector-based composition brings a strong ability to calculate similar phrases as similar vectors, achieving near state-of-the-art on a wide range of phrase similarity tasks and relation classification; meanwhile, DCS can guide building vectors for structured queries that can be directly executed. We evaluate this utility on sentence completion task and report a new state-of-the-art.


Introduction
A major goal of semantic processing is to map natural language utterances to representations that facilitate calculation of meanings, execution of commands, and/or inference of knowledge. Formal semantics supports such representations by defining words as some functional units and combining them via a specific logic. A simple and illustrative example is the Dependency-based Compositional Semantics (DCS) (Liang et al., 2013). DCS composes meanings from denotations of words (i.e. sets of things to which the words apply); say, the denotations of the concept drug and the event ban is shown in Figure 1b, where drug is a list of drug names and ban is a list of the subjectcomplement pairs in any ban event; then, a list of banned drugs can be constructed by first taking the COMP column of all records in ban (projection "π COMP "), and then intersecting the results with drug (intersection "∩"). This procedure defined how words can be combined to form a meaning.
Better yet, the procedure can be concisely illustrated by the DCS tree of "banned drugs" (Figure 1a), which is similar to a dependency tree but possesses precise procedural and logical meaning (Section 2). DCS has been shown useful in question answering (Liang et al., 2013) and textual entailment recognition (Tian et al., 2014).
Orthogonal to the formal semantics of DCS, distributional vector representations are useful in capturing lexical semantics of words (Turney and Pantel, 2010;Levy et al., 2015), and progress is made in combining the word vectors to form meanings of phrases/sentences (Mitchell and Lapata, 2010; Baroni and Zamparelli, 2010;Grefenstette and Sadrzadeh, 2011;Socher et al., 2012;Paperno et al., 2014;Hashimoto et al., 2014). However, less effort is devoted to finding a link between vector-based compositions and the composition operations in any formal semantics. We believe that if a link can be found, then symbolic formulas in the formal semantics will be realized by vectors composed from word embeddings, such that similar things are realized by similar vectors; meanwhile, vectors will acquire formal meanings that can directly be used in execution or inference process. Still, to find a link is challenging because any vector compositions that realize such a link must conform to the logic of the formal semantics.
In this paper, we establish a link between DCS and certain vector compositions, achieving a vector-based DCS by replacing denotations of words with word vectors, and realizing the composition operations such as intersection and projection as addition and linear mapping, respectively. For example, to construct a vector for "banned drugs", one takes the word vector v ban and multiply it by a matrix M COMP , corresponding to the projection π COMP ; then, one adds the result to the word vector v drug to realize the intersection operation ( Figure 1c). We provide a method to train the  Figure 1: (a) The DCS tree of "banned drugs", which controls (b) the calculation of its denotation. In this paper, we learn word vectors and matrices such that (c) the same calculation is realized in distributional semantics. The constructed query vector can be used to (d) retrieve a list of coarsegrained candidate answers to that query. word vectors and linear mappings (i.e. matrices) jointly from unlabeled corpora.
The rationale for our model is as follows. First, recent research has shown that additive composition of word vectors is an approximation to the situation where two words have overlapping context (Tian et al., 2015); therefore, it is suitable to implement an "and" or intersection operation (Section 3). We design our model such that the resulted distributional representations are expected to have additive compositionality. Second, when intersection is realized as addition, it is natural to implement projection as linear mapping, as suggested by the logical interactions between the two operations (Section 3). Experimentally, we show that vectors and matrices learned by our model exhibit favorable characteristics as compared with vectors trained by GloVe (Pennington et al., 2014) or those learned from syntactic dependencies (Section 5.1). Finally, additive composition brings our model a strong ability to calculate similar vectors for similar phrases, whereas syntactic-semantic roles (e.g. SUBJ, COMP) can be distinguished by different projection matrices (e.g. M SUBJ , M COMP ). We achieve near state-of-the-art performance on a wide range of phrase similarity tasks (Section 5.2) and relation classification (Section 5.3).
Furthermore, we show that a vector as constructed above for "banned drugs" can be used as a query vector to retrieve a coarse-grained candi-  Figure 2: DCS tree for a sentence date list of banned drugs, by sorting its dot products with answer vectors that are also learned by our model (Figure 1d). This is due to the ability of our approach to provide a language model that can find likely words to fill in the blanks such as " is a banned drug" or "the drug is banned by . . . ". A highlight is the calculation being done as if a query is "executed" by the DCS tree of "banned drugs". We quantitatively evaluate this utility on sentence completion task (Zweig et al., 2012) and report a new state-of-the-art (Section 5.4).

DCS Trees
DCS composes meanings from denotations, or sets of things to which words apply. A "thing" (i.e. element of a denotation) is represented by a tuple of features of the form Field=Value, with a fixed inventory of fields. For example, a denotation ban might be a set of tuples ban = {(SUBJ=Canada, COMP=Thalidomide), . . .}, in which each tuple records participants of a banning event (e.g. Canada banning Thalidomide).
Operations are applied to sets of things to generate new denotations, for modeling semantic composition. An example is the intersection of pet and fish giving the denotation of "pet fish". Another necessary operation is projection; by π N we mean a function mapping a tuple to its value of the field N. For example, π COMP (ban) is the value set of the COMP fields in ban, which consists of banned objects (i.e. {Thalidomide, . . .}). In this paper, we assume a field ARG to be names of things representing themselves, hence for example π ARG (drug) is the set of names of drugs.
For a value set V , we also consider inverse image π −1 N (V ) := {x | π N (x) ∈ V }. For example, consists of all tuples of the form (SUBJ=x, . . .), where x is a man's name (i.e. x ∈ π ARG (man)). Thus, sell ∩ D 1 denotes men's selling events (i.e. {(SUBJ=John, COMP=Aspirin), . . .} as in Figure 2). Similarly, the denotation of "banned  Figure 3: DCS trees in this work drugs" as in Figure 1b is formally written as Hence the following denotation consists of selling events such that the SUBJ is a man and the COMP is a banned drug. The calculation above can proceed in a recursive manner controlled by DCS trees. The DCS tree for the sentence "a man sells banned drugs" is shown in Figure 2. Formally, a DCS tree is defined as a rooted tree in which nodes are denotations of content words and edges are labeled by fields at each ends. Assume a node x has children y 1 , . . . , y n , and the edges (x, y 1 ), . . . , (x, y n ) are labeled by (P 1 , L 1 ), . . . , (P n , L n ), respectively. Then, the denotation [[x]] of the subtree rooted at x is recursively calculated as As a result, the denotation of the DCS tree in Figure 2 is the denotation D 3 of "a man sells banned drugs" as calculated above. DCS can be further extended to handle phenomena such as quantifiers or superlatives (Liang et al., 2013;Tian et al., 2014). In this paper, we focus on the basic version, but note that it is already expressive enough to at least partially capture the meanings of a large portion of phrases and sentences. DCS trees can be learned from question-answer pairs and a given database of denotations (Liang et al., 2013), or they can be extracted from dependency trees if no database is specified, by taking advantage of the observation that DCS trees are similar to dependency trees (Tian et al., 2014). We use the latter approach, obtaining DCS trees by rule-based conversion from universal dependency (UD) trees (McDonald et al., 2013). Therefore, nodes in a DCS tree are content words in a UD tree, which are in the form of lemma-POS pairs ( Figure 3). The inventory of fields is designed to be ARG, SUBJ, COMP, and all prepositions. Prepositions are unlike content words which denote sets of things, but act as relations which we treat similarly as SUBJ and COMP. For example, a prepositional phrase attached to a verb (e.g. play on the grass) is treated as in Figure 3a. The presence of two field labels on each edge of a DCS tree makes it convenient for modeling semantics in several cases, such as a relative clause (Figure 3b).

Vector-based DCS
For any content word w, we use a query vector v w to model its denotation, and an answer vector u w to model a prototypical element in that denotation. Query vector v and answer vector u are learned such that exp(v · u) is proportional to the probability of u answering the query v. The learning source is a collection of DCS trees, based on the idea that the DCS tree of a declarative sentence usually has non-empty denotation. For example, "kids play" means there exists some kid who plays. Consequently, some element in the play denotation belongs to π −1 SUBJ (π ARG (kid)), and some element in the kid denotation belongs to π −1 ARG (π SUBJ (play)). This is a signal to increase the dot product of u play and the query vector of π −1 SUBJ (π ARG (kid)), as well as the dot product of u kid and the query vector of π −1 ARG (π SUBJ (play)). When optimized on a large corpus, the "typical" elements of play and kid should be learned by u play and u kid , respectively. In general, one has Theorem 1 Assume the denotation of a DCS tree is not empty. Given any path from node x to y, assume edges along the path are labeled by (P, L), . . . , (K, N). Then, an element in the denotation y belongs to π −1 N (π K (. . . (π −1 L (π P (x) . . .). Therefore, for any two nodes in a DCS tree, the path from one to another forms a training example, which signals increasing the dot product of the corresponding query and answer vectors.
It is noteworthy that the above formalization happens to be closely related to the skip-gram model (Mikolov et al., 2013b). The skip-gram learns a target vector v w and a context vector u w for each word w. It assumes the probability of a word y co-occurring with a word x in a context window is proportional to exp(v x · u y ). Hence, if x and y co-occur within a context window, then one gets a signal to increase v x · u y . If the context window is taken as the same DCS tree, then the learning of skip-gram and vector-based DCS will be almost the same, except that the target vector v x becomes the query vector v, which is no longer assigned to the word x but the path from x to y in the DCS tree (e.g. the query vector for π −1 SUBJ (π ARG (kid)) instead of v kid ). Therefore, our model can also be regarded as extending skipgram to take account of the changes of meanings caused by different syntactic-semantic roles.
Additive Composition Word vectors trained by skip-gram are known to be semantically additive, such as exhibited in word analogy tasks. An effect of adding up two skip-gram vectors is further analyzed in Tian et al. (2015). Namely, the target vector v w can be regarded as encoding the distribution of context words surrounding w. If another word x is given, v w can be decomposed into two parts, one encodes context words shared with x, and another encodes context words not shared. When v w and v x are added up, the non-shared part of each of them tend to cancel out, because non-shared parts have nearly independent distributions. As a result, the shared part gets reinforced. An error bound is derived to estimate how close 1 2 (v w + v x ) gets to the distribution of the shared part. We can see the same mechanism exists in vector-based DCS. In a DCS tree, two paths share a context word if they lead to a same node y; semantically, this means some element in the denotation y belongs to both denotations of the two paths (e.g. given the sentence "kids play balls", π −1 SUBJ (π ARG (kid)) and π −1 COMP (π ARG (ball)) both contain a playing event whose SUBJ is a kid and COMP is a ball). Therefore, addition of query vectors of two paths approximates their intersection because the shared context y gets reinforced.
Projection Generally, for any two denotations X 1 , X 2 and any projection π N , we have And the "⊆" can often become "=", for example when π N is a one-to-one map or X 1 = π −1 N (V ) for some value set V . Therefore, if intersection is realized by addition, it will be natural to realize projection by linear mapping because holds for any vectors v 1 , v 2 and any matrix M N , which is parallel to (2). If π N is realized by a matrix M N , then π −1 N should correspond to the inverse matrix M −1 N , because π N (π −1 N (V )) = V for any value set V . So we have realized all composition operations in DCS.
Query vector of a DCS tree Now, we can define the query vector of a DCS tree as parallel to (1):

Training
As described in Section 3, vector-based DCS assigns a query vector v w and an answer vector u w to each content word w. And for each field N, it assigns two matrices M N and M −1 N . For any path from node x to y sampled from a DCS tree, assume the edges along are labeled by (P, L), . . . , (K, N).
Formally, we adopt the noise-contrastive estimation (Gutmann and Hyvärinen, 2012) as used in the skip-gram model, and mix the paths sampled from DCS trees with artificially generated noise.
models the probability of a training example coming from DCS trees, where σ(θ) = 1/{1 + exp(−θ)} is the sigmoid function. The vectors and matrices are trained by maximizing the log-likelihood of the mixed data. We use stochastic gradient descent (Bottou, 2012) for training. Some important settings are discussed below.
Noise For any v x M 1 M −1 2 . . . M 2l−1 M −1 2l · u y obtained from a path of a DCS tree, we generate noise by randomly choosing an index i ∈ [2, 2l], and then replacing M j or M −1 j (∀j ≥ i) and u y by M N(j) or M −1 N(j) and u z , respectively, where N(j) and z are independently drawn from the marginal (i.e. unigram) distributions of fields and words.
Update For each data point, when i is the chosen index above for generating noise, we view indices j < i as the "target" part, and j >= i as the "context", which is completely replaced by the noise, as an analogous to the skip-gram model. Then, at each step we only update one vector and one matrix from each of the target, context, and noise part; more specifically, we only update v , u y and u z , at the step. This is much faster than always updating all matrices.
Initialization Matrices are initialized as 1 2 (I + G), where I is the identity matrix; and G and all  Learning Rate We find that the initial learning rate for vectors can be set to 0.1. But for matrices, it should be less than 0.0005 otherwise the model diverges. For stable training, we rescale gradients when their norms exceed a threshold.
Regularizer During training, M N and M −1 N are treated as independent matrices. However, we use the regularizer γ M −1 N )I 2 to prevent M N from having too different scales at different directions (i.e., to drive M N close to orthogonal). We set γ = 0.001 and κ = 0.0001. Despite the rather weak regularizer, we find that M −1 N can be learned to be exactly the inverse of M N , and M N can actually be an orthogonal matrix, showing some semantic regularity (Section 5.1).

Experiments
For training vector-based DCS, we use Wikipedia Extractor 2 to extract texts from the 2015-12-01 dump of English Wikipedia 3 . Then, we use Stanford Parser 4 (Klein and Manning, 2003) to parse all sentences and convert the UD trees into DCS trees by handwritten rules. We assign a weight to each path of the DCS trees as follows.
ARG (π SUBJ (learn)) π −1 ARG (π COMP (learn)) π −1 about (π ARG (learn)) teacher/N skill/N otherness/N skill/N lesson/N intimacy/N he/P technique/N femininity/N she/P experience/N self-awareness/N therapist/N ability/N life/N student/N something/N self-expression/N they/P knowledge/N sadomasochism/N mother/N language/N emptiness/N lesson/N opportunity/N criminality/N father/N instruction/N masculinity/N For any path P passing through k intermediate nodes of degrees n 1 , . . . , n k , respectively, we set Note that n i ≥ 2 because there is a path P passing through the node; and Weight(P ) = 1 if P consists of a single edge. The equation (5) is intended to degrade long paths which pass through several high-valency nodes. We use a random walk algorithm to sample paths such that the expected times a path is sampled equals its weight. As a result, the sampled path lengths range from 1 to 19, average 2.1, with an exponential tail. We convert all words which are sampled less than 1000 times to * UNKNOWN * /POS, and all prepositions occurring less than 10000 times to an *UNKNOWN* field. As a result, we obtain a vocabulary of 109k words and 211 field names. Using the sampled paths, vectors and matrices are trained as in Section 4 (vecDCS). The vector dimension is set to d = 250. We compare with three baselines: (i) all matrices are fixed to identity ("no matrix"), in order to investigate the effects of meaning changes caused by syntactic-semantic roles and prepositions; (ii) the regularizer enforcing M −1 N to be actually the inverse matrix of M N is set to γ = 0 ("no inverse"), in order to investigate the effects of a semantically motivated constraint; and (iii) applying the same training scheme to UD trees directly, by modeling UD relations as matrices ("vecUD"). In this case, one edge is assigned one UD relation rel, so we implement the transfor-  Table 3: Spearman's ρ on phrase similarity mation from child to parent by M rel , and from parent to child by M −1 rel . The same hyper-parameters are used to train vecUD. By comparing vecDCS with vecUD we investigate if applying the semantics framework of DCS makes any difference. Additionally, we compare with the GloVe (6B, 300d) vector 5 (Pennington et al., 2014). Norms of all word vectors are normalized to 1 and Frobenius norms of all matrices are normalized to √ d.

Qualitative Analysis
We observe several special properties of the vectors and matrices trained by our model.
Words are clustered by POS In terms of cosine similarity, word vectors trained by vecDCS and vecUD are clustered by POS tags, probably due to their interactions with matrices during training. This is in contrast to the vectors trained by GloVe or "no matrix" (Table 1).
Matrices show semantic regularity Matrices learned for ARG, SUBJ and COMP are exactly orthogonal, and some most frequent prepositions 6 are remarkably close. For these matrices, the corresponding M −1 also exactly converge to their inverse. It suggests regularities in the semantic space, especially because orthogonal matrices preserve cosine similarity -if M N is orthogonal, two words x, y and their projections π N (x), π N (y) will have the same similarity measure, which is semantically reasonable. In contrast, matrices trained by vecUD are only orthogonal for three UD relations, namely conj, dep and appos.
Words transformed by matrices To illustrate the matrices trained by vecDCS, we start from the query vectors of two words, house and learn, 5 http://nlp.stanford.edu/projects/ glove/ 6 of, in, to, for, with, on, as, at, from applying different matrices to them, and show the 10 answer vectors of the highest dot products (Tabel 2). These are the lists of likely words which: take house as a subject, take house as a complement, fills into " in house", serve as a subject of learn, serve as a complement of learn, and fills into "learn about ", respectively. As the table shows, matrices in vecDCS are appropriately learned to map word vectors to their syntacticsemantic roles.

Phrase Similarity
To test if vecDCS has the composition ability to calculate similar things as similar vectors, we conduct evaluation on a wide range of phrase similarity tasks. In these tasks, a system calculates similarity scores for pairs of phrases, and the performance is evaluated as its correlation with human annotators, measured by Spearman's ρ.
Datasets Mitchell and Lapata (2010) create datasets 7 for pairs of three types of two-word phrases: adjective-nouns (AN) (e.g. "black hair" and "dark eye"), compound nouns (NN) (e.g. "tax charge" and "interest rate") and verb-objects (VO) (e.g. "fight war" and "win battle"). Each dataset consists of 108 pairs and each pair is annotated by 18 humans (i.e., 1,944 scores in total). Similarity scores are integers ranging from 1 to 7. Another dataset 8 is created by extending VO to Subject-Verb-Object (SVO), and then assessing similarities by crowd sourcing (Kartsaklis and Sadrzadeh, 2014). The dataset GS11 created by Grefenstette and Sadrzadeh (2011) (100 pairs, 25 annotators) is also of the form SVO, but in each pair only the verbs are different (e.g. "man pro-   vide/supply money"). The dataset GS12 described in Grefenstette (2013a) (194 pairs, 50 annotators) is of the form Adjective-Noun-Verb-Adjective-Noun (e.g. "local family run/move small hotel"), where only verbs are different in each pair.
Our method We calculate the cosine similarity of query vectors corresponding to phrases. For example, the query vector for "fight war" is calculated as v war M ARG M −1 COMP + v fight . For vecUD we use M nsubj and M dobj instead of M SUBJ and M COMP , respectively. For GloVe we use additive compositions. Table 3, vecDCS is competitive on AN, NN, VO, SVO and GS12, consistently outperforming "no inverse", vecUD and GloVe, showing strong compositionality. The weakness of "no inverse" suggests that relaxing the constraint of inverse matrices may hurt compositionaly, though our preliminary examination on word similarities did not find any difference. The GS11 dataset appears to favor models that can learn from interactions between the subject and object arguments, such as the non-linear model Wadd nl in Hashimoto et al. (2014) and the entanglement model in Kartsaklis and Sadrzadeh (2014). However, these models do not show particular advantages on other datasets. The recursive autoencoder (RAE) proposed in Socher et al. (2011) shares an aspect with vecDCS as to construct meanings from parse trees. It is tested by Blacoe and Lapata (2012) (2015) 84.1 Xu et al. (2015) 85.6 Table 5: F1 on relation classification less, we note that "no matrix" performs as good as vecDCS, suggesting that meaning changes caused by syntactic-semantic roles might not be major factors in these datasets, because the syntacticsemantic relations are all fixed in each dataset.

Relation Classification
In a relation classification task, the relation between two words in a sentence needs to be classified; we expect vecDCS to perform better than "no matrix" on this task because vecDCS can distinguish the different syntactic-semantic roles of the two slots the two words fit in. We confirm this conjecture in this section.
Dataset We use the dataset of SemEval-2010 Task 8 (Hendrickx et al., 2009), in which 9 directed relations (e.g. Cause-Effect) and 1 undirected relation Other are annotated, 8,000 instances for training and 2,717 for test. Performance is measured by the 9-class direction-aware Macro-F1 score excluding Other class.
Our method For any sentence with two words marked as e 1 and e 2 , we construct the DCS tree of the sentence, and take the subtree T rooted at the common ancestor of e 1 and e 2 . We construct four vectors from T , namely: the query vector for the subtree rooted at e 1 (resp. e 2 ), and the query vector of the DCS tree obtained from T by rerooting it at e 1 (resp. e 2 ) ( Figure 4). The four vectors are normalized and concatenated to form the only feature used to train a classifier. For ve-cUD, we use the corresponding vectors calculated from UD trees. For GloVe, we use the word vector of e 1 (resp. e 2 ), and the sum of vectors of all words within the span [e 1 , e 2 ) (resp. (e 1 , e 2 ]) as "banned drugs" "banned movies" "banned books" drug /N bratz/N publish/N marijuana/N porn/N unfair/N cannabis/N indecent/N obscene/N trafficking/N blockbuster/N samizdat/N thalidomide/N movie/N book/N smoking/N idiots/N responsum/N narcotic/N blacklist/N illegal/N botox/N grindhouse/N reclaiming/N doping/N doraemon/N redbook/N  Results VecDCS outperforms baselines on relation classification (Table 5). It makes 16 errors in misclassifying the direction of a relation, as compared to 144 such errors made by "no matrix", 23 by "no inverse", 30 by vecUD, and 161 by GloVe. This suggests that models with syntactic-semantic transformations (i.e. vecDCS, "no inverse", and vecUD) are indeed good at distinguishing the different roles played by e 1 and e 2 . VecDCS scores moderately lower than the state-of-the-art (Xu et al., 2015), however we note that these results are achieved by adding additional features and training task-specific neural networks (dos Santos et al., 2015;Xu et al., 2015). Our method only uses features constructed from unlabeled corpora. From this point of view, it is comparable to the MV-RNN model (without features) in Socher et al. (2012), and vecDCS actually does better. Table 4 shows an example of clustered training instances as assessed by cosine similarities between their features. It suggests that the features used in our method can actually cluster similar relations.

Sentence Completion
If vecDCS can compose query vectors of DCS trees, one should be able to "execute" the vectors to get a set of answers, as the original DCS trees can do. This is done by taking dot products with answer vectors and then ranking the answers. Examples are shown in Table 6. Since query vectors and answer vectors are trained from unlabeled corpora, we can only obtain a coarsegrained candidate list. However, it is noteworthy that despite a common word "banned" shared by the phrases, their answer lists are largely different, suggesting that composition actually can be done. Moreover, some words indeed answer the queries 9 https://www.csie.ntu.edu.tw/˜cjlin/ libsvm/ vecDCS 50 -no matrix 60 -no inverse 46 vecUD 31 N-gram (Various) 39-41 Zweig et al. (2012) 52 Mnih and Teh (2012) 55 Gubbins and Vlachos (2013) 50 Mikolov et al. (2013a) 55 Table 7: Accuracy (%) on sentence completion (e.g. Thalidomide for "banned drugs" and Samizdat for "banned books"). Quantitatively, we evaluate this utility of executing queries on the sentence completion task. In this task, a sentence is presented with a blank that need to be filled in. Five possible words are given as options for each blank, and a system needs to choose the correct one. The task can be viewed as a coarse-grained question answering or an evaluation for language models (Zweig et al., 2012). We use the MSR sentence completion dataset 10 which consists of 1,040 test questions and a corpus for training language models. We train vecDCS on this corpus and use it for evaluation.
Results As shown in Table 7, vecDCS scores better than the N-gram model and demonstrates promising performance. However, to our surprise, "no matrix" shows an even better result which is the new state-of-the-art. Here we might be facing the same problem as in the phrase similarity task (Section 5.2); namely, all choices in a question fill into the same blank and the same syntactic-semantic role, so the transforming matrices in vecDCS might not be able to distinguish different choices; on the other hand, vecDCS would suffer more from parsing and POS-tagging errors. Nonetheless, we believe the result by "no matrix" reveals a new horizon of sentence completion, and suggests that composing semantic vectors according to DCS trees could be a promising direction.

Discussion
We have demonstrated a way to link a vector composition model to a formal semantics, combining the strength of vector representations to calculate phrase similarities, and the strength of formal semantics to build up structured queries. In this section, we discuss several lines of previous research related to this work.
Logic and Distributional Semantics Logic is necessary for implementing the functional aspects of meaning and organizing knowledge in a structured and unambiguous way. In contrast, distributional semantics provides an elegant methodology for assessing semantic similarity and is well suited for learning from data. There have been repeated calls for combining the strength of these two approaches (Coecke et al., 2010;Liang and Potts, 2015), and several systems (Lewis and Steedman, 2013;Beltagy et al., 2014;Tian et al., 2014) have contributed to this direction. In the remarkable work by Beltagy et al. (to appear), word and phrase similarities are explicitly transformed to weighted logical rules that are used in a probabilistic inference framework. However, this approach requires considerable amount of engineering, including the generation of rule candidates (e.g. by aligning sentence fragments), converting distributional similarities to weights, and efficiently handling the rules and inference. What if the distributional representations are equipped with a logical interface, such that the inference can be realized by simple vector calculations? We have shown it possible to realize semantic composition; we believe this may lead to significant simplification of the system design for combining logic and distributional semantics.
Compositional Distributional Models There has been active exploration on how to combine word vectors such that adequate phrase/sentence similarities can be assessed (Mitchell and Lapata, 2010, inter alia), and there is nothing new in using matrices to model changes of meanings. However, previous model designs mostly rely on linguistic intuitions (Paperno et al., 2014, inter alia), whereas our model has an exact logic interpretation. Furthermore, by using additive composition we enjoy a learning guarantee (Tian et al., 2015).
Vector-based Logic Models This work also shares the spirit with Grefenstette (2013b) and Rocktaeschel et al. (2014), in exploring vector calculations that realize logic operations. However, the previous works did not specify how to integrate contextual distributional information, which is necessary for calculating semantic similarity. Formal Semantics Our model implements a fragment of logic capable of semantic composition, largely due to the simple framework of Dependency-based Compositional Semantics (Liang et al., 2013). It fits in a long tradition of logic-based semantics (Montague, 1970;Dowty et al., 1981;Kamp and Reyle, 1993), with extensive studies on extracting semantics from syntactic representations such as HPSG (Copestake et al., 2001;Copestake et al., 2005) and CCG (Baldridge and Kruijff, 2002;Bos et al., 2004;Steedman, 2012;Artzi et al., 2015;Mineshima et al., 2015).
Logic for Natural Language Inference The pursue of a logic more suitable for natural language inference is also not new. For example, MacCartney and Manning (2008) has implemented a model of natural logic (Lakoff, 1970). We would not reach the current formalization of logic of DCS without reading the work by Calvanese et al. (1998), which is an elegant formalization of database semantics in description logic.
Semantic Parsing DCS-related representations have been actively used in semantic parsing and we see potential in applying our model. For example, Berant and Liang (2014) convert λ-DCS queries to canonical utterances and assess paraphrases at the surface level; an alternative could be using vector-based DCS to bring distributional similarity directly into calculation of denotations. We also borrow ideas from previous work, for example our training scheme is similar to Guu et al. (2015) in using paths and composition of matrices, and our method is similar to Poon and Domingos (2009) in building structured knowledge from clustering syntactic parse of unlabeled data.
Further Applications Regarding the usability of distributional representations learned by our model, a strong point is that the representation takes into account syntactic/structural information of context. Unlike several previous models (Padó and Lapata, 2007;Levy and Goldberg, 2014;Pham et al., 2015), our approach learns matrices at the same time that can extract the information according to different syntactic-semantic roles. A related application is selectional preference (Baroni and Lenci, 2010;Lenci, 2011;Van de Cruys, 2014), wherein our model might has potential for smoothly handling composition.