Composition of Sentence Embeddings: Lessons from Statistical Relational Learning

Various NLP problems – such as the prediction of sentence similarity, entailment, and discourse relations – are all instances of the same general task: the modeling of semantic relations between a pair of textual elements. A popular model for such problems is to embed sentences into fixed size vectors, and use composition functions (e.g. concatenation or sum) of those vectors as features for the prediction. At the same time, composition of embeddings has been a main focus within the field of Statistical Relational Learning (SRL) whose goal is to predict relations between entities (typically from knowledge base triples). In this article, we show that previous work on relation prediction between texts implicitly uses compositions from baseline SRL models. We show that such compositions are not expressive enough for several tasks (e.g. natural language inference). We build on recent SRL models to address textual relational problems, showing that they are more expressive, and can alleviate issues from simpler compositions. The resulting models significantly improve the state of the art in both transferable sentence representation learning and relation prediction.


Introduction
Predicting relations between textual units is a widespread task, essential for discourse analysis, dialog systems, information retrieval, or paraphrase detection.Since relation prediction often requires a form of understanding, it can also be used as a proxy to learn transferable sentence representations.Several tasks that are useful to build sentence representations are derived directly from text structure, without human annotation: sentence order prediction (Logeswaran et al., 2016;Jernite et al., 2017), the prediction of previous and subsequent sentences (Kiros et al., 2015;Jernite et al., 2017), or the prediction of explicit dis-course markers between sentence pairs (Nie et al., 2017;Jernite et al., 2017).Human labeled relations between sentences can also be used for that purpose, e.g.inferential relations (Conneau et al., 2017).While most work on sentence similarity estimation, entailment detection, answer selection, or discourse relation prediction seemingly uses task-specific models, they all involve predicting whether a relation R holds between two sentences s 1 and s 2 .This genericity has been noticed in the literature before (Baudiš et al., 2016) and it has been leveraged for the evaluation of sentence embeddings within the SentEval framework (Conneau et al., 2017).
A straightforward way to predict the probability of (s 1 , R, s 2 ) being true is to represent s 1 and s 2 with d-dimensional embeddings h 1 and h 2 , and to compute sentence pair features f (h 1 , h 2 ), where f is a composition function (e.g.concatenation, product, . . .).A softmax classifier g θ can learn to predict R with those features.g θ • f can be seen as a reasoning based on the content of h 1 and h 2 (Socher et al., 2013).
Our contributions are as follows: -we review composition functions used in textual relational learning and show that they lack expressiveness (section 2); -we draw analogies with existing SRL models (section 3) and design new compositions inspired from SRL (section 4); -we perform extensive experiments to test composition functions and show that some of them can improve the learning of representations and their downstream uses (section 6).

Composition functions for relation prediction
We review here popular composition functions used for relation prediction based on sentence em-beddings.Ideally, they should simultaneously fulfill the following minimal requirements: -make use of interactions between representations of sentences to relate; -allow for the learning of asymmetric relations (e.g.entailment, order); -be usable with high dimensionalities (parameters θ and f should fit in GPU memory).
Additionally, if the main goal is transferable sentence representation learning, compositions should also incentivize gradually changing sentences to lie on a linear manifold, since transfer usually uses linear models.Another goal can be learning of transferable relation representation.Concretely, a sentence encoder and f can be trained on a base task, and f (h 1 , h 2 ) can be used as features for transfer in another task.In that case, the geometry of the sentence embedding space is less relevant, as long as the f (h 1 , h 2 ) space works well for transfer learning.Our evaluation will cover both cases.
A straightforward instantiation of f is concatenation (Hooda & Kosseim, 2017): However, interactions between s 1 and s 2 cannot be modeled with f [,] followed by a softmax regression.Indeed, f [,] (h 1 , h 2 )θ can be rewritten as a sum of independent contributions from h 1 and h 2 , namely θ Using a multilayer perceptron before the softmax would solve this issue, but it also harms sentence representation learning (Conneau et al., 2017;Logeswaran & Lee, 2018), possibly because the perceptron allows for accurate predictions even if the sentence embeddings lie in a convoluted space.To promote interactions between h 1 and h 2 , elementwise product has been used in Baudiš et al. (2016): Absolute difference is another solution for sentence similarity (Mueller & Thyagarajan, 2016), and its element-wise variation may equally be used to compute informative features: The latter two were combined into a popular instantiation, sometimes refered as heuristic matching (Tai et al., 2015;Kiros et al., 2015;Mou et al., 2015): Although effective for certain similarity tasks, f − is symmetrical, and should be a poor choice for tasks like entailment prediction or prediction of discourse relations.For instance, if R e denotes entailment and (s 1 , s 2 )= ("It just rained", "The ground is wet"), (s 1 , R e , s 2 ) should hold but not (s 2 , R e , s 1 ).The f − composition function is nonetheless used to train/evaluate models on entailment (Conneau et al., 2017) or discourse relation prediction (Nie et al., 2017).et al., 2016;Conneau et al., 2017).While the resulting composition is asymmetrical, the asymmetrical component involves no interaction as noted previously.We note that this composition is very commonly used.On the SNLI benchmark,1 12 out of the 25 listed sentence embedding based models use it, and 7 use a weaker form (e.g.omitting f ).
The outer product ⊗ has been used instead for asymmetric multiplicative interaction (Jernite et al., 2017): (5) This formulation is expressive but it forces g θ to have d 2 parameters per relation, which is prohibitive when there are many relations and d is high.
The problems outlined above are well known in SRL.Thus, existing compositions (except f ⊗ ) can only model relations superficially for tasks currently used to train state of the art sentence encoders, like NLI or discourse connectives prediction.

Statistical Relational Learning models
In this section we introduce the context of statistical relational learning (SRL) and relevant models.
Recently, SRL has focused on efficient and expressive relation prediction based on embeddings.A core goal of SRL (Getoor & Taskar, 2007) is to induce whether a relation R holds between two arbitrary entities e 1 , e 2 .As an example, we would like to assign a score to (e 1 , R, e 2 ) = (Paris, LO-CATED IN, France) that reflects a high probability.(Bordes et al., 2013a), TransE from (Bordes et al., 2013b), RESCAL from (Nickel et al., 2011), DistMult from (Yang et al., 2015) and (Trouillon et al., 2016).Following the latter, < a, b, c > denotes k a k b k c k .Re(x) is the real part of x, and p is commonly set to 1.
In  1 presents an overview of a number of state of the art relational models.We can distinguish two families of models: subtractive and multiplicative.
The TransE scoring function is motivated by the idea that translations in latent space can model analogical reasoning and hierarchical relationships.Dense word embeddings trained on tasks related to the distributional hypothesis naturally allow for analogical reasoning with translations without explicit supervision (Mikolov et al., 2013).TransE generalizes the older Unstructured model.We call them subtractive models.
The RESCAL, Distmult, and ComplEx scoring functions can be seen as dot product matching between e 1 and a relation-specific linear transformation of e 2 (Liu et al., 2017).This transformation helps checking whether e 1 matches with some aspects of e 2 .RESCAL allows a full linear mapping W r e 2 but has a high complexity, while Distmult is restricted to a component-wise weighting w r e 2 .ComplEx has fewer parameters than RESCAL but still allows for the modeling of asymmetrical relations.As shown in Liu et al. (2017), ComplEx boils down to a restriction of RESCAL where W r is a block diagonal matrix.These blocks are 2dimensional, antisymmetric and have equal diagonal terms.Using such a form, even and odd indexes of e's dimensions play the roles of real and imaginary numbers respectively.The ComplEx model (Trouillon et al., 2016) and its variations  (Lacroix et al., 2018) yield state of the art performance on knowledge base completion on numerous evaluations.

Embeddings composition as SRL models
We claim that several existing models (Conneau et al., 2017;Nie et al., 2017;Baudiš et al., 2016) boil down to SRL models where the sentence embeddings (h 1 , h 2 ) act as entity embeddings (e 1 , e 2 ).This framework is depicted in figure 1.In this article we focus on sentence embeddings, although our framework can straightforwardly be applied to other levels of language granularity (such as words, clauses, or documents).Some models (Chen et al., 2017b;Seo et al., 2016;Gong et al., 2018;Radford, 2018;Devlin et al., 2018) do not rely on explicit sentence encodings to perform relation prediction.They combine information of input sentences at earlier stages, using conditional encoding or cross-attention.There is however no straightforward way to derive transferable sentence representations in this setting, and so these models are out of the scope of this paper.They sometimes make use of composition functions, so our work could still be relevant to them in some respect.
In this section we will make a link between sentence composition functions and SRL scoring functions, and propose new scoring functions drawing inspiration from SRL.

Linking composition functions and SRL models
The composition function f from equation 2 followed by a softmax regression yields a score whose analytical form is identical to the Distmult model score described in section 3. Let θ R denote the softmax weights for relation R. The logit score for the truth of (s (a) Score map of (s1, R to the past , s2) over possible sentences s2 using Unstructured composition.
(b) Score map of (s1, R to the past , s2) over possible sentences s2 using TransE composition.
(d) Score map of (s1, R entailment , s2) over possible sentences s2 using ComplEx composition.(h Similarly, the composition f − from equation 3 followed by a softmax regression can be seen as an element-wise weighted score of Unstructured (both are equal if softmax weights are all unitary).Thus, f − from 4 (with softmax regression) can be seen as a weighted ensemble of Unstructured and Distmult.These two models are respectively outperformed by TransE and ComplEx on knowledge base link prediction by a large margin (Trouillon et al., 2016;Bordes et al., 2013a).We therefore propose to change the Unstructured and Distmult in f − such that they match their respective state of the art variations in the following sections.We will also show the implications of these refinements.

Simply replacing |h
would make the model analogous to TransE.t is learned and is shared by all relations.A relationspecific translation t R could be used but it would make f relation-specific.Instead, here, each dimension of f t (h 1 , h 2 ) can be weighted according to a given relation.Non-zero t makes f t asymmetrical and also yields features that allow for the checking of an analogy between s 1 and s 2 .Sentence embeddings often rely on pre-trained word embeddings which have demonstrated strong capabilities for analogical reasoning.Some analogies, such as part-whole, are computable with offthe-shelf word embeddings (Chen et al., 2017a) and should be very informative for natural language inference tasks.As an illustration, let us consider an artificial semantic space (depicted in figures 2a and 2b) where we posit that there is a "to the past" translation t so that h 1 + t is the embedding of a sentence s 1 changed to the past tense.Unstructured is not able to leverage this semantic space to correctly score (s 1 , R to the past , s 2 ) while TransE is well tailored to provide highest scores for sentences near h 1 + t where t is an estimation of t that could be learned from examples.

Casting ComplEx as a composition
Let us partition h dimensions into two equally sized sets R and I, e.g. even and odd dimension indices of h.We propose a new function f C as a way to fit the ComplEx scoring function into a composition function.2c and 2d) where the first dimension is high when a sentence means that it just rained, and the second dimension is high when the ground is wet.Over this semantic space, Distmult is only able to detect entailment for paraphrases whereas ComplEx is also able to naturally model that ("it just rained", R entailment , "the ground is wet") should be high while its converse should not.
We also propose two more general versions of f C : f C α can be seen as Distmult concatenated with the asymmetrical part of ComplEx and f C β can be seen as RESCAL with unconstrained block diagonal relation matrices.

On the evaluation of relational models
The SentEval framework (Conneau et al., 2017) provides a general evaluation for transferable sentence representations, with open source evaluation code.One only needs to specify a sentence encoder function, and the framework performs classification tasks or relation prediction tasks using cross-validated logistic regression on embeddings or composed sentence embeddings.Tasks include sentiment analysis, entailment, textual similarity, textual relatedness, and paraphrase detection.These tasks are a rich way to train or evaluate sentence representations since in a triple (s 1 , R, s 2 ), we can see (R, s 2 ) as a label for s 1 (Baudiš et al., 2016).Unfortunately, the relational tasks hard-code the composition function from equation 4. From our previous analysis, we believe this composition function favors the use of contextual/lexical similarity rather than high-level reasoning and can penalize representations based on more semantic aspects.This bias could harm research since semantic representation is an important next step for sentence embedding.Training/evaluation datasets are also arguably flawed with respect to relational aspects since several recent studies (Dasgupta et al., 2018;Poliak et al., 2018;Gururangan et al., 2018;Glockner et al., 2018) show that InferSent, despite being state of the art on SentEval evaluation tasks, has poor performance when dealing with asymmetrical tasks and non-additive composition of words.In addition to providing new ways of training sentence encoders, we will also extend the SentEval evaluation framework with a more expressive composition function when dealing with relational transfer tasks, which improves results even when the sentence encoder was not trained with it.

Experiments
Our goal is to show that transferable sentence representation learning and relation prediction tasks can be improved when our expressive compositions are used instead of the composition from equation 4. We train our relational model adaptations on two relation prediction base tasks (T ), one supervised (T = NLI ) and one unsupervised (T = Disc) described below, and evaluate sentence/relation representations on base and transfer tasks using the SentEval framework in order to quantify the generalization capabilities of our models.Since we use minor modifications of In-ferSent and SentEval, our experiments are easily reproducible.

Training tasks
Natural language inference (T = NLI)'s goal is to predict whether the relation between two sentences (premise and hypothesis) is Entailment, Contradiction or Neutral.We use the combination of SNLI dataset (Bowman et al., 2015) and MNLI dataset (Williams et al., 2017).We call AllNLI the resulting dataset of 1M examples.Conneau et al. (2017) claim that NLI data allows universal sentence representation learning.They used the f ,− composition function with concatenated sentence representations in order to train their Infersent model.We also train on the prediction of discourse connectives between sentences/clauses (T = Disc).Discourse connectives make discourse relations between sentences explicit.In the sentence I live in Paris but I'm often elsewhere, the word but highlights that there is a contrast between the two clauses it connects.We use Malmi et al.'s (2017) dataset of selected 400k instances with 20 discourse connectives (e.g.however, for example) with the provided train/dev/test split.This dataset has no other supervision than the list of 20 connectives.Nie et al. (2017) used f ,− concatenated with the sum of sentence representations to train their model, DisSent, on a similar task and showed that their encoder was general enough to perform well on SentEval tasks.They use a dataset that is, at the time of writing, not publicly available.

Evaluation tasks
Table 2 provides an overview of different transfer tasks that will be used for evaluation.We added another relation prediction task, the PDTB coarsegrained implicit discourse relation task, to Sent-Eval.This task involves predicting a discursive link between two sentences among {Comparison, Contingency, Entity based coherence, Expansion, Temporal}.We followed the setup of Pitler et al. (2009), without sampling negative examples in training.MRPC, PDTB and SICK will be tested with two composition functions: besides SentEval composition f ,− , we will use f C β ,− for transfer learning evaluation, since it has the most general multiplicative interaction and it does not penalize models that do not learn a translation.For all tasks except STS14, a cross-validated logistic regression is used on the sentence or relation representation.The evaluation of the STS14 task relies on Pearson or Spearman correlation between cosine similarity and the target.We force the composition function to be symmetrical on the MRPC task since paraphrase detection should be invariant to permutation of input sentences.

Setup
We want to compare the different instances of f .We follow the setup of Infersent (Conneau et al., 2017): we learn to encode sentences into h with a bi-directional LSTM using element-wise max pooling over time.The dimension size of h is 4096.Word embeddings are fixed GloVe with 300 dimensions, trained on Common Crawl 840B.2Optimization is done with SGD and decreasing learning rate until convergence.
The only difference with regard to Infersent is the composition.Sentences are composed with six different compositions for training according to the following template: We do not consider f C since it yielded inferior results in our early experiments using NLI and Sent-Eval development sets.f m,s,1,2 (h 1 , h 2 ) is fed directly to a softmax regression.Note that Infersent uses a multi-layer perceptron before the softmax, but uses only linear activations, so f ,−,1,2 (h 1 , h 2 ) is analytically equivalent to Infersent when T = NLI .

Results
Having run several experiments with different initializations, the standard deviations between them do not seem to be negligible.We decided to take these into account when reporting scores, contrary to previous work (Kiros et al., 2015;Conneau et al., 2017): we average the scores of 6 distinct runs for each task and use standard deviations under normality assumption to compute significance.Table 3 shows model scores for T = NLI , while Table 4 shows scores for T = Disc.For comparison, Table 5 shows a number of important models from previous work.Finally, in Table 6, we present results for sentence relation tasks that use an alternative composition function (f C β ,− ) instead of the standard composition function used in SentEval.
For sentence representation learning, the baseline, f − composition already performs rather well, being on par with the InferSent scores of the original paper, as would be expected.However, macro-averaging all accuracies, it is the second worst performing model.f C α ,t,1,2 is the best performing model, and all three best models use the translation (s = t).On relational transfer tasks, training with f C α ,t,1,2 and using complex C β for transfer (Table 6) always outperforms the baseline (f ,−,1,2 with − composition in Tables 3 and 4).Averaging accuracies of those transfer tasks, this result is significant for both training tasks at level p < 0.05 (using Bonferroni correction accounting for the 5 comparisons).On base tasks and the average of non-relational transfer tasks (MR, MPQA, SUBJ, TREC), our proposed compositions are on average slightly better than f ,−,1,2 .Representations learned with our proposed compositions can still be compared with simple cosine similarity: all three methods using the translational composition (s = t) very significantly outperform the baseline (significant at level p < 0.01 with Bonferroni correction) on STS14 for T = NLI .Thus, we believe f C α ,t,1,2 has more robust results and could be a better default choice than f ,−,1,2 as composition for representation learning. 3dditionally, using C β (Table 6) instead of (Tables 3 and 4) for transfer learning in relational transfer tasks (PDTB, MRPC, SICK) yields a significant improvement on average, even when m = was used for training (p < 0.001).Therefore, we believe f C β ,− is an interesting composition for inference or evaluation of models regardless of how they were trained.

Related work
There are numerous interactions between SRL and NLP.We believe that our framework merges two specific lines of work: relation prediction and modeling textual relational tasks.
Some previous NLP work focused on composition functions for relation prediction between  Kiros et al. (2015), and BoW is our re-evaluation of GloVe Bag of Words from Conneau et al. (2017).AVG denotes the average of the SentEval scores.. text fragments, even though they ignored SRL and only dealt with word units.Word2vec (Mikolov et al., 2013) has sparked a great interest for this task with word analogies in the latent space.Levy & Goldberg (2014) explored different scoring functions between words, notably for analogies.Hypernymy relations were also studied, by Chang et al. (2017) and Fu et al. (2014).Levy et al. (2015) proposed tailored scoring functions.Even the skipgram model (Mikolov et al., 2013) can be formulated as finding relations between context and target words.We did not empirically explore textual relational learning at the word level, but we believe that it would fit in our framework, and could be tested in future studies.Numerous approaches (Chen et al., 2017b;Seok et al., 2016;Gong et al., 2018;Joshi et al., 2018) were proposed to predict inference relations between sentences, but don't explicitely use sentence embeddings.Instead, they encode sentences jointly, possibly with the help of previously cited word compositions, therefore it would also be interesting to try applying our techniques within their framework.
Some modeling aspects of textual relational learning have been formally investigated by Baudiš et al. (2016).They noticed the genericity of relational problems and explored multi-task and transfer learning on relational tasks.Their work is complementary to ours since their framework unifies tasks while ours unifies composition func-tions.Subsequent approaches use relational tasks for training and evaluation on specific datasets (Conneau et al., 2017;Nie et al., 2017).

Conclusion
We have demonstrated that a number of existing models used for textual relational learning rely on composition functions that are already used in Statistical Relational Learning.By taking into account previous insights from SRL, we proposed new composition functions and evaluated them.These composition functions are all simple to implement and we hope that it will become standard to try them on relational problems.Larger scale data might leverage these more expressive compositions, as well as more compositional, asymmetric, and arguably more realistic datasets (Dasgupta et al., 2018;Gururangan et al., 2018).Finally, our compositions can also be helpful to improve interpretability of embeddings, since they can help measure relation prediction asymmetry.Analogies through translations helped interpreting word embeddings, and perhaps anlyzing our learned t translation could help interpreting sentence embeddings.

Figure 1 :
Figure 1: Implicit SRL model in text relation prediction

Figure 2 :
Figure 2: Possible scoring function values according to different composition functions.s 1 and R are fixed and color brightness reflects likelihood of (s 1 , R, s 2 ) for each position of embedding s 2 .(b) and (d) are respectively more expressive than (a) and (c).
as entities embeddings and θ R as the relation weight w R .
by softmax weights θ r is equivalent to the ComplEx scoring function Re < h 1 , θ r , h 2 >.The first half of θ r weights corresponds to the real part of ComplEx relation weights while the last half corresponds to the imaginary part.f C is to the ComplEx scoring function what f is to the DistMult scoring function.Intuitively, ComplEx is a minimal way to model interactions between distinct latent dimensions while Distmult only allows for identical dimensions to interact.Let us consider a new artificial semantic space (shown in figures

Table 2 :
Transfer evaluation tasks.N = number of training examples; C = number of classes if applicable.h 1 , h 2 are sentence representations, f m,s a composition function from section 4.

Table 3 :
SentEval and base task evaluation results for the models trained on natural language inference (T = NLI ); AllNLI is used for training.All scores are accuracy percentages, except STS14, which is Pearson correlation percentage.AVG denotes the average of the SentEval scores.

Table 4 :
SentEval and base task evaluation results for the models trained on discourse connective prediction (T = Disc).All scores are accuracy percentages, except STS14, which is Pearson correlation percentage.AVG denotes the average of the SentEval scores.

Table 5 :
Comparison models from previous work.InferSent represents the original results from Conneau et al. (2017), SkipT is SkipThought from

Table 6 :
Results for sentence relation tasks using an alternative composition function (f C β ,− ) during evaluation.AVG denotes the average of the three tasks.