Improving Semantic Composition with Offset Inference

Count-based distributional semantic models suffer from sparsity due to unobserved but plausible co-occurrences in any text collection. This problem is amplified for models like Anchored Packed Trees (APTs), that take the grammatical type of a co-occurrence into account. We therefore introduce a novel form of distributional inference that exploits the rich type structure in APTs and infers missing data by the same mechanism that is used for semantic composition.


Introduction
Anchored Packed Trees (APTs) is a recently proposed approach to distributional semantics that takes distributional composition to be a process of lexeme contextualisation (Weir et al., 2016).A lexeme's meaning, characterised as knowledge concerning co-occurrences involving that lexeme, is represented with a higher-order dependencytyped structure (the APT) where paths associated with higher-order dependencies connect vertices associated with weighted lexeme multisets.The central innovation in the compositional theory is that the APT's type structure enables the precise alignment of the semantic representation of each of the lexemes being composed.Like other countbased distributional spaces, however, it is prone to considerable data sparsity, caused by not observing all plausible co-occurrences in the given data.Recently, Kober et al. (2016) introduced a simple unsupervised algorithm to infer missing cooccurrence information by leveraging the distributional neighbourhood and ease the sparsity effect in count-based models.
In this paper, we generalise distributional inference (DI) in APTs and show how precisely the same mechanism that was introduced to support distributional composition, namely "offsetting" APT representations, gives rise to a novel form of distributional inference, allowing us to infer co-occurrences from neighbours of these representations.For example, by transforming a representation of white to a representation of "things that can be white", inference of unobserved, but plausible, co-occurrences can be based on finding near neighbours (which will be nouns) of the "things that can be white" structure.This furthermore exposes an interesting connection between distributional inference and distributional composition.Our method is unsupervised and maintains the intrinsic interpretability of APTs 1 .

Offset Representations
The basis of how composition is modelled in the APT framework is the way that the co-occurrences are structured.In characterising the distributional semantics of some lexeme w, rather than just recording a co-occurrence between w and w within some context window, we follow Padó and Lapata (2007) and record the dependency path from w to w .This syntagmatic structure makes it possible to appropriately offset the semantic representations of each of the lexemes being composed in some phrase.For example many nouns will have distributional features starting with the type amod, which cannot be observed for adjectives or verbs.Thus, when composing the adjective white with the noun clothes, the feature spaces of the two lexemes need to be aligned first.This can be achieved by offsetting one of the constituents, which we will explain in more detail in this section.
We will make use of the following nota-tion throughout this work.A typed distributional feature consists of a path and a lexeme such as in amod:white.Inverse paths are denoted by a horizontal bar above the dependency relation such as in dobj:prefer and higherorder paths are separated by a dot such as in amod.compound:dress.
Offset representations are the central component in the composition process in the APT framework.Figure 1 shows the APT representations for the adjective white (left) and the APT for the noun clothes (right), as might have been observed in a text collection.Each node holds a multiset of lexemes and the anchor of an APT reflects the current perspective of a lexeme at the given node.An offset representation can be created by shifting the anchor along a given path.For example the lexeme white is at the same node as other adjectives such as black and clean, whereas nouns such as shoes or noise are typically reached via the amod edge.
Offsetting in APTs only involves a change in the anchor, the underlying structure remains unchanged.By offsetting the lexeme white by amod the anchor is shifted along the amod edge, which results in creating a noun view for the adjective white.We denote the offset view of a lexeme for a given path by superscripting the offset path, for example the amod offset of the adjective white is denoted as white amod .The offsetting procedure changes the starting points of the paths as visible in Figure 1 between the anchors for white and white amod , since paths always begin at the anchor.The red dashed line in Figure 1 reflects that anchor shift.The lexeme white amod represents a prototypical "white thing", that is, a noun that has been modified by the adjective white.We note that all edges in the APT space are bi-directional as exemplified in the coloured amod and amod edges in the APT for white, however for brevity we only show uni-directional edges in Figure 1.
By considering the APT representations for the lexemes white and clothes in Figure 1, it becomes apparent that lexemes with different parts of speech are located in different areas of the semantic space.If we want to compose the adjective-noun phrase white clothes, we need to offset one of the two constituents to align the feature spaces in order to leverage their distributional commonalities.This can be achieved by either creating a noun offset view of white, by shift-ing the anchor along the amod edge, or by creating an adjective offset representation of clothes by shifting its anchor along amod.In this work we follow Weir et al. (2016) and always offset the dependent in a given relation.Table 1 shows a subset of the features of Figure 1 as would be represented in a vectorised APT.Vectorising the whole APT lexicon results in a very highdimensional and sparse typed distributional space.The features for white amod (middle column) highlight the change in feature space caused by offsetting the adjective white.The features of the offset view white amod , are now aligned with the noun clothes such that the two can be composed.Composition can be performed by either selecting the union or intersection of the aligned features.

Qualitative Analysis of Offset Representations
Any offset view of a lexeme is behaviourally identical to a "normal" lexeme.It has an associated part of speech, a distributional representation which locates it in semantic space, and we can find neighbours for it in the same way that we find neighbours for any other lexeme.In this way, a single APT data structure is able to provide many different views of any given lexeme.These views reflect the different ways in which the lexeme is used.For example law nsubj is the nsubj offset representation of the noun law.This lexeme is a verb and represents an action carried out by the law.This contrasts with law dobj , which is the dobj offset representation of the noun law.It is also a verb, however represents actions done to the law.Table 2 lists the 10 nearest neighbours for a number of lexemes, offset by amod, dobj and nsubj respectively.For example, the neighbourhood of the lexeme ancient in Table 2 shows that the offset view for ancient amod is a prototypical representation of an "ancient thing", with neighbours easily associated with the property ancient.Furthermore, Table 2 illustrates that nearest neighbours of offset views are often other offset representations.This means that for example actions carried out by a mother tend to be similar to actions carried out by a father or a parent.

Offset Inference
Our approach generalises the unsupervised algorithm proposed by Kober et al. (2016), henceforth "standard DI", as a method for inferring missing knowledge into an APT representation.Rather than simply inferring potentially plausible, but unobserved co-occurrences from near distributional neighbours, inferences can be made involving offset APTs.For example, the adjective white can be offset so that it represents a noun -a prototypical "white thing".This allows inferring plausible co-occurrences from other "things that can be white", such as shoes or shirts.Our algorithm therefore reflects the contextualised use of a word.This has the advantage of being able to make flexible and fine grained distinctions in the inference process.For example if the noun law is used as a subject, our algorithm allows inferring plausible co-occurrences from "other actions carried out by the law".This contrasts the use of law as an object, where offset inference is able to find cooccurrences on the basis of "other actions done to the law".This is a crucial advantage over the method of Kober et al. (2016) which only supports inference on uncontextualised lexemes.
A sketch of how offset inference for a lexeme w works is shown in Algorithm 1.Our algorithm requires a distributional model M , an APT representation for the lexeme w for which to perform offset inference, a dependency path p, describing the offset for w, and the number of neighbours k.The offset representation of w is then enriched with the information from its distributional neighbours by some merge function.We note that if the offset path p is the empty path, we would recover the algorithm presented by Kober et al. (2016).Our algorithm is unsupervised, and agnostic to the input distributional model and the neighbour retrieval function.An interesting observation is the similarity between distributional inference and distributional composition, as both operations are realised by the same mechanism -an offset followed by inferring plausible co-occurrence counts for a single lexeme in the case of distributional inference, or for a phrase in the case of composition.The merging of co-occurrence dimensions for distributional inference can also be any of the operations commonly used for distributional composition such as pointwise minimum, maximum, addition or multiplication.
This relation creates an interesting dynamic between distributional inference and composition when used in a complementary manner as in this work.The former can be used as a process of cooccurrence embellishment which is adding missing information, however with the risk of introducing some noise.The latter on the other hand can be used as a process of co-occurrence filtering, that is leveraging the enriched representations, while also sieving out the previously introduced noise.

Experiments
For our experiments we re-implemented the standard DI method of Kober et al. (2016) for a direct comparison.We built an order 2 APT space on the basis of the concatenation of ukWaC, Wackypedia and the BNC (Baroni et al., 2009), pre-parsed with the Malt parser (Nivre et al., 2006).We PPMI transformed the raw co-occurrence counts prior to composition, using a negative SPPMI shift of log 5 (Levy and Goldberg, 2014b).We also experimented with composing normalised counts and applying the PPMI transformation after composition as done by Weeds et al. (2017), however found composing PPMI scores to work better for this task.We evaluate our offset inference algorithm on two popular short phrase composition benchmarks by Mitchell and Lapata (2008) and Mitchell and Lapata (2010), henceforth ML08 and ML10 respectively.The ML08 dataset consists of 120 distinct verb-object (VO) pairs and the ML10 dataset contains 108 adjective-noun (AN), 108 noun-noun (NN) and 108 verb-object pairs.The goal is to compare a model's similarity estimates to human provided judgements.For both tasks, each phrase pair has been rated by multiple human annotators on a scale between 1 and 7, where 7 indicates maximum similarity.Comparison with human judgements is achieved by calculating Spearman's ρ between the model's similarity estimates and the scores of each human annotator individually.We performed composition by intersection and tuned the number of neighbours by a grid search over {0, 10, 30, 50, 100, 500, 1000} on the ML10 development set, selecting 10 neighbours for NNs, 100 for ANs and 50 for VOs for both DI algorithms.We calculate statistical significance using the method of Steiger (1980).

Effect of the number of neighbours
Figure 2 shows the effect of the number of neighbours for AN, NN and VO phrases, using offset inference, on the ML10 development set.Interestingly, NN compounds exhibit an early saturation effect, while VOs and ANs require more neighbours for optimal performance.One explanation for the observed behaviour is that up to some threshold, the neighbours being added contribute actually missing co-occurrence events, whereas past that threshold distributional inference degrades to just generic smoothing that is simply compensating for sparsity, but overwhelming the representations with non-plausible co-occurrence information.A similar effect has also been observed by Erk and Pado (2010) in an exemplarbased model.

Results
Table 3 shows that both forms of distributional inference significantly outperform a baseline without DI.On average, offset inference outperforms the method of Kober et al. (2016)

Related Work
Distributional inference has its roots in the work of Dagan et al. (1993Dagan et al. ( , 1994)), who aim to find probability estimates for unseen words in bigrams, and Schütze (1992Schütze ( , 1998) ) who leverages the distributional neighbourhood through clustering of contexts for word-sense discrimination.Recently Kober et al. (2016) revitalised the idea for compositional distributional semantic models.
Composition with distributional semantic models has become a popular research area in recent years.Simple, yet competitive methods, are based on pointwise vector addition or multiplication (Mitchell andLapata, 2008, 2010).However, these approaches neglect the structure of the text defining composition as a commutative operation.
The perhaps most popular approach in the literature to evaluating compositional distributional semantic models is to compare human word and phrase similarity judgements with similarity estimates of composed meaning representations, under the assumption that better distributional representations will perform better at these tasks (Blacoe and Lapata, 2012;Dinu et al., 2013;Erk and Padó, 2008;Hashimoto et al., 2014;Hermann and Blunsom, 2013;Kiela et al., 2014;Turney, 2012).

Conclusion
In this paper we have introduced a novel form of distributional inference that generalises the method introduced by Kober et al. (2016).We have shown its effectiveness for semantic composition on two benchmark phrase similarity tasks where we achieved state-of-the-art performance while retaining the interpretability of our model.We have furthermore highlighted an interesting connection between distributional inference and distributional composition.
In future work we aim to apply our novel method to improve modelling selectional preferences, lexical inference, and scale up to longer phrases and full sentences.

Figure 1 :
Figure 1: Structured distributional APT space.Different colours reflect different parts of speech.Boxes denote the current anchor of the APT, circles represent nodes in the APT space, holding lexemes, and edges represent their relationship within the space.

Figure 2 :
Figure 2: Effect of the number of neighbours on the ML10 development set.

Table 1 :
Sample of vectorised features for the APTs shown in Figure1.Offsetting white by amod creates an offset view, white amod , representing a noun, and has the consequence of aligning the feature space with clothes.
by a statistically significant margin on both datasets.

Table 3 :
Comparison of DI algorithms.‡ denotes statistical significance at p < 0.01 in comparison to the method without DI, * denotes statistical significance at p < 0.01 in comparison to standard DI and † denotes statistical significance at p < 0.05 in comparison to standard DI.

Table 4 :
Comparison with existing methods.