Improving Sparse Word Representations with Distributional Inference for Semantic Composition

Distributional models are derived from co-occurrences in a corpus, where only a small proportion of all possible plausible co-occurrences will be observed. This results in a very sparse vector space, requiring a mechanism for inferring missing knowledge. Most methods face this challenge in ways that render the resulting word representations uninterpretable, with the consequence that semantic composition becomes hard to model. In this paper we explore an alternative which involves explicitly inferring unobserved co-occurrences using the distributional neighbourhood. We show that distributional inference improves sparse word representations on several word similarity benchmarks and demonstrate that our model is competitive with the state-of-the-art for adjective-noun, noun-noun and verb-object compositions while being fully interpretable.


Introduction
The aim of distributional semantics is to derive meaning representations based on observing cooccurrences of words in large text corpora.However not all plausible co-occurrences will be observed in any given corpus, resulting in word representations that only capture a fragment of the meaning of a word.For example the verbs "walking" and "strolling" may occur in many different and possibly disjoint contexts, although both verbs would be equally plausible in numerous cases.This subsequently results in incomplete representations for both lexemes.In addition, models based on counting co-occurrences face the general problem of sparsity in a very high-dimensional vector space.The most common approaches to these challenges have involved the use of various techniques for dimensionality reduction (Bullinaria and Levy, 2012;Lapesa and Evert, 2014) or the use of low-dimensional and dense neural word embeddings (Mikolov et al., 2013;Pennington et al., 2014).The common problem in both of these approaches is that composition becomes a black-box process due to the lack of interpretability of the representations.Count-based models are therefore a very attractive line of work with regards to a number of important long-term research challenges, most notably the development of an adequate model of distributional compositional semantics.In this paper we propose the use of distributional inference (DI) to inject unobserved but plausible distributional semantic knowledge into the vector space by leveraging the intrinsic structure of the distributional neighbourhood.This results in richer word representations and furthermore mitigates the sparsity effect common in high-dimensional vector spaces, while remaining fully interpretable.Our contributions are as follows: we show that typed and untyped sparse word representations, enriched by distributional inference, lead to performance improvements on several word similarity benchmarks, and that a higher-order dependency-typed vector space model, based on "Anchored Packed Dependency Trees (APTs)" (Weir et al., 2016), is competitive with the state-of-the-art for adjective-noun, noun-noun and verb-object compositions.Using our method, we are able to bridge the gap in performance between high dimensional interpretable mod-els and low dimensional non-interpretable models and offer evidence to support a possible explanation of why high-dimensional models usually perform worse, together with a simple, practical method for over-coming this problem.We furthermore demonstrate that intersective approaches to composition benefit more from distributional inference than composition by union and highlight the ability of composition by intersection to disambiguate the meaning of a phrase in a local context.The remainder of this paper is structured as follows: we discuss related work in section 2, followed by an introduction of the APT framework for semantic composition in section 3. We describe distributional inference in section 4 and present our experimental work, together with our results in section 5. We conclude this paper and outline future work in section 6.

Related Work
Our method follows the distributional smoothing approach of Dagan et al. (1994) and Dagan et al. (1997).In these works the authors are concerned with smoothing the probability estimate for unseen words in bigrams.This is achieved by measuring which unobserved bigrams are more likely than others on the basis of the Kullback-Leibler divergence between bigram distributions.This has led to significantly improved performance on a language modelling for speech recognition task, as well as for word-sense disambiguation in machine translation (Dagan et al., 1994;Dagan et al., 1997).More recently Padó et al. (2013) used a distributional approach for smoothing derivationally related words, such as oldish -old, as a back-off strategy in case of data sparsity.However, none of these approaches have used distributional inference as a general technique for directly enriching sparse distributional vector representations, or have explored its behaviour for semantic composition.Compositional models of distributional semantics have become an increasingly popular topic in the research community.Starting from simple pointwise additive and multiplicative approaches to composition, such as Mitchell and Lapata (2008;2010), and Blacoe and Lapata (2012), to tensor based models, such as Baroni andZamparelli (2010), Coecke et al. (2010), Grefenstette et al. (2013) and Paperno et al. (2014), and neural network based approaches, such as Socher et al. (2012), Le and Zuidema (2015), Mou et al. (2015) and Tai et al. (2015).Zanzotto et al. (2015) provide a decompositional analysis of how similarity is affected by distributional composition, and link compositional models to convolution kernels.Most closely related to our approach of composition are the works of Thater et al. (2010), Thater et al. (2011) and Weeds et al. (2014), which aim to provide a general model of compositionality in a typed distributional vector space.In this paper we adopt the approach to distributional composition introduced by Weir et al. (2016), whose APT framework is based on a higher-order dependency-typed vector space, however they do not address the issue of sparsity in their work.

Background
Distributional vector space models can broadly be categorised into untyped proximity based models and typed models (Baroni and Lenci, 2010).Examples of the former include Deerwester et al. (1990); Lund and Burgess (1996); Curran (2004); Sahlgren (2006); Bullinaria and Levy (2007) and Turney and Pantel (2010).These models count the number of times every word in a large corpus co-occurs with other words within a specified spatial context window, without leveraging the structural information of the text.Typed models on the other hand, take the grammatical relation between two words for a co-occurrence event into account.Early proponents of that approach are Grefenstette (1994) and Lin (1998).More recent work by Padó and Lapata (2007), Erk and Padó (2008) and Weir et al. (2016) uses dependency paths to build a structured vector space model.In both kinds of models, the raw counts are usually transformed by Positive Pointwise Mutual Information (PPMI) or a variant of it (Church and Hanks, 1990;Niwa and Nitta, 1994;Scheible et al., 2013;Levy and Goldberg, 2014).In the following we will give an explanation of the theory of composition with APTs as introduced by Weir et al. (2016), which we adopt in this paper.In addition to direct relations between two words, the APT model also considers inverse and higher order relations.Inverse relations are denoted with a horizontal bar above the dependency relation, such as amod for an inverse adjectival modifier.Higher order dependencies are separated by a colon as in the second order distributional feature dobj:nsubj.The example below illustrates how raw text is processed to retrieve elementary representations in our APT model.As an example we consider a lowercased corpus consisting of the sentences: we folded the clean clothes i like your clothes we bought white shoes yesterday he folded the white sheets We dependency parse the raw sentences and, following Weir et al. (2016), align and aggregate the resulting parse trees according to their dependency type as shown in Figure 1.For example the lexeme clothes has the distributional features amod:dry and dobj:nsubj:we among others.Over a large corpus, this results in a very high-dimensional and sparse vector space, which due to its typed nature is much sparser than for untyped models.

Composition with APTs
Composition is linguistically motivated by the principle of compositionality, which states that the meaning of a complex expression is fully determined by its structure and the meanings of its constituents (Frege, 1884).Many simple approaches to semantic composition neglect the structure and lose information in the composition process.For example, the phrases house boat and boat house have the exact same representation when composition is done via a pointwise arithmetic operation.Despite performing well in a number of studies, this commutativity is not desirable for a fine grained understanding of the semantics of natural language.When performing composition with APTs, we adopt the method introduced by Weir et al. ( 2016) which views distributional composition as a process of contextualisation.For composing the adjective white with the noun clothes via the dependency relation amod we need to consider how the adjective interacts with the noun in the vector space.The distributional features of white describe things that are white via their first order relations such as amod, and things that can be done to white things, such as bought via amod:dobj in the example above.
Table 1 shows a number of features extracted from the aligned dependency trees in Figure 1 and highlights that adjectives and nouns do not share many features if only first order dependencies would be considered.However through the inclusion of inverse and higher order dependency paths we can observe that the second order features of the adjective align with the first order features of the noun.For composition, the adjective white needs to be offset by its inverse relation to clothes1 making it distributionally similar to a noun that has been modified by white.Offsetting can be seen as shifting the current viewpoint in the APT data structure and is necessary for aligning the feature spaces for composition (Weir et al., 2016).We are then in a position to compose the offset representation of white with the vector for clothes by the union or the intersection of their features.
Table 2 shows the resulting feature spaces of the composed vectors.It is worth noting that any arithmetic operation can be used to combine the counts of the aligned features, however for this paper we use pointwise addition for both composition functions.One of the advantages of this approach to composition is that the inherent interpretability of count-based models naturally expands beyond the word level, allowing us to study the distributional semantics of phrases in the same space as words.Due to offsetting one of the constituents, the composition operation is not commutative and hence avoids identical representations for house boat and boat house.
However, the typed nature of our vector space re- sults in extreme sparsity, for example while the untyped VSM has 130k dimensions, our APT model can have more than 3m dimensions.We therefore need to enrich the elementary vector representations with the distributional information of their nearest neighbours to ease the sparsity effect and infer missing information.Due to the syntactic nature of our composition operation it is not straightforward to apply common dimensionality reduction techniques such as SVD, as the type information needs to be preserved.

Distributional Inference
Following Dagan et al. (1994) and Dagan et al. (1997), we propose a simple unsupervised algorithm for enriching sparse vector representations with their nearest neighbours.We show that our distributional inference algorithm improves performance for untyped and typed models on several word similarity benchmarks, as well as being competitive with the state-of-the-art on semantic composition.As shown in algorithm 1 below, we iterate over all word vectors w in a given distributional model M , and add the vector representations of the nearest neighbours n, determined by cosine similarity, to the representation of the enriched word vector w .The parameter α in line 4 scales the contribution of the original word vector to the resulting enriched representation.In this work we always chose α to be identical to the number of neighbours used for distributional inference.For example, if we used 10 neighbours for DI, we would set α = 10, which we found sufficient to prevent the neighbours from dominating the vector representation.In our experiments we kept the input distributional model fixed, however it is equally possible to update the given model in an online fashion, adding some amount of stochasticity to the enriched word vector representations.There is a number of possibilities for the neighbour retrieval function neighbours() and we explore several options in this paper.The algorithm furthermore is agnostic to the input distributional model, for example it is possible to use completely different vector space models for querying neighbours and enrichment.
Algorithm 1 Distributional Inference for all w in M do 4: for all n in neighbours(M, w) do 6: w ← w + n The perhaps simplest way is to choose the top n most similar neighbours for each word in the vector space and enrich the respective vector representations with them.

Density based Neighbour Retrieval
This approach has its roots in kernel density estimation (Parzen, 1962), however instead of defining a static global parzen window, we set the window size for every word individually, depending on the distance to its nearest neighbour, plus a threshold.For example if the cosine distance between the target vector and its top neighbour is 0.5, we use a window size of 0.5 + for that word.In our experiments we typically define to be proportional to the distance of the nearest neighbour (e.g.= 0.5 × 0.1).

WordNet based Neighbour Retrieval
Instead of leveraging the intrinsic structure of our distributional vector space, we retrieve neighbours by querying WordNet (Fellbaum, 1998), and treat synsets with agreeing PoS tags as the nearest neighbours of any target vector.This restricts the retrieved neighbours to synonyms only.

Experiments
Our model is based on a cleaned October 2013 Wikipedia dump, which excludes all pages with fewer than 20 page views, resulting in a corpus of approximately 0.6 billion tokens (Wilson, 2015).The corpus is lowercased, tokenised, lemmatised, PoS tagged and dependency parsed with the Stanford NLP tools, using universal dependencies (Manning et al., 2014;de Marneffe et al., 2014).We then build our APT model with first, second and third order relations.We remove distributional features with a count of less than 10, and vectors containing fewer than 50 non-zero entries.The raw counts are subsequently transformed to PPMI weights.The untyped vector space model is built from the same lowercased, tokenised and lemmatised Wikipedia corpus.We discard terms with a frequency of less than 50 and apply PPMI to the raw co-occurrence counts.

Shifted PPMI
We explore a range of different values for shifting the PPMI scores as these have a significant impact on the performance of the APT model.The effect of shifting PPMI scores for untyped vector space models has already been explored in Levy and Goldberg (2014), and Levy et al. (2015), thus we only present results for the APT model.As shown in equation 1, PMI is defined as the log of the ratio of the joint probability of observing a word w and a context c together, and the product of the respective marginals of observing them separately.In our APT model, a context c is defined as a dependency relation together with a word.
As PMI is negatively unbounded, PPMI is used to ensure that all values are greater than or equal to 0. Shifted PPMI (SPPMI) subtracts a constant from any PMI score before applying the PPMI threshold.We experiment with values of 1, 5, 10, 40 and 100 for the shift parameter k.

Word Similarity Experiments
We first evaluate our models on 3 word similarity benchmarks, MEN (Bruni et al., 2014), which is testing for relatedness (e.g.meronymy or holonymy) between terms, SimLex-999 (Hill et al., 2015), which is testing for substitutability (e.g.synonymy, antonymy, hyponymy and hypernymy), and WordSim-353 (Finkelstein et al., 2001), where we use the version of Agirre et al. (2009), who split the dataset into a relatedness and a substitutability subset.Baroni and Lenci (2011) have shown that untyped models are typically better at capturing relatedness, whereas typed models are better at encoding substitutability.Performance is measured by computing Spearman's ρ between the cosine similarities of the vector representations and the corresponding aggregated human similarity judgements.For these experiments we keep the number of neighbours that a word vector can consume fixed at 30.This value is based on preliminary experiments on WordSim-353 (see Figure 2) using the static top n neighbour retrieval function and a PPMI shift of k = 40.Figure 2 shows that distributional inference improves performance for any number of neighbours over a model without DI (marked as horizontal dashed lines for each WordSim-353 subset) and peaks at a value of 30.Performance slightly degrades with more neighbours.For the untyped VSM we use a symmetric window of 5 on either side of the target word.Table 3 highlights the effect of the SPPMI shift parameter k, while keeping the number of neighbours fixed at 30 and using the static top n neighbour retrieval function.For the APT model, a value of k = 40 performs best (except for SimLex-999, where smaller shifts give better results), with a performance drop-off for larger shifts.In our experiments we find that a shift of k = 1 results in top performance for the untyped vector space model.It appears that shifting the PPMI scores in the APT model has the effect of cleaning the vectors from noisy PPMI artefacts, which reinforces the predominant sense, while other senses get suppressed.Subsequently, this results in a cleaner neighbourhood around the word vector, dominated by a single sense.This explains why distributional inference slightly degrades performance for smaller values of k.
Table 4 shows that distributional inference successfully infers missing information for both model types, resulting in improved performance over models without the use of DI on all datasets.The improvements are typically larger for the APT model, suggesting that it is missing more distributional knowledge in its elementary representations than untyped models.The density window and static top n neighbour retrieval functions perform very similar, however the static approach is more consistent and never underperforms the baseline for either model type on any dataset.The WordNet based neighbour retrieval function performs particularly well on SimLex-999.This can be explained by the fact that antonyms, which frequently happen to be among the nearest neighbours in distributional vector spaces, are regarded as dissimilar in SimLex-999, whereas the WordNet neighbour retrieval function only returns synonyms.The results furthermore confirm the effect that untyped models perform better on datasets modelling relatedness, whereas typed models work better for substitutability tasks (Baroni and Lenci, 2011).

Composition Experiments
Our approach to semantic composition as described in section 3 requires the dimensions of our vector space models to be meaningful and interpretable.However, the problem of missing information is amplified in compositional settings as many compatible dimensions between words are not observed in the source corpus.It is therefore crucial that distributional inference is able to inject some of the missing information in order to improve the composition process.For the experiments involving semantic composition, we enrich the elementary representations of the phrase constituents before composition.We first conduct a qualitative analysis for our APT model and observe the effect of distributional inference on the nearest neighbours of composed adjective-noun, noun-noun and verb-object compounds.In these experiments, we show how dis- tributional inference changes the neighbourhood in which composed phrases are embedded, and highlight the difference between composition by union and composition by intersection.For this experiment we use the static top n neighbour retrieval function with 30 neighbours and k = 40.
Table 5 shows a small number of example phrases together with their top 3 nearest neighbours, computed from the union of all words in the Wikipedia corpus and all phrase pairs in the Mitchell and Lapata (2010) dataset.As can be seen, nearest neighbours of phrases can be either single words or other composed phrases.Words or phrases marked with "*" in Table 5 mean that DI introduced, or failed to downrank, a spurious neighbour, while boldface means that performing distributional inference re-sulted in a neighbourhood more coherent with the query phrase than without DI.
Table 5 shows that composition by union is unable to downrank unrelated neighbours introduced by distributional inference.For example large quantity is incorrectly introduced as a top ranked neighbour for the phrase small house, due to the proximity of small and large in the vector space.The phrases market leader and television programme are two examples of incoherent neighbours, which the composition function was unable to downrank and where DI could not improve the neighbourhood.Composition by intersection on the other hand vastly benefits from distributional inference.Due to the increased sparsity induced by the composition process, a neighbourhood without DI produces numer-ous spurious neighbours as in the case of the verb have as a neighbour for win battle.Distributional inference introduces qualitatively better neighbours for almost all phrases.For example, government leader and opposition member are introduced as top ranked neighbours for the phrase party leader, and stress importance and underline are introduced as new top neighbours for the phrase emphasise need.
These results show that composition by union does not have the ability to disambiguate the meaning of a word in a given phrasal context, whereas composition by intersection has that ability but requires distributional inference to unleash its full potential.
For a quantitative analysis of distributional inference for semantic composition, we evaluate our model on the composition dataset of Mitchell and Lapata (2010), consisting of 108 adjective-noun, 108 noun-noun, and 108 verb-object pairs.The task is to compare the model's similarity estimates with the human judgements by computing Spearman's ρ.For comparing the performance of the different neighbour retrieval functions, we choose the same parameter settings as in the word similarity experiments (k = 40 and using 30 neighbours for DI).
Table 6 shows that the static top n and density window neighbour retrieval functions perform very similar again.The density window retrieval function outperforms static top n for composition by intersection and vice versa for composition by union.The WordNet approach is competitive for composition by union, but underperfoms the other approaches for composition by intersection significantly.For further experiments we use the static top n approach as it is computationally cheap and easy to interpret due to the fixed number of neighbours.Table 6 also shows that while composition by intersection is significantly improved by distributional inference, composition by union does not appear to benefit from it.

Composition by Union or Intersection
Both model types in this study support composition by union as well as composition by intersection.In untyped models, composition by union and composition by intersection can be achieved by pointwise addition and pointwise multiplication respectively.The major difference between composition in the APT model and the untyped model is that in the former, composition is not commutative due to offsetting the modifier in a dependency relation (see section 3).Blacoe and Lapata (2012) showed that an intersective composition function such as pointwise multiplication represents a competitive and robust approach in comparison to more sophisticated composition methods.For the final set of experiments on the Mitchell and Lapata (2010) dataset, we present results the APT model and the untyped model, using composition by union and composition by intersection, with and without distributional inference.We compare our models with the best performing untyped VSMs of Mitchell and Lapata (2010), and Blacoe and Lapata (2012), the best performing APT model of Weir et al. (2016), as well as with the recently published state-of-the-art methods by Hashimoto et al. (2014), andWieting et al. (2015), who are using neural network based approaches.For our models, we use the static top n approach as neighbour retrieval function and tune the remaining parameters, the SPPMI shift k (1, 5, 10, 40, 100) and the number of neighbours (10,30,50,100,500,1000,5000), for both model types, and the sliding window size for the untyped VSM (1, 2, 5), on the development portion of the Mitchell and Lapata (2010) dataset.We keep the vector configuration (k and window size) fixed for all phrase types and only tune the number of neighbours used for DI individually.The best vector configuration for the APT model is achieved with k = 10 and for the untyped VSM with k = 1.For composition by intersection best performance on the dev set was achieved with 1000 neighbours for ANs, 10 for NNs and 50 for VOs with DI.For composition by union, top performance was obtained with 100 neighbours for ANs, 30 neighbours for NNs and 50 for VOs.The best results for the untyped model on the dev set are achieved with a symmetric window size of 1 and using 5000 neighbours for ANs, 10 for NNs and 1000 for VOs with composition by pointwise multiplication, and 30 neighbours for ANs, 5000 for NNs and 5000 for VOs for composition by pointwise addition.The validated numbers of neighbours on the development set show that the problem of missing information appears to be more severe for semantic composition than for word similarity tasks.Even though a neighbour at rank 1000 or lower does not appear to have a close relationship to the target word, it still can contribute useful co-occurrence information not observed in the original vector.
Table 7 shows that composition by intersection with distributional inference considerably improves upon the best results for APT models without distributional inference and for untyped count-based models, and is competitive with the state-of-the-art neural network based models of Hashimoto et al. (2014) and Wieting et al. (2015).Distributional inference also improves upon the performance of an untyped VSM where composition by pointwise multiplication is outperforming the models of Mitchell and Lapata (2010), and Blacoe and Lapata (2012).Table 7 furthermore shows that DI has a smaller effect on the APT model based on composition by union and the untyped model based on composition by pointwise addition.The reason, as pointed out in the discussion for Table 5, is that the composition function has no disambiguating effect and thus cannot eliminate unrelated neighbours introduced by distributional inference.An intersective composition function on the other hand is able to perform the disambiguation locally in any given phrasal context.This furthermore suggests that for the APT model it is not necessary to explicitly model different word senses in separate vectors, as composition by intersection is able to disambiguate any word in context individually.Unlike the models of Hashimoto et al. (2014) and Wieting et al. (2015), the elementary word representations, as well as the representations for composed phrases and the composition process in our models are fully interpretable 2 .

Conclusion and Future Work
One of the major challenges in count-based models is dealing with extreme sparsity and missing information.This paper contributes a number of findings relating to this challenge, in particular a simple unsupervised algorithm for enriching sparse word representations by leveraging its distributional neighbourhood.We have demonstrated its benefit to typed and untyped vector space models on a range of word similarity datasets.We have shown that distributional inference improves the performance of typed and untyped VSMs for semantic composition and that our APT model is competitive with the state-of-the-art for adjective-noun, noun-noun and verb-object compositions while being fully interpretable.With our method, we are able to bridge the gap in performance between lowdimensional non-interpretable and high-dimensional interpretable representations.Lastly, we have investigated the different behaviour of composition by union and composition by intersection and have shown that an intersective composition function, together with distributional inference, has the ability to locally disambiguate the meaning of a phrase.
In future work we aim to scale our approach to semantic composition with distributional inference to longer phrases and full sentences.We furthermore plan to investigate whether the number of neighbours required for improving elementary vector representations remains as high for other compositional tasks and longer phrases as in this study.

Figure 1 :
Figure 1: Aligned Packed Dependency Tree representation of the example sentences.

Figure 2 :
Figure 2: Effect of the number of neighbours on WordSim-353.

Table 1 :
Example feature spaces for the lexemes white and clothes extracted from the dependency tree of Figure1.Not all features are displayed for space reasons.Offsetting amod:shoes by amod results in an empty dependency path, leaving just the word co-occurrence :shoes as feature.

Table 2 :
Comparison of composition by union and composition by intersection.Not all features are displayed for space reasons.

Table 3 :
Effect of the magnitude of the shift parameter k in SPPMI on the word similarity tasks.Boldface means best performance per dateset.

Table 4 :
Neighbour retrieval function comparison.Boldface means best performance on a dataset per VSM type.*) With 3 significant figures, the density window approach (0.713) is slightly better than the baseline without DI (0.708), static top n (0.710) and WordNet (0.710).

Table 5 :
Mitchell and Lapata (2010)and VO pairs in theMitchell and Lapata (2010)dataset, with and without distributional inference.Words and phrases marked with * denote spurious neighbours, boldfaced words and phrases mark improved neighbours.

Table 6 :
Neighbour retrieval function.Underlined means best performance per phrase type, boldface means best average perfor-

Table 7 :
Mitchell and Lapata (2010)d Lapata (2010)dataset.Results in brackets denote the performance of the respective models without the use of distributional inference.Underlined means best within group, boldfaced means best overall.