Fusion of Compositional Network-based and Lexical Function Distributional Semantic Models

Distributional Semantic Models (DSMs) have been successful at modeling the meaning of individual words, with interest recently shifting to compositional structures, i.e., phrases and sentences. Network-based DSMs represent and handle semantics via operators applied on word neighborhoods, i.e., semantic graphs containing a target’s most similar words. We extend network-based DSMs to address compositionality using an activation model (motivated by psycholinguistics) that operates on the fused neighborhoods of variable size activation. The proposed method is evaluated against and combined with the lexical function method proposed by (Baroni and Zamparelli, 2010). We show that, by fusing a network-based with a lexical function model, performance gains can be achieved.


Introduction
Vector Space Models (VSMs) have proven their efficiency at representing word semantics, which are vital components for numerous natural language applications, such as paraphrasing and textual entailment (Androutsopoulos and Malakasiotis, 2010), affective text analysis (Malandrakis et al., 2013), etc. VSMs constitute the most-widely used implementation of Distributional Semantic Models (DSMs) (Baroni and Lenci, 2010). A fundamental task ad-dressed in the framework of DSMs is the computation of semantic similarity between words, adopting the distributional hypothesis of meaning, i.e., "similarity of context implies similarity of meaning" (Harris, 1954). DSMs have been successful when applied to the representation of word lexical semantics, enabling the computation of word semantic similarity (Turney and Pantel, 2010). However, the application of DSMs for representing the semantics of more complex structures, e.g., phrases or sentences, is not trivial since the meaning of such structures is the result of various compositional phenomena (Pelletier, 1994) that are inherent properties of natural language creativity. The key idea behind current approaches in semantic composition (using DSMs) is the combination of word vectors using simple functions, e.g., vector addition or multiplication (Mitchell and Lapata, 2008;Mitchell and Lapata, 2010), or other transformational functions. Regardless of the used function, the resulting representations adhere to the paradigm of VSMs, while the cosine between the (composed) vectors is used for estimating similarity. Such efforts proved to be effective when computing the similarity between twoword phrases, however, their limitations were revealed for the case of longer structures (Polajnar et al., 2014), where the composition of meaning becomes more complex. Bengio and Mikolov (2003;2013) proposed an approach based on deep learning for building language models that address the prob-lem of language creativity. The models appear to constantly gain support in comparison with the traditional DSMs. A preliminary comparative analysis of them is provided in (Baroni et al., 2014b) with respect to a number of tasks related to lexical semantics.
In this work, we extend a recent network-based implementation of DSMs (Iosif and Potamianos, 2015) in order to represent the semantics of compositional structures. The used framework consists of activation models motivated by semantic priming (McNamara, 2005). For each structure, an activation area (i.e, semantic neighborhood) is computed which is regarded as a sub-space within the network. The novelty of the present work is twofold. First, we propose various approaches for the creation of activation areas for compositional structures, within a framework alternative to VSMs. Second, we investigate the fusion of the proposed network-based model with VSM-based transformational approaches from the literature. In addition, we investigate the role of words as operators on the meaning of the structures they occur in by measuring their transformative degree.
The remainder of this paper is organized as follows: in Section 2 we describe work related to DSMs. In Section 3 we describe the work on which we based the proposed models. We present the proposed models in Section 4. The lexical function model is described in Section 5, and a fusion model integrating the former with network-based models is proposed. We describe the experimental procedure that we followed and evaluate the proposed models in Section 6. We elaborate on the effects of modifiers in compositional structures in Section 7, concluding in Section 8.

Related Work
Word-level DSMs can be categorized into unstructured, that employ a bag-of-words model, and structured, that employ syntactic relationships between words (Grefenstette, 1994;Baroni and Lenci, 2010). DSMs are typically constructed from co-occurrence statistics of word tuples. An unstructured approach for the construction of network-based DSMs was proposed in (Iosif and Potamianos, 2015), where nodes represent words, and edges are formulated ac-cording to the semantic similarity of the connected nodes. For each node, the notion of semantic neighborhood (i.e., the most semantically similar words) is utilized for estimating an improved similarity between the nodes. Moving beyond the word-level, Turney (2012) proposed a "dual-space" model that combines relational and compositional methods for representing phrasal semantics. This approach utilized two complementary models in an attempt to address a series of phenomena that apply to compositional semantics, namely, "linguistic creativity", "order sensitivity", "adaptive capacity", and "information scalability" 1 . Three types of phrases were investigated: noun-noun (NN), adjective-noun (AN), and verb-object (VO). In (Baroni and Zamparelli, 2010), particular focus was given to the AN type, where adjectives were represented as matrices acting as functions to the vectorial representation of head nouns. Recent research efforts have been expanded to longer text segments such as sentences (Agirre et al., 2012;Agirre et al., 2013;Polajnar et al., 2014). In (Socher et al., 2012), based on the functional space proposed in (Baroni and Zamparelli, 2010), phrase constituents were treated as both a continuous vector and a parameter matrix, where the representation of sentence semantics was constructed via a recursive bottom-up procedure.

Baseline Network-based Model
In this section, we generalize the ideas regarding network-based DSMs presented in (Iosif and Potamianos, 2015), for the case of more complex structures. The network consists of two layers: 1) activation, and 2) similarity layer. Given a lexical unit, the first layer represents an activation area that includes a set of lexical units that are semantically related with it. The notion of "lexical unit" refers to any semantically coherent lexical structure, spanning from words (unigrams) up to word sequences (n-grams). The second layer is used for the computation of semantic similarity between two lexical units, based on their respective activation layers. The network can be defined as a graph Q = (V, E) whose set of vertices V includes the lexical units un-der investigation and whose set of edges E contains links between the vertices. The links between the lexical units in the network are weighted according to their pairwise semantic similarity.

Layer 1: Activation Model
The activation layer of a lexical unit, ξ, can be regarded as a sub-graph of Q, Q ξ , also referred to as the semantic neighborhood of ξ. Its vertices (neighbors of ξ) are determined according to their semantic similarity with ξ. Given a set of lexical units, the most similar to ξ are selected as neighbors. The activation layer is motivated by the phenomenon of semantic priming (McNamara, 2005), especially for highly coherent lexical units, such as unigrams and bigrams. In the framework of DSMs, activation layers were computed for the case of unigrams in (Iosif and Potamianos, 2015), and were extended to short phrases (bigrams) in (Iosif, 2013). Consider a phrase, i = (i 1 i 2 ), where i 1 and i 2 denote its first and second constituent. Assuming that the N i 1 and N i 2 sets represent neighborhoods of i 1 and i 2 , respectively, the neighborhood of i, N i , was computed by taking the intersection of N i 1 and N i 2 .

Layer 2: Semantic Model
Two similarity metrics are defined for computing the similarity between two lexical units, i and j. The metrics are defined on top of their respective activation models, N i and N j , computed in the previous layer. This approach relies on two assumptions, namely, maximum sense and attributional similarity, for unigrams. In this work, we extend these metrics to bigrams (see Fig. 1 and Fig. 2) in order to compute the semantic similarity between two phrases, i = (i 1 i 2 ) and j = (j 1 j 2 ), exploiting their respective activation layers N i and N j . Maximum Neighborhood Similarity. The key idea of this metric, M , is the computation of similarities between the constituents of phrase i (i 1 and i 2 ) and the members of N j . The same is done for j 1 and j 2 and the members of N i . The similarity between i and j (e.g., "assistant manager" and "board member" in Fig. 1) is computed by taking the maximum of the aforementioned similarities (0.50 in Fig. 1). The underlying hypothesis is that the neighborhoods encode senses that are shared between the constituents. The selection of the maxi- mum score suggests that the similarity between i and j can be approximated by considering their closest senses (Iosif and Potamianos, 2015). Attributional Neighborhood Similarity. In this metric, R, similarities between i 1 and i 2 and the members of N j are computed and stored into a vector. This is also done for j 1 and j 2 and the members of N j . The correlation coefficient between the two vectors (e.g., the two right-most vectors in Fig. 2) is computed. The process is repeated, using N i in the place of N j , which results into another correlation coefficient. The similarity between i and j is estimated by selecting the maximum correlation coefficient. The underlying motivation is attributional similarity, i.e., the hypothesis that the neighborhoods encode semantic or affective features. Se-mantically similar phrases are expected to exhibit correlated similarities with respect to such features (Iosif and Potamianos, 2015).

Extended Network-based Model
The major limitation of the model presented in Section 3 is that the neighborhoods of phrase constituents (e.g., N i 1 and N i 2 ) are of fixed size. This allows the computation of an empty neighborhood for the phrase (e.g., N i ), when there is no overlap between the neighborhoods of its constituents.
In this section, we propose an extension of the aforementioned model by relaxing the hard constraint regarding the fixed size of neighborhoods. The intuition behind this idea is that the activation areas are not of the same size for all words. For example, a semantically abstract word, such as "democracy", is expected to have a larger neighborhood compared to semantically concrete words, e.g., "computer". Given a phrase, e.g., i = (i 1 i 2 ), in order to compute the activation N i , we gradually extend the activation areas (i.e., sizes) of N i 1 and N i 2 until a minimum size θ for N i is reached.

Layer 1: Activation Model
We propose three different schemes for the computation of neighborhoods. An example of those schemes is depicted in Fig. 3. Scheme 1. The phrase neighborhood is computed by taking the intersection of the constituent neighborhoods, i.e., N i = N i 1 ∩N i 2 . This adheres to findings from the literature of psycholinguistics suggesting that the phrase activation (and, thus, the respective meaning) should be more specific than those of its constituents (Osherson and Smith, 1981). Scheme 2. The union of neighborhoods is used, i.e., N i = N i 1 ∪N i 2 . This is motivated by the idea that, in some cases, a phrase may be associated with a larger activation area, compared to those of its constituents. Scheme 3. The members of the phrase neighborhood are selected based on their average semantic similarity with respect to the phrase constituents.
The N i set can be regarded as a list, which is ranked according to 1 2 (S(n m , i 1 )+S(n m , i 2 )), where S(.) stands for a metric of semantic similarity. This scheme is motivated by the idea that different areas of N i 1 and N i 2 may be activated given the context of words i 1 and i 2 , respectively. The scheme also addresses the issue of scalability: the phrase neighborhood has the same size as the constituents' neighborhoods, enabling the recursive application of the model over longer structures.

Layer 2: Semantic Model
An extension of the M metric (described in Section 3) is proposed, along with two more metrics for computing the semantic similarity between lexical units utilizing their respective neighborhoods. The metrics are defined with respect to two lexical units, i and j, which are represented by their neighborhoods, N i and N j , respectively. Average of top-k similarities (M k ). This metric extends the M metric (see Section 3) by considering the top k similarity scores instead of the maximum score. Similarity between i and j, M k (i, j), is computed by taking the arithmetic mean of the k scores. Average of top-k pairwise similarities (P k ). Let C be a ranked list including the pairwise similarities computed between the members of N i and N j : where S(.) stands for a metric of semantic similarity. The similarity between i and j is computed as: where c l is the l-th member of C.
Hausdorff-based similarity (H). This metric is motivated by the Hausdorff distance (Hung and Yang, 2004). Let h(N i , N j ) be defined as where S(.) is a semantic similarity metric. The similarity between i and j is computed as:

Fusion of Lexical Function with Network-based Models
The representation of phrase semantics requires the consideration of the consituents' functional influence on the composed meaning. For example, when considering an adjective-noun phrase, such as "bad cat", the former word ("bad") acts as an operator, i.e., modifier, to the latter word ("cat"), modifying its meaning. In (Baroni and Zamparelli, 2010;Baroni et al., 2014a), it was proposed that such modifications can be implemented via the use of functions that act as linear transformations in VSMs. Application of these functions is realized via matrix-byvector multiplication as (Baroni et al., 2014a): where F is the matrix-encoded function f , a is the vectorial representation of the argument α, and b is the compositional vector output. The F function is learnt according to examples of observed input and output (distributional) representations. The input is the representation of the head word, and the output is the representation of the phrase. Regression is employed for calculating the set of weights in the matrix that best approximate the observed vectors. For example, the function for the modifier "bad" is learnt by regressing over phrase examples and their head nouns, such as <pet, bad pet>, <dog, bad dog>, <bird, bad bird>. Using the trained set of weights and the vectorial representation of the head noun, e.g., "cat", the composite representation for the phrase "bad cat" is induced.

Fusion
The proposed network-based model, presented in Section 4, exploits the merging of word senses for computing activation areas for phrases. The model defined by (5) utilizes the transformational function of an operator for changing the meaning of a phrase. Both models (intuitively) seem to be aligned with the human process of phrase comprehension, however, there are cases that one of the models applies better than the other. Consider two example phrases, "football manager" and "successful engineer". The transformational model is expected to perform better for the latter phrase, while for the first phrase an intersection of word senses (i.e., a network-based model) seems to be more appropriate. Based on the above considerations, we propose a fusion of the lexical function (lf ), defined by (5), with the proposed network-based models. The fusion is aimed to model more accurately the semantic representations of complex structures. To do so, we measure the Mean Squared Error (MSE) when training the lexical function model, in order to quantify the transformative degree of the modifier under investigation. The transformative degree is used for deciding whether a network-based or a transformational model is more appropriate. Given two phrases, i = (i 1 i 2 ) and j = (j 1 j 2 ), the transformative degree T (i, j) is defined as: where M SE(i 1 ) and M SE(j 1 ) is the MSE that corresponds to modifiers i 1 and j 1 , respectively. The proposed fusion metric, Φ lf net (i, j), used for estimating the similarity between the i and j phrases, is defined as: (7) where S N and S LF are similarity scores computed by the network-based and lexical function models, respectively. λ is a function of i and j, computed using a sigmoid function as: The sigmoid function is applied in order to smooth and normalize (within [0,1]) the values of T (i, j).
Finally, in addition to the aforementioned fusion, we also implement a fusion combining the lf and the widely-used additive (add) (Mitchell and Lapata, 2008;Mitchell and Lapata, 2010) model. This fusion metric, Φ lf add , is defined similarly to (7).

Experiments and Evaluation
The procedure for creating the network and conducting the experiments is described in Section 6.1. In Section 6.2, we evaluate the proposed models and compare them with results from the literature.

Experimental Procedure
We defined our vocabulary (network nodes) by intersecting the English vocabulary found in the AS-PELL 2 dictionary and the Wikipedia dump 3 to derive an English vocabulary of approximately 135K words. Using it, a corpus comprising of webharvested document snippets was constructed by downloading 1000 snippets for each word in the vocabulary. Word-level similarities were computed among all vocabulary entries' pairs. To this end, the Normalized Google Distance (G) was utilized, proposed in (Vitanyi, 2005;Cilibrasi and Vitanyi, 2007) and motivated by Kolmogorov complexity. Let G be defined as where w 1 and w 2 are two vocabulary words under investigation, | D | is the total number of documents in the corpus, | D | w 1 , w 2 | is the total number of documents containing both w 1 and w 2 , and A = {log | D | w 1 |, log | D | w 2 |}. We used a variation of (9), proposed in (Gracia et al., 2006), referred to as "Google-based Semantic Relatedness" (G ). This variation defines a similarity measure, bounded within the [0, 1] range and defined as where G(w 1 , w 2 ) is computed according to (9). In this work, D denotes the sentence rather than the document, as the co-occurrence of words was defined at sentence-level. This metric was adopted based on its good performance in word-level semantic similarity tasks (Iosif and Potamianos, 2015). Network-based model. We used sizes of θ = {10, 25, 50, 100, 150, 500} for the case of fixed-size neighborhoods, and θ = {1, 5, ..., 40} for the extended activation models described in Section 4.1.
We used both the baseline and the extended activation layers for the M model, the latter being defined as M . For M k and P k , we set k = {1, ..., 5}. Transformational model. For the lf model described in (5), we computed co-occurence counts for bigrams occurring at least 50 times in the corpus. Positive Pointwise Mutual Information (PPMI) was applied to reweigh them. We used a) Singular Value Decomposition (SVD), and b) Non-Negative Matrix Factorization (NMF) (Lee and Seung, 2001) to reduce the dimensionality of the space down to a) 300, and b) 500 dimensions. To train lf, we selected corpus bigrams comprising of a modifier and a noun. We used a) Least Squares (LSR), and b) Ridge (RR) (Hastie et al., 2009) regression. The DIStributional SEmantics Composition Toolkit (DISSECT 4 , (Dinu et al., 2013)) was used to implement lf, as well as the widely-used additive (add) and multiplicative (mult) models proposed in (Mitchell and Lapata, 2008;Mitchell and Lapata, 2010). Fusion model. We combined the best performing model configurations on NNs (see Section 6.2) in order to implement the proposed fusion models.

Evaluation Results
For evaluation purposes, we used the widely-used Mitchell & Lapata (2010) datasets comprising of 108 noun-noun (NN), adjective-noun (AN), and verb-object (VO) phrase pairs, evaluated by human judgements and averaged per phrase pair. The models were evaluated using Spearman's correlation coefficient. Evaluation results are presented in Table 1. Due to space limitations, only the best performing network-based model configurations are reported here. Also, since the mult model performs poorly when the composed vectors contain negative values, as is the case with SVD, we only report results for the NMF variations for it. Finally, since training the lf model with RR had significantly superior performance over LSR in all configurations, we only report evaluations of the former. The lf model, when using RR in combination with NMF, performs best (.76) for the case of NNs. Best performances for ANs and VOs are obtained by the add model (.63 and .59, respectively). Regarding network-based models, performance is improved when using the extended activation model over the baseline. This is confirmed by the absolute 5%, 11% and 10% increase for the case of NN, AN, and VO pairs, respectively, for the M metric. All the extended network-based models perform consistently better than the baseline of M , in the case of NNs, although their performance drops for the case of ANs and VOs. In the case of P k , the scheme that constructs neighborhoods via the selection of the most similar neighbors performs better than the intersection-or the union-based scheme.
Φ lf add yields no relative improvements over the best performances of the separate models. Φ lf net provides an improvement for the case of NNs, reaching .80, which is also the best observed performance overall. However, Φ lf net does not improve performance in the case of ANs and VOs.
Performance improvements when using the extended activation layer for compositional structures is consistent with experimental observations from psycholinguistics (Osherson and Smith, 1981), and shows that the activation area for phrases might be adaptive to the degree of relatedness between words.

Discussion
The results displayed in Table 1 for the fusion models provide an indication of the different ways in which the operator changes the meaning of a phrase. In this section, we investigate the transformational properties of phrases as defined by their modifiers. By observing the properties of modifiers, we discuss whether their use in a phrase has mainly a transformational or a merely compositional effect, based on the goodness of fit of each model, estimated during model training.

The Transformative Effect of Modifiers
Early research on compositionality involved applying the word-level semantic similarity estimation techniques to phrases using context-based, bag-ofwords models, i.e., defining the structures' meaning as a function of the words in their context. Though simple and cost-effective, the aforementioned techniques fail to detect the effect that a word has to its linguistic context and the semantic changes on its meaning, e.g., a "nice" table is still a table but a "fake" or "broken" table is not.
Depending on context, a modifier can affect the meaning of the encompassing phrase in different ways. For example, the modifier "normal" changes the meaning of "normal cat" much less than the modifier "dead" in "dead cat". Moreover, the modifier effect may vary for each syntactic category. For example, verbs can be transitive or intransitive, nouns can be abstract or concrete, and adjectives can be intensional or not (Boleda et al., 2013). Words that act as functions on their linguistic context have attracted much interest, and have recently been successfully handled by computational models.

Estimating the Transformative Degree
We categorise modifiers based on their regression performance, when training them for the lf model. Specifically, we acquire the MSE of their training as a measure for deciding the degree of their transformative effect on a given head noun. Taking the MSE is a sensible approach, since regression tries to derive a close approximation to observed vectorial representations of phrases and head nouns by means of transforming the head noun vector; high error in training indicates that the lf model is a poor match for this modifier. We trained the lf model using Ridge Regression and estimated the MSE for each modifier. In Table 2  transformative modifiers have a more functional influence, when used in bigram structures. For example, in "efficient machine", "efficient" has a greater effect on the meaning of "efficient machine" rather than, e.g., "new" in "new machine". A "new machine" retains the same properties of a generic machine. However, an "efficient machine" should contain mechanisms that account for optimization of speed, cost, etc. Our observations suggest that modifiers affect the structure in which they occur in different ways. Some modifiers have a stronger effect on the meaning of the head noun, while others act merely as constituents of simple compositions. The proposed fusion of the transformational, lf model, with network-based or simple compositional models indicates that combining different models can yield improved performance when the transformative degree of modifiers is used as a fusion criterion.

Conclusions
We presented a network-based model that operates on neighborhoods of variable size to calculate similarity of compositional structures. We investigated various methods for composing neighborhoods of adjacent words and presented three metrics, motivated by psycholinguistics and metric space algebra, for estimating similarity between activation areas. Employing variable size activation improves semantic similarity performance, revealing a different activational behavior among bigrams. We also presented a fusion of the proposed models with the lexical function model based on the transformative degree of modifiers, achieving an improvement of performance for noun-noun compositions, reaching state-of-the-art performance of 80% Spearman correlation with human judgements. We further investigated the transformative degree of modifiers, and elaborated on their role as mostly compositional or transformational.
In future work, we will further investigate the role of modifiers and their application in the proposed activation composition approaches, while also explore the criteria for deriving activations and deciding on fusion strategies. We also plan to apply networkbased models on longer semantic structures.