Adaptive Joint Learning of Compositional and Non-Compositional Phrase Embeddings

We present a novel method for jointly learning compositional and non-compositional phrase embeddings by adaptively weighting both types of embeddings using a compositionality scoring function. The scoring function is used to quantify the level of compositionality of each phrase, and the parameters of the function are jointly optimized with the objective for learning phrase embeddings. In experiments, we apply the adaptive joint learning method to the task of learning embeddings of transitive verb phrases, and show that the compositionality scores have strong correlation with human ratings for verb-object compositionality, substantially outperforming the previous state of the art. Moreover, our embeddings improve upon the previous best model on a transitive verb disambiguation task. We also show that a simple ensemble technique further improves the results for both tasks.


Introduction
Representing words and phrases in a vector space has proven effective in a variety of language processing tasks (Pham et al., 2015;Sutskever et al., 2014).In most of the previous work, phrase embeddings are computed from word embeddings by using various kinds of composition functions.Such composed embeddings are called compositional embeddings.An alternative way of computing phrase embeddings is to treat phrases as single units and assigning a unique embedding to each candidate phrase (Mikolov et al., 2013;Yazdani et al., 2015).Such embeddings are called noncompositional embeddings.
Relying solely on non-compositional embeddings has the obvious problem of data sparsity (i.e.rare or unknown phrase problems).At the same time, however, using compositional embeddings is not always the best option since some phrases are inherently non-compositional.For example, the phrase "bear fruits" means "to yield results"1 but it is hard to infer its meaning by composing the meanings of "bear" and "fruit".Treating all phrases as compositional also has a negative effect in learning the composition function because the words in those idiomatic phrases are not just uninformative but can serve as noisy samples in the training.These problems have motivated us to adaptively combine both types of embeddings.
Most of the existing methods for learning phrase embeddings can be divided into two approaches.One approach is to learn compositional embeddings by regarding all phrases as compositional (Pham et al., 2015;Socher et al., 2012).The other approach is to learn both types of embeddings separately and use the better ones (Kartsaklis et al., 2014;Muraoka et al., 2014).Kartsaklis et al. (2014) show that non-compositional embeddings are better suited for a phrase similarity task, whereas Muraoka et al. (2014) report the opposite results on other tasks.These results suggest that we should not stick to either of the two types of embeddings unconditionally and could learn better phrase embeddings by considering the compositionality levels of the individual phrases in a more flexible fashion.
In this paper, we propose a method that jointly learns compositional and non-compositional embeddings by adaptively weighting both types of phrase embeddings using a compositionality scoring function.The scoring function is used to quantify the level of compositionality of each phrase Figure 1: The overview of our method and examples of the compositionality scores.Given a phrase p, our method first computes the compositionality score α(p) (Eq.( 3)), and then computes the phrase embedding v(p) using the compositional and non-compositional embeddings, c(p) and n(p), respectively (Eq. ( 2)). and learned in conjunction with the target task for learning phrase embeddings.In experiments, we apply our method to the task of learning transitive verb phrase embeddings and demonstrate that it allows us to achieve state-of-the-art performance on standard datasets for compositionality detection and verb disambiguation.

Method
In this section, we describe our approach in the most general form, without specifying the function to compute the compositional embeddings or the target task for optimizing the embeddings.
Figure 1 shows the overview of our proposed method.At each iteration of the training (i.e.gradient calculation) of a certain target task (e.g.language modeling or sentiment analysis), our method first computes a compositionality score for each phrase.Then the score is used to weight the compositional and non-compositional embeddings of the phrase in order to compute the expected embedding of the phrase which is to be used in the target task.Some examples of the compositionality scores are also shown in the figure.

Compositional Phrase Embeddings
The compositional embedding c(p where d is the dimensionality, L is the phrase length, v(•) ∈ R d×1 is a word embedding, and f (•) is a composition function.The function can be simple ones such as element-wise addition or multiplication (Mitchell and Lapata, 2008).
More complex ones such as recurrent neural networks (Sutskever et al., 2014) are also commonly used.The word embeddings and the composition function are jointly learned on a certain target task.Since compositional embeddings are built on word-level (i.e.unigram) information, they are less prone to the data sparseness problem.

Non-Compositional Phrase Embeddings
In contrast to the compositional embedding, the non-compositional embedding of a phrase n(p) ∈ R d×1 is independently parameterized, i.e., the phrase p is treated just like a single word.Mikolov et al. (2013) show that non-compositional embeddings are preferable when dealing with idiomatic phrases.Some recent studies (Kartsaklis et al., 2014;Muraoka et al., 2014) have discussed the (dis)advantages of using compositional or non-compositional embeddings.However, in most cases, a phrase is neither completely compositional nor completely non-compositional.To the best of our knowledge, there is no method that allows us to jointly learn both types of phrase embeddings by incorporating the levels of compositionality of the phrases as real-valued scores.

Adaptive Joint Learning
To simultaneously consider both compositional and non-compositional aspects of each phrase, we compute a phrase embedding v(p) by adaptively weighting c(p) and n(p) as follows: where α(•) is a scoring function that quantifies the compositionality levels, and outputs a real value ranging from 0 to 1.What we expect from the scoring function is that large scores indicate high levels of compositionality.In other words, when α(p) is close to 1, the compositional embedding is mainly considered, and vice versa.For example, we expect α(buy car) to be large and α(bear fruit) to be small as shown in Figure 1.We parameterize the scoring function α(p) as logistic regression: where φ(p) ∈ R N ×1 is a feature vector of the phrase p, W ∈ R N ×1 is a weight vector, N is the number of features, and σ(•) is the logistic function.The weight vector W is jointly optimized in conjunction with the objective J for the target task of learning phrase embeddings v(p).
Updating the model parameters Given the partial derivative δ p = ∂J ∂v(p) ∈ R d×1 for the target task, we can compute the partial derivative for updating W as follows: As mentioned above, Eq. ( 7) and ( 8) show that the non-compositional embeddings are mainly updated when α(p) is close to 0, and vice versa.The partial derivative ∂J ∂c(p) is used to update the model parameters in the composition function via the backpropagation algorithm.Any differentiable composition functions can be used in our method.
Expected behavior of our method The training of our method depends on the target task; that is, the model parameters are updated so as to minimize the cost function as described above.More concretely, α(p) for each phrase p is adaptively adjusted so that the corresponding parameter updates contribute to minimizing the cost function.As a result, different phrases will have different α(p) values depending on their compositionality.If the size of the training data were almost infinitely large, α(p) for all phrases would become nearly zero, and the non-compositional embeddings n(p) are dominantly used (since that would allow the model to better fit the data).In reality, however, the amount of the training data is limited, and thus the compositional embeddings c(p) are effectively used to overcome the data sparseness problem.

Learning Verb Phrase Embeddings
This section describes a particular instantiation of our approach presented in the previous section, fo-cusing on the task of learning the embeddings of transitive verb phrases.

Word and Phrase Prediction in Predicate-Argument Relations
Acquisition of selectional preference using embeddings has been widely studied, where word and/or phrase embeddings are learned based on syntactic links (Bansal et al., 2014;Hashimoto and Tsuruoka, 2015;Levy and Goldberg, 2014;Van de Cruys, 2014).As with language modeling, these methods perform word (or phrase) prediction using (syntactic) contexts.
In this work, we focus on verb-object relationships and employ a phrase embedding learning method presented in Hashimoto and Tsuruoka (2015).The task is a plausibility judgment task for predicate-argument tuples.They extracted Subject-Verb-Object (SVO) and SVO-Preposition-Noun (SVOPN) tuples using a probabilistic HPSG parser, Enju (Miyao and Tsujii, 2008), from the training corpora.Transitive verbs and prepositions are extracted as predicates with two arguments.For example, the extracted tuples include (S, V, O) = ("importer", "make", "payment") and (SVO, P, N) = ("importer make payment", "in", "currency").The task is to discriminate between observed and unobserved tuples, such as the (S, V, O) tuple mentioned above and (S, V', O) = ("importer", "eat", "payment"), which is generated by replacing "make" with "eat".The (S, V', O) tuple is unlikely to be observed.
For each tuple (p, a 1 , a 2 ) observed in the training data, a cost function is defined as follows: where s(•) is a plausibility scoring function, and p, a 1 and a 2 are a predicate and its arguments, respectively.Each of the three unobserved tuples (p , a 1 , a 2 ), (p, a 1 , a 2 ), and (p, a 1 , a 2 ) is generated by replacing one of the entries with a random sample.
In their method, each predicate p is represented with a matrix M (p) ∈ R d×d and each argument a with an embedding v(a) ∈ R d×1 .The matrices and embeddings are learned by minimizing the cost function using AdaGrad (Duchi et al., 2011).The scoring function is parameterized as and the VO and SVO embeddings are computed as as proposed by Kartsaklis et al. (2012).The operator denotes element-wise multiplication.In summary, the scores are computed as With this method, the word and composed phrase embeddings are jointly learned based on cooccurrence statistics of predicate-argument structures.
Using the learned embeddings, they achieved state-of-the-art accuracy on a transitive verb disambiguation task (Grefenstette and Sadrzadeh, 2011).

Applying the Adaptive Joint Learning
In this section, we apply our adaptive joint learning method to the task described in Section 3.1.We here redefine the computation of v(V O) by first replacing v(V O) in Eq. ( 11) with c(V O) as, and then assigning V O to p in Eq. ( 2) and (3): The v(V O) in Eq. ( 16) is used in Eq. ( 12) and ( 13).We assume that the candidates of the phrases are given in advance.For the phrases not included in the candidates, we set v(V O) = c(V O).This is analogous to the way a human guesses the meaning of an idiomatic phrase she does not know.We should note that φ(V O) can be computed for phrases not included in the candidates, using partial features among the features described below.If any features do not fire, φ(V O) becomes 0.5 according to the logistic function.
For the feature vector φ(V O), we use the following simple binary and real-valued features: • indices of V, O, and VO

• frequency and Pointwise Mutual Information
(PMI) values of VO.
More concretely, the first set of the features (indices of V, O, and VO) is the concatenation of traditional one-hot vectors.The second set of features, frequency and PMI (Church and Hanks, 1990) features, have proven effective in detecting the compositionality of transitive verbs in Mc-Carthy et al. (2007) and Venkatapathy and Joshi (2005).Given the training corpus, the frequency feature for a VO pair is computed as where count(V O) counts how many times the VO pair appears in the training corpus, and the PMI feature is computed as where count(V ), count(O), and count( * ) are the counts of the verb V , the object O, and all VO pairs in the training corpus, respectively.We normalize the frequency and PMI features so that their maximum absolute value becomes 1.
4 Experimental Settings

Training Data
As the training data, we used two datasets, one small and one large: the British National Corpus (BNC) (Leech, 1992) and the English Wikipedia.
More concretely, we used the publicly available data2 preprocessed by Hashimoto and Tsuruoka (2015).The BNC data consists of 1.38 million SVO tuples and 0.93 million SVOPN tuples.The Wikipedia data consists of 23.6 million SVO tuples and 17.3 million SVOPN tuples.Following the provided code3 , we used exactly the same train/development/test split (0.8/0.1/0.1) for training the overall model.As the third training data, we also used the concatenation of the two data, which is hereafter referred to as BNC-Wikipedia.We applied our adaptive joint learning method to verb-object phrases observed more than K times in each corpus.
K was set to 10 for the BNC data and 100 for the Wikipedia and BNC-Wikipedia data.Consequently, the non-compositional embeddings were assigned to 17,817, 28,933, and 30,682 verb-object phrase types in the BNC, Wikipedia, and BNC-Wikipedia data, respectively.

Training Details
The model parameters consist of d-dimensional word embeddings for nouns, non-compositional phrase embeddings, d×d-dimensional matrices for verbs and prepositions, and a weight vector W for α(V O).All the model parameters are jointly optimized.We initialized the embeddings and matrices with zero-mean gaussian random values with a variance of 1 d and 1 d 2 , respectively, and W with zeros.Initializing W with zeros forces the initial value of each α(V O) to be 0.5 since we use the logistic function to compute α(V O).
We fixed d to 25 and the mini-batch size to 100.We set candidate values for the learning rate ε to {0.01, 0.02, 0.03, 0.04, 0.05}.For the weight vector W , we employed L2norm regularization and set the coefficient λ to {10 −3 , 10 −4 , 10 −5 , 10 −6 , 0}.For selecting the hyperparameters, each training process was stopped when the evaluation score on the development split decreased.Then the best performing hyperparameters were selected for each training dataset.Consequently, ε was set to 0.05 for all training datasets, and λ was set to 10 −6 , 10 −3 , and 10 −5 for the BNC, Wikipedia, and BNC-Wikipedia data, respectively.Once the training is finished, we can use the learned embeddings and the scoring function in downstream target tasks.

Evaluation on the Compositionality
Detection Function

Evaluation Settings
Datasets First, we evaluated the learned compositionality detection function on two datasets, VJ'05 4 and MC'07 5 , provided by Venkatapathy and Joshi (2005) and McCarthy et al. (2007), respectively.VJ'05 consists of 765 verb-object pairs with human ratings for the compositionality.MC'07 is a subset of VJ'05 and consists of 638 verb-object pairs.For example, the rating of "buy car" is 6, which is the highest score, indicating the phrase is highly compositional.The rating of "bear fruit " is 1, which is the lowest score, indicating the phrase is highly non-compositional.

Evaluation metric
The evaluation was performed by calculating Spearman's rank correlation scores 6 between the averaged human ratings and the learned compositionality scores α(V O).

Ensemble technique
We also produced the result by employing an ensemble technique.More concretely, we used the averaged compositionality scores from the results of the BNC and Wikipedia data for the ensemble result.

Result Overview
Table 1 shows our results and the state of the art.Our method outperforms the previous state of the art in all settings.The result denoted as Ensemble is the one that employs the ensemble technique, and achieves the strongest correlation with the human-annotated datasets.Even without the ensemble technique, our method performs better than all of the previous methods.Kiela and Clark (2013) used window-based cooccurrence vectors and improved their score using WordNet hypernyms.By contrast, our method does not rely on such external resources, and only needs parsed corpora.We should note that Kiela and Clark (2013) reported that their score did not improve when using parsed corpora.Our method also outperforms DSPROTO+, which used a small amount of the labeled data, while our method is fully unsupervised.
We calculated confidence intervals (P < 0.05) using bootstrap resampling (Noreen, 1989).For example, for the results using the BNC-Wikipedia data, the intervals on MC'07 and VJ'05 are (0.455, 0.574) and (0.475, 0.579), respectively.These results show that our method significantly outperforms the previous state-of-the-art results.

Analysis of Compositionality Scores
Figure 2 shows how α(V O) changes for the seven phrases during the training on the BNC data.As shown in the figure, starting from 0.5, α(V O) for each phrase converges to its corresponding value.
The differences in the trends indicate that our method can adaptively learn compositionality levels for the phrases.Table 2 shows the learned compositionality scores for the three groups of the examples along with the gold-standard scores given by the annotators.The group (A) is considered to be consistent with the gold-standard scores, the group (B) is not, and the group (C) shows examples for which the difference between the compositionality scores of our results is large.

Characteristics of light verbs
The verbs "take", "make", and "have" are known as light verbs7 , and the scoring function tends to assign low scores to light verbs.In other words, our  method can recognize that the light verbs are frequently used to form idiomatic (i.e.noncompositional) phrases.To verify the assumption, we calculated the average compositionality score for each verb by averaging the compositionality scores paired with its candidate objects.Here we used 135 verbs which take more than 30 types of objects in the BNC data.Table 3 shows the 10 highest and lowest average scores with the corresponding verbs.We see that relatively low scores are assigned to the light verbs as well as other verbs which often form idiomatic phrases.As shown in the group (B) in Table 2, however, light verb phrases are not always non-compositional.Despite this, the learned function assigns low scores to compositional phrases formed by the light verbs.These results suggest that using a more flexible scoring function may further strengthen our method.
Context dependence Both our method and the two datasets, VJ'05 and MC'07, assume that the compositionality score can be computed for each phrase with no contextual information.However, in general, the compositionality level of a phrase depends on its contextual information.For example, the meaning of the idiomatic phrase "bear fruit" can be compositionaly interpreted as "to yield fruit" for a plant or tree.We manually inspected the BNC data to check whether the phrase "bear fruit" is used as the compositional meaning or the idiomatic meaning ("to yield results").As a result, we have found that most of the usage was its idiomatic meaning.In the model training, our method is affected by the majority usage and fits the evaluation datasets where the phrase "bear fruit" is regarded as highly non-compositional.Incorporating contextual information into the compositionality scoring function is a promising direction of future work.

Effects of Ensemble
We used the two different corpora for constructing the training data, and our method achieves the state-of-the-art results in all settings.To inspect the results on VJ'05, we calculated the correlation score between the outputs from our results of the BNC and Wikipedia data.The correlation score is 0.674 and that is, the two different corpora lead to reasonably consistent results, which indicates the robustness of our method.However, the correlation score is still much lower than perfect correlation; in other words, there are disagreements between the outputs learned with the corpora.The group (C) in Table 2 shows such two examples.In these cases, the ensemble technique is helpful in improving the results as shown in the examples.Another interesting observation in our results is that the result of the ensemble technique outperforms that of the BNC-Wikipedia data as shown in Table 1.This shows that separately using the training corpora of different nature and then performing the ensemble technique can yield better results.By contrast, many of the previous studies on embedding-based methods combine different corpora into a single dataset, or use multiple corpora just separately and compare them (Hashimoto and Tsuruoka, 2015;Muraoka et al., 2014;Pennington et al., 2014).It would be worth investigating whether the results in the previous work can be improved by ensemble techniques.
6 Evaluation on the Phrase Embeddings 6.1 Evaluation Settings Dataset Next, we evaluated the learned embeddings on the transitive verb disambiguation dataset GS'118 provided by Grefenstette and Sadrzadeh (2011).GS'11 consists of 200 pairs of transitive verbs and each verb pair takes the same subject and object.For example, the transitive verb "run" is known as a polysemous word and this task requires one to identify the meanings of "run" and "operate" as similar to each other when taking "people" as their subject and "company" as their object.In the same setting, however, the meanings of "run" and "move" are not similar to each other.Each pair has multiple human ratings indicating how similar the phrases of the pair are.

Evaluation metric
The evaluation was performed by calculating Spearman's rank correlation scores between the human ratings and the cosine similarity scores of v(SV O) in Eq. ( 12).Following the previous studies, we used the goldstandard ratings in two ways: averaging the human ratings for each SVO tuple (GS'11a) and treating each human rating separately (GS'11b).
Ensemble technique We used the same ensemble technique described in Section 5.1.In this task we produced two ensemble results: Ensemble A and Ensemble B. The former used the averaged cosine similarity from the results of the BNC and Wikipedia data, and the latter further incorporated the result of the BNC-Wikipedia data.
Baselines We compared our adaptive joint learning method with two baseline methods.One is the method in Hashimoto and Tsuruoka (2015) and it is equivalent to fixing α(V O) to 1 in our method.The other is fixing α(V O) to 0.5 in our method, which serves as a baseline to evaluate how effective the proposed adaptive weighting method is.

Result Overview
Table 4 shows our results and the state of the art, and our method outperforms almost all of the previous methods in both datasets.Again, the ensemble technique further improves the results, and overall, Ensemble B yields the best results.
The scores in Hashimoto and Tsuruoka (2015), the baseline results with α(V O) = 1 in our method, have been the best to date.As shown in Table 4, our method outperforms the baseline results with α(V O) = 0.5 as well as those Proposed method  and Tsuruoka (2015).
with α(V O) = 1.We see that our method improves the baseline scores by adaptively combining compositional and non-compositional embeddings.Along with the results in Table 1, these results show that our method allows us to improve the composition function by jointly learning noncompositional embeddings and the scoring func-tion for compositionality detection.

Analysis of the Learned Embeddings
We inspected the effects of adaptively weighting the compositional and non-compositional embeddings.Table 5 shows the five closest neighbor phrases in terms of the cosine similarity for the three idiomatic phrases "take toll", "catch eye", and "bear fruit" as well as the two non-idiomatic phrases "make noise" and "buy car".The examples trained with the Wikipedia data are shown for our method and the two baselines, i.e., α(V O) = 1 and α(V O) = 0.5.As shown in Table 2, the compositionality levels of the first three phrases are low and their non-compositional embeddings are dominantly used to represent their meaning.One observation with α(V O) = 1 is that head words (i.e.verbs) are emphasized in the shown examples except "take toll" and "make noise".As with other embedding-based methods, the compositional embeddings are highly affected by their component words.As a result, the phrases consisting of the same verb and the similar objects are often listed as the closest neighbors.By contrast, our method flexibly allows us to adaptively omit the information about the component words.Therefore, our method puts more weight on capturing the idiomatic aspects of the example phrases by adaptively using the non-compositional embeddings.
The results of α(V O) = 0.5 are similar to those with our proposed method, but we can see some differences.For example, the phrase list for "make noise" of our proposed method captures offensive meanings, whereas that of α(V O) = 0.5 is somewhat ambiguous.As another example, the phrase lists for "buy car" show that our method better captures the semantic similarity between the objects than α(V O) = 0.5.This is achieved by adaptively assigning a relatively large compositionality score (0.71) to the phrase to use the information about the object "car".
We should note that "make noise" is highly compositional but our method outputs α(make noise) = 0.33, and the phrase list of α(V O) = 1 is the most appropriate in this case.Improving the compositionality detection function should thus further improve the learned embeddings.

Related Work
Learning embeddings of words and phrases has been widely studied, and the phrase embeddings have proven effective in many language processing tasks, such as machine translation (Cho et al., 2014;Sutskever et al., 2014), sentiment analysis and semantic textual similarity (Tai et al., 2015).Most of the phrase embeddings are constructed by information via various kinds of composition functions like long short-term memory (Hochreiter and Schmidhuber, 1997) recurrent neural networks.Such composition functions should be powerful enough to efficiently encode information about all the words into the phrase embeddings.By simultaneously considering the compositionality of the phrases, our method would be helpful in saving the composition models from having to be powerful enough to perfectly encode the non-compositional phrases.As a first step towards this purpose, in this paper we have shown the effectiveness of our method on the task of learning verb phrase embeddings.
Many studies have focused on detecting the compositionality of a variety of phrases (Lin, 1999), including the ones on verb phrases (Diab and Bhutada, 2009;McCarthy et al., 2003) and compound nouns (Farahmand et al., 2015;Reddy et al., 2011).Compared to statistical feature-based methods (McCarthy et al., 2007;Venkatapathy and Joshi, 2005), recent methods use word and phrase embeddings (Kiela and Clark, 2013;Yazdani et al., 2015).The embedding-based methods assume that word embeddings are given in advance and as a post-processing step, learn or simply employ composition functions to compute phrase embeddings.In other words, there is no distinction between compositional and noncompositional phrases.Yazdani et al. (2015) further proposed to incorporate latent annotations (binary labels) for the compositionality of the phrases.However, binary judgments cannot consider numerical scores of the compositionality.By contrast, our method adaptively weights the compositional and non-compositional embeddings using the compositionality scoring function.

Conclusion and Future Work
We have presented a method for adaptively learning compositional and non-compositional phrase embeddings by jointly detecting compositionality levels of phrases.Our method achieves the state of the art on a compositionality detection task of verb-object pairs, and also improves upon the previous state-of-the-art method on a transitive verb disambiguation task.In future work, we will apply our method to other kinds of phrases and tasks.

Figure 2 :
Figure 2: Trends of α(V O) during the training on the BNC data.

Table 2 :
Examples of the compositionality scores.

Table 3 :
The 10 highest and lowest average compositionality scores with the corresponding verbs on the BNC data.

Table 5 :
Examples of the closest neighbors in the learned embedding space.All of the results were obtained by using the Wikipedia data, and the values of α(V O) are the same as those in Table2.

Table 4 :
Transitive verb disambiguation task.The results for α(V O) = 1 are reported in Hashimoto