Syntactic Dependencies and Distributed Word Representations for Analogy Detection and Mining

Distributed word representations capture relational similarities by means of vector arithmetics, giving high accuracies on analogy detection. We empirically investigate the use of syntactic dependencies on improving Chinese analogy detection based on distributed word representations, showing that a dependency-based embeddings does not perform better than an ngram-based embeddings, but dependency structures can be used to improve analogy detection by ﬁltering candidates. In addition, we show that distributed representations of dependency structure can be used for measuring relational similarities, thereby help analogy mining.


Introduction
Relational similarity measures the correspondence between word-word relations (Medin et al., 1990). It is relevant to many tasks in NLP (Turney, 2006), such as word sense disambiguation, information extraction, question answering, information retrieval, semantic role identification and metaphor detection. Typical tasks on relational similarity include analogy detection, which measures the degree of relational similarities, and analogy mining, which extracts analogous word pairs from unstructured text.
Recently, distributed word representations (i.e. embeddings) (Mikolov et al., 2013a;Mikolov et al., 2013b;Levy and Goldberg, 2014b) have been used for unsupervised analogy detection. Mikolov et al. use attributional similarities between words in a relation to compute relational similarities, and show that the method outperforms the best sys-tem in the SemEval 2012 shared task on analogy detection. Levy and Goldberg (2014b) further improve Mikolov's relational similarity measure method using novel arithmetic combinations of attributional similarities. For simplicity, we call the method of Mikolov et al. embeddingbased analogy detection, without stressing the difference between distributed and distributional (i.e. counting-based) word representations.
Most work on embedding-based analogy detection uses relational similarities as a measure of the quality of embeddings. However, relatively little has been done in the opposite direction, exploring how to leverage embeddings for improving relational similarity algorithms. We empirically study the use of word embeddings for Chinese analogy detection and mining, leveraging syntactic dependencies, which has been shown to be closely associated with semantic relations (Levin, 1993;Chiu et al., 2007). Compared with many other languages, this association is particularly strong for Chinese, which is fully configurational and lacks morphology. To our knowledge, relatively little work has been reported on Chinese relational similarities, compared to other tasks in Chinese NLP, including syntactic parsing, information extraction and machine translation.
We work on three specific problems. First, we study the effect of dependency-based word embeddings for analogy detection. There are two variations of Mikolov et al's skip-gram embedding model, one training the distributed word representation of a word using its context words in local ngram window (Mikolov et al., 2013a), and the other training the distributed representation of a word using words in a syntactic dependency context (Levy and Goldberg, 2014b;Bansal et al., 2014). The latter has attracted much recent atten-tion due to its potential in capturing more syntactic regularities. It has been shown to outperform the former in a variety of NLP tasks, and can potentially also improve relation similarity. Our experiments on both English and Chinese show that the dependency-context embeddings consistently under-perform ngram-context embeddings. We give some theoretical justifications to the findings.
Second, we propose to use syntactic dependencies as a context for improving embeddingbased analogy detection, pruning the search space and filtering noise using syntactic dependencies. While highly useful for measuring relational similarities, attributional similarities between words are not the only source of information for analogy detection. Traditional methods, such as Turney and Littman (2005), Turney (2006), Chiu et al. (2007) andÓ Séaghdha and Copestake (2009), also leverage context between word pairs in a corpus for better accuracies, which the current embedding-based methods ignore. Results show that our proposed method achieves significant improvements for this task.
Third, we show that a novel distributed representation of syntactic dependencies between word pairs can be used to mine analogous dependencies from a large Chinese corpus. Inspired by the fact that distributed word representations can be used to measure word similarities, we use our distributed dependency representations to measure relation similarities. We propose a bootstrapping algorithm for analogy mining using dependency embeddings, and experiments on a large Chinese corpus show that the method can achieve a precision of 95.2% at a recall of 56.8%.
Our automatically-parsed corpus, trained embeddings and evaluation datasets are released publicly at http://people.sutd.edu.sg/ yue_zhang/publication.html. To our knowledge, we are the first to present results on Chinese analogy detection and to release largescale Chinese word embeddings.

Relational Similarity Tasks
There are three main tasks for relational similarity. This first is relation classification, which has been used in Task 2 of SemEval 2012 (Jurgens et al., 2012). In this task, all four words in two word pairs are given, and one needs to judge whether  "1991c (in 1991 they belong to a same relation type. In order to address this task, various supervised methods have been used (Bollegala et al., 2008;Herdaǧdelen and Baroni, 2009;Turney, 2013).
The second task is analogy detection (Mikolov et al., 2013b), which takes three words in two word pairs, and searches for a most suitable word from the vocabulary to recover the hidden word. This task has been addressed using word embeddings (Mikolov et al., 2013b;Levy and Goldberg, 2014b).
The third task is analogy mining (Chiu et al., 2007), which takes one word pair belonging to a certain semantic relation as a seed, and searches for all the word pairs that share the same relation with the seed. Compared with relation classification and analogy detection, analogy mining can be practically more useful because it requires less given information, and provides a large quantity of analogous word pairs automatically.

Skip-gram Word Embeddings
As a by-product of neural language models (Bengio et al., 2003;Mnih and Hinton, 2007), word embeddings are distributed vector representations of words, trained using local contexts. They capture linguistic regularities in languages (Mikolov et al., 2013b) and have been used in various tasks (Collobert and Weston, 2008;Turian et al., 2010;Socher et al., 2011).
In this paper, we apply the Skip-gram method of Mikolov et al. (2013a) for training embeddings, which works by maximizing the probability of a word given a context of multiple words. Mikolov et al. (2013b) use an ngram window as the context, and observe that the resulting embeddings are highly useful for unsupervised analogy detection.

Embedding-based Analogy Detection
Formally, the task of analogy detection is to find a word b* given a pair of words a:b and a word a* such that a*:b* is analogous to a:b. Mikolov et al. (2013b) show that the task can be solved by finding a word that maximizes: where sim is a similarity measure, typically the cosine function. Levy and Goldberg (2014b) show that the Equation 1 is equivalent to: As a result, the goal of analogy detection is to find a word b* which is similar to b and a* but different from a. Levy and Goldberg (2014b) further propose to substitute the addictive functions in Equation 2 with multiplicative functions: Here ε = 0.001 is used to prevent division by zero. Their experiments show that the use of Equation 3 can improve the state-of-the-art. Following Levy and Goldberg (2014b), we refer to Equation 1 and 2 as 3COSADD and Equation 3 as 3COSMUL, respectively.
The frequent correlation between semantic relations and syntactic dependencies can be due to the lack of morphology and function words in Chinese. In fact, Chinese syntactic ambiguities often need to be resolved by leveraging semantic information (Xiong et al., 2005;. Although not all occurrences of semantically-related word pairs must also form a syntactic dependency in a corpus, we show that syntactic dependencies can effectively improve analogy detection.

Dependency-context Word Embeddings for Analogy Detection
A first use of syntactic dependencies for embedding-based analogy detection is to use them directly for embeddings. Recently, a dependency context has been used for the skip-gram method, for capturing more syntactic regularities. Taking the sentence in Figure 1 for example, a bi-gram context for the word ".' (graduate)" can be "c n ê (Obama), o Ú (President), u (from), M Ã (Harvard)", while a dependency context of the same word can be "1991c/ADV, o Ú/SBV, u/CMP, { AE /POB u" 1 , where "ADV, SBV, CMP, POB" indicate adverbial modifier, subject, complement and prepositional object, respectively.
It has been shown that a dependency context leads to embeddings that better help parsing (Bansal et al., 2014) and measuring word similarity (Levy and Goldberg, 2014a), compared with ngram contexts. However, little previous work has systematically compared dependency contexts with ngram contexts in analogy detection. We empirically study this problem (c.f Section 6.3), finding that dependency context leads to significantly worse analogy detection results for both Chinese and English using state-of-the-art embedding-based methods (Levy and Goldberg, 2014b). We give analysis in Section 6.4.

Search Space Pruning Using Syntactic Dependencies
We study an alternative way of making use of syntactic dependencies, by using them to prune the vocabulary-sized search space of analogy detection. Given two word pairs a:b and a*:b*, where b* is hidden and a is the head word, we search for dependencies, taking a* as the head word. The dependent words in the search candidates need to share the POS tag of b. If there are several types of dependencies between a and b, only the one with highest frequency is used. We rank all resulting dependencies using the 3COSMUL objective, and take the word b* in the highest-scored dependencies as the answer. For example, given the word pair (i . 9 ¶ (Sarajevo):Å ç (Bosnia and Herzegovina)), whose most frequency dependency is <i . 9 ¶ (Sarajevo), Å ç (Bosnia and Herzegovina), ATT>, and the unknown pair (Ôí (London):b*), we acquire a list of dependencies, including <Ô í (London), {I (USA), ATT>, <Ôí (London), ni (Paris), COO>, <Ôí (London), \ <OE (Canada), ATT> and <Ôí (London), =I (England), ATT>. Some of these dependencies, such as <Ôí (London), ni (Paris), COO>, are parsed as the coordinate relation (COO), and thus pruned because the target syntactic relation is ATT. From the resulting list, the 3COSMUL objective successfully ranks the triple <Ôí (London), =I (England), ATT> as the top candidate. In contrast, Levy and Goldberg's method takes "H š (South Africa)" as the answer, which does not form an attributive-head phrase with "Ôí (London)".

Analogy Mining Using Dependency Embeddings
Formally, analogy mining is the task of mining analogous dependencies <x 1 , y 1 , r>, <x 2 , y 2 , r> ...<x n , y n , r> that share the same relation r with a given dependency <a, b, r>. We mine analogous dependencies by considering relational similarity and attributional similarity simultaneously using the skip-gram model for embeddings.

Dependency Embedding
Inspired by the fact that word similarities can be measured by using distributed word representations, we hypothesize that relation similarities can Input : dependency embedding DT, word embedding DW, seed dependency s, threshold α and β. Output: set of ranked dependencies WP. be measured by distributed relation representations. Based on the observation in Section 2.4, semantically analogous word pairs typically have syntactic dependencies. We use the skip-gram algorithm to train distributed representations of syntactic dependencies, and use them for mining analogous word pairs.
With respect to the skip-gram model, words are the most common target for embeddings (Levy and Goldberg, 2014b;Levy and Goldberg, 2014a;Mikolov et al., 2013a), although continuous vector representations can be trained for other structures. For example, Mikolov et al. (2013a) take idiomatic phrases as embedding targets. Dependencies, which consist of a modifier word, a head word and a syntactic relation between them, can also be represented by continuous embeddings using the same algorithm.
To induce dependency embeddings, we take the union of the dependency context of both the dependent and the head of a dependency as the context. For instance, in the example sentence, the context of the dependency <oÚ (Presiden-t), . ' (graduate), SBV> consists of four tokens: "1991c/ADV", "cnê/ATT", "u/CMP" and "{AE /POB u". The same skip-gram algorithm is used to train embeddings for dependency structures.

Analogy Mining by Bootstrapping
A bootstrapping algorithm is used to mine analogous word pairs based on dependency-context word embeddings and dependency embeddings. Algorithm 1 shows pseudocode of the recursive bootstrapping algorithm.
The recursive function Mine (Algorithm 1) contains three steps with six parameters, including the dependency embeddings DT, word embeddings DW, a seed dependency s, and two thresholds α and β.
Step 1 (lines 3 to 5) is an initialization process, where the dependency embedding is used to return up to 100 most similar dependencies for the given seed s. These dependencies are stored in SimDT, and the candidate analogous dependency set DTSet is initialized to an empty set.

In
Step 2 (lines 6 to 16), an analogous score S-coreXY is computed for each dependency Triple in SimDT by multiplying the similarity scores between the two dependents and the two heads in Triple and s, respectively. Triple is stored into the set DTSet if ScoreXY is ranked top α. The top 1 score in DTSet is referred to as MScore. In Step 3 (lines 17 to 24), if the score of a dependency Triple in DTSet is larger than β×MScore, it is used as a new seed for mining more analogous dependencies, by calling the function Mine recursively.
We take the seed dependency < (play), g OE (piano), VOB> as an example to illustrate the work-flow of the Mine function. In Step 1, a set of similar dependencies (e.g., < (play),3¦ (guitar), VOB>, < (play), OE (lyra), VOB>), is calculated using the dependency embeddings DT and stored in SimDT. Each dependency in SimDT is scored in Step 2, and the top α scores are put into the set DTSet. Finally, a dependency is used as seed to mine new analogous dependencies if its score is larger than a threshold (β×MScore). For instance, the dependency < (play), OE (lyra), VOB> is used to mine the new dependency < (play), 8 (zheng), VOB>, which is then used to mine other dependencies such as <N (blow), ¨j (cucurbit flute), VOB> and <N (blow), iŽd (sax), VOB>.

Word Embeddings
We train three sets of word embeddings: NG5 (ngram context with 5 words to the left of the target word and 5 words to the right), NG2 (2 words to the left and right) and DEP (dependency context), and one set of dependency embeddings DT (dependency context), using the Skip-Gram model. WORD2VEC 2 is used to train NG5 and NG2, and WORD2VECF 3 is used to train DEP and DT. The negative-sampling parameter is set to 15 in all the training processes.
All embeddings are trained on a free Chinese news archive 4 that contains about 170 millions sentences and 3.4 billions words. We segment and parse these sentences using the MVT implementation of ZPar 0.7 5 (Zhang and Clark, 2011), which is trained on a large-scale annotated corpus and achieves state-of-the-art analyzing accuracy on contemporary Chinese (Qiu et al., 2014) 6 . Targets and contexts for word and dependency embeddings were filtered with a minimum frequency of 100 and 10, respectively, and all the four types of embeddings are trained with 200 dimensions.

Datasets and Evaluation Metrics
Three datasets are used for evaluating Chinese embeddings. First, we construct a set of semantic analogy questions. This set contains five types of semantic analogy questions, including capital-country (136 word pairs, and 18354 analogy questions), provincial capital-province (28, 756), city-province (637, 386262), family member (male-female) (18, 306) and currency-country (62, 3782). We collect the five types of word pairs and then produce analogy questions automatically by concatenating two word pairs. The resulting analogy dataset contains 400K analogy questions. We refer to this dataset as the Chinese Analogy Question Set (CAQS). Because embeddings are central for analogy detection, yet there is little large-scale evaluation results on Chinese embeddings in the literature, we perform embedding evaluation on two datasets. The first one is the Chinese WordSim (CWS), translated from the English WordSim-353 Set and re-scored by native Chinese speakers . This dataset consists of 297 word pairs.
The second one is the Chinese thesaurus Tongyicicilin (Cilin) (Che et al., 2010), which groups 74,000 Chinese words into five-layer hierarchies and has been used for evaluating the accuracy of word similarity by traditional sparse vector space models (Qiu et al., 2011;. The third level of Cilin, which contains 1428 classes, is used to evaluate whether two words are semantically similar. For comparison between Chinese and English, we also use an English analogy question dataset, the Google dataset 7 (Mikolov et al., 2013a), to evaluate the English word embeddings of Levy and Goldberg (2014a) 8 on analogy detection.
On both the CAQS and the Google datasets, the 3COSMUL method (Levy and Goldberg, 2014b) is used to to answer analogy questions based on given embeddings. The results on the CWS dataset are evaluated using the two standard metrics for the task, namely Spearman's ρ and Kendall's τ rank correlation coefficients. The results on Cilin are evaluated using Precision@K: the percentage of words from the top-K candidates that belong to the Cilin category of the target word. If one of the top-K candidates belongs to the same third-level category in Cilin as the target word, the candidate word is taken as correct.   Table 3: English results on the Google set. Table 1 shows the results of the three Chinese embedding on Cilin and CWS, where NG2 performs much better than NG5 on both datasets. This demonstrates that one does not need to use large window sizes in training word-based embeddings for capturing word similarities. The result is similar to the finding of Shi et al. (2010), which indicates that a window size of 2 is better than a window size of 4 for capturing word similarity by using distributional word representations.

Dependency-based and
DEP performs slightly worse than NG2 on CWS and Cilin in P@1 and P@5. However, it achieves better results on Cilin in P@10 to P@100 when more candidate similar words are evaluated. In contrast, NG5 and NG2 mix more semantically related words. This finding is consistent with that of Levy and Goldberg (2014a). Table 2 shows the results of the three Chinese embeddings on CAQS. Unlike on Cilin and CWS, NG5 outperforms DEP, and is also slightly better than NG2. Similar tendency is shown in Table 3 for the three English embeddings evaluated on the Google dataset. These results show that dependency embeddings are relatively weak for answering analogy questions. On the other hand, the performance also varies across different relation types.  Table 4: Comparison between NG2, NG5 and DEP Embeddings. (P: personal name, C: city name)

Analysis
To analyze the difference between the three Chinese embeddings methods qualitatively, we manually inspect the words "B (wear)", "'‹ (Guan Yu, a person name in the novel 'n I ü Â (Romance of the three kingdoms)')", and "x² (Zhengzhou, a city)". Their most similar words are shown in Table 4.

Analogy Detection
As mentioned in Section 2.3, both 3COSADD and 3COSMUL seek a word b * that is similar to b and a * but dissimilar to a. Ideally, the two word pairs b:b * and a:a * should be semantically similar while the two word pairs a:b and a * :b * should be semantically related. Therefore, 3COSADD and 3COSMUL require the embeddings to give higher cosine scores for both semantically similar and related words.
Our analysis above shows that word-context embeddings tend to mix semantically related and similar words, but dependency-context embeddings only capture semantic similarity. This partly explains the reason that dependency-context word embeddings are weak for analogy detection.
It has also been shown in Section 6.3 that the performances of analogy detection vary across different types of relations, which indicates that there are more sophisticated underlying factors. One intuitive explanation is that different semantic relations correspond to different syntactic dependency structures. For example, the male-female family member relation is expected to stand less frequently in a syntactic dependency relation, compared with geographic relations such as city-country, which stand frequently in attributional syntactic relations (e.g. "London, England"). As a result, where the coupling between syntactic and semantic relations is weak, our analysis in Section 6.3 and other work based on syntactic relations can find limitations.

Syntactic Dependencies for Improved Analogy Detection
The results on CAQS using the method in Section 4 are shown in the IMP rows of Table 2. The method achieves significant improvements (from 80.0% to 90.9% using NG5) compared with Levy and Goldberg's method. In addition, DEP also performs significantly better than with MUL, with an increase from 22.0% to 89.8%. The main reason for this improvement is that the filtering process using syntactic dependencies successfully prunes noisy words.  Error analysis shows that the main errors by the improved method are quite different from those by the baseline. For instance, the main errors of Levy and Goldberg's method for the city-province relation are caused by giving another province as the answer, while the improved method gives the name of the country as answer. This is because irrelevant provinces do not co-occur frequently with the city in syntactic dependencies, and hence can be filtered by our method. On the other hand, both the country name and province name co-occur frequently with the city name in syntactic dependencies, and our method cannot make a choice between them.

Dependency Structure Embeddings for Analogy Mining
Shown in Table 5, we use six seeds to mine analogous dependencies. The first seed is used for development and the others for test. The first three seeds, the fourth seed and the last two seeds belong to the Use:Thing, Produce:Thing and Sub-Location:Location relations, respectively. α and β are set to 20, and 0.6, respectively. Each set of mined dependencies together with the seed dependency and relation type is shown to two human evaluators, who are required to give a Yes/No answer to each dependency in the set. We take the average scores of the two evaluators (the average inter-annotator agreement is 0.95) as the final precision scores. As shown in the table, the precisions using different seeds are quite different, ranging from 40% to 96%. One possible reason is that different relations have different numbers of analogous dependencies, ranging from dozens to thousands, and thus the fixed thresholds tuned on a development seed does not apply as effectively to all test cases. For instance, " (play)" and its analogous actions, "N (blow)" and ". (play)", are all human actions on musical instruments, while the actions "¯(eat)" and " (write)" can apply to many patients. For the seed < (play), gOE (piano), VOB>, irrelevant results such as << (use), } f (scissors), VOB> and << (use), >Ù (flashlight), VOB>, have the verb "< (use)", which is also a human action, yet cannot be considered as usage of the patients "}f (scissors)" and "> Ù (flashlight)". Because of the stricter selectional preference of " (play)", its precision of analogy mining is lower.
We tentatively measure the recall of the algorithm by taking the first three types of word pairs in CAQS as the gold set, which contains 801 word pairs. All the three types of word pairs belong to the relation Sub-Location:Location. The recall is computed as the percentage of the gold word pairs covered by the mined dependencies. When using the two seeds <ÉÇ (Wuhan), (Hubei), ATT> and < ® (Beijing), ¥I (China), ATT> for analogy mining, the recalls are 50.2% and 11.3%, respectively. Their union recall is 56.8%. When the precision of each seed is similar, we can achieve better recall without precision loss by using more seeds.

Related Work
Turney (2006) introduces a latent relational analysis (LRA) model to measure relational similarity, and apply a novel co-occurrence-based method for analogy filtering. The model can be used for both analogy detection and relation classification, yet cannot scale up well to large datasets due to the complexity of Singular Value Decomposition. Recently, distributed word representations using the skip-gram model (Mikolov et al., 2013a) has been shown to give competitive results on analogy detection. Levy and Goldberg (2014a) extends the skip-gram method with dependency-context embeddings. We study the effect of Levy and Goldberg's embeddings on analogy detection, and further extend their embeddings to dependency-context dependency structure embeddings for analogy mining. Chiu et al. (2007) presents a similarity graph tranversal (SGT) method to mine analogous relations from raw English text automatically, using syntactic dependencies to find candidate relations. The method is unsupervised, and can scale up well to large data sets. However, Chiu et al. (2007) mainly focuses on relations between subjects and objects because of its word-pair extraction method.Ó Séaghdha and Copestake (2009) is a supervised method, which combines lexical similarity and relational similarity to classify se-mantic relations. These methods are based on distributional word representation models and fit for classifying noun-noun word pairs. In contrast, our methods are based on distributed word representation models, and can mine noun-noun word pairs as well as verb-noun word pairs. In addition, our analogy mining method is unsupervised, while the methods of both Turney (2006) andÓ Séaghdha and Copestake (2009) are supervised.

Conclusion
We studied several Chinese relational similarity tasks to train embeddings under the context of distributed word representations using the skip-gram model and syntactic dependencies. For Chinese analogy detection, we compared word-context and dependency-context embeddings, finding that the former results in much better accuracies. Observing that common relations in Chinese are frequently represented by syntactic dependencies, we improved Chinese analogy detection using a dependency context. Further, we empirically studied Chinese analogy mining by proposing a bootstrapping algorithm using a novel distributed representation of syntactic dependencies.