Graph-based Clustering of Synonym Senses for German Particle Verbs

In this paper, we address the automatic induction of synonym paraphrases for the empirically challenging class of German particle verbs. Similarly to Cocos and Callison-Burch (2016), we incorporate a graph-based clustering approach for word sense discrimination into an existing para-phrase extraction system, (i) to improve the precision of synonym identiﬁcation and ranking, and (ii) to enlarge the diversity of synonym senses. Our approach sig-niﬁcantly improves over the standard sys-tem, but does not outperform an extended baseline integrating a simple distributional similarity measure.


Introduction
Alignments in parallel corpora provide a straightforward basis for the extraction of paraphrases by means of re-translating pivots and then ranking the obtained set of candidates. For example, if the German verb aufsteigen is aligned with the English pivot verbs rise and climb up, and the two English verbs are in turn aligned with the German verbs aufsteigen, ansteigen and hochklettern, then ansteigen and hochklettern represent two paraphrase candidates for the German verb aufsteigen. Bannard and Callison-Burch (2005) were the first to apply this method to gather paraphrases for individual words and multi-word expressions, using translation probabilities as criteria for ranking the obtained paraphrase candidates.
This standard re-translation approach however suffers from a major re-translation sense problem, because the paraphrase candidates cannot distinguish between the various senses of the target word or phrase. Consequently, (i) the different senses of the original word or phrase are merged,  when the back translations of all pivot words are collected within one set of paraphrase candidates; and (ii) the ranking step does not guarantee that all senses of a target are covered by the top-ranked candidates, as more frequent senses amass higher translation probabilities and are favoured.
Recently, Cocos and Callison-Burch (2016) proposed two approaches to distinguish between paraphrase senses (i.e., aiming to solve problem (i) above). In this paper, we address both facets (i) and (ii) of the re-translation sense problem, while focusing on an emprically challenging class of multi-word expressions, i.e., German particle verbs (PVs). German PVs can appear morphologically joint or separated (such as steigt . . . auf ), and are often highly ambiguous. For example, the 138 PVs we use in this paper have an average number of 5.3 senses according to the Duden 1 dictionary. Table 1 illustrates the re-translation sense problem for German PVs. It lists the 10 top-ranked paraphrases for the target verb ausrichten obtained with the standard method. Four synonyms in the 10 top-ranked candidates were judged valid according to the Duden, covering three out of five senses listed in the Duden. Synonyms for a fourth sense "to tell" (sagen,übermitteln, weitergeben) existed in the candidate list, but were ranked low.
Our approach to incorporate word senses into the standard paraphrase extraction applies a graph-based clustering to the set of paraphrase candidates, based on a method described in (Apidianaki and He, 2010;Apidianaki et al., 2014). It divides the set of candidates into clusters by reducing edges in an originally fully-connected graph to those exceeding a dynamic similarity threshold. The resulting clusters are taken as paraphrase senses, and different parameters from the graphical clustering (such as connectedness in clusters; cluster centroid positions; etc.) are supposed to enhance the paraphrase ranking step. With this setting, we aim to achieve higher precision in the top-ranked candidates, and to cover a wider range of senses as the original re-translation method. Bannard and Callison-Burch (2005) introduced the idea of extracting paraphrases with the retranslation method. Their work controls for word senses regarding specific test sentences, but not on the type level. Subsequent approaches improved the basic re-translation method, including Callison-Burch (2008) who restrict paraphrases by syntactic type; and Wittmann et al. (2014) who add distributional similarity between paraphrase candidate and target word as a ranking feature. Approaches that applied extracted paraphrases relying on the re-translation method include the evaluation of SMT (Zhou et al., 2006) and query expansion in Q-A systems (Riezler et al., 2007).

Related Work
Most recently, Cocos and Callison-Burch (2016) proposed two clustering algorithms to address one of the sense problems: They discriminate between target word senses, exploiting hierarchical graph factorization clustering and spectral clustering. The approaches cluster all words in the Paraphrase Database (Ganitkevitch et al., 2013) and focus on English nouns in their evaluation.
A different line of research on synonym extraction has exploited distributional models, by relying on the contextual similarity of two words or phrases, e.g. Sahlgren (2006), van der Plas and Tiedemann (2006), Padó and Lapata (2007), Erk and Padó (2008). Typically, these methods do not incorporate word sense discrimination.

Synonym Extraction Pipeline
This section lays out the process of extracting, clustering and ranking synonym candidates.

Synonym Candidate Extraction
Following the basic approach for synonym extraction outlined by Bannard and Callison-Burch (2005), we gather all translations (i.e., pivots) of an input particle verb, and then re-translate the pivots. The back translations constitute the set of synonym candidates for the target particle verb.
In order to rank the candidates according to how likely they represent synonyms, each candidate is assigned a probability. The synonym probability p(e 2 |e 1 ) e2 =e1 for a synonym candidate verb e 2 given a target particle verb e 1 is calculated as the product of two translation probabilities: the pivot probability p(f i |e 1 ), i.e. the probability of the English pivot f i being a translation of the particle verb e 1 , and the return probability p(e 2 |f i ), i.e. the probability that the synonym candidate e 2 is a translation of the English pivot f i . The final synonym score for e 2 is the sum over all pivots f 1..n that re-translate into the candidate: The translation probabilities are based on relative frequencies of the counts in a parallel corpus, cf. section 4.1.
Filtering We apply filtering heuristics at the pivot probability step and the return probability step: obviously useless pivots containing only stop-words (e.g. articles) or punctuation are discarded. In the back-translation step, synonym candidates that did not include a verb are removed. Furthermore, we removed pivots (pivot probability step) and synonym candidates (return probability step) consisting only of light verbs, due to their lack of semantic content and tendency to be part of multi-word expressions. If left unfiltered, light verbs often become super-nodes in the graphs later on (see section 3.2) due to their high distributional similarity with a large number of other synonym candidates. This makes it difficult to partition the graphs into meaningful clusters with the algorithm used here.
Distributional Similarity We add distributional information as an additional feature for the ranking of synonym candidates, because weighting the score from equation (1) by simple multiplication with the distributional similarity between the candidate and the target (as obtained from large corpus data, cf. section 4.1), has been found to improve the ranking (Wittmann et al., 2014).
Properties of the clusters: C(#(cand)) number of synonym candidates in a cluster C(av-sim(cand,c)) average distributional similarity between synonym candidates in a cluster and the cluster centroid C(av(#(e))) average number of edges in the clusters of the cluster analyses C(#(e)) total number of edges in a cluster C(av-sim(cand,v)) average distributional similariy between synonym candidates in a cluster and the target PV C(av-sim(cand,gc)) average distributional similariy between all synonym candidates and the global centroid C(sim(c,v)) distributional similarity between a cluster centroid and the target PV C(con) connectedness of a cluster Properties of the synonym candidates: S(tr) translation probability of a synonym candidate S(#(e)) number of edges of a synonym candidate S(cl%(#(e))) proportion of cluster edges for a synonym candidate S(sim(cand,v)) distributional similarity between a synonym candidate and the target PV S(sim(cand,c)) distributional similarity between a synonym candidate and the cluster centroid S(sim(cand,gc)) distributional similarity between a synonym candidate and the global centroid Table 2: Properties of synonym candidates and clusters.

Graph-Based Clustering of Candidates
The clustering algorithm suggested by Apidianaki et al. (2014) is adopted for clustering all extracted synonym candidates for a specific particle verb target. In a first step, a fully connected undirected graph of all synonym candidates is created as a starting point, with nodes corresponding to synonym candidates and edges connecting two candidates; edge weights are set according to their distributional similarity. In a second step, a similarity threshold is calculated, in order to delete edges with weights below the threshold. The threshold is initialized with the mean value between all edge weights in the fully connected graph. Subsequently, the threshold is updated iteratively: 1. The synonym candidate pairs are partitioned into two groups: P 1 contains pairs with similarities below the current threshold, and P 2 contains pairs with similarities above the current threshold and sharing at least one pivot.

A new threshold is set: T =
A P 1 +A P 2 2 , where A P i is the mean over all similarities in P i .
After convergence, the resulting graph consists of disconnected clusters of synonym candidates. Singleton clusters are ignored. The sub-graphs represent the cluster analysis to be used in the ranking of synonyms for the target particle verb.

Iterative Application of Clustering Algorithm
Because the resulting clusterings of the synonym candidates typically contain one very large (and many small) clusters, we extend the original algorithm and iteratively re-apply the clustering: After one pass of the clustering algorithm as described above (T 1 ), the resulting set of connected synonym candidates becomes the input to another iteration of the algorithm (T 2...n ). Each iteration of the algorithm results in a smaller and more strongly partitioned sub-graph of the initially fully connected graph because the similarity threshold for edges becomes successively higher.

Synonym Candidate Ranking
Assuming that clusters represent senses, we hypothesize that combining properties of individual synonym candidates with properties of the graphbased clusters of synonym candidates results in a ranking of the synonym candidates that overcomes both facets of the re-translation sense problem: Including synonym candidates from various clusters should ensure more senses of the target particle verbs in the top-ranked list; and identifying salient clusters should improve the ranking. Table 2 lists the properties of the individual synonym candidates S and the properties of the graph-based cluster analyses C that we consider potentially useful. For the experiments in section 4, we use all combinations of S and C properties.
The distributional similarity sim is determined by cosine similarities between vectors relying on co-occurrences in a window of 20 words. We use the German web corpus DECOW14AX (Schäfer and Bildhauer, 2012;Schäfer, 2015) containing 12 billion tokens, with the 10,000 most common nouns as vector dimensions. The feature values are calculated as Local Mutual Information (LMI), cf. (Evert, 2005).
Our dataset contains the same 138 German particle verbs from Europarl as in previous work (Wittmann et al., 2014), all PVs with a frequency f ≥ 15 and at least 30 synonyms listed in the Duden dictionary. For the evaluation, we also rely on the Duden, which provides synonyms for the target particle verbs and groups the synonyms by word sense. We consider four evaluation measures, and compare the ranking formulas by macro-averaging each of the evaluation measures over all 138 particle verbs: • Precision among the 10/20 top-ranked synonym candidates.
• Number and proportion of senses represented among the 10 top-ranked synonyms.

Results
The basic system (line 1 in table 3) only relies on the translation probabilities (S(tr)). It is extended by incorporating the distributional similarity between the target particle verb and the synonym candidates (line 2). Our five best rankings with one iteration of graphical clustering (T 1 ) are shown in lines 3-7. All of these include the translation probability and the distributional similarity between candidate and particle verb; only one makes use of cluster information. Thus, the simple distributional extension is so powerful that additional cluster information cannot improve the system any further. The most relevant cluster measure is the number of edges of the cluster C(#(e)), an indication of cluster size and connectedness.
While the best three clustering systems 2 outperform the extended basic system (line 2) in terms of top-10/top-20 precision, none of the improvements is significant. 3 Also, the number and proportion of senses remain the same as in the basic approach with distributional extension. Further iterations of the clustering step (T 2...n ) up to n = 8 lead to increasingly worse precision scores and sense detection, cf. figure 1 for T 1...5 .

Discussion
Overall, the distributional similarity between the target word and the synonym candidates represents the strongest extension of the basic retranslation approach, and the cluster graphs do not provide further useful information. A breakdown of the cluster analyses revealed that the cluster sizes are very unevenly distributed. Typically, there is one very large cluster and several considerably smaller clusters, as shown by the first part of  Table 4: Distribution of candidates, synonyms and senses in the largest cluster vs. all other clusters in the iterations T 1 -T 5 .
proportion of candidates in the remaining clusters.
In addition, we found that most correct synonyms are also in the largest cluster (middle part of table 4). Accordingly, the cluster analyses do not represent partitions of the target verb senses, but most senses are in the largest cluster (bottom part of table 4).
Consequently, while the synonym features are useful for ranking the set of candidates, clusterlevel features are ineffective as they are derived from effectively meaningless cluster analyses. 4 While re-applying the clustering step gradually overcomes the uneven cluster distribution (iterations T 2 -T 5 in table 4), the sizes of the graphs decrease dramatically. For example (not depicted in table 4), on average there are only 169 candidates left in T 5 compared to 1,792 in T 1 , with an average of 2.8 correct synonyms instead of 22.5, and an average of 1.7 senses instead of 4.5.
We assume that partitioning the candidate set according to senses in combination with the cluster-level measures is a valid approach to deal with the word sense problem, but based on our analysis we conclude that either (i) the context vectors are not suitable to differentiate between senses, or that (ii) the clustering algorithm is inapt for this scenario. A possible solution might be to apply the algorithms suggested in Cocos and Callison-Burch (2016). Finally, no weighting was applied to any of the properties listed in table 2. This could be improved by using a held-out data development set, and a greater number of particle verbs (we only use 138) would probably be needed as well.

Summary
We hypothesized that graph-based clustering properties in addition to synonym candidate properties should improve the precision of synonym identification and ranking, and extend the diversity of synonym senses. Unfortunately, our extensions failed, and analyses of cluster properties revealed that future work should improve the vector representations and compare other clustering algorithms. One should keep in mind, however, that we focused on a specifically challenging class of multi-word expressions: highly ambiguous German particle verbs.