Annotation and Extraction of Multiword Expressions in Turkish Treebanks

Multiword expressions (MWEs) present particular and distinctive semantic properties, hence their automatic extraction receives special attention from the natural language processing (NLP) and corpus linguistics community, and is still an active research area. Unfortunately, the creation of necessary resources for this task is quite rigorous and many languages suffer from the lack of these; as in the case for Turkish. This study presents our MWE annotations on recently introduced Turkish Treebanks, which focuses on annotating various types of linguistic units and expressions, including named entities, numerical expressions, id-iomatic phrases, verb phrases with auxiliaries and duplications. The paper aims to provide a benchmark and pave the way towards further MWE extraction research for Turkish. To this end, the paper also introduces our experimental results with seven baseline approaches, a dependency parser and a previously introduced rule-based extractor on these annotated corpora. Our highest performances achieved over these resources are about 60% F-scores.

This study presents our MWE annotations on recently introduced Turkish Treebanks, which focuses on annotating various types of linguistic units and expressions, including named entities, numerical expressions, idiomatic phrases, verb phrases with auxiliaries and duplications. The paper aims to provide a benchmark and pave the way towards further MWE extraction research for Turkish. To this end, the paper also introduces our experimental results with seven baseline approaches, a dependency parser and a previously introduced rule-based extractor on these annotated corpora. Our highest performances achieved over these resources are about 60% F-scores.

Introduction
Automatic extraction of multiword expressions (MWEs) is an important and challenging task in natural language processing (NLP). They are introduced to be a key problem for the development of large-scale NLP technology (Sag et al., 2002). Multiword expressions are lexical items that can be decomposed into single words where these single words represent most of the time a totally different meaning compared to word sets within which they occur. Thus, MWEs pose significant problem for NLP and machine translation (MT) applications. The effect and the importance of MWE extraction techniques are being investigated by the NLP and CL communities. A recent ICT-Cost Action (IC1207-PARSEME "PARSing and Multi-word Expressions") focuses only on MWEs in a multidisciplinary level from different perspectives.
In the literature some studies are focused on deriving automatic MWE extraction techniques without using annotated data. Attia (2006) investigates the automatic acquisition of Arabic MWEs and proposes three complementary approaches to extract related MWEs automatically. Piao et al. (2006) propose similar approaches automatically identifying Chinese MWEs and achieve precision ranging from 61.16% to 93.96% for different types. Schone and Jurafsky (2001) seek a knowledge-free method for inducing MWEs from text corpora and provide two major evaluations of nine existing collocationfinders. Metin and Karaoglan (2010) tries to explore Turkish collocations by using standard statistical methods (e.g Chi-square hypothesis test and mutual information). Tsvetkov and Wintner (2012) extract MWEs by using monolingual and parallel corpora (Hebrew-English), and then use the outcome to train a machine translation system. As mentioned in most of the aforementioned studies, although it might be feasible to automatically identify MWEs using these approaches, yet they need to be improved further. The need for and the importance of manually annotated large-scale data for MWE extraction purpose is not negligible. There exist many recent works on creating language resources for MWEs e.g. MWE databases, corpora and treebanks. The French corpora (Laporte et al., 2008a;Laporte et al., 2008b) and the Prague Dependency Treebank (Bejček and Straňák, 2010) may be given as examples of these studies among many others.
Dependency parsers are capable of providing quite acceptable performances for MWE extraction. Nivre and Nilsson (2004), Eryigit et al. (2011), Vincze et al. (2013) and Candito and Constant (2014) investigate the impact of dependency parsers on Swedish, Turkish and Hungarian MWE extraction. Vincze et al. (2013) show that their results outperformed those achieved by state-of-the-art techniques for Hungarian LVC detection. Eryigit et al. (2011) show that in the training stage, the unification of MWEs of a certain type, namely compound verb and noun formations, has a negative effect on parsing accuracy by increasing the lexical sparsity. In spite of their syntactic relations, MWEs still need special treatments in terms of semantic relations.
Inspired by these recent studies, to shed light and provide a direction for future studies on adequate MWE extraction techniques for Turkish, in this paper we present our annotation for MWEs on recently introduced Turkish Treebanks. We focus on annotating various types of linguistic units and expressions, including named entities, numerical expressions, idiomatic phrases, verb phrases with auxiliaries and duplications. The paper experiments with different lexical approaches together with automatic named entity recognition (NER). The results are compared with those of an available collocation extraction tool (Oflazer et al., 2004) and a dependency parser (Eryigit et al., 2008). Although, the newly introduced methods improved the previous results by almost 20 percentage points (yielding ∼60% Fscore), we treat these results as the state-of-the-art baselines for Turkish.
The paper is structured as follows: Section 2 introduces the used language resources, Section 3 discusses MWEs in Turkish, Section 4 presents models for MWE extraction, Section 5 gives the experimental results and discussions, Section 6 presents the conclusion.

Language Resources
We use four different treebanks in our experiments, three of which have been annotated within this study. The first treebank, METU-Sabancı Tree-bank, (MST) (Oflazer et al., 2003) is from Eryigit et al. (2011) where the authors state that most of the MWEs in the original treebank are not annotated. They use a semi-automatic way for annotating these MWEs. To this end, they first extracted a MWE list consisting the 30150 MWEs available in the Turkish Dictionary (TDK, 2011) and then automatically listed the entire treebank sentences where the lemmas of the co-occurring words could match the lemmas of the MWE constituents in the list. They then manually marked the sentences where the co-occurring words may be actually accepted as a MWE (but somehow missed during the construction of the original treebank). This semiautomatic annotation approach is incapable of detecting non-adjacent MWE constituents. IMST, IVS and IWT are recently introduced Turkish treebanks annotated with a new dependency scheme (Sulubacak and Eryigit, 2014).
IMST contains exactly the same sentences thus the same MWEs as MST. But differing from the previous work, the annotation of MWEs are done fully manually without using a semi-automatic selection as explained above. The MWEs are annotated by the use of a specific dependency label (MWE) regardless of their category. In this study, we present our MWE annotations on these three treebanks: IVS with 300 sentences, IMST with 5,635 sentences collected from formally-written data and IWT with 5,009 sentences collected from Web 2.0. Table 1 presents the resulting MWE statistics on each of these datasets. Since a MWE may consist of two or more words, the table provides both the exact number of MWEs (in the second line) and the total number of MWE relations between MWE constituents (in the first line). As may be noticed from this table, IMST contains almost 50% more MWE annotation than MST of Eryigit et al. (2011)  guages are typically fusional or analytic, Turkish is an agglutinative language, meaning that it is possible to derive and inflect words indefinitely through cascading suffixes. In fact, the derivation is so common that most sentences contain several derived words incorporating one or more suffixes, even in the colloquial language. The constituents of MWEs also commonly undergo inflection (Oflazer et al., 2004;Savary, 2008), giving way to numerous forms of the same expression each appropriate for a different syntactic function. Furthermore, many idiomatic MWEs may also be interpreted literally-that is, there are permissible expressions used in their literal meaning that are morphosyntactically identical to a MWE. Another point is that the constituents of a MWE may occur at nonadjacent positions in the sentence. Figure 1 gives an example for the MWE "ekmegini yemek" (to gain one's livelihood from (someone)). In the given sentence, the words composing the MWE are both inflected (the first word "ekmek" (bread) with 1st person possessive agreement suffix in accusative form and the second word "yemek" (to eat) in past tense with 2nd singular person agreement) and written separately from each other.
For these reasons, ordered surface word form matches do not suffice in properly assessing the semantic quality of expressions. Therefore, the disambiguation of MWEs is a more complicated problem than could be resolved by use of look-up tables.
In the rest of this section, we describe the extent of MWEs we specified in our framework. We specify six major categories for MWEs, considering common idiosyncratic formations in Turkish in addition to well-recognized global conventions. We consider any word falling under these categories to be a MWE, as we later build our extraction models around them. The categories are given below: Named Entities: Proper names and titles of unique persons such as "Genel Sekreter Ban Kimoon" (Secretary-General Ban Ki-moon), organizations such as "Avrupaİnsan Hakları Mahkemesi" (European Court of Human Rights) and locations such as "Papua Yeni Gine" (Papua New Guinea) occur very frequently in both edited and unedited texts. Commonly recognized as named entities, these expressions often span multiple words, thereby forming a category of MWEs.
Numerical Expressions: We mark any group of contiguous tokens denoting a numerical expression as MWEs, including spelled out numbers, quantities such as currency values and percentages, and temporal expressions such as date and time phrases. Such expressions are often considered to be a subgroup of named entities, but since they are among the most frequently encountered MWEs, we handle them under a separate category to emphasize their importance.
Idiomatic Phrases: Many common idiomatic phrases in Turkish are also occasionally used in their literal meanings, such as "yola düşmek" (hit the road, or lit. fall on the road). Since both meanings of the phrase would appear morphosyntactically similar, such cases lead to ambiguities in meaning that must be resolved using contextual information. For this reason, we consider idiomatic phrases to be a most challenging category of MWEs.
Light verb constructions: Turkish has a way of forming verb phrases using auxiliary verbs such as "olmak" (to be), "etmek" (to do), "yapmak" (to make) and "kılmak" (to render). Among the examples, especially the first two are extremely productive and often used in very common expressions like "teşekkür etmek" (to thank, or lit. to do thank). Although the figurative meanings of such phrases are usually predictable, they still comprise idiomatic phrases. We handle these outside the previous category due to their prevalence, much like numerical Compound Function Words: We include any compound particles, multi-word interjections and other function word compounds under MWEs. This category excludes function words modified by intensifiers such as "de" and "ise", which also regularly modify content words, as in "ya da" (or). Ultimately, there are few permissible function word compounds in Turkish, but they are often commonly used phrases, and warrant a category of MWEs.
Duplications: It is common to use word duplication as a grammatical mechanism in both formal and informal Turkish. Duplicating an adjective allows the word to be used as an adverb much like affixation, such as in "yavaş yavaş" (slowly, or lit. slow slow). Onomatopoeic or gibberish (and usually rhyming) pairs of words such as "allak bullak" (topsy-turvy) are also used fairly often to the same effect. Furthermore, there is the 'm'-duplication, which is a common mechanism in colloquial Turkish, where a word is repeated and an 'm' is prefixed to the duplicate (replacing the initial consonant) in order to add the 'and so on' meaning, like in "form morm" (forms and so). We evaluate all such duplications as MWEs.

Models for MWE Extraction
For our MWE extraction experiments, we test with a Turkish dependency parser from Eryigit et al. (2008), an existing collocation extraction tool (Oflazer et al., 2004) (which we call Morpho-Coll from this point on), and seven lexical models. The lexical models are based on the previous work by Eryigit et al. (2011), three of which are identical to the models described in the study and the rest integrate different lexical approaches and a NER module into these models. The rest of this section gives the details about our extraction models and their methodologies.

Dependency Parser
This model comprises a generic dependency parser which includes MWE as one of the dependency relations. We extract MWEs by traversing these relations represented in the output dependency graphs.

MorphoColl
This model attempts to automatically extract collocations making use of lexical information and morphosyntactic rules. It is composed of three sequential layers, where each layer has its own set of rules and produces the input to the next layer as its output.

Lexical Models
We first filtered MWEs from a Turkish dictionary (TDK, 2011) into a list and used this list as a look-up table. We used the list in three elementary models with different validation criteria, as introduced previously in Eryigit et al. (2011).
Model #0: The first MWE extraction model selects the sequences of words whose surface forms match those of the constituents of a MWE in the referenced list. Thus, this model extracts lexicalized collocations which are considered fixed MWEs (Oflazer et al., 2004). An example for this case is given below: • "Arka arkaya iki operasyon geçirdi." lit. (Back to back) (two) (operations) (he/she had). (He/she had two operations consecutively.) Model #1: The second model selects the sequences of words whose surface forms except the last word (which may go under inflection) are the same as the constituents of a MWE in the referenced list. For the last constituent, the stem of the word is required to match. This model extracts collocations belonging to the semi-lexicalized category as stated in (Oflazer et al., 2004). Below is an example for this case: • "Gelecegini haber vermedi." lit. (that he/she was coming) (he/she didn't give) (news). (He/she didn't inform) Model #2: The third model checks only the stems of the words and select the sequences of words matching the stems of a MWE in the referenced list. Non-lexicalized collocations (Oflazer et al., 2004) each of whose constituents can undergo inflection are extracted by this model. The following example demonstrates this case: • "Asla umudunu kesmeyeceksin." lit. (Never) (your hope) (you will cut) (You will never despair) As a summary, Model 0 doesn't allow any inflections or derivations in the MWE candidate whereas Model 1 allows for only the last word, and Model 2 allows for all of its words. Since the used dictionary does not include proper names, the models introduced above are incapable of detecting named entities. Thus, our following two models which we name "Model #1 + NER" and "Model #2 + NER" use a Turkish named entity recognizer (Ş eker and Eryigit, 2012) on top of the mentioned models. Since the NER module may also return single word entities, only the extracted entities with multiple words are accepted as MWEs in these models. Below are some examples of the MWEs which are extracted by the NER in both models: • "Milli Savunma Bakanlıgı'nın toplantısı bugün yapılacak." lit. (National) (Defense) (of the Ministry) (the meeting) (today) (is to be held) (The Ministry of National Defense meeting is to be held today.) • "Bayındır Sokak'taki evimden çıktım." The used NER tool which is trained on a data set following the MUC guidelines (Chinchor and Robinson, 1997) for named entity annotation does not extract the titles of the proper names as part of the entity such as in "Başkan Barack Obama" (President Barack Obama) where the word 'president' is not extracted as part of the MWE. On the other hand, in our annotations on Turkish Treebanks, these words are also annotated as part of the MWEs. The Model #1 + Enlarged NER implicates the previous and/or the next word of the proper name to the extracted MWE if their first characters are in uppercase letter with the aim to detect the missing title words. The following example shows a MWE consisting of titles and proper names as would be extracted by this model: It is impractical to expect from a dictionary list to contain duplications (especially for m-duplications) because there is a theoretically infinite number of duplications (Section 3).
Our last model Model #1 + Enlarged NER + Dup contains an additional module which detects these repetitions on top of the previous model. Below is an example showing a MWE formed by word repetition handled by this model: • "Onu yavaş yavaş sakinleştirdi." lit. (him/her) (slow slow) (he/she calmed down). (He/she slowly calmed him/her down) Table 2 gives the precision, recall and F-scores (based on the number of MWEs) for the evaluation of the presented models on the introduced datasets. As stated previously, IMST, which contains higher number of annotated MWEs (Section2) yields lower recall scores compared to MST for all of the models. This is because of the newly annotated MWEs with non-adjacent constituents (Section3). On the other hand, all of the models give higher precision scores on IMST where the missing MWE annotations of MST are eliminated due to careful manually annotations on IMST.

Experimental Results and Discussions
Although, Model #1 is a very straightforward lexical matching approach, it outperforms Morpho-Coll and the dependency parser on newly annotated  datasets. The reason is because, the literal interpretation of MWEs with adjacent constituents is less probable compared to idiomatic usage. Such as the MWE "ayvayı yemek" which is close in meaning to to be in hot water (slang to be in trouble) may also be used literally in the case of eating a quince which is a much less probable usage.
The impact of adding a NER layer improves the results almost 10 percentage points. Our Enlarged NER adds almost 10 percentage points on top of this, and the impact (∼2 percentage points) of duplication detection is also promising although not as high as the previous two. Our best performed model Model #1 + Enlarged NER + Dup achieves 60.32%, 62.93%, 57.44% and 60.1% F-scores in MST, IMST, IVS and IWT respectively.
The extractors that we presented in this paper are limited to an individual dependency parser, a rulebased model and dictionary-based models with rulebased additions. Since these models do not go beyond considering the lexical forms and syntactic structures of constituents, they have an equally limited performance in determining MWEs, which are essentially semantic entities. As such, our models should only be considered baseline models. We expect the models to be a benchmark for future work on more sophisticated MWE extraction systems for Turkish and facilitate comparison with studies on other languages analogous to Turkish in their morphosyntactic structure, such as other agglutinative languages like Finnish and Hungarian, as well as various morphologically rich languages like French and Arabic.
Our premise is that, in order to properly pick out MWEs from within texts, a model needs to integrate morpho-lexical, syntactic and semantic mod-ules all in one, in order to respectively extract critical constituents, appoint the grammatical relations between them, and determine the nature of the extracted phrases. One of our future plans is to design and implement such a model following this study, making use of machine learning and incorporating sequential modules, each working out a separate aspect of the candidate expressions. Additionally, we aim to expand our survey and test our new model on other languages besides Turkish for a more thorough performance evaluation.

Conclusion
In this study, we described the various challenges in annotating and extracting MWEs in Turkish, due to the typology and certain idiosyncratic features of the language. We outlined the framework we established on what constitutes a MWE, along with the exceptional cases that have been considered. Afterwards, we discussed our elementary approach to extracting MWEs in Turkish, then presented the basic extraction models we developed and tested on four Turkish treebanks. Our best model which uses a lexical look-up approach allowing the inflection of the final MWE constituent, an enhanced named entity recognition module and a duplication extraction module obtains about 60% F-measure in these treebanks.