Joint Dependency Parsing and Multiword Expression Tokenization

Complex conjunctions and determiners are often considered as pretokenized units in parsing. This is not always realistic, since they can be ambiguous. We pro-pose a model for joint dependency parsing and multiword expressions identiﬁcation, in which complex function words are represented as individual tokens linked with morphological dependencies. Our graph-based parser includes standard second-order features and verbal subcategorization features derived from a syntactic lexicon.We train it on a modiﬁed version of the French Treebank enriched with morphological dependencies. It recognizes 81.79% of ADV + que conjunctions with 91.57% precision, and 82.74% of de + DET determiners with 86.70% precision.


Introduction
Standard NLP tool suites for text analysis are often made of several processes that are organized as a pipeline, in which the input of a process is the output of the preceding one. Among these processes, one commonly finds a tokenizer, which segments a sentence into words, a part-of-speech (POS) tagger, which associates to every word a part-of-speech tag, and a syntactic parser, which builds a parse tree for the sentence 1 . These three processes correspond to three formal operations on the string: segmentation into linguistically relevant units (words), tagging the words with POS tags and linking the (word, POS) pairs by means of syntactic dependencies.
This setup is clearly not ideal, as some decisions are made too early in the pipeline (Branco and Silva, 2003). More specifically, some tokenization and tagging choices are difficult to make without taking syntax into account. To avoid the pitfall of premature decisions, probabilistic tokenizers and taggers can produce several solutions in the form of lattices (Green and Manning, 2010;Goldberg and Elhadad, 2011). Such approaches usually lead to severe computational overhead due to the huge search space in which the parser looks for the optimal parse tree. Besides, the parser might be biased towards short solutions, as it compares scores of trees associated to sequences of different lengths (De La Clergerie, 2013).
This problem is particularly hard when parsing multiword expressions (MWEs), that is, groups of tokens that must be treated as single units (Baldwin and Kim, 2010). The solution we present in this paper is different from the usual pipeline. We propose to jointly parse and tokenize MWEs, transforming segmentation decisions into linking decisions. Our experiments concentrate on two difficult tokenization cases. Hence, it is the parser that will choose, in such cases, whether to group or not several tokens.
Our first target phenomenon is the family of ADV+que constructions, a type of complex conjunction in French. They are formed by adverbs like bien (well) or ainsi (likewise) followed by the subordinative conjunction que (that). They function like English complex conjunctions so that and now that. Due to their structure, ADV+que constructions are generally ambiguous, like in the following examples: 1. Je mange bien que je n'aie pas faim I eat although I am not hungry 2. Je pense bien que je n'ai pas faim I think indeed that I am not hungry In example 1, the sequence bien que forms a complex conjunction (although) whereas in example 2, the adverb bien (indeed) modifies the verb pense (think), and the conjunction que (that) introduces the sentential complement je n'ai pas faim (I am not hungry). In treebanks, the different readings are represented through the use of wordswith-spaces in the case of complex conjunctions.
Our second target phenomenon is the family of partitive articles which are made of the preposition de (of ) followed by the definite determiner le, la, l' or les 2 (the). These de+DET constructions are ambiguous, as shown in the following examples: 3. Il boit de la bière He drinks some beer 4. Il parle de la bière I talks about the beer In example 3, the sequence de la forms a determiner (some) whereas in example 4, de is a preposition (about) and la is the determiner (the) of the noun bière (bière).
We focus on these constructions for two reasons. First, because they are extremely frequent. For instance, in the frWaC corpus, from a total of 54.8M sentences, 1.15M sentences (2.1%) contain one or more occurrences of our target ADV+que constructions and 26.7M sentences (48.6%) contain a de+DET construction (see Tables 1 and 2). Moreover, in a corpus of 370 M words in French, 3 des is the 7 th most frequent word. Second, because they are perfect examples of phenomena which are difficult to process by a tokenizer. In order to decide, in example 1, that bien que is a complex subordinate conjunction, non-trivial morphological, lexical and syntactic clues must be taken into account, such as the subcategorization frame of the verb of the principal clause and the mood of the subordinate clause. All these clues are difficult to take into account during tokenization, where the syntactic structure of the sentence is not yet explicit.
Ask the parser to perform tokenization will not always solve the problem. Even state-of-the-art parsers can fail to predict the right structure for the cases we are dealing with. The main reason is that they are trained on treebanks of limited size, and some lexico-syntactic phenomena cannot be well modeled. This brings us to the second topic of this paper, which is the integration of external linguistic resources in a treebank-trained probabilistic parser. We show that, in order to cor-2 Sequences de le and de les do not appear as such in French. They have undergone a morpho-phonetic process known as amalgamation and are represented as tokens du and des. In our pipeline, they are artificially detokenized.
3 Newspaper Le Monde from 1986 to 2002.
rectly solve the two problems at hand, the parser must have access to lexico-syntactic information that can be found in a syntactic lexicon. We propose a simple way to introduce such information in the parser by defining new linguistic features that blend smoothly with treebank features used by the parser when looking for the optimal parse tree. The paper is organized as follows: Section 2 describes related work on MWE parsing. Section 3 proposes a way to represent multiword units by means of syntactic dependencies. In Section 4, we briefly describe the parser that has been used in this work, and in Section 5, we propose a way to integrate a syntactic lexicon into the parser. Section 6 describes the data sets used for the experiments, which results are presented and discussed in Section 7. Section 8 concludes the paper.

Related Work
The famous "pain-in-the-neck" article by Sag et al. (2002) discusses MWEs in parsers, contrasting two representation alternatives in the LinGO ERG HPSG grammar of English: compositional rules and words-with-spaces. The addition of compositional rules for flexible MWEs has been tested in a small-scale experiment which showed significant coverage improvements in HPSG parsing by the addition of 21 new MWEs to the grammar (Villavicencio et al., 2007).
It has been demonstrated that pre-grouping MWEs as words-with-spaces can improve the performance of shallow parsing for English (Korkontzelos and Manandhar, 2010). Nivre and Nilsson (2004) obtained similar results for dependency parsing of Swedish. They compare models trained on two representations: one where MWEs are linked by a special ID dependency, and another one based on gold pre-tokenization. Their results show that the former model can recognize MWEs with F1=71.1%, while the latter can significantly improve parsing accuracy and robustness in general. However, the authors admit that "it remains to be seen how much of theoretically possible improvement can be realized when using automatic methods for MWU recognition".
Several methods of increasing complexity have been proposed for fully automatic MWE tokenization: simple lexicon projection onto a corpus (Kulkarni and Finlayson, 2011), synchronous lexicon lookup and parsing (Wehrli et al., 2010;Seretan, 2011), token-based classifiers trained using association measures and other contextual features (Vincze et al., 2013a), or contextual sequence models like conditional random fields (Constant and Sigogne, 2011;Constant et al., 2013b;Vincze et al., 2013b) and structured perceptron (Schneider et al., 2014). In theory, compound function words like ADV+que and de+DET allow no internal variability, thus they should be represented as words-with-spaces. However, to date no satisfactory solution has been proposed for automatically tokenizing ambiguous MWEs. Green et al. (2013) propose a constituency parsing model which, as a by-product, performs MWE identification. They propose a flat representation for contiguous expressions in which all elements are attached to a special node, and then they compare several parsing models, including an original factored-lexicon PCFG and a tree substitution grammar. These generic parsing models can be used for parsing in general, but they have interesting memorization properties which favor MWE identification. Their experiments on French and Arabic show that the proposed models beat the baseline in MWE identification while producing acceptable general parsing results.
Candito and Constant (2014) and Vincze et al. (2013c) present experiments on dependency parsing for MWE identification which are the closest to our settings. Vincze et al. (2013c) focus on light verb constructions in Hungarian. They propose distinguishing regular verbal dependencies from light verbs and their complements through four special labels prefixed by LCV-. Then, they train the Bohnet parser (Bohnet, 2010) using standard parameters and features, and evaluate on a gold test set. They report no significant changes in attachment scores, whereas F1 for light verb identification is 75.63%, significantly higher than the baseline methods of lexicon projection (21.25%) and classification (74.45%). (2014) compare several architectures for dependency parsing and MWE identification in French. For regular MWEs like noun compounds, they use regular expressions to automatically generate an internal syntactic structure, combining standard and MWE-dedicated dependency labels. Irregular expressions like complex conjunctions are represented as separate tokens, with a special DEP CPD dependency that links all tokens to the first MWE word (Constant et al., 2013a). They compare different architec-tures for MWE identification before, during and after parsing, showing that the best architecture depends on whether the target MWEs are regular or irregular.

Candito and Constant
Similarly to these two papers, we use a special dependency to model MWEs and evaluate parsing and identification accuracy. Our work departs from theirs on three important aspects. First, we concentrate on syntactically irregular compounds, that we represent with a new kind of dependency. Second, we integrate into the parser a syntactic lexicon in order to help disambiguate ADV+que and de+DET constructions. Third, we built a specific evaluation corpus to get a better estimation of the performances of our model on ADV+que and de+DET constructions.

The MORPH Dependency
In order to let the parser take the tokenization decisions, we propose not to group sequences of tokens of the form ADV+que and de+DET at tokenization time. Instead, we transform the task of segmentation decision into a parsing decision task.
We associate a syntactic structure to ADV+que and de+DET constructions by introducing a new type of dependency that we call MORPH. It is not a standard syntactic dependency, but a reminiscent of the morphological dependencies of Mel'čuk (1988), similar to the DEP CPD label proposed by Candito and Constant (2014) or the ID dependency of Nivre and Nilsson (2004), except that we focus on syntactically-motivated MWEs, proposing a regular structure for them.
The syntactic structures of examples 1 and 2, introduced in Section 1, are represented below 4 . In example 1, the complex conjunction bien que is represented by the presence of the MORPH dependency, whereas, in example 2, the adverb bien modifies the verb pense and que introduces its object. From an NLP perspective, the two readings are treated the same way by the tokenizer and the tagger. It is only at parsing time that the presence of the complex conjunction is predicted.
The syntactic structures of examples 3 and 4 are represented below. In example 3, the partitive article de la is represented by means of the MORPH dependency. Example 4 exhibits a standard prepositional phrase structure.

Parsing
The parser used in this study is a second-order graph-based parser (Kübler et al., 2009). Given a sentence W = w 1 . . . w l , the parser looks for the dependency treeT of W that maximizes the score s:T = arg max where T (W ) is the set of all possible dependency trees for sentence W and F(T ) is the set of all relevant subparts, called factors, of tree T and s(F ) is the score of factor F . The values of these scores are parameters estimated during training.
We can define different models of increasing complexity depending on the decomposition of the tree into factors. The most simple one is the arcfactored or first-order model, which simply decomposes a tree into single dependencies and assigns them a score, independently of their context. We used a second-order parser which decomposes a tree into factors of three types: 1. first-order factors, made of one dependency; 2. sibling factors, made of two dependencies sharing a common governor; 3. grandchildren factors, made of two dependencies where the dependent of one of them is the governor of the other one.

Integration with a Syntactic Lexicon
Although this kind of parsers achieve state-of-theart performances (Bohnet, 2010), their predictions are limited to the phenomena that occur in the treebanks they are trained on. In particular, they often fail at correctly distinguishing elements that are subcategorized by a verb (henceforth complements) from others (modifiers). This is due to the fact that the nature and number of the complements is specific to each verb. If the verb did not occur, or did not occur often enough, in the treebank, the nature and number of its complements will not be correctly modeled by the parser. A precise description of verb complements plays an important role in the task of predicting the MORPH dependency, as we illustrate in example 1. In this example, the verb manger (eat) does not accept an object introduced by the subordinate conjunction que (that) . This is a vital information in order to predict the correct syntactic structure of the sentence. If the parser cannot link the conjunction que to the verb manger with an OBJ dependency, then it has to link it with a MOD dependency (it has no other reasonable solution). But que by itself cannot be a MOD of the verb unless it is a complex conjunction. The parser has therefore no other choice than linking que with the adverb using a MORPH dependency.
In order to help the parser build the right solution in such cases, we have introduced information derived from a syntactic lexicon in the parser. The syntactic lexicon associates, each verb lemma, the features +/-QUE and +/-DE, that indicate respectively if the verb accepts an object introduced by the subordinating conjunction que and by the preposition de. The verbs of our examples would have the following values: We will call such features subcat features (SFs). The semantics of positive feature values are quite different from the semantics of negative ones. The former indicates that a verb may (but does not need to) license a complement introduced by the conjunction que or the preposition de, whereas the latter indicates that the verb cannot license such a complement. Negative feature values have, therefore, a higher predictive power.
Every verbal lemma occurrence in the treebank is enriched with subcat features and three new factor templates have been defined in the parser in order to model the co-occurrence of subcat features and some syntactic configurations. These templates are represented in Figure 1. The first one is a first-order template and the others are grandchildren templates. In the template description, G, D and GD stand respectively for governor, dependent and grand-dependent. SF, POS, FCT and LEM respectively stand for subcat feature, part of speech, syntactic function and lemma. Two factors, of the types 1 and 3, have been represented in Figure 2. The first one models the cooccurrence of subcat feature -QUE and an object introduced by a subordinating conjunction. Such feature will receive a negative score at the end of training, since a verb having the -QUE feature should not license a direct object introduced by a subordinating conjunction. The second feature models the co-occurrence of the feature -QUE and a modifier introduced by the subordinating conjunction QUE and having an adverb as a dependent. Such a feature will receive a positive score.

Experimental Setup
We test the proposed model to verify the linguistic plausibility and computational feasibility of using MORPH links to represent syntactically idiosyncratic MWEs in a dependency parser enriched with subcat features. Therefore, we train a probabilistic dependency parsing model on modified treebank, representing ADV+que and de+DET constructions using this special syntactic relation in-stead of pretokenization. Furthermore, in addition to regular features learned from the treebank, we also introduce and evaluate subcat features based on a lexicon of verbal valency, which helps identifying subordinative clauses and de prepositional phrases (see Section 5). We evaluate parsing precision and MWE identification on a test treebank and, more importantly, on a dataset built specifically to study the representation of our target constructions. All experiments used the NLP tool suite MACAON 5 , which comprises a second-order graph-based parser.

Data Sets and Resources
French Treebank (FTB) The parser was trained on the French Treebank, a syntactically annotated corpus of news articles from Le Monde (Abeillé et al., 2003). We used the version which was transformed into dependency trees by Candito et al. (2009), and which was also used by Candito and Constant (2014) for experiments on MWE parsing. We used a standard split of 9,881 sentences (278K words) for training and 1,235 sentences for test (36K words). We applied simple rules to transform the flat representation of ADV+que and de+DET constructions into MORPH-linked individual tokens. All other MWEs are kept unchanged in training and test data. They are represented as single tokens, not decomposed into individual words. MORPH Dataset The test portion of the FTB contains relatively few instances of our target constructions (see Tables 4 and 6). Thus, we have created two specific data sets to evaluate the prediction of MORPH links. As for ADV+que constructions, we manually selected the 7 most potentially ambiguous combinations from the top-20 most frequent combinations in the French Web as Corpus -frWaC (Baroni and Bernardini, 2006). 6 As for de+DET constructions, we selected all 4 possible combinations. For each target ADV+que and de+DET construction, we randomly selected 1,000 sentences from the frWaC based on two criteria: (1) sentences should contain only one occurrence of the target construction and (2) sentences should have between 10 and 20 words, to avoid distracting the annotators while still providing enough context. Additionally, for de+DET we selected only sentences in which a verb preceded the construction, in order to minimize the occur-  rence of nominal complements (président de la république -president of the republic) and focus on the determiner/preposition ambiguity. Two expert French native speakers annotated around 100 sentences per construction. Malformed or ambiguous sentences were discarded. Disagreements were either discussed and resolved or the sentence was discarded. 7 We can see in Table 1 that ADV+que constructions are highly ambiguous, with 56.4% of the cases being complex conjunctions. However, they also present high variability: even though they share identical syntactic behavior, some of them tend to form complex conjunctions very often (alors) while others occur more often in other syntactic configurations (tant and encore). As one can see in Table 2, de+DET sequences tend to function as prepositions followed by a determiner with the notable exception of de les. The reason is that de les (actually the amalgame des) is actually the plural of the indefinite article (un), used with any plural noun, while the other determiners are partitives that tend to be used only with massive nouns. The last column of these tables shows the number of occurrences of each construction in the frWaC corpus. We can see that they are very recurrent combinations, specially de+DET constructions, which account for 3.7% of the total number of bigrams in the corpus. This underlines the importance of correctly predicting their syntactic structure in a parser.
DicoValence Lexicon DicoValence (van den Eynde and Mertens, 2003) is a lexical resource which lists the subcategorization frames of more than 3, 700 French verbs. 8 It describes more specifically the number and nature of the verbs' complements. Dicovalence gives a more finegrained description of the complements than what is needed in our feature templates. We have only kept, as described in Section 5, the subcat features -QUE, +QUE, -DE and +DE of each verb. Table 3 below shows the number of verbal entries having each of our four subcat features. Although the number of verbs described in DicoValence is moderate, its coverage is high on our data sets. It is equal to 97.82% on the FTB test set and is equal to 95.48% on the MORPH dataset.

Evaluation
We evaluate our models on two aspects: parsing quality and MWE identification (Nivre and Nilsson, 2004;Vincze et al., 2013c;Candito and Constant, 2014). First, we use standard parsing attachment scores to verify whether our models impact parsing performance in general. We compare the generated dependency trees with the reference in the test portion of the FTB, reporting the proportion of matched links, both in terms of structure -unlabeled attachment score (UAS) -and of labeled links -labeled attachment score (LAS). Since our focus is on MWE parsing, we are also interested in MWE identification metrics. We focus on words whose dependency label is MORPH and calculate the proportion of correctly predicted MORPH links among those in the parser output (precision), among those in the reference (recall) and the F1 average. Since some of the phenomena are quite rare in the FTB test portion, we focus on the MORPH dataset, which contains around 100 instances of each target construction. We compare our approach with two simple baselines. The first one consists in pretokenizing ADV+que systematically as a single token, while de+DET is systematically left as two separate tokens. This baseline emulates the behavior of most parsing pipelines, which deal with functional complex words during tokenization. This corresponds to choosing the majority classes in the last row of Tables 1 and 2. For ADV+que, the precision of the baseline is 56.4%. If we assume recall is 100%, this yields an F1 score of 72.2%. For de+DET, however, recall is 0% since no MORPH link is predicted at all. Therefore, we only look at the baseline's precision of 63.5%. A second, slightly more sophisticated baseline, consists in choosing the majority class for each individual construction and average precisions over the constructions. In this case, the average precision is 75.3% for ADV+que and 76.6% for de+DET. We compare our model to the one proposed by Green et al. (2013). We used the pretrained model available as part of the Stanford parser 9 . Their model outputs constituent trees, which were automatically converted to unlabeled dependency structures. We ignore the nature of the dependency link, only checking whether the target construction elements are linked in the correct order.
Our experiments use the MACAON tool suite. For the FTB, gold POS and gold lemmas are given as input to the parser. In the case of the MORPH dataset, for which we do not have gold POS and lemmas, they are predicted by MACAON. The first best prediction is given as input to the parser.    (27). It is therefore difficult draw clear conclusions concerning the task of predicting the MORPH dependency.
The precision and recall have nevertheless been reported. The recall is perfect (all MORPH dependencies have been predicted) and the the precision is reasonable (the parser overpredicts a little). The table also shows that the use of subcat features is not beneficial, as attachment scores as well as precision decrease. The decrease of precision is misleading, though, due to the small number of occurrences it has been computed on. Table 5 displays the precision, recall and F1 of the prediction of the MORPH dependency on the 730 ADV+que sentences of the MORPH dataset, without and with the use of subcat features. The scores obtained are lower than the same experiments on the FTB.Precision is higher than recall, which indicates that the parser has a tendency to underpredict. We also present the precision of the two baselines described in Section 6.2. Only in two cases the per-construction majority baseline (indiv.) outperforms our parser without subcat features. These two constructions do not tend to form complex conjunctions, that is, the parser overgenerates MORPH dependencies. Here, subcat features help increasing precision, systematically outperforming the baselines.
The introduction of subcat features has a beneficial but limited impact on the results, increasing precision and lowering a bit recall, augmenting the tendency of the parser to under predict MORPH dependencies. Overall, our models are more precise than the Stanford parser at predicting MORPH links, specially for bien que and en-    Table 6 reports the results of the same experiments on de+DET constructions. It shows that the frequency of de+DET constructions is higher than ADV+que constructions. It also shows that the introduction of subcat features has a positive impact on the prediction of the MORPH dependency, but a negative effect on the attachment scores. Table 7 reveals that the prediction of the correct structure of de+DET constructions is more difficult than that of ADV+que constructions for the parser.
Here, not only the majority class is the non-MWE analysis (63.5%), but also there is higher ambiguity because of nominal and adverbial complements that have the same structure. This impacts the performance of the Stanford parser, which overgenerates MORPH links, achieving the lowest precision for all constructions except for des. Results also show that the introduction of subcat features has an important impact on the quality of the prediction (F1 jumps from 75% to 84.67%). The use of subcat features slightly improves the identification of de les, which is a determiner most of the time. On the other hand, it greatly improves F1 for other constructions, which appear less often as determiners. We believe that the higher impact of subcat frames on de+DET is mainly due to the fact that the number of verbs licensing complements introduced by the preposition de is higher than the number of verbs licensing complements introduced by the conjunction que (see Table 3). Therefore, the parser trained without subcat features can only rely on the examples present in the FTB which are proportionally smaller in the first case than in the second.

Conclusions
This paper introduced and evaluated a joint parsing and MWE identification model that can effectively detect and represent ambiguous complex function words. The difficulty of processing such expressions is underestimated because of their limited variability. They often are pregrouped as words-with-spaces in many parsing architectures (Sag et al., 2002). However, we did not use gold tokenization, unrealistic for ambiguous MWEs (Nivre and Nilsson, 2004 Table 7: MORPH link prediction for de+DET constructions: precision of global majority baseline, precision of individual per-construction baseline, precision of Green et al. (2013) constituent parser, precision, recall and F1 of our dependency parser without and with subcat features.
los and Manandhar, 2010). We proposed to deal with these constructions during parsing, when the required syntactic information to disambiguate them is available. Thus, we trained a graph-based dependency parser on a modified treebank where complex function words were linked with a MORPH dependency. Our results demonstrate that a standard parsing model can correctly learn such special links and predict them for unseen constructions. Nonetheless, the model is more accurate when we integrate external information from a syntactic lexicon. This improved precision for ADV+que and specially de+DET constructions. For the latter, F1 improved in almost 10%, going from 75% to 84.61%.
This study raised several linguistic and computational questions. Some complex function words include more than two elements, like si bien que (so much that) and d'autant (plus) que (especially as). Moreover, they may contain nested expressions with different meanings and structures, e.g. tant que (as long as) is a conjunction but en tant que (as) is a preposition. The same applies for quantified partitive determiners, like beaucoup de (much) and un (petit) peu de (a (little) bit of ). Their identification and representation is planned as a future extension to this work.
We also would like to compare our approach to sequence models (Schneider et al., 2014). Careful error analysis could help us understand in which cases syntactic features can help. Moreover, different variants of the syntactic features and more sophisticated representation for syntactic lexicons can help improve MWE parsing further. For instance, we represent the subcat features of pronominal verbs and their simple versions with the same features, but they should be distinguished, e.g. se rappeler (remember) is +DE but rappeler (remind) is -DE.