Deep Lexical Segmentation and Syntactic Parsing in the Easy-First Dependency Framework

We explore the consequences of representing token segmentations as hierarchical structures (trees) for the task of Multiword Expression (MWE) recognition, in isolation or in combination with dependency parsing. We pro-pose a novel representation of token segmentation as trees on tokens, resembling dependency trees. Given this new representation, we present and evaluate two different architectures to combine MWE recognition and dependency parsing in the easy-ﬁrst framework: a pipeline and a joint system, both taking advantage of lexical and syntactic dimensions. We experimentally validate that MWE recognition signiﬁcantly helps syntactic parsing.


Introduction
Lexical segmentation is a crucial task for natural language understanding as it detects semantic units of texts. One of the main difficulties comes from the identification of multiword expressions [MWE] (Sag et al., 2002), which are sequences made of multiple words displaying multidimensional idiomaticity (Nunberg et al., 1994). Such expressions may exhibit syntactic freedom and varying degree of compositionality, and many studies show the advantages of combining MWE identification with syntactic parsing (Savary et al., 2015), for both tasks (Wehrli, 2014). Indeed, MWE detection may help parsing, as it reduces the number of lexical units, and in turn parsing may help detect MWEs with syntactic freedom (syntactic variations, discontinuity, etc.).
In the dependency parsing framework, some previous work incorporated MWE annotations within syntactic trees, in the form of complex subtrees either with flat structures (Nivre and Nilsson, 2004;Eryigit et al., 2011;Seddah et al., 2013) or deeper ones (Vincze et al., 2013;Candito and Constant, 2014). However, these representations do not capture deep lexical analyses like nested MWEs. In this paper, we propose a two-dimensional representation that separates lexical and syntactic layers with two distinct dependency trees sharing the same nodes 1 . This representation facilitates the annotation of complex lexical phenomena like embedding of MWEs (e.g. I will (take a (rain check))). Given this representation, we present two easy-first dependency parsing systems: one based on a pipeline architecture and another as a joint parser.

Deep Segmentation and Dependencies
This section describes a lexical representation able to handle nested MWEs, extended from Constant and Le Roux (2015) which was limited to shallow MWEs. Such a lexical analysis is particularly relevant to perform deep semantic analysis.
A lexical unit [LU] is a subtree of the lexical segmentation tree composed of either a single token unit or an MWE. In case of a single token unit, the subtree is limited to a single node. In case of an MWE, the subtree is rooted by its leftmost LU, from which there are arcs to every other LU of the MWE. For instance, the MWE in spite of made of three single token units is a subtree rooted by in. It comprises two arcs: in → spite and in → of. The MWE make big deal is more complex as it is formed of a single token unit make and an MWE big deal. It is represented as a subtree whose root is make connected to the root of the MWE subtree corresponding to big deal. The subtree associated with big deal is made of two single token units. It is rooted by big with an arc big → deal. Such structuring allows to find nested MWEs when the root is not an MWE itself, like for make big deal. It is different for the MWE Los Angeles Lakers comprising the MWE Los Angeles and the single token unit Lakers. In that case, the subtree has a flat structure, with two arcs from the node Los, structurally equivalent to in spite of that has no nested MWEs. Therefore, some extra information is needed in order to distinguish these two cases. We use arc labels. Labeling requires to maintain a counter l in order to indicate the embedding level in the leftmost LU of the encompassing MWE. Labels have the form sub l mwe for l ≥ 0. Let U = U 1 ...U n be a LU composed of n LUs. If n = 1, it is a single token unit. Otherwise, subtree(U, 0), the lexical subtree 2 for U is recursively constructed by adding arcs subtree(U 1 , l + 1) In the case of shallow representation, every LUs of U are single token units.
Once built the LU subtrees (the internal dependencies), it is necessary to create arcs to connect them and form a complete tree : that we call ex-2 The second argument l corresponds to the embedding level.
ternal dependencies. LUs are sequentially linked together: each pair of consecutive LUs with roots (w i ,w j ), i < j, gives an arc w i lex − − → w j . Figure 1 and Figure 2 respectively display the deep and shallow lexical segmentations of the sentence The Los Angeles Lakers made a big deal out of it.
For readibility, we note mwe for sub 0 mwe and submwe for sub 1 mwe.

Easy-first parsing
Informally, easy-first proposed in Goldberg and Elhadad (2010) predicts easier dependencies before risky ones. It decides for each token whether it must be attached to the root of an adjacent subtree and how this attachment should be labeled 3 . The order in which these decisions are made is not decided in advance: highest-scoring decisions are made first and constrain the following decisions.
This framework looks appealing in order to test our assumption that segmentation and parsing are mutually informative, while leaving the exact flow of information to be learned by the system itself: we do not postulate any priority between the tasks nor that all attachment decisions must be taken jointly. On the contrary, we expect most decisions to be made independently except for some difficult cases that need both lexical and syntactic knowledge.
We now present two adaptations of this strategy to build both lexical and parse trees from a unique sequence of tokens 4 . The key component is to use features linking information from the two dimensions.

Pipeline Architecture
In this trivial adaptation, two parsers are run sequentially. The first one builds a structure in one dimension (i.e. for segmentation or syntax). The second one builds a structure in the other dimension, with the result of the first parser available as features.

Joint Architecture
The second adaptation is more substantial and takes the form of a joint parsing algorithm. This adaptation is provided in Algorithm 1. It uses a single classifier to predict lexical and syntactic actions. As in easy-first, each iteration predicts the most certain head attachment action given the currently predicted subtrees, but here it may belong to any dimension. This action can be mapped to an edge in the appropriate dimension via function EDGE. Function score(a,i) computes the dot-product of feature weights and features at position i using surrounding subtrees in both dimensions 5 .
Algorithm 1 Joint Easy-first parsing 1: function JOINT EASY-FIRST PARSING(w0...wn) 2: Let A be the set of possible actions 3: arcss,arcs l := (∅, ∅) 4: hs,h l := w0 . . . wn, w0 . . . wn 5: while |h l | > 1 ∨ |hs| > 1 do 6:â,î := argmax a∈A,i∈[|h d |] score(a,i) 7: (par, lab, child, dim) := EDGE((hs, h l ),â,î) 8: arcs dim := arcs dim ∪ (par, lab, child) 9: h dim := h dim \{child} 10: end while 11: return (arcs l , arcss) 12: end function 13: function EDGE((hs, h l ), (dir, lab, dim), i) 14: if dir =← then we have a left edge 15: We can reuse the reasoning from Goldberg and Elhadad (2010) and derive a worst-case time com-4 It is straightforward to add any number of tree structures. 5 Let us note that the algorithm builds projective trees for each dimension, but their union may contain crossing arcs.  Schneider et al. (2014a). The FTB contains annotations of contiguous MWEs. We generated the dataset from the version described in Candito and Constant (2014) and used the shallow lexical representation, in the official train/dev/test split of the SPMRL shared task (Seddah et al., 2013). The Sequoia treebank contains some limited annotations of MWEs (usually, compounds having an irregular syntax). We manually extended the coverage to all types of MWEs including discontiguous ones. We also included deep annotation of MWEs (in particular, nested ones). We used a 90%/10% train/test split in our experiments. Some statistics about the data sets are provided in table 4.1. Tokens were enriched with their predicted part-of-speech (POS) and information from MWE lexicon 6 lookup as in Candito and Constant (2014).

Parser and features
Parser. We implemented our systems by modifying the parser of Y. Goldberg 7 also used as a baseline. We trained all models for 20 iterations with dynamic oracle  using the following exploration policy: always choose an oracle transition in the first 2 iterations (k = 2), then choose model prediction with probability p = 0.9.
Features. One-dimensional features were taken directly from the code supporting . We added information on typographical cues (hyphenation, digits, capitalization, . . . ) and the existence of substrings in MWE dictionaries in order to help lexical analysis. Following Constant et al. (2012) and Schneider et al. (2014a), we used dictionary lookups to build a first naive segmentation and incorporate it as a set of features. Twodimensional features were used in both pipeline and joint strategies. We first added syntactic path features to the lexical dimension, so syntax can guide segmentation. Conversely, we also added lexical path features to the syntactic dimension to provide information about lexical connectivity. For instance, two nodes being checked for attachment in the syntactic dimension can be associated with information describing whether one of the corresponding node is an ancestor of the other one in the lexical dimension (i.e. indicating whether the two syntactic nodes are linked via internal or external paths). We also selected automatically generated features combining information from both dimensions. We chose a simple data-driven heuristics to select combined features. We ran one learning iteration over the FTB training corpus adding all possible combinations of syntactic and lexical features. We picked the templates of the 10 combined features whose scores had the greatest absolute values. Although this heuristics may not favor the most discriminant features, we found that the chosen features helped accuracy on the development set.

Results
For each dataset, we carried out four experiments. First we learned and ran independently two distinct 7 We started from the version available at the time of writing at https://bitbucket.org/yoavgo/ tacl2013dynamicoracles baseline easy-first parsers using one-dimensional features: one producing a lexical segmentation, another one predicting a syntactic parse tree. We also trained and ran a joint easy-first system predicting lexical segmentations and syntactic parse trees, using two-dimensional features. We also experimented the pipeline system for each dimension, consisting in applying the baseline parser on one dimension and using the resulting tree as source of twodimensional features in a standard easy first parser applied on the other dimension. Since pipeline architectures are known to be prone to error propagation, we also run an experiment where the pipeline second stage is fed with oracle first-stage trees.
Results on the test sets are provided in table 2, where LAS and UAS are computed with punctuation. Overall, we can see that the lexical information tends to help syntactic prediction while the other way around is unclear.  Table 2: Results on our three test sets. Statistically significant differences (p-value < 0.05) from the corresponding "distinct" setting are indicated with †. Rows -oracle trees are the same as pipeline but using oracle, instead of predicted, trees.

Discussion
The first striking observation is that the syntactic dimension does not help the predictions in the lexical dimension, contrary to what could be expected. In practice, we can observe that variations and discontinuity of MWEs are not frequent in our data sets. For instance, Schneider et al. (2014a) notice that only 15% of the MWEs in EWT are discontiguous and most of them have gaps of one token. This could explain why syntactic information is not useful for segmentation. On the other hand, the lexical dimension tends to help syntactic predictions. More precisely, while the pipeline and the joint approach reach comparable scores on the FTB and Sequoia, the joint system has disappointing results on EWT. The good scores for Sequoia could be explained by the larger MWE coverage.
In order to get a better intuition on the real impact of each of the three approaches, we broke down the syntax results by dependency labels. Some labels are particularly informative. First of all, the precision on the modifier label mod, which is the most frequent one, is greatly improved using the pipeline approach as compared with the baseline (around 1 point). This can be explained by the fact that many nominal MWEs have the form of a regular noun phrase, to which its internal adjectival or prepositional constituents are attached with the mod label. Recognizing a nominal MWE on the lexical dimension may therefore give a relevant clue on its corresponding syntactic structure. Then, the dep cpd connects components of MWE with irregular syntax that cannot receive standard labels. We can observe that the pipeline (resp. the joint) approach clearly improves the precision (resp. recall) as compared with the baseline (+1.6 point). This means that the combination of a preliminary lexical segmentation and a possibly partial syntactic context helps improving the recognition of syntax-irregular MWEs. Coordination labels (dep.coord and coord) are particularly interesting as the joint system outperforms the other two on them. Coordination is known to be a very complex phenomenon: these scores would tend to show that the lexical and syntactic dimensions mutually help each other.
When comparing this work to state-of-the-art systems on data sets with shallow annotation of MWEs, we can see that we obtain MWE recognition scores comparable to systems of equivalent complexity and/or available information. This means that our novel representation which allows for the annotation of more complex lexical phenomena does not deteriorate scores for shallow annotations.

Conclusions and Future Work
In this paper we presented a novel representation of deep lexical segmentation in the form of trees, forming a dimension distinct from syntax. We experimented strategies to predict both dimensions in the easy-first dependency parsing framework. We showed empirically that joint and pipeline processing are beneficial for syntactic parsing while hardly impacting deep lexical segmentation.
The presented combination of parsing and segmenting does not enforce any structural constraint over the two trees 8 . We plan to address this issue in future work. We will explore less redundant, more compact representations of the two dimensions since some annotations can be factorized between the two dimensions (e.g. MWEs with irregular syntax) and some can easily be induced from others (e.g. sequential linking between lexical units).