A Transition-Based System for Joint Lexical and Syntactic Analysis

We present a transition-based system that jointly predicts the syntactic structure and lexical units of a sentence by building two structures over the input words: a syntactic dependency tree and a forest of lexical units including multiword expressions (MWEs). This combined representation allows us to capture both the syntactic and semantic structure of MWEs, which in turn enables deeper downstream semantic analysis, especially for semi-compositional MWEs. The proposed sys-tem extends the arc-standard transition system for dependency parsing with transitions for building complex lexical units. Experiments on two different data sets show that the approach signiﬁcantly improves MWE identiﬁcation accuracy (and sometimes syntactic accuracy) compared to existing joint approaches.


Introduction
Multiword expressions (MWEs) are sequences of words that form non-compositional semantic units. Their identification is crucial for semantic analysis, which is traditionally based on the principle of compositionality. For instance, the meaning of cut the mustard cannot be compositionally derived from the meaning of its elements and the expression therefore has to be treated as a single unit. Since Sag et al. (2002), MWEs have attracted growing attention in the NLP community.
Identifying MWEs in running text is challenging for several reasons (Baldwin and Kim, 2010;Seretan, 2011;Ramisch, 2015). First, MWEs encompass very diverse linguistic phenomena, such as complex grammatical words (in spite of, because of), nominal compounds (light house), non-canonical prepositional phrases (above board), verbal idiomatic expressions (burn the midnight oil), light verb constructions (have a bath), multiword names (New York), and so on. They can also be discontiguous in the sense that the sequence can include intervening elements (John pulled Mary's leg). They may also vary in their morphological forms (hot dog, hot dogs), in their lexical elements (lose one's mind/head), and in their syntactic structure (he took a step, the step he took).
The semantic processing of MWEs is further complicated by the fact that there exists a continuum between entirely non-compositional expressions (piece of cake) and almost free expressions (traffic light). Many MWEs are indeed semicompositional. For example, the compound white wine denotes a type of wine, but the color of the wine is not white, so the expression is only partially transparent. In the light verb construction take a nap, nap keeps its usual meaning but the meaning of the verb take is bleached. In addition, the noun can be compositionally modified as in take a long nap. Such cases show that MWEs may be decomposable and partially analyzable, which implies the need for predicting their internal structure in order to compute their meaning.
From a syntactic point of view, MWEs often have a regular structure and do not need special syntactic annotation. Some MWEs have an irregular structure, such as by and large which on the surface is a coordination of a preposition and an adjective. They are syntactically as well as semantically non-compositional and cannot be represented with standard syntactic structures, as stated in Candito and Constant (2014). Many of these irregular MWEs are complex grammatical words like because of, in spite of and in order to -fixed (grammatical) MWEs in the sense of Sag et al. (2002). In some treebanks, these are annotated using special structures and labels because they can-not be modified or decomposed. We hereafter use the term fixed MWE to refer to either fixed or irregular MWEs.
In this paper, we present a novel representation that allows both regular and irregular MWEs to be adequately represented without compromising the syntactic representation. We then show how this representation can be processed using a transitionbased system that is a mild extension of a standard dependency parser. This system takes as input a sentence consisting of a sequence of tokens and predicts its syntactic dependency structure as well as its lexical units (including MWEs). The resulting structure combines two factorized substructures: (i) a standard tree representing the syntactic dependencies between the lexical elements of the sentence and (ii) a forest of lexical trees including MWEs identified in the sentence. Each MWE is represented by a constituency-like tree, which permits complex lexical units like MWE embeddings (for example, [[Los Angeles ] Lakers], I will [take a [rain check]]). The syntactic and lexical structures are factorized in the sense that they share lexical elements: both tokens and fixed MWEs. The proposed parsing model is an extension of a classical arc-standard parser, integrating specific transitions for MWE detection. In order to deal with the two linguistic dimensions separately, it uses two stacks (instead of one). It is synchronized by using a single buffer, in order to handle the factorization of the two structures. It also includes different hard constraints on the system in order to reduce ambiguities artificially created by the addition of new transitions. To the best of our knowledge, this system is the first transition-based parser that includes a specific mechanism for handling MWEs in two dimensions. Previous related research has usually proposed either pipeline approaches with MWE identification performed either before or after dependency parsing (Kong et al., 2014;Vincze et al., 2013a) or workaround joint solutions using off-the-shelf parsers trained on dependency treebanks where MWEs are annotated by specific subtrees (Nivre and Nilsson, 2004;Eryigit et al., 2011;Vincze et al., 2013b;Candito and Constant, 2014;Nasr et al., 2015).

Syntactic and Lexical Representations
A standard dependency tree represents syntactic structure by establishing binary syntactic relations between words. This is an adequate representa-tion of both syntactic and lexical structure on the assumption that words and lexical units are in a one-to-one correspondence. However, as argued in the introduction, this assumption is broken by the existence of MWEs, and we therefore need to distinguish lexical units as distinct from words.
In the new representation, each lexical unitwhether a single word or an MWE -is associated with a lexical node, which has linguistic attributes such as surface form, lemma, part-ofspeech tag and morphological features. With an obvious reuse of terminology from context-free grammar, lexical nodes corresponding to MWEs are said to be non-terminal, because they have other lexical nodes as children, while lexical nodes corresponding to single words are terminal (and do not have any children).
Some lexical nodes are also syntactic nodes, that is, nodes of the syntactic dependency tree. These nodes are either non-terminal nodes corresponding to (complete) fixed MWEs or terminal nodes corresponding to words that do not belong to a fixed MWE. Syntactic nodes are connected into a tree structure by binary, asymmetric dependency relations pointing from a head node to a dependent node. Figure 1 shows the representation of the sentence the prime minister made a few good decisions. It contains three non-terminal lexical nodes: one fixed MWE (a few), one contiguous non-fixed MWE (prime minister) and one discontiguous non-fixed MWE (made decisions). Of these, only the first is also a syntactic node. Note that, for reasons of clarity, we have suppressed the lexical children of the fixed MWE in Figure 1. (The non-terminal node corresponding to a few has the lexical children a and few.) For the same reason, we are not showing the linguistic attributes of lexical nodes. For example, the node made-decisions has the following set of features: surface-form='made decisions', lemma='make decision', POS='V'. Nonfixed MWEs have regular syntax and their components might have some autonomy. For example, in the light verb construction made-decisions, the noun decisions is modified by the adjective good that is not an element of the MWE.
The proposed representation of fixed MWEs is an alternative to using special dependency labels as has often been the case in the past (Nivre and Nilsson, 2004;Eryigit et al., 2011). In addition to special labels, MWEs are then represented as a flat subtree of the syntactic tree. The root of the subtree is the left-most or right-most element of the MWE, and all the other elements are attached to this root with dependencies having special labels. Despite the special labels, these subtrees look like ordinary dependency structures and may confuse a syntactic parser. In our representation, fixed MWEs are instead represented by nodes that are atomic with respect to syntactic structure (but complex with respect to lexical structure), which makes it easier to store linguistic attributes that belong to the fixed MWE and cannot be derived from its components. The new representation also allows us to represent the hierarchical structure of embedded MWEs. Figure 2 provides an analysis of she took a rain check that includes such an embedding. The lexical node took-rain-check corresponds to a light verb construction where the object is a compound noun that keeps its semantic interpretation whereas the verb has a neutral value. One of its children is the lexical node rain-check corresponding to a compound noun. Let us now define the representation formally. Given a sentence x = x 1 , . . . , x n consisting of n tokens, the syntactic and lexical representation is a quadruple (V, F, N, A), where 1. V is the set of terminal nodes, corresponding one-to-one to the tokens x 1 , . . . , x n , 2. F is a set of n-ary trees on V , with each tree corresponding to a fixed MWE and the root labeled with the part-of-speech tag for the MWE, 3. N is a set of n-ary trees on F , with each tree corresponding to a non-fixed MWE and the root labeled with the part-of-speech tag for the MWE,

4.
A is a set of labeled dependency arcs defining a tree over F . This is a generalization of the standard definition of a dependency tree (see, for example, Kübler et al. (2009)), where the dependency structure is defined over an intermediate layer of lexical nodes (F ) instead of directly on the terminal nodes (V ), with an additional layer of non-fixed MWEs added on top. To exemplify the definition, here are the formal structures corresponding to the representation visualized in Figure 1.
where V contains the terminal and (F ∪ N ) − V the non-terminal lexical nodes. The set of syntactic nodes is simply F . It is worth noting that the representation imposes some limitations on what MWEs can be represented. In particular, we can only represent overlapping MWEs if they are cases of embedding, that is, cases where one MWE is properly contained in the other. For example, in an example like she took a walk then a bath, it might be argued that took should be part of two lexical units: took-walk and took-bath. This cannot currently be represented. By contrast, we can accommodate cases where two lexical units are interleaved, as in the French example il prend un cachet et demi, with the two units prend-cachet and un-et-demi, which occur in the crossed pattern A1 B1 A2 B2. However, while these cases can be represented in principle, the parsing model we propose will not be capable of processing them.
Finally, it is worth noting that, although our representation in general allows lexical nodes with arbitrary branching factor for flat MWEs, it is often convenient for parsing to assume that all trees are binary (Crabbé, 2014). For the rest of the paper, we therefore assume that non-binary trees are always transformed into equivalent binary trees using either right or left binarization. Such transformations add intermediate temporary nodes that are only used for internal processing.

Transition-Based Model
A transition-based parser is based on three components: a transition system for mapping sentences to their representation, a model for scoring different transition sequences (derivations), and a search algorithm for finding the highest scoring transition sequence for a given input sentence. Following Nivre (2008), we define a transition system as a quadruple S = (C, T, c s , C t ) where: 3. c s is an initialization function that maps each input sentence x to an initial configuration c s (x) ∈ C, 4. C t ⊆ C is a set of terminal configurations.
A transition sequence for a sentence x is a sequence of configurations C 0,m = c 0 , . . . , c m such that c 0 = c s (x), c m ∈ C t , and for every c i (0 ≤ i < m) there is some transition t ∈ T such that t(c i ) = c i+1 . Every transition sequence defines a representation for the input sentence. Training a transition-based parser means training the model for scoring transition sequences. This requires an oracle that determines what is an optimal transition sequence given an input sentence and the correct output representation (as given by treebank). Static oracles define a single unique transition sequence for each input-output pair. Dynamic oracles allow more than one optimal transition sequence and can also score nonoptimal sequences . Once a scoring model has been trained, parsing is usually performed as best-first search under this model, using greedy search or beam search.

Arc-Standard Dependency Parsing
Our starting point is the arc-standard transition system for dependency parsing first defined in Nivre (2004) and represented schematically in Figure 3. A configuration in this system consists of a triple c = (σ, β, A), where σ is a stack containing partially processed nodes, β is a buffer containing remaining input nodes, and A is a set of dependency arcs. The initialization function maps • Shift takes the first node in the buffer and pushes it onto the stack.
• Right-Arc(k) adds a dependency arc (i, k, j) to A, where j is the first and i the second element of the stack, and removes j from the stack.
• Left-Arc(k) adds a dependency arc (j, k, i) to A, where j is the first and i the second element of the stack, and removes i from the stack.
A transition sequence in the arc-standard system builds a projective dependency tree over the set of terminal nodes in V . The tree is built bottom-up by attaching dependents to their head and removing them from the stack until only the root of the tree remains on the stack.

Joint Syntactic and Lexical Analysis
To perform joint syntactic and lexical analysis we need to be able to build structure in two parallel dimensions: the syntactic dimension, represented by a dependency tree, and the lexical dimension, represented by a forest of (binary) trees. The two dimensions share the token-level representation, as well as the level of fixed MWEs, but the syntactic tree and the non-fixed MWEs are independent. We extend the parser configuration to use two stacks, one for each dimension, but only one buffer. In addition, we need not only a set of dependency arcs, but also a set of lexical units. A configuration in the new system therefore consists of a quintuple c = (σ 1 , σ 2 , β, A, L), where σ 1 and σ 2 are stacks containing partially processed nodes (which may now be complex MWEs), β is a buffer containing remaining input nodes (which • Shift takes the first node in the buffer and pushes it onto both stacks. This guarantees that the two dimensions are synchronized at the token level. • Right-Arc(k) adds a dependency arc (x, k, y) to A, where y is the first and x the second element of the syntactic stack (σ 1 ), and removes y from this stack. It does not affect the lexical stack (σ 2 ). 1 • Left-Arc(k) adds a dependency arc (y, k, x) to A, where y is the first and x the second element of the syntactic stack (σ 1 ), and removes x from this stack. Like Right-Arc(k), it does 1 We use the variables x and y, instead of i and j, because the stack elements can now be complex lexical units as well as simple tokens. not affect the lexical stack (σ 2 ).
• Merge F (t) applies in a configuration where the two top elements x and y are identical on both stacks and combines these elements into a tree t(x, y) representing a fixed MWE with part-of-speech tag t. Since it operates on both stacks, the new element will be a syntactic node as well as a lexical node.
• Merge N (t) combines the two top elements x and y on the lexical stack (σ 2 ) into a tree t(x, y) representing a non-fixed MWE with part-of-speech tag t. Since it only operates on the lexical stack, the new element will not be a syntactic node.
• Complete moves the top element x on the lexical stack (σ 2 ) to L, making it a final lexical unit in the output representation. Note that x can be a simple token, a fixed MWE (created on both stacks), or a non-fixed MWE (created only on the lexical stack).
A transition sequence in the new system derives the set of lexical nodes and simultaneously builds a projective dependency tree over the set of syntactic nodes. By way of example, Figure 5 shows the transition sequence for the example in Figure 1.

Implicit Completion
The system presented above has one potential drawback: it needs a separate Complete transition for every lexical unit, even in the default case  [4,5,6,7,8] [4,5,6,7,8]  when a lexical unit is just a token. This makes sequences much longer and increases the inherent ambiguity. One way to deal with this problem is to make the Complete transition implicit and deterministic, so that it is not scored by the model (or predicted by a classifier in the case of deterministic parsing) but is performed as a side effect of the Right-Arc and Left-Arc transitions. Every time we apply one of these transitions, we check whether the dependent x of the new arc is part of a unit y on the lexical stack satisfying one of the following conditions: (i) x = y; (ii) x is a lexical child of y and every lexical node z in y either has a syntactic head in A or is the root of the dependency tree. If (i) or (ii) is satisfied, we move y from the lexical stack to the set L of lexical units as a side effect of the arc transition.

Experiments
This section provides experimental results obtained with a simple implementation of our system using a greedy search parsing algorithm and a linear model trained with an averaged perceptron with shuffled examples and a static oracle. More precisely, the static oracle is defined using the following transition priorities: Merge F > Merge N > Complete > LeftArc > RightArc > Shift. At each state of the training phase, the static oracle selects the valid transition that has the higher priority.
We evaluated the two variants of the system, namely Explicit and Implicit, with explicit and implicit completion, respectively. They were compared against the joint approach proposed in Candito and  that we applied to an arcstandard parser, instead of a graph-based parser. The parser is trained on a treebank where MWE status and grammatical function are concatenated in arc labels. We consider it as the Baseline. We used classical transition-based parsing features consisting of patterns combining linguistic attributes of nodes on the stacks and the buffer, as well as processed subtrees and transition history. We can note that the joint systems do not contain features sharing elements of both stacks. Preliminary tuning experiments did not show gains when using such features.
We also compared these systems against weaker ones, obtained by disabling some transitions and using one stack only. Two systems, namely Syntactic-baseline and Syntactic only predict the syntactic nodes and the dependency structure by using respectively a baseline parser and our system where neither the lexical stack nor the Merge N and Complete transitions are used. The latter one is an implementation of the proposal in Nivre (2014). Two systems are devoted only to the lexical layer: Lexical only recognizes the lexical units (only the lexical stack and the Merge N and Complete transitions are activated); Fixed only identifies the fixed expressions.  We also implemented pipeline systems where: (i) fixed MWEs are identified by applying only the Fixed system; (ii) elements of predicted MWEs are merged into single tokens; (iii) the retokenized text is parsed using the Baseline or Implicit systems trained on a dataset where fixed MWEs consist of single tokens.
We carried out our experiments on two different datasets annotating both the syntactic structure and the MWEs: the French Treebank [FTB] (Abeillé et al., 2003) and the STREUSLE corpus (Schneider et al., 2014b) combined with the English Web Treebank [EWT] (Bies et al., 2012). They are commonly used for evaluating the most recent MWE-aware dependency parsers and supervised MWE identification systems. Concerning the FTB, we used the dependency version developed in Candito and Constant (2014) derived from the SPMRL shared task version . Fixed and non-fixed MWEs are distinguished, but are limited to contiguous ones only. The STREUSLE corpus (Schneider et al., 2014b) corresponds to a subpart of the English Web Treebank (EWT). It consists of reviews and is comprehensively annotated in contiguous and discontiguous MWEs. Fixed and non-fixed expressions are not distinguished though the distinction between non-compositional and collocational MWEs is made. This implies that the Merge F transition is not used on this dataset. Practically, we used the LTH converter (Johansson and Nugues, 2007) to obtain the dependency version of the EWT constituent version. We also used the predicted linguistic attributes used in Constant and Le Roux (2015) and in Constant et al. (2016). Both datasets include predicted POS tags, lemmas and morphology, as well as features computed from compound dictionary lookup. None of them is entirely satisfying with respect to our model, but they allow us to evaluate the feasibility of the approach. Statistics on the two datasets are provided in Table 1.
Results are provided in Table 2 for French and in Table 3 for English. In order to evaluate the syn-tactic layer, we used classical UAS and LAS metrics. Before evaluation, merged units were automatically decomposed in the form of flat subtrees using specific arcs as in , so all systems can be evaluated and compared at the token level. MWE identification is evaluated with the F-score of the MWE segmentation, namely MWE for all MWEs and FMWE for fixed MWEs only. An MWE segment corresponds to the set of its component positions in the input token sequence.
First, results show that our joint system consistently and significantly outperforms the baseline in terms of MWE identification on both datasets. The merge transitions play a key role. In terms of syntax, the Explicit system does not have any positive impact (on par or degraded scores), whereas the Implicit system allows us to obtain slightly better results on French and a significant improvement on English. The very good performances on English might be explained by the fact that it contains a non-negligeable set of discontiguous MWEs which complicates the prediction of explicit Complete transitions.
When compared with weaker systems, we can see that the addition of the lexical layer helps improve the prediction of the syntactic layer, which confirms results on symbolic parsing (Wehrli, 2014). The syntactic layer does not seem to impact the lexical layer prediction: we observe comparable results. This might be due to the fact that syntax is helpful for long-distance discontiguity only, which does not appear in our datasets (the English dataset contains MWEs with small gaps). Another explanation could also be that syntactic parsing accuracy is rather low due to the use of a simple greedy algorithm. Developing more advanced transition-based parsing methods like beam-search may help improve both syntactic parsing accuracy and MWE identification. When comparing joint systems with pipeline ones, we can see that preidentifying fixed MWEs seems to help MWE identification whereas syntactic parsing accuracy tends to be slightly lower. One hypothesis could be that Merge F transitions may confuse the prediction of Merge N transitions.
When compared with existing state-of-the-art systems, we can see that the proposed systems achieve MWE identification scores that are comparable with the pipeline and joint approaches used in Candito and Constant (2014) (2014). These scores are obtained on the SPMRL shared task version, though they are not entirely comparable with our system as they do not distinguish fixed from non-fixed MWEs.

Related work
The present paper proposes a new representation for lexical and syntactic analysis in the framework of syntactic dependency parsing. Most existing MWE-aware dependency treebanks represent an MWE as a flat subtree of the syntactic tree with special labels, like in the UD treebanks (Nivre et al., 2016) or in the SPMRL shared task , or in other individual treebanks (Nivre and Nilsson, 2004;Eryigit et al., 2011). Such representation enables MWE discontinuity, but the internal syntactic structure is not annotated. Candito and Constant (2014) proposed a representation where the irregular and regular MWEs are distinguished: irregular MWEs are integrated in the syntactic tree as above; regular MWEs are an-notated in their component attributes while their internal structure is annotated in the syntactic tree. The Prague Dependency Treebank (Bejček et al., 2013) has several interconnected annotation layers: morphological (m-layer), syntactic (a-layer) and semantic (t-layer). All these layers are trees that are interconnected. MWEs are annotated on the t-layer and are linked to an MWE lexicon (Bejček and Straňák, 2010). Constant and Le Roux (2015) proposed a dependency representation of lexical segmentation allowing annotations of deeper phenomena like MWE nesting. More details on MWE-aware treebanks (including constituent ones) can be found in Rosén et al. (2015).
Statistical MWE-aware dependency parsing has received a growing interest since Nivre and Nilsson (2004). The main challenge resides in finding the best orchestration strategy. Past research has explored either pipeline or joint approaches. Pipeline strategies consist in positioning the MWE recognition either before or after the parser itself, as in Nivre and Nilsson (2004), Eryigit et al. (2011), Constant et al. (2013), and Kong et al. (2014) for pre-identification and as in Vincze et al. (2013a) for post-identification. Joint strategies have mainly consisted in using off-the-shelf parsers and integrating MWE annotation in the syntactic structure, so that MWE identification is blind for the parser (Nivre and Nilsson, 2004;Eryigit et al., 2011;Vincze et al., 2013b;Candito and Constant, 2014;Nasr et al., 2015).
Our system includes a special treatment of MWEs using specific transitions in a classical transition-based system, in line with the proposal of Nivre (2014). Constant et al. (2016) also proposed a two-dimensional representation in the form of dependency trees anchored by the same words. The annotation of fixed MWEs is redundant on both dimensions, while they are shared in our representation. They propose, along with this representation, an adaptation of an easy-first parser able to predict both dimensions. Contrary to our system, there are no special mechanisms for treating MWEs.
The use of multiple stacks to capture partly independent dimensions is inspired by the multiplanar dependency parser of Gómez-Rodríguez and Nivre (2013). Our parsing strategy for (hierarchical) MWEs is very similar to the deterministic constituency parsing method of Crabbé (2014).

Conclusion
This paper proposes a transition-based system that extends a classical arc-standard parser to handle both lexical and syntactic analysis. It is based on a new representation having two linguistic layers sharing lexical nodes. Experimental results show that MWE identification is greatly improved with respect to the mainstream joint approach. This can be a useful starting point for several lines of research: implementing more advanced transitionbased techniques (beam search, dynamic oracles, deep learning); extending other classical transition systems like arc-eager and hybrid as well as handling non-projectivity.