Peking at MRP 2019: Factorization- and Composition-Based Parsing for Elementary Dependency Structures

We design, implement and evaluate two semantic parsers, which represent factorization- and composition-based approaches respectively, for Elementary Dependency Structures (EDS) at the CoNLL 2019 Shared Task on Cross-Framework Meaning Representation Parsing. The detailed evaluation of the two parsers gives us a new perception about parsing into linguistically enriched meaning representations: current neural EDS parsers are able to reach an accuracy at the inter-annotator agreement level in the same-epoch-and-domain setup.


Introduction
For the CoNLL 2019 Shared Task on Cross-Framework Meaning Representation Parsing (MRP; , we concentrate on Elementary Dependency Structures (EDS; Oepen and Lønning, 2006), the graph-based meaning representations derived from English Resource Semantics 1 (ERS; Flickinger et al., 2014b) that is the richly detailed semantic annotation associated to English Resource Grammar (ERG; Flickinger, 2000), a domain-independent, linguistically deep and broad-coverage HPSG grammar. The full ERS and EDS annotations include not only basic predicate-argument structures, but also information about quantifiers and scopal operators, e.g. negation, as well as analyses of linguistically complex phenomena such as time and date expressions, conditionals, and comparatives.
Following 's practice, we divide existing work on string-to-semantic-graph parsing into four types, namely factorization-, composition-, transition-and translation-based approaches. Our previous studies (Chen et al., 2018b;Cao et al., 2019) as well as other investigations on other graph banks indicate that the 1 http://moin.delph-in.net/ErgSemantics factorization-and composition-based approaches obtain currently superior accuracies. In this paper, we fine-tune our factorization-and compositionbased parsers and present a detailed evaluation on the MRP data.
Our factorization-based system obtains an overall accuracy of 94.47 in terms of the official MRP evaluation metrics, and out-performs other submission systems by a large margin with respect to the prediction for labels, properties, anchors and edges. We highlight a new perception: Current neural parsers are able to reach an accuracy at the inter-annotator agreement level (Bender et al., 2015) for the linguistically enriched EDS representations in the same-epoch-and-domain setup. Given the information depth of ERS, we think many NLP applications may benefit from a revisit of classic discrete semantic representations. The composition-based system reaches a score of 91.84. We do not think the performance gap suggests a weakness of the latter approach, but take it a reflection of the fact that a composition-based parser involves more individual modules that have not been fully optimized yet.

Parsing to Semantic Graphs
In this section, we present a summary of factorization-, composition-, transition-and translation-based parsing approaches.
Factorization-Based Approach. This type of approach is inspired by the successful design of graph-based dependency tree parsing (McDonald, 2006). A factorization-based parser explicitly models the target semantic structures by defining a score function that is able to evaluate the goodness of any candidate graph. Usually, the set of possible graphs that can be assigned to an input sentence is extremely large. Therefore, a parser also needs to know how to find the highest-scoring graphs from a large set.
To the best of our knowledge, McDonald and Pereira (2006) present the first graph-based syntactic dependency parsing algorithm that removes the tree-shape constraint. In the scenario of semantic dependency parsing, Kuhlmann and Jonsson (2015) generalize the graph-based framework (aka Maximum Spanning Tree parsing) and propose Maximum Subgraph parsing. Given a directed graph G = (V, E) that corresponds to an input sentence x = w 0 , . . . w n−1 and a score function SCOREG. The string-to-graph parsing is formulated as a problem of searching for a subset E ⊆ E with the maximum score. Formally, we have the following optimization problem: For semantic dependency parsing, V is the set of surface tokens, and G is, usually, the corresponding complete graph.
It is relatively straightforward to extend Kuhlmann and Jonsson's framework to cover more types of semantic graphs as follows, where GEN(x) denotes all plausible semantic graphs that can be assigned to x. To make the above combinatorial optimization problems solvable, people usually employ a factorization strategy, i.e. defining a decomposable score function that enumerates all sub-parts of a candidate graph. This view matches a classic solution to structured prediction which captures elemental and structural information through partwise factorization. For example, the following formula defines a first-order factorization model for semantic dependency parsing, The essential computational module in this architecture is the score function, which is usually induced based on moderate-sized annotated sentences. Various deep learning models together with vector-based encodings induced from largescale raw texts have been making advances in shaping a score function significantly (Dozat and Manning, 2018). We will detail our factorizationbased parser in §3.
Composition-Based Approach. Compositionality is a cornerstone for many formal semantic theories. Following a principle of compositionality, a semantic graph can be viewed as the result of a derivation process, in which a set of lexical and syntactico-semantic rules are iteratively applied and evaluated. On the linguistic side, such rules extensively encode explicit knowledge about natural languages. On the computational side, such rules must be governed by a well-defined grammar formalism. In particular, to manipulate graph construction in a principled way, Hyperedge Replacement Grammar (HRG; Drewes et al., 1997) and AM Algebra (Groschwitz et al., 2017) have been applied to build semantic parsers for various graph banks (Chen et al., 2018b;Groschwitz et al., 2018;Lindemann et al., 2019).
A composition-based parser explicitly models derivations that yield semantic graphs by defining a score function SCORED. Assume a derivation D = r 1 , r 2 , . . . , r m is a sequence of rules. Formally, we have the following optimization problem:

SCORED(D) (4)
To make the above problem solvable, people usually employ a decomposition strategy, i.e. summing over local scores that correspond to individual derivation steps: Again, this matches many structured prediction models. Deep learning has been shown very powerful to associate scores to individual rule applications, and thus to provide great models for evaluating a derivation. The general form of (4) is a very complex combinatorial optimization problem. The approximating strategy to search for the best derivation instead has been shown practical yet effective for ERS parsing (Chen et al., 2018b). Formally, we solve the below problem, where GEN DERIV (x) denotes all sound derivations that yield x. Then we get a target graph by evaluating D . We will detail our composition-based parser in §4.
Transition-Based Approach. This type of approach is inspired by the successful design of transition-based dependency tree parsing (Yamada and Matsumoto, 2003;Nivre, 2008). To the best of our knowledge, Sagae and Tsujii (2008) firstly apply this type of approach to predict predicateargument structures grounded in HPSG (Miyao et al., 2005). A number of new transition systems and disambiguation models have been discussed for parsing into different graphs (Wang et al., 2015;Zhang et al., 2016;Buys and Blunsom, 2017; Translation-Based Approach. This type of approach is inspired by the success of sequence-tosequence (seq2seq for short) models that are the heart of modern Neural Machine Translation. A translation-based parser takes a family of semantic graphs as a foreign language, in that a semantic graph is encoded and then viewed as a string from another language (Peng et al., 2017b;Konstas et al., 2017;Buys and Blunsom, 2017). A parser knows how to linearize a graph. Data augmentation has been shown very helpful (Konstas et al., 2017), partially reflecting the data-hungry nature of seq2seq models.
Simple application of seq2seq models is not sucessful.
However, some basic models can be integrated with other types of approaches.  propose to combine the translation-and transition-based approaches.  combined the translation-and factorization-based approaches.

Elements in EDS Graphs
The key idea underlying the factorization-based approach is to explicitly model what are expected as elements in target structures. Therefore before introducing the technical details of our parser, we roughly sketch key elements in EDS graphs. Refer to Flickinger et al. (2014a) for more information about the design of ERS.
We distinguish three kinds of elements: (1) labeled nodes, (2) node properties and (3) labeled edges. Nodes are sometimes called concepts 2 , where their labels reflect conceptual meaning. The node labels can be divided into two classes: (1) surface concepts that are exclusively introduced by lexical entries, whose orthography is the source form of a core part of a concept symbol, and (2) abstract concepts that are used to represent the semantic contribution of grammatical constructions or more specialized lexical entries. Take the output structure in Figure 1 for example: go v 1 and want v 1 indicate surface concepts, while proper q and named indicate abstract concepts.
To avoid proliferation of concepts, some concepts are parameterized. The parameters can be viewed as properties of nodes. For example, named("Tom") is a named concept with a CARG property of "Tom". For every EDS graph, there exists a top concept, which relates to the top handle in its original ERS annotation. In Figure 1, for example, want v 1 is the top. In this paper, we practically treat whether a node is top as a property whose value can be either true or false.
Edges are called relations. An edge links exactly two nodes and mainly reflects predicateargument relations. Edges are assigned with a small, fixed inventory of role labels (e.g. ARG1, ARG2, . . . ).

The Architecture
We employ a four-stage pipeline to incrementally construct an EDS graph. Figure 1 illustrates the four steps with a simple sentence. The core idea is to identify concepts from surface strings, and then detect the relations between them.

Tokenization
Automatic tokenization for English has been widely viewed as a solved problem for quite a long time. Taking the risk of oversimplifying the situation, tokenization does not have a significant impact on downstream NLP tasks, e.g. POS tagging and syntactic parsing. When we consider semantic parsing, however, it is still a controversial issue which unit is the most basic one that triggers conceptual meaning and semantic construction. Therefore, we need to rethink the tokenization problem in which tokens may not be fully consistent with their traditional definitions. Moreover, when we consider other languages like German or Chinese, tokenization brings other issues.
In this paper, we take the most basic word-level units 3 as strings that are separated by whitespaces 169 Tom wants to go.

Input string
Assets of these short-term funds surged more than $5.5 billion in September.
RegEx match Assets of these short -term funds surged more than $ 5 . 5 billion in September .
Our tokens Assets of these short -term funds surged more than $ 5 . 5 billion in September .

PTB tokens
Assets of these short -term funds surged more than $ 5 . 5 billion in September .

Concept
Type String more+than p multi-unit more than asset n 1 single unit Assets short a of sub-unit short-term term n of sub-unit short-term mis-a error sub-unit misinterpreted interpret v 1 sub-unit misinterpreted and punctuation markers. In an EDS graph, a surface concept may be aligned with a sub-unit, a single unit or multiple units. Table 2 shows some examples. During concept identification ( §3.4), there should exist a surjection from surface concepts to the input tokens. Therefore, tokenization is important for obtaining a reasonable alignment between concepts and input tokens. We adopt the character-based word segmentation approach for Chinese (Sun, 2010) to find suitable tokens. We first split an input sentence into troduce our neural parsing models, we still use word to relate the units to word embeddings. a sequence of basic elements with simply defined regular expressions. The core part of our tokenizer is a sequence labeling model over this sequence. In particular, each element is assigned with a positional label that indicates token boundaries. The labels can be either B, which means the unit is at the begining of a target token, or I, which means the unit is inside a token. For sequential classification, we utilize a multi-layer BiLSTM network. Tokens can be retrieved from the predicted labels. See Figure 1 for an example. Note that, Dridan and Oepen (2012) showed that regular expressions are quite powerful to deal with the tokenizaiton problem for different styles.

Concept Identification
Surface concepts (e.g. quantifier some q) and some of the abstract concepts (e.g. named entity named) have a more transparent connection to surface forms and are relatively easier to identify. We call such concepts lexicalized concepts, which include all but are not limited to surface concepts. We cast identification of lexicalized concepts as a token-based tagging problem. The lexicalized concepts usually include lemma information in its label. For example, boy n 1 consists of a lemma (boy) and a type , denoted as * n 1. As lemmas are much more easily to analyze, our concept identifier targets the type part only.
Some of the rest of abstract concepts are triggered by phrasal constructions. For example, compound is associated to the combination of multiple words. In this case, a concept is originally aligned to a sequence of continuous words. Considering that this type of concepts is a small portion, we propose to handle them in a word-level tagger. To this end, we re-align them to specific tokens with a small set of heuristic rules. For example, compound is re-aligned to the first word of a compound. Re-aligning these concepts means discarding their original anchors. To fully fit the MRP goals, we treat anchors as properties of concepts, and recover them by predicting the start/end boundaries with a classification model, as to be described in §3.6.
We employ a neural sequence labeling model to predict concepts. A multi-layer BiLSTM is utilized to encode tokens and another two softmax layers to predict concept-related labels: One for lexicalized concepts and the other for the rest. We also use recently widely-used contextualized word representation models, including ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018). Figure 2 shows the neural network for concept identication.

Relation Detection
After finding a set of concepts, the next step is to link them together. Each semantic dependency is treated independently. We use integers as indices to mention concepts nodes. For any two nodes i and j, we give a score SCOREEDGE(i, j) to the possible arc i → j. An arc is included to the final graph if and only if its score is greater than 0. We use a first-order model as described in Eq. (3). Figure 2 briefly summarizes the neural network for relation detection.
Following Manning (2016, 2018), we use a deep biaffine attention to evaluate a candidate edge: where c i /c j is the vector associated to i/j. We consider two information sources to calculate c: a textual part r c2w(i) and a conceptual part n i , as He wants to go encoder encoder encoder encoder r 1 r 4 2:pronoun q 1:pron 3: * v 1 φ 4: * v 1 arg max c 1 c 4 BIAFFINE SCOREEDGE(pron ← go v 1) Figure 2: The network architecture for our concept identification and relation detection models which share the same architecture in word embedding and contextual encoder layers but with the same sets of parameters. A softmax layer is used for concept identification. To determine whether the dependency pron ← go v 1 exists, i.e. unlabeled dependency parsing, the corresponding embeddings c 1 and c 4 , which are the concatenation of textual embeddings (in the red color) and the conceptual embeddings (in the yellow color), are biaffinely transformed into a score.
following, c i = r c2w(i) ⊕ n i Due to our concept identification method, we have a function "c2w" that takes as input the index of a node and returns as output the index of its anchored word. r c2w(i) is the contextual vector of the word aligned to i, which is calculated by the word embedding layer and the encoder layers. n i is the randomly-initialized embedding of i's concept type, e.g. * v 1. We also use the deep biaffine attention function to calculate each edge's scores for all labels, according to which we select the best label that achieves the maximum. For training, we use a margin-based approach to compute loss from the gold graph G * and the best predictedĜ according to current model parameters. We define the loss term as: The margin objective ∆ measures the similarity between G * andĜ. Following Peng et al. (2017a), we define ∆ as weighted Hamming to trade off between precision and recall.

Property Prediction
The final stage is to predict properties for each concept that is generated in the previous stages. For the EDS representation at CoNLL2019, we consider three types of properties and apply different strategies.
Anchors (spans). String anchors are treated as properties of concepts. For a given concept, a classification model is utilized to select two tokens over all input tokens as the start/end boundary of the concept respectively. We use exactly the same neural architecture in §3.5 to encode input tokens. See Figure 3 for a visualized illustration. The score of token j w being the start/end boundary of node i can be computed by following equation: Here PROJ(·) represents a feed-forward network with LEAKYRELU activation. The anchors provided by training dataset are all character-based, so transformation is required before training this model. In the same manner, after retrieving the start/end word of a concept, we need to convert word-based anchors back to characterbased anchors. Margin-based loss is used again when training this model and the total loss is the sum of losses for both boundaries.
The CARG property. Since the main function of the CARG attribute is to reduce the size of predicate names by parameterizing them with regularized surface strings, a rule-based system could be effective to predict the CARG information.
Firstly, we decide whether a concept has the CARG property according to its label. For example, named, card and ord need CARGs, but not the q.
Secondly, we use a dictionary which is extracted automatically from the training dataset. Entries of the dictionary are of the form label, string, CARG . For example, a concept named whose anchoring string is D.C. will be mapped to WashingtonDC. Based on a close observation of the data, we introduce several heuristic rules if there is no applicable entry for a concept in the dictionary. For example, one widely applicable rule is to use 1 as the CARG value for concepts labeled card and aligned to a float number which is less than 1.
Finally, if no rule is available, we remove punctuation markers at left or right boundaries of anchoring strings and use the remaining part.
Top concept. We cast the precition for top as a binary classification problem over all nodes in a final graph. This strategy matches a recent research interest in graph neural networks (Li et al., 2015;Veličković et al., 2017;Defferrard et al., 2016;Chen et al., 2018a;, one goal of which is to associate vectors to graph nodes. Such vectors can be more easily to be integrated to neural networks for various purposes. We employ a Graph-based LSTM  to encode an EDSgraph and a multi-layer feed-forward network to determine whether a node is top. Similar as §3.5, margin-based approach is used to compute the loss term.

The Composition-Based Parser
Our composition-based parser is based on our previous work (Chen et al., 2018b). The core engine is a graph rewriting system that explicitly explores the syntactico-semantic recursive derivations that are governed by a synchronous HRG (SHRG). See Figure 4 for an example. Our parser constructs EDS graphs by explicitly modeling such derivations. In particular, it utilizes a constituent parser to build a syntactic derivation, and then selects semantic HRG rules associated to syntactic CFG rules to generate a graph. When multiple rules are applicable for a single phrase, a neural network is used to rank them.
One main difference between our submission parer and the parser introduced in Chen et al. (2018b) is that the syntactic parsing model is a reimplementation of Kitaev and Klein (2018). It utilizes transformer layers to capture words' contextual information, denoted as r i . After encoding an input sentence, a multiple-layer peceptron (MLP) is employed to get span scores. The score of span (i, j) with label L is calculated from its embedding s i,j , which is from the contextual vector of  Figure 4: An SHRG-based syntactico-semantic derivation. The derivation can be viewed as a syntactic tree enriched with semantic interpretation rules that are defined by an HRG. Each phrase in the syntactic tree is also assigned with a graph which corresonds to a subpart in the final semantic graph. Moreover, some particular nodes (filled nodes) in a sub-graph is marked as communication channels to other meaning parts in the same sentence. In HRG, these nodes are summarized as a hyperedge. Gluing two sub-graphs according to a construction rule follows the graph substitution principle of HRG. The application of the top rule that introduces a reentrancy structure is such an example. The "X" node in the graph of the left branching phrase is unified with the "X" node in the rule, and so do to the "Y" and "Z" nodes.
the two endpoints, r i and r j−1 : The operator [] denotes index selection. We perform CKY decoding to retrieve the highestscored constituent tree that agrees with the syntactic CFG grammar. When a phrase structure tree is available, semantic interpretation can be regarded as translating this tree to the derivation of graph construction. As multiple subgraph correspondents in each node are available, the beam search strategy is used to balance the search complexity and quality.
To score subgraphs, we use two types of features. The first type is node feature. For a concept n aligned with span (i, j), we use the span embedding s i,j as features, and score with non-linear transformation: The second type is edge feature. Note that a semantic dependency with label L from conceptual node n a to n b are aligned to constituents (i 1 , j 1 ) and (i 2 , j 2 ) respectively. We calculate this part of score with non-linear transformation from the span embeddings s i 1 ,j 1 , s i 2 ,j 2 and random initialized concept embeddings n a , n b : For training, again, we use the margin-based loss.

Experiments
The MRP2019 training data consists of 35656 sentences in total. For convenience, the compositionand factorization-based parsers share the same tokenization model. Gold token position labels are extracted from DeepBank (Flickinger et al., 2012). For the composition-based parser, we leverage the syntactic information provided by DeepBank to extract synchronous grammars. Therefore, all sentences in the MRP2019 data that do not appear in DeepBank 1.1 are removed. Following the same preprocessing of semantic graphs in Chen et al. (2018b) and using the recommended setup in DeepBank, there are 33722 samples for training and 1689 samples for validation. The synchronous grammars are extracted from the training data using coarse-grained labels (Chen et al., 2018b). For factorization-based parser, we use heuristic rules to re-align the non-lexicalized concepts to input tokens. We remove all sentences that do not recieve results in this step from our training set. After re-alignment, 33580 sentences are left for training and 1689 for validation. Table 3 shows the results of both parsers on the validation data using the official evalution toolmtool 4 . Table 4 shows the intermediate results during parsing for both parsers.
For factorization-based parsing, we combine 4 models for concept identification and 5 models for relation detection. We ensemble models by averaging the score functions across all stand-alone models. These models use different initial random seeds, different pretraining methods (ELMo or BERT) or different encoder architectures (Transformer or BiLSTM). All these models achieve a similar performance respectively, but the ensemble one achieves a much better performance, as we can conclude from   Table 4: Results of each stage for both parsers on the development data. Gold tokenization has the same meaning in Table 3. Columns in the right block are the SMATCH scores ignoring all the node and edge properties for generated graphs. For factorization-based parser, columns in the middle block include the F 1 scores of concept identification with respect to lexicalized, non-lexicalized and all concepts respectively. For composition-based parser, columns in the middle block are the syntactic parsing results using standard metric and POS concerns the prediction of preterminals.
Our factorization-based parser achieves relatively satisfactory performance in all basic evaluation items except top. In the in-domain evalution, its performace nearly reaches the inter-annotator agreement reported in Bender et al. (2015). To find top concepts, our model encodes the semantic graphs and ignores the input sentences. We take the unsatisfactory result as a confirmation of the challenge to encode complex discrete structures into vectors.
The evalution results of our composition-based parser are not as good as the factorization-based one. We believe that the disagreement between our SHRG grammar and the original ERG leads to a major part of the performance gap.

Conclusion
Current neural ERS parsers work rapidly and reliably, with an MRP accuracy of over 94% in the same-epoch-and-domain setup. It is comparable to the inter-annotator agreement (in Elementary Dependency Match) reported in Bender et al. (2015). As ERS parsers become more and more accurate, efficient and robust, they have extensive application prospects in downstream deep language understanding-related tasks.