Accurate Polyglot Semantic Parsing With DAG Grammars

Semantic parses are directed acyclic graphs (DAGs), but in practice most parsers treat them as strings or trees, mainly because models that predict graphs are far less understood. This simplification, however, comes at a cost: there is no guarantee that the output is a well-formed graph. A recent work by Fancellu et al. (2019) addressed this problem by proposing a graph-aware sequence model that utilizes a DAG grammar to guide graph generation. We significantly improve upon this work, by proposing a simpler architecture as well as more efficient training and inference algorithms that can always guarantee the well-formedness of the generated graphs. Importantly, unlike Fancellu et al., our model does not require language-specific features, and hence can harness the inherent ability of DAG-grammar parsing in multilingual settings. We perform monolingual as well as multilingual experiments on the Parallel Meaning Bank (Abzianidze et al., 2017). Our parser outperforms previous graph-aware models by a large margin, and closes the performance gap between string-based and DAG-grammar parsing.


Introduction
Semantic parsers map a natural language utterance into a machine-readable meaning representation, thus helping machines understand and perform inference and reasoning over natural language data. Various semantic formalisms have been explored as the target meaning representation for semantic parsing, including dependency-based compositional semantics (Liang et al., 2013), abstract meaning representation (AMR, Banarescu et al., 2013), minimum recursion semantics (MRS, Copestake et al., 2005), and discourse representation theory (DRT, Kamp, 1981). Despite meaningful differences across formalisms or parsing models, a representation in any of these formalisms can be ex- pressed as a directed acyclic graph (DAG).
Consider for instance the sentence 'We barred the door and locked it', whose meaning representation as a Discourse Representation Structure (DRS) is shown in Figure 1. A DRS is usually represented as a set of nested boxes (e.g. b 1 ), containing variable-bound discourse referents (e.g. 'lock(e 2 )'), semantic constants (e.g. 'speaker'), predicates (e.g. AGENT) expressing relations between variables and constants, and discourse relations between the boxes (e.g. CONTINUATION). This representation can be expressed as a DAG by turning referents and constants into vertices, and predicates and discourse relations into connecting edges, as shown in Figure 2.
How can we parse a sentence into a DAG? Commonly-adopted approaches view graphs as strings (e.g. , or trees (e.g. Zhang et al., 2019a;Liu et al., 2018), taking advantage of the linearized graph representations provided in annotated data (e.g. Figure 3, where the graph in Figure 2 is represented in PENMAN notation (Goodman, 2020)). An advantage of these linearized representations is that they allow for the use of well-understood sequential decoders and provide a general framework to parse into any arbitrary formalism. However, these representations are unaware of the overall graph strucure they build as well as of reentrant semantic relations, such as coordination, coreference, and control, that are widespread in language. Parsers such as Zhang et al. (2019b) although able to generate reentrancies in their output, they do so by simply predicting pointers back to already generated nodes. Parsing directly into DAGs, although desirable, is less straightforward than string-based parsing. Whereas probabilistic models of strings and trees are ubiquitous in NLP, at present, it is an active problem in modern formal language theory to develop formalisms that allow to define probability distributions over DAGs of practical interest. 1 A successful line of work derives semantic graphs using graph grammars that allow to generate a graph by rewriting non-terminal symbols with graph fragments. Among these, hyperedge replacement grammar (HRG) has been explored for parsing into semantic graphs (Habel, 1992;Chiang et al., 2013). However, parsing with HRGs is not practical due to its complexity and large number of possible derivations per graph (Groschwitz et al., 2015). Thus, work has looked at ways of constraining the space of possible derivations, usually in the form of align-1 See Gilroy (2019) for an extensive review of the issue. ment or syntax (Peng et al., 2015). For example, Groschwitz et al. (2018) and Donatelli et al. (2019) extracted fine-grained typed grammars whose productions are aligned to the input sentence and combined over a dependency-like structure. Similarly, Chen et al. (2018) draw on constituent parses to combine together HRG fragments. Björklund et al. (2016) show that there exists a restricted subset of HRGs, Restricted DAG grammar (RDG), that provides a unique derivation per graph. A unique derivation means that a graph is generated by a unique sequence of productions, which can then be predicted using sequential decoders, without the need of an explicit alignment model or an underlying syntactic structure. Furthermore, the grammar places hard constraints on the rewriting process, which can be used to guarantee the well-formedness of output graphs during decoding. Drawing on this result, a recent work by Fancellu et al. (2019) introduces recurrent neural network RDGs, a sequential decoder that models graph generation as a rewriting process with an underlying RDG. However, despite the promising framework the approach in FA19 2 falls short in several aspects.
In this paper, we address these shortcomings, and propose an accurate, efficient, polyglot model for Neural RDG parsing. Specifically, our contributions are as follows: Grammar: In practice, RDGs extracted from training graphs can be large and sparse. We show a novel factorization of the RDG production rules that reduces the sparsity of the extracted grammars. Furthermore, we make use of RDGs extracted on fully human annotated training data to filter out samples from a larger noisy machine-generated dataset that cannot be derived using such grammars. We find that this strategy not only drastically reduces the size of the grammar, but also improves the final performance. Model: FA19 use a syntactic parsing inspired architecture, a stackLSTM, trained on a gamut of syntactic and semantic features. We replace this with a novel architecture that allows for batched input, while adding a multilingual transformer encoder that relies on word-embedding features only. Constrained Decoding: We identify a limitation in the decoding algorithm presented by FA19, in that it only partially makes use of the well-formdness constraints of an RDG. We describe the source of this error, implement a correction and show that we can guarantee well-formed DAGs. Multilinguality: Training data in languages other than English is often small and noisy. FA19 addressed this issue with cross-lingual models using features available only for a small number of languages, but did not observe improvements over monolingual baselines in languages other than English. We instead demonstrate the flexibility of RDGs by extracting a joint grammar from graph annotations in different languages. At the same time, we make full use of our multilingual encoder to build a polyglot model that can accept training data in any language, allowing us to experiment with different combinations of data. Our results tell a different story where models that use combined training data from multiple languages always substantially outperform monolingual baselines.
We test our approach on the Parallel Meaning Bank (PMB, Abzianidze et al., 2017), a multilingual graphbank. Our experimental results demonstrate that our new model outperforms that of FA19 by a large margin on English while fully exploiting the power of RDGs to always guarantee a wellformed graph. We also show that the ability of simultaneously training on multiple languages substantially improves performance for each individual language. Importantly, we close the performance gap between graph-aware parsing and state-of-theart string-based models.

Restricted DAG Grammar
We model graph generation as a process of graph rewriting with an underlying grammar. Our grammar is a restricted DAG grammar (RDG, Björklund et al., 2016), a type of context-free grammar designed to model linearized DAGs. For ease of understanding, we represent fragments in grammar productions as strings. This is shown in Figure 4, where the right-hand-side (RHS) fragment can be represented as its left-to-right linearization, with reentrant nodes flagged by a dedicated $ symbol.
An RDG is a tuple P, N, Σ, S, V where P is a set of productions of the form α → β; N is the set of non-terminal symbols {L, T 0 , · · · , T n } up to a maximum number of n; Σ is the set of terminal symbols; S is the start symbol; V is an unbounded set of variable references {$1, $2,...}, whose role is described below. The left-hand-side (LHS) α of a production p ∈ P is a function T i ∈ N (where i is the rank) that takes i variable references as arguments. Variable references are what ensure the well-formedness of a generated graph in an RDG, by keeping track of how many reentrancies are expected in a derivation as well as how they are connected to their neighbouring nodes. Rank, in turn, is an indication of how many reentrancies are present in a graph derivation. For instance, in the graph fragment in Figure 4, given that there are two variable references and a non-terminal of rank 2, we are expecting two reentrant nodes at some point in the derivation. The RHS β is a typed fragment made up of three parts: a variable v describing the semantic type 3 , a label non-terminal L, and a list of tuples e, s where e is an edge label from a set of labels E and s is either a non-terminal function T or a variable reference. The non-terminal L can only be rewritten as a terminal symbol l ∈ Σ. If a node is reentrant, we mark it with a superscript * over v. Variable references are percolated down the derivation and are replaced once a reentrant variable v * is found on the RHS. Following FA19, we show a complete derivation in Figure 5 that reconstructs the graph in Figure 2. Our grammar derives strings by first rewriting the start symbol S, a non-terminal function T 0 . At each subsequent step, the leftmost non-terminal function in the partially derived string is rewritten, with special handling for variable references described below. A derivation is complete when no non-terminals remain.
to a reference, as shown for production r 4 , where the variable c is mapped to $1. Once this mapping is performed, all instances of $1 in the RHS are replaced by the corresponding variable name. In this way, the reference to c is kept track of during the derivation becoming the target of AGENT in r 6 . Same applies in r 5 where x is mapped to $2. All our fragments are delexicalized. This is achieved by the separate non-terminal L that at every step is rewritten in the corresponding terminal label (e.g. bar). Delexicalization allows to reduce the size of grammar and factorize the prediction of fragments and labels separately.
However, DAG grammars can still be large due to the many combinations of how edge labels and their corresponding non-terminals can appear in a fragment. For this reason, we propose a further simplification where edge labels are replaced with placeholdersê 1 ...ê |e| , which we exemplify using the production in Figure 4 as follows: After a fragment is predicted, placeholders are then replaced with actual edge labels by a dedicated module (see § 3.2 for more details).
Comparison with Groschwitz et al. (2018)'s AM algebra. RDG is very similar to other graph grammars proposed for semantic parsing, in particular to Groschwitz et al. (2018)'s AM algebra used for AMR parsing. Groschwitz et al. (2018)'s framework relies on a fragment extraction process similar to ours where each node in a graph along with its outgoing edges makes up a fragment. However, the two grammars differ mainly in how typing and as a consequence, composition is thought of: whereas in the AM algebra both the fragments themselves and the non-terminal edges are assigned thematic types (e.g. S [ubject], O[bject], MOD[ifier]), we only place rank information on the non-terminals and assign a more generic semantic type to the fragment.
The fine-grained thematic types in the AM algebra add a level of linguistic sophistication that RDG lacks, in that fragments fully specify the roles a word is expected to fill. This ensures that the output graphs are always semantically well-formed; in that AM algebra behaves very similar to CCG. However this sophistication not only requires adhoc heuristics that are tailored to a specific formalism (AMR in this case) but also relies on alignment information with the source words.
On the other hand, our grammar is designed to predict a graph structure in sequential models. Composition is constrained by the rank of a nonterminal so to ensure that at each decoding step the model is always aware of the placement of reentrant nodes. However, we do not ensure semantic well-formedness in that words are predicted separately from their fragments and we do not rely on alignment information. In that our grammar extraction algorithm does not rely on any heuristics and can be easily applied to any semantic formalism.

Architecture
Our model is an encoder-decoder architecture that takes as input a sentence and generates a DAG G as a sequence of fragments with their corresponding labels, using the rewriting system in § 2. In what follows we describe how we obtain the logits for each target prediction, all of which are normalized with the softmax function to yield probability distributions. A detailed diagram of our architecture is shown in Figure 7 in Appendix A.

Encoder
We encode the input sentence w 1 , . . . , w |n| using a pre-trained multilingual BERT (mBERT) model (Devlin et al., 2018). 4 The final word-level representations are obtained through mean-pooling the sub-word representations of mBERT computed using the Wordpiece algorithm (Schuster and Nakajima, 2012). We do not rely on any additional (language-specific) features, hence making the encoder polyglot. The word vectors are then fed to a two-layer BiLSTM encoder, whose forward and backward states are concatenated to produce the final token encodings s enc 1 , . . . , s enc n .

Decoder
The backbone of the decoder is a two layer LSTM, with two separate attention mechanisms for each layer. Our decoding strategy follows steps similar to those in Figure 5. At each step we first predict a delexicalized fragment f t , and substitute a terminal label l t in place of L. We initialize the decoder LSTM with the encoder's final state s enc n . At each step t, the network takes as input [f t−1 ; l t−1 ], the concatenation of the embeddings of the fragment and its label output at the previous time step. At t = 0, we initialize both fragment and label encodings with a START token. The first layer in the decoder is responsible for predicting fragments. The second layer takes as input the output representations of the first layer, and predicts terminal labels. The following paragraphs provide details on the fragment and label predictions.
Fragment prediction. We make the prediction of a fragment dependant on the embedding of the parent fragment and the decoder history. We define as parent fragment the fragment containing the non-terminal the current fragment is rewriting; for instance, in Figure 5, the fragment in step 1 is the parent of the fragment underlined in step 2. Following this intuition, at time t, we concatenate the hidden state of the first layer h 1 t with a context vector c 1 t and the embedding of its parent fragment u t . The logits for fragment f t are predicted with a single linear layer W f [c 1 t ; u t ; h 1 t ] + b. We compute c 1 t using a standard soft attention mechanism (Xu et al., 2015) as follows, where s enc 1:N represents the concatenation of all encoding hidden states: Label prediction. Terminal labels in the output graph can either correspond to a lemma in the input sentence (e.g. 'bar', 'lock'), or to a semantic constant (e.g. 'speaker'). We make use of this distinction by incorporating a selection mechanism that learns to choose to predict either a lemma from the input or a token from a vocabulary of L. We concatenate the hidden state of the second layer h 2 t with the embedding of the fragment predicted at the current time-step f t and the second layer context vector c 2 t . Let us refer to this representation The context vector for the second layer is computed in the same way as c 1 t , but using h 2 t in place of h 1 t and separate attention MLP parameters. To compute the logits for labelprediction we apply a linear transformation to the encoder representations e = W s s enc 1:N . We concatenate the resulting vector with the label embedding matrix L and compute the dot product z T t [e; L] to obtain the final unnormalized scores jointly for all tokens in the input and L.
In the PMB, each label is also annotated with its sense tag and information about whether it is presupposed in the context or not. We predict the former, s t , from a class of sense tags S extracted from the training data, and the latter, p t , a binary variable, by passing z t two distinct linear layers to obtain the logits for each.
Edge factorization. In § 2, we discussed how we made grammars even less sparse by replacing the edge labels in a production fragment with placeholders. From a modelling perspective, this allows to factorize edge label prediction, where the decoder first predicts all the fragments in the graph and then predicts the edge labels e i ...e |e| that substitute in place of the placeholders.
To do so, we cache the intermediate representations z t over time. We use these as features, to replace the edge-placeholdersê i with the corresponding true edge labels e i . To obtain the edgelabel logits we pass the second-layer representation for the child fragment z c and parent fragment z p to a pairwise linear layer: W e [W c z c W p z p ].

Graph-aware decoding
At inference time, our graph decoder rewrites nonterminals left-to-right by choosing the fragment with the highest probability, and then predicts terminal and/or edge labels. The rank of a non-terminal and the variable references it takes as arguments place a hard constraint on the fragment that rewrites in its place (as shown in § 2). Only by satisfying these constraints, the model can ensure wellformedness of generated graphs.
By default, our decoder does not explicitly follow these constraints and can substitute a nonterminal with any fragment in the grammar. This is to assess whether a vanilla decoder can learn to substitute in a fragment that correctly matches a non-terminal. On top of the vanilla decoder, we then exploit these hard constrains in two different ways, as follows: Rank prediction. We incorporate information about rank as a soft constraint during learning by having the model predict it at each time step. This means that the model can still predict a fragment whose rank and variable references do not match those of a non-terminal but it is guided not to do so. We treat rank prediction as a classification task where we use the same features as fragment prediction that we then pass to a linear layer: Note that the range of predicted ranks is determined by the training grammar so it is not possible to generate a rank that has not been observed and doesn't have associated rules.
Constrained decoding. We explicitly ask the model to choose only amongst those fragments that can match the rank and variable references of a non-terminal. This may override model predictions but always ensures that a graph is well-formed. To ensure well-formedness, FA19 only checks for rank. This can lead to infelicitous consequences. Consider for instance the substitution in Figure 6. Both fragments at the bottom of the middle and right representations are of rank 2 but whereas the first allows for the edges to refer back to the reentrant nodes, the second introduces an extra reentrant node, leaving therefore one of the reentrant nodes disconnected. Checking just for rank is therefore not enough; one also needs to check whether a reentrant node that will substitute in a variable reference has already been generated. If not, any fragment of the same rank can be accepted. If such a node already exists, only fragments that do not introduce another reentrant node can be accepted. This constrained decoding strategy is what allows us to always generate well-formed graphs; we integrate this validation step in the decoding algorithm when selecting the candidate fragment.
Finally, we integrate these hard constraints in the softmax layer as well. Instead of normalizing the logits across all fragment types with a single softmax operation, we normalize them separately for each rank. The errors are only propagated through the subset of parameters in W f and b f responsible for the logits within the target rank r t .

Training objective
Our objective is to maximize the log-likelihood of the full graph P (G|s) approximated by the decomposition over each prediction task separately: where f t is the fragment; t is the label; r t is the (optional) rank of f t ; s t and p t are the sense and presupposition label of terminal label t ; e i ...e |e| are the edge labels of f t . To prevent our model from overfitting, rather than directly optimizing the log-likelihoods, we apply label smoothing for each prediction term (Szegedy et al., 2016).

Experimental setup 4.1 Data
We evaluate our parser on the Parallel Meaning Bank (Abzianidze et al., 2017), a multilingual graph bank where sentences in four languages (English (en), Italian (it), German (de) and Dutch (nl)) are annotated with their semantic representations in the form of Discourse Representation Structures (DRS). We test on v.2.2.0 to compare with previous work, and present the first results on v.3.0 on all four languages. We also present results when training on both gold and silver data, where the latter is ∼10x larger but contains machine-generated parses, of which only a small fraction has been manually edited. Statistics for both versions of the PMB are reported in Appendix B.
Our model requires an explicit grammar which we obtain by automatically converting each DAG in the training data 5 into a sequence of productions. This conversion follows the one in FA19 with minor changes; we include details in Appendix C.
Statistics regarding the grammars extracted from the PMB are presented in Table 1, where along with the number of training instances and fragments, we report average rank -an indication of how many reentrancies (on average) are present in the graphs. RDGs can be large especially in the case of silver data, where incorrect parses lead to a larger number of fragments extracted and more complex, noisy constructions, as attested by the higher average ranks. More importantly, we show that removing the edge labels from the fragments leads to a drastic reduction in the number of fragments, especially for the silver corpora.

Evaluation metrics
To evaluate our parser, we need to compare its output DRSs to the gold-standard graph structures. For this, we use the Counter tool of , which calculates an F-score by searching for the best match between the variables of the predicted and the gold-standard graphs. Counter's search algorithm is similar to the evaluation system SMATCH for AMR parsing .
There might be occurrences where our graph is deemed ill-formed by Counter; we assign these graphs a score of 0. The ill-formedness is however not due to the model itself but to specific requirements placed on the output DRS by the Counter script. 5 Our DAGs are different from the DRG (Discourse Representation Graphs) of Basile and Bos (2013); we elaborate more on this in Appendix C.

Experimental Results
We first present results of ablation experiments to understand which model configuration performs best ( § 5.1). We then compare our bestperforming model with several existing semantic parsers ( § 5.2), and present our model's performance in multilingual settings ( § 5.3). Table 2 shows results for our model in various settings. Our baseline is trained on gold data alone, uses a full grammar and performs unconstrained decoding, with and without rank prediction. Note that unconstrained decoding could lead to ill-formed graphs. To better understand the effect of this, we compare the performance of the baseline with a model that uses constrained decoding and thus always generates well-formed graphs. We train all our models on a single TitanX GPU v100. We report hyperparameters and other training details in Appendix D.

Ablation experiments
Our results are different from that of FA19, who show that a baseline model outperforms one with constrained decoding. Not only we find that constrained decoding outperforms the baseline, but we observe that without it, 26 graphs (∼4%) are ill-formed. In addition, our results show that predicting edge labels separately from fragments (edge factorization) leads to a substantial improvement in performance, while also drastically reducing the size of the grammar (as shown in Table 1).
van Noord et al. (2019), Liu et al. (2019) and FA19 found that models trained on silver data requires an additional training fine-tuning on gold data alone to achieve the best performance; we also follow this strategy in our experiments. Overall, results show that adding silver data improves performance, and that filtering the input silver data leads only to a slight loss in performance while keeping the size of the grammar small.

Comparison to previous work
We compare our best-performing model against previous work on PMB2.2.0. We first compare the performance on models trained solely on gold data. Besides the DAG-grammar parser of FA19, we compare with the transition-based stackLSTM of Evang (2019) that utilizes a buffer-stack architecture to predict a DRS fragment for each input token using the alignment information in the PMB; our graph parser does not make use of such information and solely relies on attention.
We then compare our best-performing model with two models trained on gold plus silver data. van Noord et al. (2019) is a seq2seq parser that decodes an input sentence into a concatenation of clauses, essentially a flattened version of the boxes in Figure 1. Similar to FA19, their model also uses a wide variety of language-dependent features, including part-of-speech, dependency and CCG tags, while ours relies solely on word embeddings. In this respect, our model is similar to   Table 3. When trained on gold data alone, our model outperforms previous models by a large margin, without relying on alignment information or extra features besides word embeddings. When trained on silver+gold, we close the performance gap with state-of-the-art models that decode into strings, despite relying solely on multilingual word embeddings. Table 4 shows the results on languages other than English. In our multilingual experiments, we first train and test monolingual models in each language. In addition, we perform zero-shot experiments by training a model on English and testing it on other languages (cross-lingual). We also take full advantage of the fact that our models rely solely on multilingual word embeddings, and experiment with two other multilingual settings: The bilingual models are trained on data in English plus data in a target language (tested on the target language). The polyglot models combine training data of all four languages (tested on each language). Parameters for all languages in the bilingual and polyglot mod-  els are fully shared. FA19 only experiment with a cross-lingual model trained with additional language-dependent features, some of which available only for a small number of languages (on PMB2.2.0). We therefore compare our cross-lingual models with theirs on PMB2.2.0. We then introduce the first results on PMB3, where we experiment with the other two multilingual settings.

Multilingual experiments
Our results tell a different story from FA19, where all of our multilingual models (bilingual, polyglot and cross-lingual) outperform the corresponding monolingual baselines. We hypothesize this is mainly due to the fact that for languages other than English, only small silver training data are available and adding a large gold English data might help dramatically with performance. This hypothesis is also reinforced by the fact that a crosslingual model training on English data alone can reach a performance comparable to the other two models.

Conclusions
In this paper, we have introduced a graph parser that can fully harness the power of DAG grammars in a seq2seq architecture. Our approach is efficient, fully multilingual, always guarantees well-formed graphs and can rely on small grammars, while outperforming previous graph-aware parsers in English, Italian, German and Dutch by large margin. At the same time, we close the gap between stringbased and RDG-based decoding. In the future, we are planning to extend this work to other semantic formalisms (e.g. AMR, UCCA) as well as test-ing on other languages, so to encourage work in languages other than English.
A System architecture An illustration of our system architecture is shown in Figure 7

C DAG-grammar extraction
Our grammar consists of three steps: Preprocess the DRS. First, we treat all constants as lexical elements and bind them to a variable c. For instance, in Figure 1 we bind 'speaker' to a variable c 1 and change the relations AGENT(e 1 , 'speaker') and AGENT(e 2 , 'speaker') into AGENT(e 1 , c 1 ) and AGENT(e 2 , c 1 ), respectively. Second, we deal with multiple lexical elements that map to the same variables (e.g. cat(x 1 ) ∧ entity(x 1 ), where the second predicate specify the 'nature' of the first) by renaming the second variable as i and creating a dummy relation OF that maps from from the first to the second. Finally, we get rid of relations that generate cycles. We found 25 cycles in the PMB, and they are all related to the same phenomenon where the relationships 'Role' and 'Of' have inverted source and target (e.g. person(x1) -Role -mother(x4), mother(x4) -Ofperson(x1)). We remove cyclicity by merging the two relations into one edge label. All these changes are then reverted before evaluation.
Converting the convert the DRS into a DAG. We convert all main boxes, lexical predicates and constants (now bound to a variable) to nodes whereas binary relations between predicates and boxes are treated as edges. For each box, we identify a root variable (if any) and attach this as child to the boxnode with an edge :DRS. A root variable is defined as a variable belonging to a box that is *not* at the receiving end of any binary predicates; in Figure 1, these are e 1 and e 2 for b 2 and b 3 respectively. We then follow the binary relations to expand the graph. In doing so, we also incorporate presuppositional boxes in the graph (i.e. b 4 in Figure 1). Each of these boxes contain predicates that are presupposed in context (usually definite descriptions like 'the door'). To link presupposed boxes to the main boxes (i.e. to get a fully connected DAG) we assign a (boolean) presupposition feature to the root variable of the presupposed box (this feature is marked with the superscript p in Figure 2). Any descendant predicates of this root variable will be considered as part of the presupposed DRS. During post-processing, when we need to reconstruct the DRS out of a DAG, we rebuild the presupposed box around variables for which presupposition is predicted as 'True', and their descendants. Note that Basile and Bos (2013) proposed a similar conversion to generate Discourse Representation Graphs (DRG), exemplified in Figure 8 using our working example. We argue that our representation is more compact in that: 1) we ignore 'in' edgeswhere each variable is explicitly marked as part of the box by means of a dedicated edge. This is possible since each box (the square nodes) has a main predicate and all its descendants belong to the box; 2) we treat binary predicates (e.g. AGENT) as edge labels and not nodes; 3) we remove presupposition boxes (in Figure 8, the subgraph rooted in a P labelled edge) and assign a (boolean) presupposition variable to the presupposed predicates. Convert the DAGs into derivation trees. DAGs are converted into derivation trees in two passes following the algorithm in Björklund et al. (2016), which we summarize here; the reader is referred to the original paper for further details. The algorithm consists of two steps: first, for each node n we traverse the graph post-order and store information on the reentrant nodes in the subgraph rooted n.
To be more precise, each outgoing edge e i from n defines a subgraph s i along which we extract a list of all the reentrant nodes we encounter. This list also includes the node itself, if reentrant.
We then traverse the tree depth-first to collect the grammar fragments and build the derivation tree. Each node contains information of its variable (and type), lexical predicate and features as well as a list of the labels on outgoing edges that we plug in the fragments. In order to add variable Figure 7: Overview of our architecture, following the description is § 3. Our encoder (on the left) computes multilingual word-embeddings using MBERT which then feed into a 2-layers BiLSTM. At the time step t, a 2 layers decoder LSTM (on the right) reconstructs a graph G by predicting fragment f t and terminal label l t . Additionally, parsing on PMB requires to predict for each label l t a sense tag s t and presupposition information p t (a boolean flag). To predict f t we use the hidden state of the decoder first layer (in blue) along with context vector c f t and information about the parent fragment u t (yellow edges). All other predictions are done using the hidden state of the decoder second layer (in red) along a separate context vector c l t . Both context vectors are computed using soft attention over the input representations (top left). Fragments predicted are used to substitute the leftmost non-terminal in the partial graph G (in pink), as shown at the top for G 2 ...G 5 . For G 1 the first fragment predicted initializes the graph (this corresponds to substituting the start symbol S). The edge labels in the fragments above are replaced with placeholders e 1 ...e |e| to display how edge factorization works. We assume here, for brevity, that G 5 is our final output graph and show the prediction of two edges that substitute in place of the placeholders (box at the bottom). For edge prediction, we use a bundle of features collected during decoding, namely the parent and children fragment embedding f t , the second layer hidden state (in red) and the context vector c l at time t.   references, if any, we need to know whether there are any reentrant nodes that are shared across the subgraphs s i ...s |e| . If so, these become variable references. If the node n itself is reentrant, we flag it with * so that we know that its variable name can substitute a variable reference.

D Implementation Details
We use the pre-trained uncased multilingual BERT base model from Wolf et al. (2019). All models trained on English data, monolingual or multilingual, share the same hyper-parameter settings. Languages other than English in the PMB v3.0 have less training data, especially in the cases of Dutch and Italian. Hence, we reduce the model capacity across the board and increase dropout to avoid over-fitting. Hyperparameter settings are shown in Table. 7. We found fine-tuning BERT model necessary to achieve good performance. Following Sun et al. (2019) and Howard and Ruder (2018), we experimented with different fine-tuning strategies, all applied after model performance plateaued: 1. setting constant learning rate for BERT layers 2. gradually unfreezing BERT layer by layer with decaying learning rate 3. slanted-triangular learning rate scheduling following Howard and Ruder (2018).
We have concluded that strategy 1 works best for our task, with fine-tuning learning rate of 2e-5 for English and a smaller learning rate of 1e-5 for other languages.

E Multilingual experiments -full results
All results for the multilingual experiments including precision and recall are shown in Table 6.