MaskParse@Deskin at SemEval-2019 Task 1: Cross-lingual UCCA Semantic Parsing using Recursive Masked Sequence Tagging

This paper describes our recursive system for SemEval-2019 Task 1: Cross-lingual Semantic Parsing with UCCA. Each recursive step consists of two parts. We first perform semantic parsing using a sequence tagger to estimate the probabilities of the UCCA categories in the sentence. Then, we apply a decoding policy which interprets these probabilities and builds the graph nodes. Parsing is done recursively, we perform a first inference on the sentence to extract the main scenes and links and then we recursively apply our model on the sentence using a masking features that reflects the decisions made in previous steps. Process continues until the terminal nodes are reached. We chose a standard neural tagger and we focus on our recursive parsing strategy and on the cross lingual transfer problem to develop a robust model for the French language, using only few training samples


Introduction
Semantic representation is an essential part of NLP. For this reason, several semantic representation paradigms have been proposed. Among them we find PropBank (Palmer et al., 2005) and FrameNet Semantics (Baker et al., 1998), Abstract Meaning Representation (AMR) (Banarescu et al., 2013), Universal Decompositional Semantics (White et al., 2016) and Universal Conceptual Cognitive Annotation (UCCA) (Abend and Rappoport, 2013). These constantly improving representations, along with the advances in semantic parsing, have proven to be beneficial in many NLU tasks such as Question Answering (Shen and Lapata, 2007), text summarization (Genest and Lapalme, 2011), dialog systems (Tur et al., 2005), information extraction (Bastianelli et al., 2013) and machine translation (Liu and Gildea, 2010). UCCA is a cross-lingual semantic representation scheme, has demonstrated applicability in En-glish, French and German (with pilot annotation projects on Czech, Russian and Hebrew). Despite the newness of UCCA, it has proven useful for defining semantic evaluation measures in textto-text generation and machine translation (Birch et al., 2016). UCCA represents the semantics of a sentence using directed acyclic graphs (DAGs), where terminal nodes correspond to text tokens, and non-terminal nodes to higher level semantic units. Edges are labelled, indicating the role of a child in the relation to its parent. UCCA parsing is a recent task and since UCCA has several unique properties, adapting syntactic parsers or parsers from other semantic representations is not straight-forward. Current state of the art parser TUPA (Hershcovich et al., 2017) uses a transition based parsing to build UCCA representations.
Building over previous work on FrameNet Semantic Parsing (Marzinotto et al., 2018a,b) we chose to perform UCCA parsing using sequence tagging methods along with a graph decoding policy. To do this we propose a recursive strategy in which we perform a first inference on the sentence to extract the main scenes and links and then we recursively apply our model on the sentence with a masking mechanism at the input in order to feed information about the previous parsing decisions.

Model
Our system consists of a sequence tagger that is first applied on the sentence to extract the main scenes and links and then it is recursively applied on the extracted element to build the semantic graph. At each step of the recursion we use a masking mechanism to feed information about the previous stages into the model. In order to convert the sequence labels into nodes of the UCCA graph we also apply a decoding policy at each stage.
Our tagger is implemented using deep bi-directional GRU (biGRU). This simple architecture is frequently used in semantic parsers across different representation paradigms. Besides its flexibility, it is a powerful model, with close to state of the art performance on both PropBank (He et al., 2017) and FrameNet semantic parsing (Yang and Mitchell, 2017;Marzinotto et al., 2018b).
More precisely, the model consists of a 4 layer bi-directional Gated Recurrent Unit (GRU) with highway connections (Srivastava et al., 2015). Our model uses has a rich set of features including syntactic, morphological, lexical and surface features, which have shown to be useful in language abstracted representations. The list is given below: • Word embeddings of 300 dimensions 1 .
• Syntactic dependencies of each token 2 .
• Part-of-speech and morphological features such as gender, number, voice and degree 2 . • Capitalization and word length encoding.
• Prefixes and Suffixes of 2 and 3 characters.
• A language indicator feature.
• Boolean indicator of idioms and multi word expression. Detailed in section 3.2. • Masking mechanism, which indicates, for a given node in the graph, the tokens within the span as well as the arc label between the node and its parent. See details in section 2.1.
Except for words where we use pre-trained embeddings, we use randomly initialized embedding layers for categorical features.

Masking Mechanism
We introduce an original masking mechanism in order to feed information about the previous parsing stages into the model. During parsing, we first do an initial inference step to extract the main scenes and links. Then, for each resulting node, we build a new input which is essentially the same, but with a categorical sequence masking feature. For the input tokens in the node span, this feature is equal to the label of the arc between the node and its parent. Outside of the node span, this mask is equal to O. A diagram of this masking process is shown in figure 1. The process continues and the model recursively extracts the inner semantic structures (the node's children) in the graph, until the terminal nodes are reached.
To train such a model, we build a new training corpus in which the sentences are repeated several times. More precisely, a sentence appears N times (N being the number of non terminal nodes in the UCCA graph) each one a with different mask.

Multi-Task UCCA Objective
Along with the UCCA-XML graph representations, a simplified tree representation in CoNLL format was also provided. Our model combines both representations using a multitask objective with two tasks. TASK1 consists in, for a given node and its corresponding mask, predicting the children and their arc labels. TASK1 encodes the children spans using a BIO scheme. The TASK2 consists in predicting the CoNLL simplified UCCA structure of the sentence. More precisely, TASK2 is a sequence tagger that predicts the UCCA-CoNLL function of each token. TASK2 is not used for inference purposes. It is only a support that help the model to extract relevant features, allowing it to model the whole sentence even when parsing small pre-terminal nodes.

Label Encoding
We have previously stated that TASK1 uses BIO encoded labels to model the structure of the children of each node in the semantic graph. In some rare cases, the BIO encoding scheme is not sufficient to model the interaction between parallel scenes. For example, when we have two parallel scenes and one of them appears as a clause inside the other. In such cases, BIO encoding does not allow to determine whether the last part of the sentence belongs to the first scene or to the clause. Despite this issue, prior experiments testing more complete label encoding schemes (BIEO, BIEOW) showed that BIO outperforms the other schemes on the validation sets.

Graph Decoding
During the decoding phase, we convert the BIO labels into graph nodes. To do so, we add a few constraints to ensure the outputs are feasible UCCA graphs that respect the sentence's structure: • We merge parallel scenes (H) that do not have either a verb or an action noun to the nearest previous scene having one. • Within each parallel scene, we force the existence of one and only one State (S) or Process (P) by selecting the token with the highest probability of State or Process. Step 1 Step 2.A Step 2.B Step 3   Step 1 parses the sentence to extract parallel scenes (H) and links (L). Then Steps 2.A 2.B use a different mask to parse these scenes and extract arguments (A) and processes (P) which will be recursively parsed until terminal nodes are reached.
• For scenes (H) and arguments (A) we do not allow to split multi word expressions (MWE) and chunks into different graph nodes. If the boundary between two segments lies inside a chunk or MWE segments are merged.

Remote Edges
Our approach easily handles remote edges. We consider remote arguments as those detected outside the parent's node span (see REM in Fig.1). Our earlier models showed low recall on remotes. To fix this, we introduced a detection threshold on the output probabilities. This increased the recall at the cost of some precision. The optimal detection threshold was optimized on the validation set.

UCCA Task Data
In For French we enriched the Treebank with XPOS from our lexicon. Finally, since tokenization is pre-established in the UCCA corpus we projected the improved POS and dependency parsing into the original tokenization of the task.

Supplementary lexicon
We observed that a major difficulty in UCCA parsing is analyzing idioms and phrases. The unawareness about these expressions, which are mostly used as links between scenes, mislead the model during the early stages of the inference and errors get propagated through the graph. To boost the performance of our model when detecting links and parallel scenes we developed an internal list with about 500 expression for each language. These lists include prepositional, adverbial and conjunctive expressions and are used to compute Boolean features indicating the words in the sentence which are part of an expression.

Multilingual Training
This model uses multilingual word embeddings trained using fastText (Bojanowski et al., 2017) and aligned using MUSE (Conneau et al., 2017). This is done in order to ease cross-lingual training.
In prior experiments we introduced an adversarial objective similar to (Kim et al., 2017;Marzinotto et al., 2019) to build a language independent representation. However, the language imbalance on the training data did not allow us to take advantage from this technique. Hence, we simply merged training data from different languages.

Experiments
We focus on obtaining the model that best generalizes on the French language. We trained our model for 50 epochs and we selected the best one on the validation set. In our experiments we did not use any product of experts or bagging technique and we did not run any hyper parameter optimization.
We trained several models building different training corpora composed of different language combinations. We obtained our best model using the training data for all the languages. This model FR+DE+EN achieved 63.6% avg. F1 on the French validation set. Compared to 63.1% for FR+DE, 62.9% for FR+EN and 50.8% for only FR.

Main Results
In Table 2 we provide the performance of our model for all the open tracks and we provide the results for TUPA baseline in order to establish a comparison. Our model finishes 4th in the French Open Track with an average F1 score of 65.4%, very close to the 3rd place which had a 65.6% F1. For languages with larger training corpus, our model did not outperform the monolingual TUPA.

Error Analysis
In Table 3 we give the performance by arc type. We observe that the main performance bottleneck is in the parallel scene segmentation (H). Due to our recursive parsing approach, this kind of error is particularly harmful to the model performance, because scene segmentation errors at the early steps of the parsing may induce errors in the rest of the graph. To assert this, we used the validation set to compare the performance of the mono scene sentences (with no potential scene segmentation problems) with the multi scene sentences. For the French track we obtained 67.2% avg. F1 on the 114 mono scene sentences compared to 61.9% avg. F1 on the 124 multi scene sentences.

Conclusions
We described an original approach to recursively build the UCCA semantic graph using a sequence tagger along with a masking mechanism and a decoding policy. Even though this approach did not yield the best results in the UCCA task, we believe that our original recursive, mask-based parsing can be helpful in low resource languages. Moreover, we believe that this model could be further improved by introducing a global criterion and by performing further hyper parameter tuning.