Discourse Representation Structure Parsing with Recurrent Neural Networks and the Transformer Model

We describe the systems we developed for Discourse Representation Structure (DRS) parsing as part of the IWCS-2019 Shared Task of DRS Parsing.1 Our systems are based on sequence-to-sequence modeling. To implement our model, we use the open-source neural machine translation system implemented in PyTorch, OpenNMT-py. We experimented with a variety of encoder-decoder models based on recurrent neural networks and the Transformer model. We conduct experiments on the standard benchmark of the Parallel Meaning Bank (PMB 2.2). Our best system achieves a score of 84.8% F1 in the DRS parsing shared task.


Introduction
Discourse Representation Theory is a popular theory of meaning representation designed to account for a variety of linguistic phenomena, including the interpretation of pronouns and temporal expressions within and across sentences (Kamp and Reyle, 1993).The Groningen Meaning Bank (GMB; Bos et al. 2017) provides a large collection of English texts annotated with Discourse Representation Structures (DRS), while the Parallel Meaning Bank (PMB; Abzianidze et al. 2017) provides DRSs in English, German, Italian and Dutch.Furthermore, the PMB introduces clause representation, as shown on the top of Figure 1.
With the recent introduction of neural network learning to the Natural Language Processing community, several neural DRS parsers have been developed for the problem of DRS parsing, i.e. the problem of taking a document or a sentence as input, and outputting their corresponding DRS.Liu et al. (2018) convert box-style DRSs to tree-style DRSs and propose the three-step tree DRS parser on the GMB, while van Noord et al. ( 2018) adopt a neural machine translation approach to parse sentences to their clause-style DRSs on PMB.Due to the different standard of annotations between GMB and PMB, and that the IWCS-2019 Shared Task of DRS Parsing mainly focuses on averagely short sentences in PMB annotations, our systems take sentences as input and output a clause-style DRS of PMB represented as a sequence for the IWCS-2018 Shared Task of DRS parsing (Abzianidze et al., 2019).

Preprocessing
The Preprocess step works on the sentences and their DRSs of the training data and on the sentences of the development and the test data.We tried two levels of preprocessing, character-level and word-level.
Character Level We use the scripts of van Noord et al. (2018) to perform character-level preprocessing for sentences and their DRSs.Each sentence is separated into characters where a special symbol "|||" is used to mark a word boundary. 2The clauses are represented as a character sequence, except for the semantic roles, DRS operators and deictic constants, as shown in Figure 3(a).For example, "b1 REF e1" is preprocessed to "$NEW ||| REF", which means that a new box (b1) is construct and a new referent (e1) is introduced by the box; "b2 ground floor "n.01" x1" is preprocessed to "$0 ||| g r o u n d f l o o r ||| " n .0 1 " ||| @0", which means that the sense ground floor.n.01 is constructed and then assigned to the referent @0, which is latest introduced, where @n (n∈ Z) denotes the referent |n|th latest introduced.3 .Similarly, $n (n∈ Z) denotes the box |n|th latest constructed.
Word Level Each sentence is tokenized using the Moses script4 and then transformed to its lowercase form.Clauses are represented as sequences without changing the order, where a special symbol "|||" is used to start a new clause.We rule out quotation marks in clauses (e.g."tom" is converted to tom) and (a) character level sentence: i live on the ground floor .DRS:

Neural Models
We adopt Recurrent Neural Networks (RNNs) equipped with Long Shot-Term Memory (LSTM; Hochreiter and Schmidhuber 1997) units and the Transformer model (Vaswani et al., 2017) as our neural models.
For the model implementation, we use the one provided by the OpenNMT-py toolkit (Klein et al., 2017).
The hyperparameters we used are shown in Table 1 which are institutionally set without optimization.

Fine-tuning
We propose a fine-tuning approach to enable the system to effectively use more training data in various quality, i.e. bronze and silver data.The fine-tuning approach allows the system train to convergence on one dataset (e.g.silver and gold data) and then continues to train to convergence on another dataset (e.g.gold data), where the optimizers are reset.

Postprocessing and Evaluation
We adopt the postprocessing scripts of van Noord et al. (2018) to transform back the output of our models to the clause format, and then use COUNTER (van Noord et al., 2018) as our evaluation metric.

Experiments
In this section, we introduce the training data that we used and the results on the PMB benchmarks.

Data
The training data consists of all of the bronze data (bronze), all of the silver data (silver), and the training section of the gold data (gold).All data is preprocessed.We mix bronze, silver and gold as bsg-data, and mix silver and gold as sg-data, and name the training section of gold data as g-data.Meanwhile, we adopt GloVe (Pennington et al., 2014) pre-trained word embeddings5 to initialize the representation of input tokens.

Results
Table 2 shows the results on test data, where sg-data means that the models are only trained on sg-data, and + g-data means that the models are continually fine-tuned on g-data.With LSTM, the character model performs marginally better than the word model.However, with Transformer, the word model performs significantly better than the character model.With both LSTM and Transformer, fine-tuning on g-data significantly improves the performance.Although the character LSTM is marginally better than the word Transformer, we still prefer the word Transformer as our final model, because it could be trained faster.
Table 3 shows the improved results on test dataset by using word Transformer with bronze data, where bsg-data means that the model is only trained on bsg-data, + sg-data means that the model is continually fine-tuned on sg-data, and + g-data means that the model is further fine-tuned on g-data.As shown in Tables 2 and 3, the improvement gap of fine-tuning on sg-data from bsg-data (3.24% F1) is narrower than that of fine-tuning on g-data from sg-data (8.84% F1).Fine-tuning on g-data may be the key to improve the performance on the test dataset.We believe this is due to the high similarity between g-data and the test data.Also, we discover that the model trained on bsg-data then fine-tuned on g-data can also have good performance, but slightly worse than the final models.
We submitted the word Transformer on bsg-data + sg-data + g-data as our final model to the DRS parsing shared task.On the test dataset of the shared task, our model achieves 84.80 F 1 score.

Analysis
We further analyze the output of the parsers trained on sg-data + g-data to see what components of the meaning representation are challenging.Table 4 shows the detailed results of Counter, where DRS Operators (e.g.negation), Roles (e.g.Agent), Concepts (i.e.predicates), synsets (e.g."n.01") are scored separately.
We compare four parsing models, LSTM with character-level preprocessing (char-LSTM), LSTM with word-level preprocessing (word-LSTM), Transformer with character-level preprocessing (chartransformer) and Transformer with word-level preprocessing (word-transformer).The char-LSTM and word-transformer models both achieve good performance, where word-transformer performs best on the construction of DRS operators, Concepts, Synsets-Noun and Synsets-Adjectives, and char-LSTM performs best on construction of Roles and Synsets-Verbs.The word-LSTM model is mediocre, but significantly outperforms the other models on the construction of Synsets-Adverbs with a large gap of average 35.14%F 1 score.

Conclusions
In this paper, we describe the system for the IWCS-2019 Shared Task of DRS parsing.We found that the character-level LSTM and the word-level transformer are competitive in the task.The training time of LSTM models increases as input sequences are longer, while training time are not sensitive to the lengths of input sequences in transformer.The output of LSTM models and transformers have different error distributions.There is still a large improvement space for the sequential models.

Figure 2
Figure 2 shows the data pipeline in our system for both training and parsing.There are three main parts: (a) The component Preprocess, which prepares the input data to make it suitable for training and parsing models; (b) The component Neural Model which is based on OpenNMT; (c) The component Postprocess which contains some rules to ensure the system output is a well-formed DRSs.

Figure 1 :Figure 2 :
Figure 1: The clause representations (top) and box-style representations (bottom) for the sentence I live on the ground floor..

Table 1 :
Choice of hyperparameters for our neural network models.
remain them case-sensitive.Following previous work(van Noord et al., 2018), the indices of variables in clauses are relative, as shown in Figure3(b), which is the same to the character-level preprocessing.

Table 2 :
Results on test partition of the Parallel Meaning Bank.

Table 3 :
Results on test dataset by word transformer

Table 4 :
F 1 -scores of fine-grained evaluation on test dataset.