Linguistic Information in Neural Semantic Parsing with Multiple Encoders

Recently, sequence-to-sequence models have achieved impressive performance on a number of semantic parsing tasks. However, they often do not exploit available linguistic resources, while these, when employed correctly, are likely to increase performance even further. Research in neural machine translation has shown that employing this information has a lot of potential, especially when using a multi-encoder setup. We employ a range of semantic and syntactic resources to improve performance for the task of Discourse Representation Structure Parsing. We show that (i) linguistic features can be beneficial for neural semantic parsing and (ii) the best method of adding these features is by using multiple encoders.

1 Introduction 2 Data and Methodology

Discourse Representation Structures
DRSs are formal meaning representations based on Discourse Representation Theory (Kamp and Reyle, 1993). We use the version of DRT as provided in the Parallel Meaning Bank (PMB, ), a semantically annotated parallel corpus, with texts in English, Italian, German and Dutch. DRSs are rich meaning representations containing quantification, negation, reference resolution, comparison operators, discourse relations, concepts based on WordNet, and semantic roles based on VerbNet.
All experiments are performed using the data of the PMB. In our experiments, we only use the English texts and corresponding DRSs. We use PMB release 2.2.0, which contains gold standard (fully manually annotated) data of which we use 4,597 as train, 682 as dev and 650 as test instances. It also contains 67,965 silver (partially manually annotated) and 120,662 bronze (no manual annotations) instances. Most sentences are between 5 and 15 tokens in length. Since we will compare our results mainly to Van Noord, Abzianidze, Toral, and Bos (2018), we will only employ the gold and silver data.
Name x3 "tom" Figure 1: DRS in box format (a), gold clause representation (b) and example system output (c) for I am not working for Tom, with precision of 5/8 and recall of 5/9, resulting in an F-score of 58.8.

Representing Input and Output
We represent the source and target data in the same way as Van Noord, Abzianidze, Toral, and Bos (2018), who represent the source sentence as a sequence of characters, with a special character indicating uppercase characters. The target DRS is also represented as a sequence of characters, with the exception of DRS operators, thematic roles and DRS variables, which are represented as super characters (Van Noord and Bos, 2017b), i.e. individual tokens. Since the variable names itself are meaningless, the DRS variables are rewritten to a more general representation, using the De Bruijn index (de Bruijn, 1972). In a post-processing step, the original clause structured is restored. 1 To include morphological and syntactic information, we apply a lemmatizer, POS-tagger and dependency parser using Stanford CoreNLP (Manning et al., 2014), similar to Sennrich and Haddow (2016) for machine translation. The lemmas and POS-tags are added as a token after each word. For the dependency parse, we add the incoming arc for each word. We also apply the easyCCG parser of Lewis and Steedman (2014), using the supertags. 2 Finally, we exploit semantic information by using semantic tags (Bjerva et al., 2016;. Semantic tags are language-neutral semantic categories, which get assigned to a word in a similar fashion as part-of-speech tags. Semantic tags are able to express important semantic distinctions, such as negation, modals and types of quantification. We train a semantic tagger with the TnT tagger (Brants, 2000) on the gold and silver standard data in the PMB release. Examples of the input to the model for each source of information are shown in Table 1. There are two ways to add the linguistic information; (1) merging all the information (i.e., input text and linguistic information) in a single encoder, or (2) using multiple encoders (i.e., encoding separately the input text and the linguistic information). Multi-source encoders were initially introduced for multilingual translation (Zoph and Knight, 2016;Firat et al., 2016;Libovický and Helcl, 2017), but recently were used to introduce syntactic information to the model (Currey and Heafield, 2018). Table 2 shows examples of how the input is structured for using one or more encoders. Experiments showed that using more than two encoders drastically decreased performance. Therefore, we merge all the linguistic information in a single encoder (see last row of Table 2).

Neural Architecture
We employ a recurrent sequence-to-sequence neural network with attention (Bahdanau et al., 2014) and two bi-LSTM layers, similar to the one used by Van Noord, Abzianidze, Toral, and Bos (2018). However, their model was trained with OpenNMT (Klein et al., 2017), which does not support multiple encoders. Therefore, we switch to the sequence-to-sequence framework implemented in Marian (Junczys-Dowmunt et al., 2018). We use model-type s2s (for a single encoder) or multi-s2s (for multiple encoders).
For the latter, this means that the multiple inputs are encoded separately by an identical RNN (without sharing parameters). The encoders share a single decoder, in which the resulting context vectors are concatenated. An attention layer 3 is then applied to selectively give more attention to certain parts of the vector (i.e. it can learn that the words themselves are more important than just the POS-tags). A detailed overview of our parameter settings, found after a search on the dev set, can be found in Table 3. When only using gold data, training is stopped after 15 epochs. For gold + silver data, we stop training after 6 epochs, after which we restart the training process from that checkpoint to finetune on only the gold data, also for 6 epochs.

Evaluation Procedure
Produced DRSs are compared with the gold standard representations by using COUNTER (Van Noord, Abzianidze, Haagsma, and Bos, 2018). This is a tool that calculates micro precision, recall and F-score over matching clauses, similar to the SMATCH (Cai and Knight, 2013) evaluation tool for AMR parsing. All clauses have the same weight in matching, except for REF clauses, which are ignored. An example of the matching procedure is shown in Figure 1. The produced DRSs go through a strict syntactic and semantic validation process, as described in Van Noord, Abzianidze, Toral, and Bos (2018). If a produced DRS is invalid, it is replaced by a dummy DRS, which gets an F-score of 0.0.
We check whether two systems differ significantly by performing approximate randomization (Noreen, 1989), with α = 0.05, R = 1000 and F (model 1 ) > F (model 2 ) as test statistic for each DRS pair.

Results and Discussion
We perform all our experiments twice: (i) only using gold data for training and (ii) with both gold (fully manually annotated) and silver (partially manually annotated) data.
The results of adding external sources of linguistic information are shown in Table 4. We clearly see that using an additional encoder for the linguistic information is superior to merging all the information in a single encoder. For two encoders and only using gold data, the scores increase by at least 0.7 for each source of information individually. Lemmatization shows the highest improvement, most likely because the DRS concepts that need to be produced are often lemmatized versions of the source words. When we stack the linguistic features, we observe an improvement for each addition, resulting in a final 2.7 point F-score increase over the baseline.
If we also employ silver data, we again observe that the multi-encoder setup is preferable over a single encoder, for both isolating and stacking the linguistic features. On isolation, the results are similar to only using gold data, with the exception of the semantic tags, which even hurt the performance now. Interestingly, when stacking the linguistic features, there is no improvement over only using the lemma of the source words.
We now compare our best models to previous parsers 4 (Bos, 2015;Van Noord, Abzianidze, Toral, and Bos, 2018) and two baseline systems, SPAR and SIM-SPAR. As previously indicated, Van Noord, Abzianidze, Toral, and Bos (2018) used a similar sequence-to-sequence model as our current approach, but implemented in OpenNMT and without the linguistic features. Boxer (Bos, 2008(Bos, , 2015 is a DRS parser that uses a statistical CCG parser for syntactic analysis and a compositional semantics based on λcalculus, followed by pronoun and presupposition resolution. SPAR is a baseline system that outputs the same DRS for each test instance 5 , while SIM-SPAR outputs the DRS of the most similar sentence in the training set, based on a simple word embedding metric. 6 The results are shown in Table 5. Our model clearly outperforms the previous systems, even when only using gold standard data. When compared to Van Noord, Abzianidze, Toral, and Bos (2018), retrained with the same data used in our systems, the largest improvement (3.6 and 3.5 for dev and test) comes from switching framework and changing certain parameters such as the optimizer and learning rate. However, the linguistic features are clearly still beneficial when using only gold data (increase of 2.7 and 1.9 for dev and test), and also still help when employing additional silver data (1.1 and 0.3 increase for dev and test, both significant).

Conclusions
In this paper we have shown that a range of linguistic features can improve performance of sequence-tosequence models for the task of parsing Discourse Representation Structures. We have shown empirically that the best method of adding these features is by using a multi-encoder setup, as opposed to merging the sources of linguistic information in a single encoder. We believe that this method can also be beneficial for other semantic parsing tasks in which sequence-to-sequence models do well.