Character-level Representations Improve DRS-based Semantic Parsing Even in the Age of BERT

We combine character-level and contextual language model representations to improve performance on Discourse Representation Structure parsing. Character representations can easily be added in a sequence-to-sequence model in either one encoder or as a fully separate encoder, with improvements that are robust to different language models, languages and data sets. For English, these improvements are larger than adding individual sources of linguistic information or adding non-contextual embeddings. A new method of analysis based on semantic tags demonstrates that the character-level representations improve performance across a subset of selected semantic phenomena.


Introduction
Character-level models have obtained impressive performance on a number of NLP tasks, ranging from the classic POS-tagging (Santos and Zadrozny, 2014) to complex tasks such as Discourse Representation Structure (DRS) parsing (van Noord et al., 2018b). However, this was before the large pretrained language models (Peters et al., 2018;Devlin et al., 2019) took over the field, with the consequence that for most NLP tasks, state-ofthe-art performance is now obtained by fine-tuning on one of these models (e.g., Conneau et al., 2020).
Does this mean that, despite a long tradition of being used in language-related tasks (see Section 2.1), character-level representations are no longer useful? We try to answer this question by looking at semantic parsing, specifically DRS parsing (Abzianidze et al., 2017;van Noord et al., 2018a). We aim to answer the following research questions: 1. Do pretrained language models (LMs) outperform character-level models for DRS parsing?
2. Can character and LM representations be combined to improve performance, and if so, what is the best method of combining them? 3. How do these improvements compare to adding linguistic features? 4. Are the improvements robust across different pretrained language models, languages, and data sets? 5. On what type of sentences do character-level representations specifically help?
Why semantic parsing? Semantic parsing is the task of automatically mapping natural language utterances to interpretable meaning representations. The produced meaning representations can then potentially be used to improve downstream NLP applications (e.g., Issa et al., 2018;Song et al., 2019;Mihaylov and Frank, 2019), though the introduction of large pretrained language models has shown that explicit formal meaning representations might not be a necessary component to achieve high accuracy. However, it is now known that these models lack reasoning capabilities, often simply exploiting statistical artifacts in the data sets, instead of actually understanding language (Niven and Kao, 2019;McCoy et al., 2019). Moreover, Ettinger (2020) found that the popular BERT model (Devlin et al., 2019) completely failed to acquire a general understanding of negation. Related, Bender and Koller (2020) contend that meaning cannot be learned from form alone, and argue for approaches that focus on grounding the language (communication) in the real world. We believe formal meaning representations therefore have an important role to play in future semantic applications, as semantic parsers produce an explicit model of a real-world interpretation.
Why Discourse Representation Structures? DRS parsing is a task that combines logical, pragmatic and lexical components of semantics in a Sent: I haven't been to Boston since 2013.
In semantic parsing, if character-level represen-tations are employed, they are commonly used in combination with non-contextual word-level representations (Lewis et al., 2016;Ballesteros and Al-Onaizan, 2017;Groschwitz et al., 2018;Cai and Lam, 2019). There are a few recent studies that did use character-level representations in combination with BERT (Zhang et al., 2019a,b;Cai and Lam, 2020), though only Zhang et al. (2019a) provided an ablation score without the characters. Moreover, it is not clear if this small improvement was significant. van Noord and Bos (2017) and van Noord et al. (2018b), on the other hand, used solely character-level representations in an end-to-end fashion, using a bi-LSTM sequence-to-sequence model, which outperformed word-based models that employed non-contextual embeddings.

Discourse Representation Structures
DRSs are formal meaning representations introduced by Discourse Representation Theory (Kamp and Reyle, 1993) with the aim to capture the meaning of texts ( Figure 1).  (Fellbaum, 1998) or VerbNet (Bonial et al., 2011). Moreover, its releases contain gold standard DRSs. For these reasons, we take the PMB as our corpus of choice to evaluate our DRS parsers. DRS parsing Early approaches to DRS parsing employed rule-based systems for small English texts (Johnson and Klein, 1986;Wada and Asher, 1986;Bos, 2001). The first open domain DRS parser is Boxer (Bos, 2008(Bos, , 2015, which is a combination of rule-based and statistical models. Le and Zuidema (2012) used a probabilistic parsing model that used dependency structures to parse GMB data as graphs. More recently, Liu et al. (2018) proposed a neural model that produces (treestructured) DRSs in three steps by first learning the general (box) structure of a DRS, after which specific conditions and referents are filled in. In followup work (Liu et al., 2019a) they extend this work by adding an improved attention mechanism and constraining the decoder to ensure well-formed output. This model achieved impressive performance on both sentence-level and document-level DRS parsing on GMB data. Fu et al. (2020) in turn improve on this work by employing a Graph Attention Network during both encoding and decoding. The introduction of gold standard DRSs in the PMB enabled a principled comparison of approaches. In our previous work (van Noord et al., 2018b), we showed that sequence-to-sequence models can successfully learn to produce DRSs, with characters as the preferred representation. In follow-up work, we improved on these scores by adding linguistic features (van Noord et al., 2019). The first shared task on DRS parsing (Abzianidze et al., 2019) sparked more interested in the topic, with a system based on stack-LSTMs (Evang, 2019) and a neural graph-based system (Fancellu et al., 2019). The best system (Liu et al., 2019b) used a similar approach as van Noord et al. (2018b), but swapped the bi-LSTM encoder for a Transformer. We will compare our approach to these models in Section 4.
3 Method 3.1 Neural Architecture As our baseline system, we start from a fairly standard sequence-to-sequence model with attention (Bahdanau et al., 2015), implemented in AllenNLP . 1 We improve on this model in a number of ways, mainly based on Nematus (Sennrich et al., 2017): (i) we initialize the decoder hidden state with the mean of all encoder states, (ii) we add an extra linear layer between this mean encoder state and the initial decoder state and (iii) we add an extra linear layer after each decoder state.
Specifically, given a source sequence (s 1 , . . . , s l ) of length l, and a target sequence (t 1 , . . . , t k ) of length k, let e i be the embedding of source symbol i, let h i be the encoder hidden state at source position i and let d j be the decoder state at target position j. A single forward encoder state is obtained as follows: The final state is obtained by concatenating the forward and backward hidden states, The decoder is initialized with the average over all encoder states: c tok = l i=1 h i / l and d 0 = tanh (W init c tok ). Characters in one encoder We will experiment with adding character-level information in either one or two encoders. For one encoder, we use char-CNN (Kim et al., 2016), which runs a Convolutional Neural Network (LeCun et al., 1990) over the characters for each token. It applies convolution layers for certain widths, which in essence select n-grams of characters. For each width, it does this a predefined number of times, referred to as the number of filters. The filter vectors form a matrix, which is then pooled to a vector by taking the max value of each initial filter vector. A detailed schematic overview of this procedure is shown in Appendix A. However, we usually do not look at only a single width, but at a range of widths, e.g., [1,2,3,4,5]. In that case, we simply concatenate the resulting vectors to obtain our final char-CNN embedding: e char i = [e w1 ; e w2 ; e w3 ; e w4 ; e w5 ]. Each widthfilter combination has independent learnable parameters. Finally, the char-CNN embedding is concatenated to the token-level representation, which is fed to the encoder: e i = [e tok i ; e char i ]. Characters in two encoders In the two-encoder setup, we run separate (but structurally identical) bi-LSTM encoders over the tokens and characters, and concatenate the resulting context vector before we feed it to the decoder: In the decoder, we replace the LSTM with a doubly-attentive LSTM, based on the doubly- attentive GRU (Calixto et al., 2017). We apply soft-dual attention (Junczys-Dowmunt and Grundkiewicz, 2017) to be able to attend over both encoders in the decoder (also see Figure 2): Here, e t j−1 is the embedding of the previously decoded symbol t, C the set of encoder hidden states for either the tokens or characters, ATT the attention function (dot-product) and d j the final decoder hidden state at step j. This model can easily be extended to more than two encoders, which we will experiment with in Section 4.
This type of multi-source model is commonly used to represent different languages, e.g., in machine translation (Zoph and Knight, 2016;Firat et al., 2016) and semantic parsing (Susanto and Lu, 2017;Duong et al., 2017), though it has also been successfully applied in multi-modal translation (Libovický and Helcl, 2017), multi-framework semantic parsing (Stanovsky and Dagan, 2018) and adding linguistic information (Currey and Heafield, 2018;van Noord et al., 2019). To the best of our knowledge, we are the first to represent the characters as a source of extra information in a multisource sequence-to-sequence model.

Transformer
We also experiment with the Transformer model (Vaswani et al., 2017), using the stacked self attention model as implemented in AllenNLP. A possible advantage of this model is that it might handle longer sentences and documents better. However, it might be harder to tune (Popel and Bojar, 2018) 2 and its improved performance has mainly been shown for large data sets, as opposed to the generally smaller semantic parsing data sets (Section 3.3). Indeed, we cannot outperform the LSTM architecture (see Section 4), even when tuning more extensively. We therefore do not experiment with adding character-level representations to this architecture, though the char-CNN can be added similarly as for the LSTM model. Hyper-parameters To make a fair comparison, we conduct an independent hyper-parameter search on the development set for all nine input text representations (see Section 3.2) across the two neural architectures, starting from the settings of van Noord et al. (2019). We found that the best settings were very close for all systems, with the only notable difference that the learning rate of the Transformer models is considerably smaller than for the bi-LSTM models (0.0002 vs 0.001). 3 For the char-CNN model, we use 100 filters, an embedding size of 75 and n-gram filter sizes of [1, 2, 3] for English and [1, 2, 3, 4, 5] for German, Italian and Dutch. For experiments where we add characters or linguistic features, the only extra search we do is the size of the hidden vector of the RNN encoder (300 − 600), since this vector now has to contain more information, and could potentially benefit from a larger size. Note that (possible) improved performance is not simply due to larger model capacity, since during tuning of the baseline models a larger RNN hidden size did not result in better performance.

Representations
We will experiment with five well-known pretrained language models: ELMO (Peters et al., 2018), BERT base/large (Devlin et al., 2019) and ROBERTA base/large (Liu et al., 2019c). 4 The performance of these five large LMs is contrasted with results of a character-level model and three wordbased models. The word-based models either learn the embeddings from scratch or use non-contextual GLOVE (Pennington et al., 2014) or FASTTEXT (Grave et al., 2018) embeddings. Pre-and postprocessing of the DRSs is done using the method described in van Noord et al. (2018b). 5 The DRSs are linearized, after which the variables are rewritten to a relative representation. The character-level model has character representations for the DRS concepts and constants, but not for variables, roles and operators. For all word-level models, the DRS concepts are initialized with GLOVE embeddings, while the other target tokens are learned from scratch. BERT specifics For the BERT models, we obtained the best performance by only keeping the vector of the first WordPiece per original token (e.g., only keep play out of play ##ing). For ROBERTA, it was best to use the WordPiece tokenization as is. Since linguistic features are added on token level, we duplicate the semantic tags for multi-piece tokens of ROBERTA in Table 5. Interestingly, we found that for both BERT and ROBERTA, it was best to keep the pretrained weights frozen. This was not a small difference: models using finetuning always obtained low scores (45 to 60).

Data and Evaluation
We use PMB releases 2.2.0 and 3.0.0 6 in our experiments (Table 1). The latter is a larger and more diverse extension of 2.2.0, which will be used for most of our experiments. We use 2.2.0 to compare to previous work and to verify that our results are robust across datasets. The PMB releases contain DRSs for four languages (English, German, Italian and Dutch) for three levels of annotation: gold (fully manually checked), silver (partially manually corrected) and bronze (no manual corrections). To make a fair comparison to previous work, we only employ the gold and silver data, by pretraining on gold + silver data and subsequently fine-tuning on only the gold data. If there is no gold train data available, we train on silver + bronze and fine-tune on silver. Unless otherwise indicated, our results are on the English development set of release 3.0.0. 6 https://pmb.let.rug.nl/data.php  Linguistic features We want to contrast our method of character-level information to adding sources of linguistic information. Based on van Noord et al. (2019), we employ these five sources: part-of-speech tags (POS), dependency parses (DEP), lemmas (LEM), CCG supertags (CCG) and semantic tags (SEM). For the first three sources, we use Stanford CoreNLP  to parse the documents in our dataset. The CCG supertags are obtained by using easyCCG (Lewis and Steedman, 2014). For semantic tagging, we train our own trigram-based tagger using TnT (Brants, 2000). 7 Table 2 shows a tagged example sentence for all five sources of information. Moreover, we also include non-contextual GLOVE and FASTTEXT embeddings as an extra source of information.
We add these sources of linguistic information in the same way as we add the character-level information, in either one or two encoders (see Section 3.1). In two encoders, we can use the exact same architecture. For one encoder, we (obviously) do not use the char-CNN, but learn a separate embedding for the tags (of size 200), that is then concatenated to the token-level representation, i.e., e i = [e tok i ; e ling i ]. If we use two encoders with a LM, characters and linguistic information (e.g., Table 4), the characters are added separately in the second encoder, while the LM and linguistic information representations are added in the first encoder. Evaluation We compare the produced DRSs to the gold standard using Counter (van Noord et al., 2018a), which calculates micro precision, recall and F1-score based on the number of matching clauses. 8 We use Referee (van Noord et al., 2018b) to ensure that the produced DRSs are syntactically and semantically well-formed (i.e., no free variables, no loops in subordinate relations) and form a connected graph. DRSs that are ill-formed get an F1-score of 0.0. All shown scores are averaged F1-scores over five training runs of the system, in which the same five random seeds are used. 9 For significance testing we use approximate randomization (Noreen, 1989), with α = 0.05 and R = 1000.
We also introduce and release DRS-JURY. This program provides a detailed overview of the performance of a DRS parser, but can also compare experiments, possibly over multiple runs. Features include significance testing, semantic tag analysis (Section 5.1), sentence length plotting (Section 5.2), new detailed Counter scores (Appendix D), and analysing (relative) best/worst produced DRSs (Appendix E). We hope this is a step in the direction of a more principled way of evaluating DRS parsers.

Results
LMs vs char-level models DRS parsing is no exception to the general trend in NLP: it is indeed the case that the pretrained language models outperform the char-only model (Table 3). Interestingly, the Transformer model has worse performance for all representations. 10 Surprisingly, we find that BERT-BASE is the best model, though the differences are small. 11 We use this model in further experiments (referred to as BERT). Adding characters to BERT We can see the impact of adding characters to BERT (first row of results in Table 4). For both methods, it results in a clear and significant improvement over the BERT-only baseline, 87.6 versus 88.1. Adding linguistic features to BERT However, another common method of improving performance is adding linguistic features to the tokenlevel representations. We try a range of linguistic features (described in Section 3.3), that are added in either one or two encoders. We see in the first two columns of results of Table 4 that even though linguistic information sources indeed do improve performance (up to 0.4 absolute), there is no single source that can beat adding just the character-level representations (88.1).

Combining characters and linguistic features
An obvious follow-up question is whether we still see improvements for character-level models when   also adding linguistic information. In a single encoder, adding characters (third column of results in Table 4) is beneficial for 6 out of 7 linguistic sources (i.e., compared to the first column of results). The scores are, however, not higher than simply adding characters on their own, suggesting that linguistic features are not always beneficial if character-level features are also included. For two encoders, the pattern is less clear, but we do find our highest score thus far when we combine characters and semantic tags (88.4). 12 Using three encoders did not yield clear improvements over two encoders. Therefore, we do not experiment with using more than three encoders. Robustness to different LMs We want to verify that the character improvements are robust to using different language models (  semantic tags also results in an improvement over just using characters for all the LMs considered.

Robustness across languages
We train systems for German, Italian and Dutch for four models: char-only, BERT-ONLY, BERT + char in 1 encoder, and BERT + char in two encoders. 13 The BERT model we use is bert-multilingual-uncased. The results for both PMB releases are shown in Figure 3. For all languages, adding characters leads to a clear improvement for both one and two encoders, though for Dutch the improvement is smaller than for German and Italian. Interestingly, the two-encoder setup seems to be preferable for these smaller, non-English data sets. For 2.2.0, we outperform the system of Fancellu et al. (2019) for German and Italian and obtain competitive scores for Dutch.

Comparison to previous work
To check whether the improvements hold on unseen data, we run our best models on the test set and compare the scores to previous work (Table 6). 14 We see 13 We do not train a model that uses semantic tags as features, since there is not enough gold semantic tag data available to train a good tagger for any of these languages.
14 For the detailed Counter scores see Appendix D.   Table 7: F-scores on subsets of sentences that contain a certain phenomenon, based on semantic tags, for the combined dev and test set of PMB release 3.0.0. Full scores shown for BERT and absolute differences for the remaining systems.

Semantic tag analysis
We are also interested in finding out why the character-level representations help improve performance. As a start, we investigate on what type of sentences and semantic phenomena the character representations are the most beneficial. We introduce a novel method of analysis: selecting subsets of sentences based on the occurrence of certain semantic tags. In the PMB release, each token is also annotated with a semantic tag, which indicates the semantic properties of the token in the given context (Abzianidze and Bos, 2017). This allows us to easily select all sentences that contain certain (semantic) phenomena and evaluate the performance of the different models on those sentences. 15 The selected phenomena and corresponding Fscores for our four best models (see Table 6) are shown in Table 7. 16 Our best model (+ch+sem) has the best performance on six of the seven phenomena selected, even though the differences are small. The character-level representations seem to help across the board; the +char models improve on the baseline (BERT) in almost all instances.
For Numerals and Named Entities we expected the characters to help specifically, since (i) BERT representations might not be as optimal for all individual numerals (Wallace et al., 2019), and (ii) the 15 Note that this method of analysis can easily be used for other NLP tasks as well, the only requirement being that a semantic tagger has to be used to get the semantic tags. 16   character representations might attend more to capital letters, which often indicate the presence of a named entity. Indeed, the character representations clearly help for Numerals, but less so for Named Entities. Of course, this analysis only scratched the surface as to why the character-level representations improve performance. We leave a more detailed investigation to future work.

Sentence length analysis
We are also interested in finding out which model performs well on longer documents. When the Transformer model was introduced, one of the advantages was less decrease in performance for longer sentences (Vaswani et al., 2017). Also, since Boxer is partly rule-based and not trained in an endto-end fashion, it might be able to handle longer sentences better. Figure 4 shows the performance over sentence length for seven of our trained systems. We see a similar trend for all models: a decrease in performance for longer sentences. We also create a regression model that predicts F-score, with as predictors parser and document length in tokens, similar to van Noord et al. (2018b). We do not find a significant interaction of any model with sentence length, i.e., none of the models decreases significantly less or more than any other model. To get some idea how well our models would do on longer (possibly multi-sentence) documents, we create a new evaluation set. We select all silver documents with 15 or more and less than 51 tokens that have at least the semtagging or CCG layer marked as gold standard. This resulted in a set of 128 DRSs, which should contain the higher quality silver documents. We retrain our models with those sentences removed and plot the performance over sentence length in Figure 5. We see that performance still decreases for longer sentences, though not as much after 30 tokens per document. The Transformer model does not seem to catch up with the bi-LSTM models, even for longer documents. The addition of characters is still beneficial for longer documents, though only in one encoder.

Discussion
We found that adding character-level representations generally improved performance, though we did not find a clear preference for either the oneencoder or two-encoder model. We believe that, given the better performance of the two-encoder model on the fairly short documents of the non-English languages (see Figure 3), this model is likely the most useful in semantic parsing tasks with single sentences, such as SQL parsing (Zelle and Mooney, 1996;Iyer et al., 2017;Finegan-Dollak et al., 2018), while the one encoder char-CNN model has more potential for tasks with longer sentences/documents, such as AMR (Banarescu et al., 2013), UCCA (Abend and Rappoport, 2013) and GMB-based DRS parsing (Bos et al., 2017;Liu et al., 2018Liu et al., , 2019a). The latter model also has more potential to be applicable for other (semantic parsing) systems as it can be applied to all systems that form token-level representations from a document. In this sense, we hope that our findings here are also applicable for other, more structured, encoder-decoder models devel-oped for semantic parsing (e.g., Yin and Neubig, 2017;Dong and Lapata, 2018;Liu et al., 2019a). An unexpected finding is that the BERT models outperformed the larger ROBERTA models. In addition, it was even preferable to use BERT only as initial token embedder, instead of fine-tuning using the full model. Perhaps this is an indication that certain NLP tasks cannot be solved by simply training ever larger language models. Moreover, the Transformer model did not improve performance for any of the input representations, while being harder to tune as well. We are a bit hesitant with drawing strong conclusions here, though, since we only experimented with a vanilla Transformer, while recent extensions (e.g., Dehghani et al., 2019;Guo et al., 2019;Press et al., 2020) might be more promising for smaller data sets.

Conclusion
We performed a range of experiments on Discourse Representation Structure Parsing using neural sequence-to-sequence models, in which we vary the neural representation of the input documents. We show that, not surprisingly, using pretrained contextual language models is better than simply using characters as input (RQ1). However, characters can still be used to improve performance, in both a single encoder and two encoders (RQ2). The improvements are larger than using individual sources of linguistic information, and performance still improves in combination with these sources (RQ3). The improvements are also robust across different languages models, languages and data sets (RQ4) and improve performance across a range of semantic phenomena (RQ5). These methods should be applicable to other semantic parsing and perhaps other natural language analysis tasks.  Figure 6 shows a schematic overview of using the char-CNN (Kim et al., 2016) to encode the word have with a width of 2. A width of 2 selects the bigrams ha, av and ve, returning a scalar for each bigram operation, which in turn form a vector f 1 for filter 1. We then take the max value of this vector to obtain the first value of our width 2 (w 2 ) char-CNN embedding e w2 1 . The final vector e w2 is thus of length n.

B Experimental settings
Tuning Table 8 gives an overview of the hyperparameters we used and/or experimented with in the tuning stage. This table only gives an overview of the settings for the BERT-BASE model, though the settings for the other representations (described in Section 3.2) are usually very similar. We performed manual tuning, selecting the settings with the highest F1-score. The number of tuning runs was between 10 and 40 for each representation type and model combination (see Table 3). Output, evaluation (containing F1-scores, standard deviation and confidence interval) and configuration files for our four best models (see Table 6) are available here: https://github.com/RikVN/Neural_ DRS/.
Data filtering We filtered ill-formed DRSs from the PMB data sets, which only occurs for silver and bronze data (< 0.1% of DRSs). For the bi-LSTM models, the filtering of source and target tokens (see Table 8) only filters out three very large documents from training. This was done for efficiency and memory purposes, it did not make a difference in terms of F1-score. However, for the Transformer model this improved F1-score by around 0.5.
Training time and model size A single run of the baseline BERT model takes about 5 hours to train on a single NVIDIA V100 GPU, with around 17 million trainable parameters. Adding characterlevel representations in one encoder (using the char-CNN) uses around 55 million trainable parameters, with a runtime of around 6 hours. Using a two encoder setup increases this to around 8 hours, but with only 34 million trainable parameters. New evaluation set When training models that are evaluated on the silver-standard evaluation set of longer documents, we do not perform fine-tuning on the gold standard data. Also, we run Counter with the --default-sense setting (not punishing models that get the word sense wrong), since the word senses of the evaluation set are not gold standard. This has a similar increase of around 1.0 for all models.   Table 9: Semantic tags that were used to select sentences that contain a certain phenomenon. The example sentence in Table 2 is included in the categories Modality, Pronouns, Named Entities and Numerals . Perfect sense is the F-score when we ignore word senses during matching, i.e., be.v.01 can match with be.v.02. The last 9 rows are not in the original detailed Counter scores, but are produced by DRS-JURY. Character-level representations help to produce fewer ill-formed and more perfect DRSs, especially on 3.0.0. Table 11 shows the sentences for which our best model (on 3.0.0 English dev) produced the lowest quality DRSs, with a possible explanation. In Table 12, we show the sentences for which our best model has the best performance (relative to the BERT-ONLY baseline model). It is harder to give an explanation in this case, though we indicate which clauses were (in)correctly predicted by the models.   Table 11: Sentences of the English 3.0.0 dev set for which our best model (+char +sem) produced the worst DRSs.

Document Diff Comment
Fish surface for air. 0.554 Correctly produced Goal Oil this bicycle. 0.482 Correctly produced oil as a verb I'm fed up with this winter, I want spring right now! 0.404 Correctly produced CONTINUATION and Pivot He's Argentinian.
0.386 BERT-ONLY failed to produce country and Name Alas! 0.364 Odd sentence, but correctly produced state.v.01 Fire burns. 0.300 Bad performance for both, BERT-ONLY got a score of 0.0 All journeys begin with a first step. 0.300 BERT-ONLY produced a lot of non-matching clauses How heavy you are! 0.299 BERT-ONLY produced a lot of non-matching clauses One plus two is equal to three. 0.252 Correctly produced summation.n.04 He's not like us. 0.246 Correctly produced Theme and Co-Theme Table 12: Sentences of the English 3.0.0 dev set for which our best model (+char +sem) produced the best DRSs, relative to the BERT-ONLY baseline.