DagoBERT: Generating Derivational Morphology with a Pretrained Language Model

Can pretrained language models (PLMs) generate derivationally complex words? We present the ﬁrst study investigating this question, taking BERT as the example PLM. We examine BERT’s derivational capabilities in different settings, ranging from using the unmod-iﬁed pretrained model to full ﬁnetuning. Our best model, DagoBERT (Derivationally and generatively optimized BERT), clearly outperforms the previous state of the art in derivation generation (DG). Furthermore, our experiments show that the input segmentation crucially impacts BERT’s derivational knowledge, suggesting that the performance of PLMs could be further improved if a morphologically informed vocabulary of units were used.


Introduction
What kind of linguistic knowledge is encoded by pretrained language models (PLMs) such as ELMo (Peters et al., 2018), GPT-2 (Radford et al., 2019), and BERT (Devlin et al., 2019)? This question has attracted a lot of attention in NLP recently, with a focus on syntax (e.g., Goldberg, 2019) and semantics (e.g., Ethayarajh, 2019). It is much less clear what PLMs learn about other aspects of language. Here, we present the first study on the knowledge of PLMs about derivational morphology, taking BERT as the example PLM. Given an English cloze sentence such as this jacket is . and a base such as wear, we ask: can BERT generate correct derivatives such as unwearable?
The motivation for this study is twofold. On the one hand, we add to the growing body of work on the linguistic capabilities of PLMs. Most PLMs segment words into subword units (Bostrom and Durrett, 2020), e.g., unwearable is segmented into un, ##wear, ##able by BERT's WordPiece tokenizer (Wu et al., 2016). The fact that many of BERT DCL this jacket is [MASK] wear [MASK] .
un ##able Figure 1: Basic experimental setup. We input sentences such as this jacket is unwearable .
to BERT, mask out derivational affixes, and recover them using a derivational classification layer (DCL).  these subword units are derivational affixes suggests that PLMs might acquire knowledge about derivational morphology (Table 1), but this has not been tested. On the other hand, we are interested in derivation generation (DG) per se, a task that has been only addressed using LSTMs (Cotterell et al., 2017;Vylomova et al., 2017;Deutsch et al., 2018), not models based on Transformers like BERT.
Contributions. We develop the first framework for generating derivationally complex English words with a PLM, specifically BERT, and analyze BERT's performance in different settings. Our best model, DagoBERT (Derivationally and generatively optimized BERT), clearly outperforms an LSTM-based model, the previous state of the art.
We find that DagoBERT's errors are mainly due to syntactic and semantic overlap between affixes. Furthermore, we show that the input segmentation impacts how much derivational knowledge is available to BERT, both during training and inference. This suggests that the performance of PLMs could be further improved if a morphologically informed vocabulary of units were used. We also publish the largest dataset of derivatives in context to date. 1

Derivational Morphology
Linguistics divides morphology into inflection and derivation. Given a lexeme such as wear, while inflection produces word forms such as wears, derivation produces new lexemes such as unwearable. There are several differences between inflection and derivation (Haspelmath and Sims, 2010), two of which are particularly important for the task of DG. 2 First, derivation covers a much larger spectrum of meanings than inflection (Acquaviva, 2016), and it is not possible to predict in general with which of them a particular lexeme is compatible. This is different from inflectional paradigms, where it is automatically clear whether a certain form will exist (Bauer, 2019). Second, the relationship between form and meaning is more varied in derivation than inflection. On the one hand, derivational affixes tend to be highly polysemous, i.e., individual affixes can represent a number of related meanings (Lieber, 2019). On the other hand, several affixes can represent the same meaning, e.g., ity and ness. While such competing affixes are often not completely synonymous as in the case of hyperactivity and hyperactiveness, there are examples like purity and pureness or exclusivity and exclusiveness where a semantic distinction is more difficult to gauge (Bauer et al., 2013;Plag and Balling, 2020). These differences make learning functions from meaning to form harder for derivation than inflection.
Derivational affixes differ in how productive they are, i.e., how readily they can be used to create new lexemes (Plag, 1999). While the suffix ness, e.g., can attach to practically all English adjectives, the suffix th is much more limited in its scope of applicability. In this paper, we focus on productive affixes such as ness and exclude unproductive affixes such as th. Morphological productivity has been the subject of much work in psycholinguistics since it reveals implicit cognitive generalizations (see Dal and Namer (2016) for a review), making it an interesting phenomenon to explore in PLMs. Furthermore, in the context of NLP applications such as sentiment analysis, productively formed derivatives are challenging because they tend to have very low frequencies and often only occur once (i.e., they are hapaxes) or a few times in large corpora (Mahler et al., 2017). Our focus on productive derivational morphology has crucial consequences for dataset design (Section 3) and model evaluation (Section 4) in the context of DG.

Dataset of Derivatives
We base our study on a new dataset of derivatives in context similar in form to the one released by Vylomova et al. (2017), i.e., it is based on sentences with a derivative (e.g., this jacket is unwearable .) that are altered by masking the derivative (this jacket is .). Each item in the dataset consists of (i) the altered sentence, (ii) the derivative (unwearable) and (iii) the base (wear). The task is to generate the correct derivative given the altered sentence and the base. We use sentential contexts rather than tags to represent derivational meanings because they better reflect the semantic variability inherent in derivational morphology (Section 2). While Vylomova et al. (2017) use Wikipedia, we extract the dataset from Reddit. 3 Since productively formed derivatives are not part of the language norm initially (Bauer, 2001), social media is a particularly fertile ground for our study.
For determining derivatives, we use the algorithm introduced by Hofmann et al. (2020a), which takes as input a set of prefixes, suffixes, and bases and checks for each word in the data whether it can be derived from a base using a combination of prefixes and suffixes. The algorithm is sensitive to morpho-orthographic rules of English (Plag, 2003), e.g., when ity is removed from applicability, the result is applicable, not applicabil. Here, we use BERT's prefixes, suffixes, and bases as input to the algorithm. Drawing upon a comprehensive list of 52 productive pre-  Table 2: Data summary statistics. The table shows statistics of the data used in the study by frequency bin and affix type. We also provide example derivatives with anti (P), ness (S), and un##able (PS) for the different bins. µ f : mean frequency per billion words; n d : number of distinct derivatives; n s : number of context sentences.
fixes and 49 productive suffixes in English (Crystal, 1997), we find that 48 and 44 of them are contained in BERT's vocabulary. We assign all fully alphabetic words with more than 3 characters in BERT's vocabulary except for stopwords and previously identified affixes to the set of bases, yielding a total of 20,259 bases. We then extract every sentence including a word that is derivable from one of the bases using at least one of the prefixes or suffixes from all publicly available Reddit posts. The sentences are filtered to contain between 10 and 100 words, i.e., they provide more contextual information than the example sentence above. 4 See Appendix A.1 for details about data preprocessing. The resulting dataset comprises 413,271 distinct derivatives in 123,809,485 context sentences, making it more than two orders of magnitude larger than the one released by Vylomova et al. (2017). 5 To get a sense of segmentation errors in the dataset, we randomly pick 100 derivatives for each affix and manually count missegmentations. We find that the average precision of segmentations in the sample is .960±.074, with higher values for prefixes (.990±.027) than suffixes (.930±.093).
For this study, we extract all derivatives with a frequency f ∈ [1, 128) from the dataset. We divide the derivatives into 7 frequency bins with , and f ∈ [64, 128) (B7). Notice that we focus on low-frequency derivatives since we are interested in productive derivational morphology (Section 2). In addition, BERT is likely to have seen high-frequency derivatives multiple times during 4 We also extract the preceding and following sentence for future studies on long-range dependencies in derivation. However, we do not exploit them in this work. 5 Due to the large number of prefixes, suffixes, and bases, the dataset can be valuable for any study on derivational morphology, irrespective of whether or not it focuses on DG.
pretraining and might be able to predict the affix because it has memorized the connection between the base and the affix, not because it has knowledge of derivational morphology. BERT's pretraining corpus has 3.3 billion words, i.e., words in the lower frequency bins are very unlikely to have been seen by BERT before. This observation also holds for average speakers of English, who have been shown to encounter at most a few billion word tokens in their lifetime (Brysbaert et al., 2016).
Regarding the number of affixes, we confine ourselves to three cases: derivatives with one prefix (P), derivatives with one suffix (S), and derivatives with one prefix and one suffix (PS). 6 We treat these cases separately because they are known to have different linguistic properties. In particular, since suffixes in English can change the POS of a lexeme, the syntactic context is more affected by suffixation than by prefixation. Table 2 provides summary statistics for the seven frequency bins as well as example derivatives for P, S, and PS. For each bin, we randomly split the data into 60% training, 20% development, and 20% test. Following Vylomova et al. (2017), we distinguish the lexicon settings SPLIT (no overlap between bases in train, dev, and test) and SHARED (no constraint on overlap).

Setup
To examine whether BERT can generate derivationally complex words, we use a cloze test: given a sentence with a masked word such as this jacket is . and a base such as wear, the task is to generate the correct derivative such as unwearable. The cloze setup has been previously used in psycholinguistics to probe derivational morphology (Pierrehumbert, 2006;Apel and Lawrence, 2011) and was introduced to NLP in this context by Vylomova et al. (2017).
In this work, we frame DG as an affix classification task, i.e., we predict which affix is most likely to occur with a given base in a given context sentence. 7 More formally, given a base b and a context sentence x split into left and right contexts x (l) = (x 1 , . . . , x d−1 ) and x (r) = (x d+1 , . . . , x n ), with x d being the masked derivative, we want to find the affixâ such that where ψ is a function mapping bases and affixes onto derivatives, e.g., ψ(wear, un##able) = unwearable. Notice we do not model the function ψ itself, i.e., we only predict derivational categories, not the morpho-orthographic changes that accompany their realization in writing. One reason for this is that as opposed to previous work, our study focuses on low-frequency derivatives, for many of which ψ is not right-unique, e.g., ungoogleable and ungooglable or celebrityness and celebritiness occur as competing forms in the data.
These predictions reflect increasing degrees of derivational knowledge. A priori, where to draw the line between correct and incorrect predictions 7 In the case of PS, we predict which affix bundle (e.g., un##able) is most likely to occur.  on this continuum is not clear, especially with respect to the last two cases. Here, we apply the most conservative criterion: a predictionâ is only judged correct if ψ(b,â) = x d , i.e., ifâ is the affix in the masked derivative. Thus, we ignore affixes that might potentially produce equally possible derivatives such as superwearable.
We use mean reciprocal rank (MRR), macroaveraged over affixes, as the evaluation measure (Radev et al., 2002). We calculate the MRR value of an individual affix a as where D a is the set of derivatives containing a, and R i is the predicted rank of a for derivative i. We Denoting with A the set of all affixes, the final MRR value is given by

Segmentation Methods
Since BERT distinguishes word-initial (wear) from word-internal (##wear) tokens, predicting prefixes requires the word-internal form of the base. However, only 795 bases in BERT's vocabulary have a word-internal form. Take as an example the word unallowed: both un and allowed are in the BERT vocabulary, but we need the token ##allowed, which does not exist (BERT tokenizes the word into una, ##llo, ##wed). To overcome this problem, we test the following four segmentation methods: HYP. We insert a hyphen between the prefix and the base in its word-initial form, yielding the tokens un, -, allowed in our example. Since both prefix and base are guaranteed to be in the BERT vocabulary (Section 3), and since there are no tokens starting with a hyphen in the BERT vocabulary, BERT always tokenizes words of the form prefixhyphen-base into prefix, hyphen, and base, making this a natural segmentation for BERT.  Table 4: Performance (MRR) of prefix (P) models. Best score per column in gray, second-best in light-gray.  Table 5: Performance (MRR) of suffix (S) models. Best score per column in gray, second-best in light-gray.  INIT. We simply use the word-initial instead of the word-internal form, segmenting the derivative into the prefix followed by the base, i.e., un, allowed in our example. Notice that this looks like two individual words to BERT since allowed is a word-initial unit.
TOK. To overcome the problem of INIT, we segment the base into word-internal tokens, i.e., our example is segmented into un, ##all, ##owed. This means that we use the word-internal counterpart of the base in cases where it exists.
PROJ. We train a projection matrix that maps embeddings of word-initial forms of bases to wordinternal embeddings. More specifically, we fit a matrixT ∈ R m×m (m being the embedding size) via least squares, where E, E ## ∈ R n×m are the word-initial and word-internal token input embeddings of bases with both forms. We then map bases with no wordinternal form and a word-initial input token embedding e such as allow onto the projected wordinternal embedding e T .  We evaluate the four segmentation methods on the SHARED test data for P with pretrained BERT BASE , using its pretrained language modeling head for prediction and filtering for prefixes. The HYP segmentation method performs best (Table 3) and is adopted for BERT models on P and PS.

Models
All BERT models use BERT BASE and add a derivational classification layer (DCL) with softmax activation for prediction ( Figure 1). We examine three BERT models and two baselines. See Appendix A.2 for details about implementation, hyperparameter tuning, and runtime.
DagoBERT. We finetune both BERT and DCL on DG, a model that we call DagoBERT (short for  Derivationally and generatively optimized BERT).
Notice that since BERT cannot capture statistical dependencies between masked tokens (Yang et al., 2019), all BERT-based models predict prefixes and suffixes independently in the case of PS.
BERT+. We keep the model weights of pretrained BERT fixed and only train DCL on DG. This is similar in nature to a probing task.
BERT. We use pretrained BERT and leverage its pretrained language modeling head as DCL, filtering for affixes, e.g., we compute the softmax only over prefixes in the case of P.
LSTM. We adapt the approach described in Vylomova et al. (2017), which combines the left and right contexts x (l) and x (r) of the masked derivative by means of two BiLSTMs with a characterlevel representation of the base. To allow for a direct comparison with BERT, we do not use the character-based decoder proposed by Vylomova et al. (2017) but instead add a dense layer for the prediction. For PS, we treat prefix-suffix bundles as units (e.g., un##able).
In order to provide a strict comparison to Vylomova et al. (2017), we also evaluate our LSTM and best BERT-based model on the suffix dataset released by Vylomova et al. (2017) against the reported performance of their encoder-decoder model. 8 Notice Vylomova et al. (2017) show that providing the LSTM with the POS of the derivative increases performance. Here, we focus on the more general case where the POS is not known and hence do not consider this setting.
Random Baseline (RB). The prediction is a random ranking of all affixes. 8 The dataset is available at https://github.com/ ivri/dmorph. While Vylomova et al. (2017) take morphoorthographic changes into account, we only predict affixes, not the accompanying changes in orthography (Section 4.1).

Overall Performance
Results are shown in Tables 4, 5, and 6. For P and S, DagoBERT clearly performs best. Pretrained BERT is better than LSTM on SPLIT but worse on SHARED. BERT+ performs better than pretrained BERT, even on SPLIT (except for S on B7). S has higher scores than P for all models and frequency bins, which might be due to the fact that suffixes carry POS information and hence are easier to predict given the syntactic context. Regarding frequency effects, the models benefit from higher frequencies on SHARED since they can connect bases with certain groups of affixes. 9 For PS, DagoBERT also performs best in general but is beaten by LSTM on one bin. The smaller performance gap as compared to P and S can be explained by the fact that DagoBERT as opposed to LSTM cannot learn statistical dependencies between two masked tokens (Section 4).
The results on the dataset released by Vylomova et al. (2017) confirm the superior performance of DagoBERT (Table 7). DagoBERT beats the LSTM by a large margin, both on SHARED and SPLIT. We also notice that our LSTM (which predicts derivational categories) has a very similar performance to the LSTM encoder-decoder proposed by Vylomova et al. (2017).

Patterns of Confusion
We now analyze in more detail the performance of the best performing model, DagoBERT, and contrast it with the performance of pretrained BERT. As a result of our definition of correct predictions (Section 4.1), the set of incorrect predictions is heterogeneous and potentially contains affixes resulting in equally possible derivatives. We are hence interested in patterns of confusion in the data.  We start by constructing the row-normalized confusion matrix C for the predictions of DagoBERT on the hapax derivatives (B1, SHARED) for P and S. Based on C, we create a confusion graph G with adjacency matrix G, whose elements are i.e., there is a directed edge from affix i to affix j if i was misclassified as j with a probability greater than θ. We set θ to 0.08. 10 To uncover the community structure of G, we use the Girvan-Newman algorithm (Girvan and Newman, 2002), which clusters the graph by iteratively removing the edge with the highest betweenness centrality. The resulting clusters reflect linguistically interpretable groups of affixes (Table 8). In particular, the suffixes are clustered in groups with common POS. These results are confirmed by plotting the confusion matrix with an ordering of the affixes induced by all clusterings of the Girvan-Newman algorithm (Figure 2, Figure 3). They indicate that even when DagoBERT does not predict the affix occurring in the sentence, it tends to predict an affix semantically and syntactically congruent with the ground truth (e.g., ness for ity, ify for ize, inter for intra). In such cases, it is often a more productive affix that is predicted in lieu of a less productive one. Furthermore, DagoBERT frequently confuses affixes denoting points on the same scale, often antonyms (e.g., pro and anti, pre and post, under and over). This can be related to recent work showing that BERT has difficulties with negated expressions (Ettinger, 2020; Kassner and Schütze, 2020). Pretrained BERT shows similar confusion patterns overall but overgenerates several affixes much more strongly than DagoBERT, in particular re, non, y, ly, and er, which are among the most productive affixes in English (Plag, 1999(Plag, , 2003. To probe the impact of productivity more quantitatively, we measure the cardinality of the set of hapaxes formed by means of a particular affix a in the entire dataset, |H a |, and calculate a linear regression to predict the MRR values of affixes based on |H a |. |H a | is a common measure of morphological productivity (Baayen and Lieber, 1991;Pierrehumbert and Granell, 2018). This analysis shows a significant positive correlation for both prefixes (R 2 = .566, F (1, 43) = 56.05, p < .001) and suffixes (R 2 = .410, F (1, 41) = 28.49, p < .001): the more productive an affix, the higher its MRR value. This also holds for DagoBERT's predictions of prefixes (R 2 = .423, F (1, 43) = 31.52, p < .001) and suffixes (R 2 = .169, F (1, 41) = 8.34, p < .01), but the correlation is weaker, particularly in the case of suffixes (Figure 4).

Impact of Input Segmentation
We have shown that BERT can generate derivatives if it is provided with the morphologically correct segmentation. At the same time, we observed that BERT's WordPiece tokenizations are often morphologically incorrect, an observation that led us to impose the correct segmentation using hyphenation (HYP). We now examine more directly how BERT's derivational knowledge is affected by using the original WordPiece segmentations versus the HYP segmentations.
We draw upon the same dataset as for DG (SPLIT) but perform binary instead of multi-class classification, i.e., the task is to predict whether, e.g., unwearable is a possible derivative in the context this jacket is . or not. As negative examples, we combine the base of each derivative (e.g., wear) with a randomly chosen affix different from the original affix (e.g., ##ation) and keep the sentence context unchanged, resulting in a balanced dataset. We only use prefixed derivatives for this experiment.  We train binary classifiers using BERT BASE and one of two input segmentations, the morphologically correct segmentation or BERT's WordPiece tokenization. The BERT output embeddings for all subword units belonging to the derivative in question are max-pooled and fed into a dense layer with a sigmoid activation. We examine two settings: training only the dense layer while keeping BERT's model weights frozen (FROZEN), or finetuning the entire model (FINETUNED). See Appendix A.3 for details about implementation, hyperparameter tuning, and runtime.
Morphologically correct segmentation consistently outperforms WordPiece tokenization, both on FROZEN and FINETUNED (Table 9). We interpret this in two ways. Firstly, the type of segmentation used by BERT impacts how much derivational knowledge can be learned, with positive effects of morphologically valid segmentations. Secondly, the fact that there is a performance gap even for models with frozen weights indicates that a morphologically invalid segmentation can blur the derivational knowledge that is in principle available and causes BERT to force semantically unrelated words to have similar representations. Taken together, these findings provide further evidence for the crucial importance of morphologically valid segmentation strategies in language model pretraining (Bostrom and Durrett, 2020).  (Ethayarajh, 2019;Wiedemann et al., 2019;Ettinger, 2020). There is also a recent study examining morphosyntactic information in a PLM, specifically BERT (Edmiston, 2020).

Related Work
There has been relatively little recent work on derivational morphology in NLP. Both Cotterell et al. (2017) andDeutsch et al. (2018) propose neural architectures that represent derivational meanings as tags. More closely related to our study, Vylomova et al. (2017) develop an encoder-decoder model that uses the context sentence for predicting deverbal nouns. Hofmann et al. (2020b) propose a graph auto-encoder that models the morphological well-formedness of derivatives.

Conclusion
We show that a PLM, specifically BERT, can generate derivationally complex words. Our best model, DagoBERT, clearly beats an LSTM-based model, the previous state of the art in DG. DagoBERT's errors are mainly due to syntactic and semantic overlap between affixes. Furthermore, we demonstrate that the input segmentation impacts how much derivational knowledge is available to BERT. This suggests that the performance of PLMs could be further improved if a morphologically informed vocabulary of units were used.

A.1 Data Preprocessing
We filter the posts for known bots and spammers (Tan and Lee, 2015). We exclude posts written in a language other than English and remove strings containing numbers, references to users, and hyperlinks. Sentences are filtered to contain between 10 and 100 words. We control that derivatives do not appear more than once in a sentence.

A.2 Hyperparameters
We tune hyperparameters on the development data separately for each frequency bin (selection criterion: MRR). Models are trained with categorical cross-entropy as the loss function and Adam (Kingma and Ba, 2015) as the optimizer. Training and testing are performed on a GeForce GTX 1080 Ti GPU (11GB).
LSTM. We initialize word embeddings with 300-dimensional GloVe (Pennington et al., 2014) vectors and character embeddings with 100dimensional random vectors. The BiLSTMs consist of three layers and have a hidden size of 100.
We use a batch size of 64 and perform grid search for the learning rate l ∈ {1 × 10 −4 , 3 × 10 −4 , 1 × 10 −3 , 3 × 10 −3 } and the number of epochs n e ∈ {1, . . . , 40} (number of hyperparameter search trials: 160). The number of trainable parameters varies with the type of the model due to different sizes of the output layer and is 2,354,345 for P, 2,354,043 for S, and 2,542,038 for PS models. 11 Table 10 lists statistics of the validation performance over hyperparameter search trials and provides information about the best validation performance as well as corresponding hyperparameter configurations. 12 We also report runtimes for the hyperparameter search.
For the models trained on the Vylomova et al. (2017) dataset, hyperparameter search is identical as for the main models, except that we use accuracy as the selection criterion. Runtimes for the hyperparameter search in minutes are 754 for SHARED and 756 for SPLIT in the case of DagoB-ERT, and 530 for SHARED and 526 for SPLIT in the case of LSTM. Best validation accuracy is .943 (l = 3 × 10 −6 , n e = 7) for SHARED and .659 (l = 1 × 10 −5 , n e = 4) for SPLIT in the case of DagoBERT, and .824 (l = 1 × 10 −4 , n e = 38) for SHARED and .525 (l = 1 × 10 −4 , n e = 33) for SPLIT in the case of LSTM.

A.3 Hyperparameters
We use the HYP segmentation method for models with morphologically correct segmentation. We 11 Since models are trained separately on the frequency bins, slight variations are possible if an affix does not appear in a particular bin. The reported numbers are for B1. 12 Since expected validation performance (Dodge et al., 2019) may not be correct for grid search, we report mean and standard deviation of the performance instead. tune hyperparameters on the development data separately for each frequency bin (selection criterion: accuracy). Models are trained with binary crossentropy as the loss function and Adam as the optimizer. Training and testing are performed on a GeForce GTX 1080 Ti GPU (11GB).
For FROZEN, we use a batch size of 16 and perform grid search for the learning rate l ∈ {1 × 10 −4 , 3 × 10 −4 , 1 × 10 −3 , 3 × 10 −3 } and the number of epochs n e ∈ {1, . . . , 8} (number of hyperparameter search trials: 32). The number of trainable parameters is 769. For FINETUNED, we use a batch size of 16 and perform grid search for the learning rate l ∈ {1 × 10 −6 , 3 × 10 −6 , 1 × 10 −5 , 3 × 10 −5 } and the number of epochs n e ∈ {1, . . . , 8} (number of hyperparameter search trials: 32). The number of trainable parameters is 109,483,009. All other hyperparameters are as for BERT BASE . Table 11 lists statistics of the validation performance over hyperparameter search trials and provides information about the best validation performance as well as corresponding hyperparameter configurations. We also report runtimes for the hyperparameter search.   (σ), and maximum (max) of the validation performance (MRR) on all hyperparameter search trials for prefix (P), suffix (S), and prefix-suffix (PS) models. It also gives the learning rate (l) and number of epochs (n e ) with the best validation performance as well as the runtime (τ ) in minutes averaged over P, S, and PS for one full hyperparameter search (32 trials for DagoBERT and BERT+, 160 trials for LSTM).   (σ), and maximum (max) of the validation performance (accuracy) on all hyperparameter search trials for classifiers using morphological and WordPiece segmentations. It also gives the learning rate (l) and number of epochs (n e ) with the best validation performance as well as the runtime (τ ) in minutes for one full hyperparameter search (32 trials for both morphological and WordPiece segmentations).