Artificially Evolved Chunks for Morphosyntactic Analysis

We introduce a language-agnostic evolutionary technique for automatically extracting chunks from dependency treebanks. We evaluate these chunks on a number of morphosyntactic tasks, namely POS tagging, morphological feature tagging, and dependency parsing. We test the utility of these chunks in a host of different ways. We first learn chunking as one task in a shared multi-task framework together with POS and morphological feature tagging. The predictions from this network are then used as input to augment sequence-labelling dependency parsing. Finally, we investigate the impact chunks have on dependency parsing in a multi-task framework. Our results from these analyses show that these chunks improve performance at different levels of syntactic abstraction on English UD treebanks and a small, diverse subset of non-English UD treebanks.


Introduction
Shallow parsing, or chunking, consists of identifying constituent phrases (Abney, 1997). As such, it is fundamentally associated with constituency parsing, as it can be used as a first step for finding a full constituency tree (Ciravegna and Lavelli, 1999;Tsuruoka and Tsujii, 2005). However, chunking information can also be beneficial for dependency parsing (Attardi and DellOrletta, 2008;Tammewar et al., 2015), and vice versa (Kutlu and Cicekli, 2016). Latterly, Lacroix (2018) explored the efficacy of noun phrase (NP) chunking with respect to universal dependency (UD) parsing and POS tagging for English treebanks. As UD treebanks do not contain chunking annotation, they deduced chunks by adopting linguistic-based phrase rules. They observed improvements on POS and morphological feature tagging in a shared multi-task framework for the English treebanks in UD version 2.1 (Nivre et al., 2017). However, an increase in performance for parsing was only obtained for one treebank.
Contribution 1. We first relax the standard definition of chunks and present an evolutionary method to automatically deduce chunks for any language given a dependency treebank. 2. We show that chunking information can improve performances for POS tagging, morphological feature tagging, and dependency parsing, both in a multi-task and a single-task framework.

Chunks and chunking rules
While Lacroix (2018) described a method to obtain chunks from sentences with UD annotations, their approach is limited to NP chunks and requires hand-crafted linguistic rules, meaning that it cannot be transferred to other languages without language-specific knowledge. In contrast, we introduce a fully automatic approach to obtain chunks from UD-annotated sentences in a language-agnostic way. Figure  1 depicts our method of extracting candidate chunk types. Describing chunks with rules For each subtree in the training set that meets the above criteria, the corresponding sequence of POS tags of its words is saved as a candidate rule. Each rule is collected for a given treebank to construct a ruleset of unique candidate chunk types. When more than one overlapping subtree meets these conditions the maximal substring is used, e.g. in Figure 1 PRON AUX ADV is chosen instead of PRON AUX or AUX ADV. We allow any chunk type with the exception of those containing the PUNC POS tag and we apply a mild frequency cut of 5 to make the problem more tractable. The English-EWT treebank, for example, results in a ruleset consisting of 512 candidates.
Annotating with rulesets This ruleset (or any subset of it) can be applied to a UD treebank to obtain chunks, by using them as patterns that generate a chunk when they are matched by a sequence of POS tags and meet the criteria described above. 2 In particular, we can apply it to the training set to obtain a set of chunks on which to train a statistical chunker to process arbitrary texts and help morphosyntactic tasks. When annotating a treebank, the POS tag of the head is used as a suffix for the chunk type, e.g. DET ADJ NOUN would result in IOB tags of B-NOUN and I-NOUN, assuming the head of this phrase corresponds to the NOUN tag (Ramshaw and Marcus, 1999).
However, not all candidate rules are useful and can impact the ability of a chunker to make sensible predictions. For this reason, we will not use the whole candidate ruleset obtained from a training corpus, but instead try to find a subset of the ruleset whose resulting set of chunks strikes a good balance between the following criteria: (i) coverage (i.e. there should be enough chunks to maximize their informativeness for morphosyntactic tasks) and (ii) consistency and learnability (i.e. the chunks should follow patterns predictable enough to be easily learnable by a machine learning model, so that our approach is not undermined by low chunking accuracy). Our hypothesis is that these two characteristics (which we quantify with a fitness function in the next section) are reasonable proxies for the usefulness of a particular set of chunks for morphosyntactic tasks.
Note that to achieve this, it is not possible to merely remove error-prone rules from the ruleset because there is a complicated interplay between rules, i.e. if the 10% most error-prone rules are removed, the overall accuracy of the system is not guaranteed to improve. Furthermore, with so many candidate rules, it is not possible to try every combination as this results in an astronomical number (2 n ). Therefore, we aim to use an evolutionary method to find optimal subsets of rules to be used when annotating treebanks.

Evolutionary search for chunk rules
Evolutionary algorithms aim to optimise an objective (fitness) function by evaluating a population of individuals and subsequently generating a new population based on the best performing individuals from the population (Back, 1996). This process is then repeated until a set number of generations is reached or until the fitness function converges. Each individual consists of a set of parameters and its corresponding objective function value, or fitness. The fitness of an individual is used to decide whether to use it as a parent for subsequent generations or to remove it from the population. We introduce the techniques used to select parents and how they are then used to generate offspring (Algorithm 1 in Appendix A).
K-best parent selection The selection operator makes the population converge. We used the simple k-best method where the top k individuals of a population are selected as the parents.
Mutation Mutation is a genetic operator which prevents a population becoming too genetically similar by randomly altering individuals. This ensures that at least some level of genetic diversity is maintained from generation to generation. Our individuals have binary genes, so our mutation operator flips each gene with a probability P mutate gene .
Crossover Crossover is a genetic operator which also preserves genetic variety in a population. In single-point crossover, a random index κ is chosen and the substring 0-κ of parent x is replaced with the corresponding part of parent y and vice-versa. This results in two offspring. Single-point crossover can be extended to x-point crossover, where x points are used to cut individuals.
We used the DEAP framework for our implementation (Fortin et al., 2012), and the parameters in Table  6 (Appendix B). We represented our rulesets as a binary vector, where 1 meant a rule was used and 0 meant it was not. Our fitness function was obtained by combining the F1-score of a chunker implemented with the sequence-labelling framework NCRF++ (Yang and Zhang, 2018) and the proportion of the maximum compression rate, weighted 1.0 and 0.5 respectively. The compression rate, r, is defined as: where C tokens is the number of tokens in a treebank, C chunks the number of chunks a ruleset creates, and C out the number of tokens outside of chunks. And subsequently the proportion of the maximum compression rate, r% is defined as: where r subset is the compression rate of the current rule subset and r all is the compression rate of the full ruleset.
We used a small network for chunking due to the considerable computational costs of evolutionary algorithms. For each individual in each population, we trained a chunker for 5 epochs (see Table 7 in Appendix B for the parameters) and the corresponding model's best performance on the development set was taken as that individual's fitness along with the proportion of the maximum compression rate, r%: the proportion of the maximum rate was used to prevent the algorithm from generating rulesets that generated few chunks and therefore minimising the potential impact. The convergence over 40 generations for English-EWT and Japanese-GSD can be seen in Figure 2. As a final step, we took the top 100 best rulesets from across generations and extracted the rules that appeared in at least 75% and 95% of these sets, as the evolutionary algorithm only managed to find a single set with a fairly low performance. Rulesets were obtained this way for each treebank, except the rulesets extracted from English-EWT were subsequently used on the other English UD treebanks. The statistics for the resulting chunks for the respective test data can be seen in Table 1.  Table 1: Chunking statistics on test data for each treebank used where # rules is the number of rules in a ruleset for a given threshold and C/sent corresponds to the number of chunks per sentence found.
(BiLSTMs) networks (Hochreiter and Schmidhuber, 1997;Schuster and Paliwal, 1997). The input to the network are continuous word representations and character embeddings.
In this paper we used NCRF++ (Yang and Zhang, 2018), which uses stacked BiLSTMs, to generate contextualised hidden representations for every word ( h i ) in the input sentence. For decoding, it uses a feed-forward layer followed by a softmax activation: The single task models are optimised with cross-entropy loss, L , defined as: For the multi-task learning models, we implemented a hard-sharing architecture, where all the stacked BiLSTMs are shared across all tasks (Søgaard and Goldberg, 2016) . A separate feed-forward layer (as the one used in the single task setup) is used to decode the output for each task. With respect to the computation of the loss under the multi-task learning (MTL) setup, L MT L , is defined as: where t is a task from the set of all tasks, T ; β t is the corresponding weight for task t; and L t is the cross-entropy loss for task t. A schematic of the network can be seen in Figure 3.

Dependency parsing as sequence labelling
In order to more readily utilise the multi-task framework for dependency parsing, we have cast dependency parsing as a sequence-labelling task. This was done by using the relative position encoding scheme introduced by Strzyz et al. (2019). We opted to use this encoding as it was the highest performing labelling scheme they evaluated. For each word in a sentence the dependency relation label is combined with the relative position of its head based on the POS tag of the head, e.g. a noun which is the subject of a verb (son in the input sentence in Figure 3) would have a label of +1,nsubj,VERB, where +1 indicates the head is the next VERB in the sentence and nsubj is the relation label.

Experiments
Data The analyses were undertaken using the English treebanks (EWT, GUM, LinES, and ParTUT) and also Bulgarian-BTB, German-GSD, and Japanese-GSD from UD v2.3 (Nivre et al., 2018). No results are given for Japanese-GSD for morphological feature tagging as it does not contain this information.
Network hyperparameters We used the framework as described above and hyperparameters from  which can be seen in Table 8 in the Appendix B. The standard input to the system consisted of word embeddings concatenated with character embeddings. All embeddings were randomly initialised.
Figure 3: Multi-task architecture shown with sequence-labelling dependency parsing (as described in subsection 4.1), POS tagging, and chunking as shared tasks. Network input is a concatenation of word embeddings (circles) and character-level word embeddings (triangles) obtained from a character-based LSTM layer. The network is constructed of BiLSTM layers followed by a softmax layer for inference.

Experiment 1
We tested the impact of our chunks on POS and morphological feature tagging in a shared multi-task setting. This entails feeding word and character embeddings as input to the network with the output being some combination of POS tags, morphological feature tags, and chunk labels. These results were compared against the baseline taggers (single-task networks and POS and morphological features shared only). Tasks were equally weighted. As a further baseline we include results for POS and morphological feature tagging using UDPipe 2.2 (Straka and Straková, 2019).

Experiment 2
We used the best predictions (when using chunking) from experiment 1 as additional features for a sequence-labelling dependency parser (Strzyz et al., 2019). Therefore, network input consisted of word and character embedding and then some combination of POS tags, morphological feature tags, or chunk labels with the sole output being a dependency parser tag. We used gold tags and labels as input during training, but at runtime we used predicted tags and labels. For baselines we train a model with no features which is decoded with predicted POS tags using UDPipe 2.2 (as the sequencelabelling encoding we are using requires POS tags to resolve dependency heads) and also a model trained with POS tags as features but also using UDPipe 2.2 predicted POS tags at runtime.

Experiment 3
We tested the impact of our chunks on a sequence-labelling dependency parser in a multi-task framework with and without the other tasks. POS tagging was treated as a secondary main task with a weight of 0.5 (as POS tags are needed to decode the sequence-labelling scheme for the dependency parser) and chunks and morphological features were considered auxiliary tasks with a weight of 0.25 when used. The input during this experiment were only word and character embeddings. An example is shown in Figure 3 where the shared tasks are chunking, POS tagging, and dependency parsing. The baseline used here is a model trained solely to predict dependency parsing tags which are then decoded using predicted POS tags from UDPipe 2.2.

Results and discussion
As seen in Table 2 the multi-task framework with chunks improves the performance of both POS and morphological tagging for all English treebanks. In the same table, it is clear that they do not aid Bulgarian, but they do improve POS tagging performance for German and Japanese. Table 3 shows that chunking performance consistently improves in the multi-task setting. Parsing performance is improved across all treebanks when the predictions from experiment 1 are used as features (Table 4), but only for English-EWT (the largest treebank) and ParTUT (the smallest) do the predicted chunks explicitly improve performance and for the other treebanks only the other predicted features help. This is in contrast to the findings of Nguyen and Verspoor (2018), who obtained higher performance for larger treebanks. In the multi-task setting for the dependency parser (  Table 2: Multi-task tagging performance on English UD treebanks (en-ewt, en-gum, en-lines, and enpartut), Bulgarian-BTB (bg), German-GSD (de), and Japanese-GSD (ja) UD treebanks: single, singletask training; pos, with POS tagging; feats, with morphological feature tagging (except Japanese (ja) which has no morphological features); and chunks x , with chunks with threshold x.  Table 3: Chunker F1 scores in multi task setting where the baseline presented is from training the chunker for a given ruleset with threshold 75% or 95% as a single task and multi is from training with pos and morphological feature tagging except for Japanese (ja) which has no morphological features.  Table 4: Feature input ablation for dependency parser with English UD treebanks (en-ewt, en-gum, enlines, and en-partut), Bulgarian-BTB (bg), German-GSD (de), and Japanese-GSD (ja) UD treebanks: no features ud pipe , no features but UDPipe predicted POS tags used to decode; pos, gold POS tags for training and predicted POS tags for runtime (pos ud pipe UDPipe predicted POS tags used); feats, gold morphological feature tags for training and predicted feature tags for runtime; and chunks x , gold chunks with threshold x at training time and predicted chunks for runtime.  Table 5: Multi-task parsing results for English (en-ewt, en-gum, en-lines, and en-partut), Bulgarian-BTB (bg), German-GSD (de), and Japanese-GSD (ja) UD treebanks: single ud pipe , parsing as single task with UDPipe predicted POS tags used to decode parser output; pos, with POS tagging as aux. task; feats, with morphological feature tagging as aux. task; and chunks x , with chunking as aux. task for threshold x. performance with a meaningful increase in accuracy observed over baseline models for each treebank. As can be seen in Figure 4, the change in performance when using the predicted chunks as a feature for parsing is less profound than in the multi-task experiments. Only two English treebanks explicitly benefit from predicted chunks, whereas all treebanks benefit from at least one feature. So the performance is at least implicitly improved by using our chunks, except for the more morphologically-rich (especially with respect to verbal inflection) Bulgarian. The treebank used for Japanese, generally an agglutinative language, does not contain morphological features, so perhaps it too would not improve with chunks if they could have been used. Therefore, it would be interesting to evaluate whether the impact of chunking information is predicated by certain linguistic features. Furthermore, the increase in performance for each treebank for the multi-task experiments suggests that the performance when using the chunks as input would improve with better predicted chunks, which corroborates the findings of Lacroix (2018).

Conclusion
We have introduced a language-agnostic method for extracting chunks from dependency treebanks. We have also shown the efficacy of these chunks with respect to improving POS tagging, morphological feature tagging, and dependency parsing for a number of UD treebanks.
hyperparameter value population size 100 number of generations 4 k-best 5 P mutate 0.5 P mutate gene 0.05 P crossover 0.5 decay (linear) 0.1 Table 6: Hyperparameters for the evolutionary algorithm: k-best, the number of best parents chosen to seed next generation; P mutate , the probability an individual will mutate; P mutate gene , the probability a given gene will mutate; P crossover , the probability a pair of individuals will crossover; and decay is how much P mutate and P crossover decrease after each generation.