On the Choice of Auxiliary Languages for Improved Sequence Tagging

Recent work showed that embeddings from related languages can improve the performance of sequence tagging, even for monolingual models. In this analysis paper, we investigate whether the best auxiliary language can be predicted based on language distances and show that the most related language is not always the best auxiliary language. Further, we show that attention-based meta-embeddings can effectively combine pre-trained embeddings from different languages for sequence tagging and set new state-of-the-art results for part-of-speech tagging in five languages.


Introduction
State-of-the-art methods for sequence tagging tasks, such as named entity recognition (NER) and part-of-speech (POS) tagging, exploit embeddings as input representation.Recent work suggested to include embeddings trained on related languages as auxiliary embeddings to improve model performance: Catalan and Portuguese embeddings, for instance, help NER models on Spanish-English code-switching data (Winata et al., 2019a).In this paper, we analyze whether auxiliary embeddings should be chosen from related languages, or if embeddings from more distant languages could also help.
For this, we revisit current language distance measures (Gamallo et al., 2017) and adapt them to the embeddings and training data used in our experiments.We investigate the question, whether we can predict the best auxiliary language based on those language distance measures.Our results suggest that no strong correlation exists between language distance and performance and that even less related languages can be a good choice as auxiliary languages.
In our experiments, we explore both available monolingual and multilingual pre-trained byte-pair encoding (Heinzerling and Strube, 2018) and FLAIR embeddings (Akbik et al., 2018).For combining monolingual subword embeddings from different languages, we investigate two different methods: the concatenation of embeddings and the use of attention-based meta-embeddings (Kiela et al., 2018;Winata et al., 2019a).
We perform experiments on CoNLL and universal dependency datasets for NER and POS tagging in five languages and show that meta-embeddings are a promising alternative to the concatenation of additional auxiliary embeddings as they learn to decide on the auxiliary languages in an unsupervised way.Moreover, the inclusion of more languages is often beneficial and meta-embeddings can be effectively used to leverage a larger number of embeddings and achieve new state-of-theart performance on all five POS tagging tasks.Finally, we propose guidelines to decide which auxiliary languages one should use in which setting.

Related Work
Combination of Embeddings.Previous work has seen performance gains by combining, e.g., various types of word embeddings (Tsuboi, 2014) or variants of the same type of embeddings trained on different corpora (Luo et al., 2014).For the combination, alternative solutions have been proposed, such as different input channels (Zhang et al., 2016), concatenation (Yin and Schütze, 2016), averaging of embeddings (Coates and Bollegala, 2018), and attention (Kiela et al., 2018).In this paper, we compare the inclusion of auxiliary languages via concatenation to the dynamic combination with attention.that Catalan and Portuguese embeddings help for Spanish-English NER.In a later work, it was shown that also more distant languages can be beneficial (Winata et al., 2019b), but only tests in the special setting of code-switching NER were performed and no connection between the relatedness of languages and the performance increase was analyzed.In contrast, our work shows that the inclusion of auxiliary languages increases performance in monolingual settings as well and we analyze whether language distance measures can be used to select the best auxiliary language in advance.

Auxiliary
Language Distance Measures.Gamallo et al. (2017) proposed to measure distances between languages by using the perplexity of language models trained on one language and applied to another language.Campos et al. ( 2019) used a similar method to retrace changes in multilingual diachronic corpora over time.Another popular measure of similarity is based on vocabulary overlap, assuming that similar languages share a large portion of their vocabulary (Brown et al., 2008).

Embeddings
Each input word is represented with a pretrained word vector.We experiment with byte-pair encoding (BPEmb) (Heinzerling and Strube, 2018) and FLAIR embeddings (Akbik et al., 2018), as for both of them pretrained embeddings are pub-licly available for all the languages we consider.1

Combination of Embeddings
As we experiment with multiple word embeddings, we compare two combination methods: a simple concatenation e CON CAT and attentionbased meta-embeddings e AT T as shown in Figure 1b and 1c, respectively, and described next.
In both cases, the input are n embeddings e i , 1 ≤ i ≤ n that should be combined.In our experiments, we use embeddings from n different languages.
For concatenation, we simply stack the individual embeddings into a single vector: e CON CAT = [e 1 , e 2 , .., e n ].
In the case of meta-embeddings, we follow Kiela et al. (2018) and compute the combination as a weighted sum of embeddings.For this, all n embeddings e i need to be mapped to the same size first.In contrast to previous work, we use a nonlinear mapping as this yielded better performance in our experiments.Thus, we compute to the size E of the largest embedding.The attention weight α i for each embedding x i is then computed with a fully-connected hidden layer of size H with parameters W ∈ R H×E and V ∈ R 1×H , followed by a softmax layer.
The parameters of the meta-embedding layer Finally, the embeddings x i are weighted using the attention weights α i resulting in the word representation e AT T = i α i • x i .

Experiments and Results
We perform NER and POS experiments on five languages: German (De), English (En), Spanish (Es), Finnish (Fi), and Dutch (Nl).Note that we assume at least a character overlap to use auxiliary embeddings from another language.Thus, languages with a different character set, e.g., Asian languages, cannot be used, in this setting.Future work could investigate the inclusion of languages with different character sets, e.g., by using bilingual dictionaries or machine translation.
For NER, we use the CoNLL 2002/03 datasets (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003) and the FiNER corpus (Ruokolainen et al., 2019).For POS tagging, we experiment with the universal dependencies treebanks. 2or each language, we report results for the following methods: Monolingual (Mono).Only embeddings from the source language were taken for the experiments.This is the baseline setting.We also compare our results to multilingual embeddings (Multi) which have been successfully used in monolingual settings as well (Heinzerling and Strube, 2019).To ensure comparability, we use the multilingual versions of BPEmb and FLAIR, which were trained simultaneously on 275 and 300 languages, respectively.
Mono + X.A second set of embeddings from a different language X is concatenated with the original monolingual embeddings.While for this typically embeddings from a related language are chosen, we report results for all language combinations and investigate in particular whether relatedness is necessary for improvement.
Mono + All & Meta-Embeddings.We also experiment with the combination of all embeddings from all languages from our experiments.In this setting, we use all six embeddings (five monolingual embeddings and the multilingual embeddings) and combine them either using concatenation (Mono + All) or meta-embeddings.We have chosen these settings that are mainly based on monolingual embeddings, as the current stateof-the-art for named entity recognition is based on monolingual FLAIR embeddings (Akbik et al., 2019).In addition, multilingual embeddings, such as multilingual BERT (Devlin et al., 2019) tend to perform worse than their monolingual counterparts3 in monolingual experiments.For completeness, we include experiments with multilingual embeddings as mentioned before.

Results
Following Reimers and Gurevych (2017), we report all experimental results as the mean of five runs and their standard deviation in Table 1 for experiments with byte-pair encoding embeddings.
The results with FLAIR embeddings can be found in the appendix.We performed statistical significance testing to check if the concatenation (Mono + All) and meta-embeddings models are better than the best Mono + X model.We used paired permutation testing with 2 20 permutations and a significance level of 0.05 and performed the Fischer correction following Dror et al. (2017). 4or meta-embeddings, we found statistically significant differences in 12 out of 20 settings (6 with BPEmb, 6 with FLAIR) against the best monolingual + X model, while we found statistically significant differences for Mono + All in only 7 out of 20 cases.This suggests that metaembeddings are superior to monolingual settings with one auxiliary language as well as to the concatenation of all embeddings.Further, we found that the combination of monolingual and multilingual byte-pair encoding embeddings is always superior to either monolingual or multilingual embeddings alone for both tasks.Even though the multilingual embeddings have seen many languages during pre-training, they can still benefit from the high performance of monolingual embeddings and vice versa.As the meta-embeddings assign attention weights for each embedding, we can inspect the importance the models give to the different embeddings.An analysis for an example sentence can be found in Section D in the appendix.Table 2 provides the results of BPEmb and FLAIR meta-embeddings in comparison to state of the art, showing that we set the new state of the art for POS tagging.

Analysis of Language Distances
To evaluate how useful language distances are for predicting the best auxiliary language, we compare rankings based on language distances and the observed performance rankings based on Table 1.For this, we take the language distance from Gamallo et al. ( 2017), which is based on language modeling perplexity PP of unigram language models LM applied to texts of foreign languages CH.Lower language model perplexities on a foreign dataset indicate higher language relatedness.
We also test language similarities based on vocabulary overlap with W (L1|L2) being the number of words of L1 which are shared with L2 and N (L1) the number of words of L1 shared with other languages.
For our experiments, we further adapt d P to use the perplexity of the FLAIR forward language models on the test data provided by Gamallo et al. (2017)   To analyze the correlation between language distance measures and the performance of our model, we compute the Spearman's rank correlation coefficient between the real rankings based on performance and predicted rankings from our language distances.The results are shown in Figure 2. We conclude that predicting the auxiliary language ranking is a hard task and see that the most related language is not always the best auxiliary language in practice (cf., Table 1).This holds in particular for POS tagging, where the performance differences of models are quite small.
In general, d * P shows a higher correlation with model performance than d P and d V , indicating that not only word overlap plays a role but also context information.The majority voting d M V achieves the highest correlation and often predicts the best auxiliary language for NER models using byte-pair encoding.However, the actual ranking of all languages does not match the performance ranking, which results in a relatively low correlation with only a little above 0.5.Finally, we propose a small guide in Figure 3 to decide which auxiliary languages one can use to improve performance over monolingual embeddings.Depending on the available amount of data, it is recommended to train multiple monolingual embeddings and combine them using metaembeddings, which was observed to be the best method in our experiments.If not enough data is available to train monolingual embeddings, the best solution would be the inclusion of multilingual embeddings, assuming the existence of highquality embeddings, such as multilingual byte-pair encoding.If none of the above applies, language distance measures, in particular the combination of multiple distances, can help to identify the most promising auxiliary embeddings.Despite not always predicting the best model, the predicted auxiliary language often led to improvements over the monolingual baseline in our experiments.

Conclusion
In this paper, we investigated the benefits of auxiliary languages for sequence tagging.We showed that it is hard to predict the best auxiliary language based on language distances.We further showed that meta-embeddings can leverage multiple embeddings effectively for those tasks and set the new state of the art on part-of-speech tagging across different languages.Finally, we proposed a guide on how to decide which method of including auxiliary languages one should use.

A Hyperparameters and Training
We use the Byte-Pair-Encoding embeddings (Heinzerling and Strube, 2018) with 300 dimensions and a vocabulary size of 200k tokens for all languages.For FLAIR, we use the embeddings provided by the FLAIR framework (Akbik et al., 2018) with 2048 dimensions for each language model resulting in a total embedding size of 4096 dimensions.The bidirectional LSTM has a hidden size of 256 units.
For training, we use stochastic gradient descent with a learning rate of 0.1 and a batch size of 64 sentences.The learning rate is halved after 3 consecutive epochs without improvement on the development set.We apply dropout with probability 0.1 after the input layer.

B Language Distances
We report the language rankings of the single metrics d P , d * P , d V and d * V in Table 4.

C Results on NER and POS tagging with FLAIR embeddings
We performed the same experiments as in Section 4.1 with FLAIR embeddings as well and report the results in Table 5 for NER and for POS tagging.
In difference to the BPE experiments reported in the paper, we do not include multilingual embeddings in the Mono + All and meta-embedding versions of FLAIR.The reason is prior experiments in which multilingual embeddings led to reduced performance.This is also reflected in the poor performance of the multilingual FLAIR embeddings alone.It seems that the multilingual BPE embeddings are more effective in downstream tasks than the multilingual FLAIR embeddings.

D Analysis of Attention Weights
As the meta-embeddings assign attention weights for each embedding, we can inspect the importance the models give to the different embeddings.Figure 4 shows the assigned weights for an English sentence.In general, the model assigns most weight to the English embeddings.However, we observe an increased weight for German and the multilingual embedding for the German word Bayerische.Even though Vereinsbank is also a German word, the model focuses more on English for this word, probably because the subword bank has the same meaning in English.

E Study: Increased Number of Parameters vs. Auxiliary Language
To investigate whether the performance increase comes from the increased number of parameters rather than the inclusion of more embeddings, we also investigated including the same embedding type twice (Mono + Mono).However, we found nl nl en nl nl nl de fi en nl en en de nl en en de de en en 2 en en nl en es fi nl nl nl en de nl nl de de nl en en de de 3 fi fi es * fi de de fi es fi de fi fi en en es * de fi fi es * es 4 es es fi * es fi es es de de fi nl de es es nl * es es es fi * fi  that this does not help: The performance is comparable to the monolingual baseline.Thus, the performance increase for Mono + X really comes from additional information provided by the embeddings of the auxiliary language.

Figure 1 :
Figure1: Overview of our model architecture (left).The embedding combination e can be either computed using the concatenation e CON CAT (middle) or the meta embedding method e AT T (right).

Figure 2 :
Figure 2: Spearman's rank correlation between language distance and model performance rankings for NER and POS tasks for different language distances.

Figure 4 :
Figure 4: Learned attention weights of the metaembeddings model with byte-pair encoding embeddings for English NER.

Table 2 :
are randomly initialized and learnt during training.Comparison with state of the art.
† 95.34 ± .14 † Table1: Results of NER (F 1 , above) and POS (Accuracy, below) experiments with BPE embeddings.† marks models that are statistically significant to the best Mono + X model.box highlights the closest auxiliary language according to language distance measure d M V , and box the best auxiliary language according to performance.

Table 3 :
Language ranking according to the majority voting distance d M V .
and call it d * P .Similarly, we adapt d * V to compute the overlap of words in our training data.Note that both variants, d * P and d * V , are based on properties from either our model or training data and are, therefore, specific to our setting.Finally, we create a ranking d M V which combines the rankings from d P , d * P , d V , d * V with majority voting.The ranking of d M V is provided in Table 3, the rankings of the individual distance measures are given in the appendix.

Table 4 :
Language distances.Languages marked with * are ranked the same.

Table 5 :
Results of NER (F 1 , above) and POS (Accuracy, below) experiments with FLAIR embeddings.† marks models that are statistically significant to the best Mono + X model.box highlights the closest auxiliary language according to language distance measure d M V , and box the best auxiliary language according to performance.