More Embeddings, Better Sequence Labelers?

Recent work proposes a family of contextual embeddings that significantly improves the accuracy of sequence labelers over non-contextual embeddings. However, there is no definite conclusion on whether we can build better sequence labelers by combining different kinds of embeddings in various settings. In this paper, we conduct extensive experiments on 3 tasks over 18 datasets and 8 languages to study the accuracy of sequence labeling with various embedding concatenations and make three observations: (1) concatenating more embedding variants leads to better accuracy in rich-resource and cross-domain settings and some conditions of low-resource settings; (2) concatenating contextual sub-word embeddings with contextual character embeddings hurts the accuracy in extremely low-resource settings; (3) based on the conclusion of (1), concatenating additional similar contextual embeddings cannot lead to further improvements. We hope these conclusions can help people build stronger sequence labelers in various settings.


Introduction
In recent years, sequence labelers equipped with contextual embeddings have achieved significant accuracy improvement ( . Different types of embeddings have different inductive biases to guide the learning process. However, little work has been done to study how to concatenate these contextual embeddings and non-contextual embeddings to build better sequence labelers in * Yong Jiang and Kewei Tu are the corresponding authors. ‡ : This work was conducted when Xinyu Wang was interning at Alibaba DAMO Academy. multilingual, low-resource, or cross-domain settings over various sequence labeling tasks. In this paper, we empirically investigate the effectiveness of concatenating various kinds of embeddings for multilingual sequence labeling and try to answer the following questions: 1. In rich-resources settings, does combining different kinds of contextual embeddings result in a better sequence labeler? Are noncontextual embeddings helpful when the models are equipped with contextual embeddings?
2. When we train models in low-resource and cross-domain settings, do the conclusions from the rich-resource settings still hold?
3. Can sequence labelers automatically learn the importance of each kind of embeddings when they are concatenated?
2 Model Architecture

Sequence Labeling
We use the BiLSTM structure for all the sequence labeling tasks, which is one of the most popular approaches to sequence labeling (Huang et al., 2015; Ma and Hovy, 2016). Given a n word sentence x = {x 1 , · · · , x n } and L kinds of embeddings, we feed the sentence to generate the l-th kind of word embeddings {e l 1 , · · · , e l n }: We concatenate these embeddings to generate the word representations {r 1 , · · · , r n } as the input of the BiLSTM layer: where ⊕ represents the vector concatenation operation. We feed the word representations into a singlelayer BiLSTM to generate the contextual hidden layer of each word. Then we use either a Softmax layer (the MaxEnt approach) or a Conditional Random Field layer (the CRF approach) (Lafferty et al., 2001; Lample et al., 2016; Ma and Hovy, 2016) fed with the hidden layers to generate the conditional probability p(y|x). Given the corresponding sequence of gold labels y * = {y * 1 , · · · , y * n } for the input sentence, the loss function for a model with parameters θ is:

Embeddings
There are mainly four kinds of embeddings that have been proved effective on the sequence labeling task: contextual sub-word embeddings, contextual character embeddings, non-contextual word embeddings and non-contextual character embeddings 1 . As we conduct our experiments in multilingual settings, we need to select suitable embeddings from each category for the concatenation.

Experiments and Results
For simplicity, we use M to represent M-BERT embeddings, F to represent Flair embeddings, W to represent fastText embeddings, C to represent non-contextual character embeddings, All to represents the concatenation of all types of embeddings and the operator "+" to represent the concatenation operation. We use the MaxEnt approach for all experiments 3 . Due to the space limit, some detailed experiment settings, extra experiments and discussions are included in the appendix.   (4) which one is the best concatenation.

Rich-resource and Low-resource Settings
How to build better sequence labelers through embedding concatenations in both rich-resource and low-resource settings is the most important concern for users. We report the results of various concatenations of embeddings for the tasks in Table 1 for rich-resource settings and in Figure 1 for low-resource settings. From the results, we have the following observations. Observation #1. Concatenating more embedding variants results in better sequence labelers: In rich-resource settings, concatenating more embedding variants (M+F+W and All) results in best scores in most of the cases, which indicates that the inductive biases in various kind of embeddings are helpful to train a better sequence labeler. In low-resource settings, M+F+W and All performs inferior to the F+W when the number of sentences are lower than 100. However, when the training set gets larger, the gap between these concatenations becomes smaller and reverses when the training set becomes larger than 100 for NER and POS tagging and the gap also disappears for Chunking. A possible reason is that using CSEs makes the model sample inefficient so that CSEs requires more training samples to improve accuracy than CCEs. The observation suggests that concatenating more embedding variants performs better if the training set is not extremely small. Observation #2. NCEs become less effective when concatenated with CSEs and CCEs: Concatenating NCEs with CSEs only marginally improves the accuracy. There is almost no improvement when concatenated with both CSEs and CCEs but the NCEs does not hurt the accuracy as well. A possible reason is that the CSEs and CCEs largely contain the information in NCEs 4 . Observation #3. NWEs are significantly helpful on top of contextual embeddings: Although models based on contextual embeddings have proved to be stronger than models based on NWEs for sequence labeling, concatenating NWEs with contextual embeddings can still improve the accuracy significantly. The results imply that the contextual embeddings contain more contextual information over the input but lack static word information. From these observations, we find that in most of rich-resource and low-resource settings, concatenating all embeddings variants or all embeddings variants except NCEs is the simplest choice for a better sequence labeler.

Cross-domain Settings
Another concern for users is that we want to build better sequence labelers not only in in-domain set-   Table 2) are almost consistent with rich-resource settings, suggesting that concatenating more embedding variants results in better sequence labelers.

Importance of Embeddings
To study the effectiveness of concatenating embeddings from another perspective, we preserve only one kind of embedding in All and mask out the other embeddings as 0 to study how the models rely upon each kind of embeddings. To avoid the impact of embedding dimensions, we train the model by linearly projecting each kind of embeddings into the same dimension of 4096. The results ( Figure  2) show that the accuracy of preserved embeddings has a positive correlation with the results in Table  1. For example, M gets higher accuracy than other embeddings in NER and Table 1 also shows that the model with F performs inferior to the model with M only. The models with concatenated embeddings almost do not rely on NCEs and relies mostly on CSEs or CCEs depending on the task. These results show that models with concatenated embeddings can extract helpful information from each kind of embeddings to improve accuracy.   Table 4: Comparisons of F+W, All, and F+W+proj (F+W with linearly projecting the hidden size into the hidden size of All) in three tasks with 10-sentence lowresource settings. The accuracy is averaged over tasks.

On Concatenating Similar Embeddings
Since concatenating more embeddings variants results in better sequence labelers, we additionally concatenate multilingual Flair embeddings (M-Flair) or English BERT embeddings (En-BERT) with All embeddings to show whether concatenating the same category of embeddings can further improve the accuracy. We evaluate the addition of En-BERT on English and M-Flair on all languages in each task. The results are shown in Table  3. It can be seen that additionally concatenating the same category of embeddings does not further improve the accuracy in most cases except for concatenating En-BERT on English WikiAnn NER. A possible reason is that the BERT models are trained on the same domain as WikiAnn and hence the inductive biases of BERT embeddings help improve the accuracy. We also find that concatenating En-BERT with All only improves the accuracy of WikiAnn English NER. We think the possible reason for the improvement is that the BERT and the training data have the same domain of Wikipedia. We conduct the same concatenation on the CoNLL English NER dataset for comparison. The results in Table  7 show that concatenating En-BERT with All does not further improve the accuracy on CoNLL English NER.

English BERT vs. M-BERT
We use English BERT embeddings instead of M-BERT embeddings to see whether the languagespecific CSEs impact the observations. The results (Table 5) show that our observations do not change in both rich-resource and low-resource settings. Using a language-specific BERT embedding can even get better sequence labelers for the POS tagging and chunking tasks in rich-resource settings.

Hidden Sizes and Accuracy
In low-resource settings with 10 sentences, we find that models with All perform inferior to the models with F+W. One possible concern is that whether the larger hidden size of All introduces more parameters in the model and makes the model over-fits  the training set. We linearly project the hidden size of F+W (4396) to the same hidden size as All (5214). Table 4 shows that with linear projection, F+W performs even better. Therefore, the cause for over-fitting is not the inferior accuracy of All but possibly the sample inefficiency for CSEs.
Another concern is whether we can project each embedding to a larger hidden size to improve the accuracy. Since we try a projection to 4096 for each embedding in F+W+proj (Section 3.4), we further project each embedding variants to see how the projection affect the accuracy in rich-resource settings. The results (Table 6) show that the linear projection for each embedding significantly decreases the accuracy of the models.
From the two experiments, we find that the hidden sizes of concatenated embeddings do not impact the observations.

Conclusion
In this paper, we analyze how to get a better sequence labeler by concatenating various kinds of embeddings. We make several empirical observations that we hope can guide future work to build better sequence labelers: (1) in most settings, concatenating more embedding variants leads to better results, while in extremely low-resource settings, only using CSEs and NWEs performs better; (2) NCEs become less effective when concatenated with contextual embeddings, while NWEs are still beneficial; (3) neural models can automatically learn which embeddings are beneficial to the task; (4) additionally concatenating similar contextual embeddings with the best concatenations from (1) cannot further improve the accuracy in most cases.

A Appendix
In this appendix, we use ISO 639-1 codes 5 to represent each language for simplification.

A.1 Settings
Datasets We use the following datasets for experiments: • . We run our experiments on a GPU server with NVIDIA Tesla V100 GPU. For model training, we set the mini-batch size to 2,000 tokens for better GPU utilization. Following the official release of Flair, we use an SGD optimizer with a learning rate of 0.1 for training all models and set the hidden size of BiLSTM to 256. We anneal the learning rate by 0.5 if there is no improvement on the development sets for 10 and 100 epochs when training rich-resource and low-resource datasets respectively. We fix these hyper-parameters for all experiments because we find that tuning these hyper-parameters does not impact the observation and usually results in lower accuracy. We average over 5 runs for each experiment and report the macro-average score over all languages for each task.

A.2 Detailed Results
For the models using the CRF layer, similar to the main paper, we plot our results in the rich-resource and low-resource settings in Figure 3. The figures have similar trends as the MaxEnt models, showing that output structures do not impact the observations. Table 10 shows the importance of each kind of embeddings for each language and task (Section 3.4 in the main paper). Table 11, 13 and 15 show average scores over each language for each task in the rich-resource and low-resource settings (Section 3.2). Table 12, 14 and 16 show average scores over each language for each task in the rich-resource and low-resource settings. Table 9 shows the average scores for each language in our cross-domain experiments (Section 3.3). Table 17 show the detailed comparison for additionally concatenating M-Flair embeddings with All for all datasets (Section 3.5).