Low-Resource Sequence Labeling via Unsupervised Multilingual Contextualized Representations

Previous work on cross-lingual sequence labeling tasks either requires parallel data or bridges the two languages through word-by-word matching. Such requirements and assumptions are infeasible for most languages, especially for languages with large linguistic distances, e.g., English and Chinese. In this work, we propose a Multilingual Language Model with deep semantic Alignment (MLMA) to generate language-independent representations for cross-lingual sequence labeling. Our methods require only monolingual corpora with no bilingual resources at all and take advantage of deep contextualized representations. Experimental results show that our approach achieves new state-of-the-art NER and POS performance across European languages, and is also effective on distant language pairs such as English and Chinese.


Introduction
Sequence labeling tasks such as named entity recognition (NER) and part-of-speech (POS) tagging are fundamental problems in natural language processing (NLP). Recent sequence labeling models achieve state-of-the-art performance by combining both character-level and word-level information (Chiu and Nichols, 2016;Ma and Hovy, 2016;Lample et al., 2016). However, these models heavily rely on large-scale annotated training data, which may not be available in most languages. Cross-lingual transfer learning is proposed to address the label scarcity problem by transferring annotations from high-resource languages (source languages) to low-resource languages (target languages). In this scenario, a ma-jor challenge is how to bridge interlingual gaps with modest resource requirements.
There is a large body of work exploring crosslingual transfer through language-independent features, such as morphological features and universal POS tags for cross-lingual NER (Tsai et al., 2016) and dependency parsers (McDonald et al., 2011). However, these approaches require linguistic knowledge for language-independent feature engineering, which is expensive in low-resource settings. Other work relies on bilingual resources to transfer knowledge from source languages to target languages. Parallel corpora are widely used to project annotations from the source to the target side (Yarowsky et al., 2001;Ehrmann et al., 2011;Kim et al., 2012;Wang and Manning, 2014). These methods could achieve strong performance with a large amount of bilingual data, which is scarce in low-resource settings.
Recent research leverages cross-lingual word embeddings (CLWEs) to establish inter-lingual connections and reduce the requirements of parallel data to a small lexicon or even no bilingual resource (Ni et al., 2017;Fang and Cohn, 2017;Xie et al., 2018). However, word embedding spaces may not be completely isomorphic due to language-specific linguistic properties, and therefore cannot be perfectly aligned. For example, different from English, Chinese nouns do not distinguish singular and plural forms, while Spanish nouns distinguish masculine and feminine.
On the other hand, NER tags such as person names, organizations, and locations are shared across different languages. Language-independent frameworks such as universal conceptual cognitive annotation (Abend and Rappoport, 2013), universal POS (Petrov et al., 2011a), and universal dependencies (Nivre et al., 2016) are defined to represent different languages in a unified formation. These work serves as our motivation to assume that the semantic meanings of words from different languages can be roughly aligned at a conceptual level and it is more reasonable to align deep semantic representations instead of shallow word embeddings. Meanwhile, monolingual contextualized embeddings derived from language models have shown to be effective for extracting semantic information and have achieved significant improvement on several NLP tasks (Peters et al., 2018).
In this paper, we propose a Multilingual Language Model with deep semantic Alignment (MLMA). We train MLMA on monolingual corpora from each language and align its internal states across different languages. Then MLMA is utilized to generate language-independent representations and to bridge the gaps between highresource and low-resource languages. For evaluation, we conduct extensive experiments on the NER and POS benchmark datasets under crosslingual settings. The experiment results show that our methods achieve substantial improvements comparing to previous state-of-the-art methods in European languages. We also validate our approaches on a distant language pair, English-Chinese, and the results are competitive with previous methods which use large-scale parallel corpora. Our contributions are as follows: 1. Instead of word-level alignment, we propose MLMA that uses contextualized representations to bridge the inter-lingual gaps. 2. We propose three methods to align contextualized representations without any bilingual resource. 3. Our methods achieve new state-of-the-art performance on cross-lingual NER and POS tasks in European languages, and very competitive results for English-Chinese NER, where previous work uses large parallel data.

Approach
Our approach belongs to the model transfer (Section 5.2) and mainly consists of three steps: 1. Training a multilingual language model with alignment (MLMA) using monolingual corpora of the source and the target languages.
(Section 2.1, 2.2 and 2.3) 2. Building a cross-lingual sequence labeling model based on the language-independent representations from the MLMA. (Section 2.4 and 2.5) 3. Learning the cross-lingual sequence labeling model (with MLMA fixed) on the annotated data of source languages and directly applying it to the target languages.
The architecture of MLMA is shown in Figure 1. In the following sections, we focus on introducing the Step 1 and 2. We first present the architecture of MLMA and describe how to build the unsupervised multilingual alignment. Next, we propose effective methods for collapsing the multi-layer hidden states from MLMA into a single representation. Finally, we introduce the sequence labeling model used in the experiments.

Language Model Architecture
MLMA is a language model with multi-head selfattention mechanism (Vaswani et al., 2017). The architecture is similar to Radford et al. (2018), except that we combine both a forward and a backward Transformer decoder to build a bidirectional language model. Take the forward direction as an example, given a sentence with N tokens W = [w 1 , w 2 , · · · , w N ] T as input, we first map the sequence of tokens W to token embeddings where E e and E p are the embedding matrix and the positional encoding matrix, and d is the dimension of embeddings and hidden states. Then n blocks of transformer layers are stacked above the token embeddings. Each block contains a masked multi-head self-attention and a positionwise feedforward layer. The detailed implementation is the same as Vaswani et al. (2017).
where − → H l refers to the output of the l-th transformer block. Finally, the output distribution over the next tokens is calculated through a softmax function with tied embedding matrix.
For the backward direction, we calculate ← − H l and ← − P in an analogous way. Finally, we jointly minimize the negative log likelihood of the forward and backward directions: ( log p(w t |w 1 , . . . , w t−1 ) + log p(w t |w t+1 , . . . , w N )) In a multilingual setting, we share all param- Figure 1: The architecture of MLMA consists of language-specific embedding layers and language-agnostic Transformer layers. MLMA is jointly learned through language modeling loss and alignment loss, and its internal representations are utilized to bridge the gap between source and target languages. eters in Transformer layers across different languages to facilitate language-agnostic representations, except that we adopt an individual embedding matrix E e for each language.

Unsupervised Distribution Alignment
We find that only sharing Transformer layers is not enough for forcing hidden representations from different languages into a common space, as suggested in experiments (Section 3.4). Therefore, we propose three methods to build crosslingual representations based on identical strings, mean/variance, and average linkage.
To simplify the description, we take the alignment between two languages s and t as an example, but our methods can be directly extended to a scenario with multiple languages by adding the alignment between each pair of languages.

Notation
For the language model, given a sentence with N tokens, the forward internal representation − → H l in Eq (2) can be expanded as h l,k refers to the forward hidden representation of the k-th token in the sentence. Then we concatenate the forward and backward hidden representations for each token, h l,k = − → h l,k ⊕ ← − h l,k . We denote the collection of the token representations h l,k at layer l from the whole corpora of language s as C s l , which can be regarded as a sampling from the deep semantic space of language s. Similarly, C t l is used for language t.

Identical Strings
Similar language pairs such as English and Spanish have a large number of identical strings shared between their vocabularies, which are utilized as the seed dictionary for embedding alignment in previous work (Smith et al., 2017). Similarly, we treat identical strings as explicit supervision signals and align the embeddings of identical strings between different languages. The matching of the embeddings from different languages will lead to an implicit alignment of internal representations.
In the experiments, we directly minimize the Euclidean distance between the embeddings of each identical string across different languages: iden is the set of identical strings between the vocabulary of language s and language t, and |W (s,t) iden | refers to the number of members in W (s,t) iden . λ id is a scaling weight, and e s w (e t w ) is the embedding of word w from embedding matrix E s e (E t e ) of language s (t).

Mean and Variance
In this section, we propose another approach to directly align the distributions of internal representations between different languages. In particular, we leverage the mean and variance of internal distributions for alignment. We denote the mean and variance of C s l as m s l and v s l . Similarly, m t l and v t l refer to the mean and variance of C t l . We minimize the Euclidean distance between the mean and variance of language s and language t for all layers: where λ l is a scaling weight, and | · | is the L1 norm of a vector. Without the denominators, the model could escape this regularization by learning a mean and variance with low absolute values. In practice, rather than calculating the mean and variance over the whole source and target corpora, we use the mean and variance of the source and target inner states h l,k in the current mini-batch as an approximation.

Average Linkage
In this method, we employ another metric, average linkage, to perform a more precise point-wise matching. The average linkage is a widely used metric for calculating the similarity of clusters and networks (Yim and Ramdeen, 2015;Seifoddini, 1989;Newman, 2012;Moseley and Wang, 2017). It is sensitive to the shape, thus serves as a better choice than mean and variance. The average linkage measures the similarity of two sets X and Y by calculating the averaged distance between all members of each set: where n X (n Y ) is the number of members in X (Y ), and f is a distance function. We take Euclidean distance as the distance function f and minimize the average linkage between C s l and C t l : ] Similarly, the terms avl(C s l , C s l ) and avl(C t l , C t l ) are used to prevent the model from escaping this regularization. In practice, we calculate L avl between the source and target inner states h l,k inside the mini-batch as an approximation.
The regularization L avl is similar to the maximum mean discrepancy (MMD), which is often employed in domain adaptation (Tzeng et al., 2014;Long et al., 2015) and style transfer (Li et al., 2017) for images. However, different from MMD, our method directly uses Euclidean distance instead of the kernel function.

Training of MLMA
During the training stage of MLMA, we sample equivalent number of sentences from the monolingual corpora of each language for each mini-batch. Then MLMA is optimized through a combination of the language modeling loss L lm and the alignment regularization loss L reg . For each alignment method, we use its corresponding alignment loss: where λ lm i is used for balancing the convergence speed of different languages. N LL i is the negative log likelihood of language i in Eq (4).

Cross-lingual Representations
After the MLMA is trained, we fix its parameters and extract the hidden states as cross-lingual contextualized representations (CLCRs). In this section, we propose two effective strategies for integrating these multi-layer high-dimensional representations into downstream models. Self-Weighted Sum For each token, we concatenate all layers of hidden states and feed them into a multi-layer perceptron (MLP) to calculate a (n + 1)-dimensional weight vector, s = softmax(MLP(h 0,k ⊕ · · · ⊕ h n,k )). Then we calculate a weighted sum of these layers according to the weight vector, CLCR k = n l=0 s l · h l,k . Fully-Weighted Sum We introduce a weight matrix, F ∈ R (n+1)×2d , with separate weights for each hidden dimension. The weight matrix F is softmaxed by column and used to calculate a weighted sum of all layers for each hidden dimension, CLCR k = n l=0 F l h l,k , where is the element-wise product.
The parameters of the MLP and F are trained during the learning of sequence labeling model.

Sequence Labeling Model
The sequence labeling model is then built on the CLCRs. For both NER and POS tasks, we use an LSTM-CRF model following Lample et al. (2016), which consists of a character-level LSTM, a word-level LSTM, and a linear-chain CRF.
More specifically, given a sequence of words as [w 1 , w 2 , . . . , w N ], where w k is composed of a sequence of characters [c k,1 , c k,2 , . . . , c k,m ]. First, for each word w k , the character-level LSTM takes its character sequence [c k,1 , c k,2 , . . . , c k,m ] as input and outputs a vector e k to represent this word. Then the pre-trained CLCR k is concatenated with e k to form a word-level embedding x k . Finally, the sequence of word-level embeddings [x 1 , x 2 , . . . , x N ] are fed into the word-level LSTM, and the linear-chain CRF are employed to predict the probability distribution for all possible output label sequences.

Experiments
We first introduce the datasets used in the experiment and then the implementation details of our models, before presenting the results on NER and POS tasks.
In all cases, the sequence labeling model is trained on the source language (English) training data and is tested on the target language test data.

Details of MLMA
We adopt a 6-layer bi-directional Transformer decoder with 8 attention heads. The dimension size of hidden states and inner states are 512 and 2048, respectively. The dropout rates after attention and residual connection are both 0.1. We use the Adam optimization scheme (Kingma and Ba, 2014) with a learning rate of 0.0001 and a gradient clip norm of 5.0. The vocabulary size of each language is 200,000, and we train the model with a sampled softmax (Jean et al., 2015) of 8192 samples. We only keep the sentences containing less than 200 tokens for training and group them into batches by length. Each batch contains around 4096 tokens for each language. The language modeling weight λ lm i is set to be 1.0 for each language. For align-ment, λ m l , λ v l , λ al l are set to be 0.1, 0.01 and 1.0 for every layer l, and λ iden is set to be 100.
For languages except English, the latest dump of Wikipedia is used as monolingual corpora. For English, we use 1B Word Benchmark (Chelba et al., 2013) to reduce the effects of potential internal alignment in Wikipedia (Zirikly and Hagiwara, 2015;Tsai et al., 2016).
All characters are preprocessed to lowercase, and Chinese text are converted into the simplified version through OpenCC 2 . The corpora of European languages are tokenized by nltk (Loper and Bird, 2002) and Chinese text is segmented using Ltp 3 .

Details of Sequence Labeling Model
In our experiments, we set the hidden size of wordlevel LSTM and character-level LSTM to be 300 and 100, respectively. The character embedding size is set to be 100. We apply dropout at both the input and the output of word-level LSTM to prevent overfitting. The dropout rate is set to be 0.5. We train the sequence labeling model for 20 epochs using Adam optimizer with a batch size of 20 and perform an early stopping when there is no improvement for 3 epochs. We set the initial learning rate to be 0.001 and decay the learning rate by 0.1 for each epoch. We do not update the pre-trained cross-lingual deep representations from MLMA during training. For each model, we run it five times and report the mean and standard deviation. We disable the character-level LSTM in English-German and English-Chinese NER as they have a different character pattern from English. For POS, we disable the character-level LSTM following Fang and Cohn (2017).

Results for NER
We first train a multilingual language model without alignment (MLM) and report its performance of cross-lingual NER in Table 1. The poor performance demonstrates that only sharing part of the parameters in a language model is far from enough for cross-lingual transfer.
As shown in Table 1, the mean/variance alignment strategy (MLMA-Mv) is competitive with previous work which utilizes extra bilingual resources (Section 5.1). The average linkage strategy (MLMA-Avl) performs a more precise align-  ment and gains a further improvement. We conducted experiments of using all three alignments together, and results show no significant improvement over average linkage alone. These results agree with our statements that average linkage performs a more precise matching, and thus, carries the benefits brought by the other methods.
To demonstrate the strengths of the proposed cross-lingual contextualized representations (CLCRs) over cross-lingual word embeddings (CLWEs), we also report the results of using CLWEs for direct model transfer in Table 1. Specifically, we compare with the unsupervised method MUSE from Conneau et al. (2017). The experiment results demonstrate its effectiveness for cross-lingual sequence labeling. The alignment method using identical strings (MLMA-Iden) outperforms MUSE, suggesting that the contextual-level representations are more effective than the word-level ones. The other proposed methods (MLMA-Mv and MLMA-Avl) achieve significant improvement over MUSE and MLMA-Iden, which shows the benefit of directly aligning the contextualized representations. Combination with CLWEs We further demonstrate that CLWEs are compatible with our methods by using MUSE embeddings to initialize the embedding layer of our multilingual language model. The results of MLMA-Avl (init) shown in Table 1 indicate that the CLWEs lead to a better initialization and improved performance. Multi-source Transfer We conduct experiments of multi-source transfer based on method MLMA-Avl and report the performance as MLMA-Avl (multi) in Table 1. The experiment settings largely follow the previous work (Mayhew et al., 2017). They employ two source languages for each target language and use syntactic features to choose the related source languages. For Spanish and German, we use English and Dutch as source languages. English and Spanish are adopted for   (2017) report different results according to different resource requirements. We only list their best results in each setting. Methods with mark , †, ‡ require parallel corpora, bilingual lexicons, and training data respectively.
Dutch. The multi-source transfer leads to a significant improvement for Spanish and German, but a slight decline for Dutch. In the followup experiment, we find that the Spanish training set achieves a poor cross-lingual performance on Dutch. Similar results are observed in the experiments of Spanish to English and Dutch to English. These results suggest that the cross-lingual transfer may be directional, and we leave this issue for future work. Comparison with BERT We also compare the performance of our MLMA with the released multilingual BERT (Devlin et al., 2018). As shown in Table 1, our MLMA-Avl achieves a better performance on Spanish and Dutch. For German, BERT achieves a high performance as it employs effective subword information through BPE. The architecture of BERT also performs better than LSTM.
It is worth mentioning that, in previous work and this work, the corpora used in the experiments are limited to the source and the target language. In contrast, the multilingual BERT is jointly learned on Wikipedia of 102 languages and may benefit from a multi-hop transfer. BERT employs a shared BPE vocabulary for different languages, which implicitly performs a subword alignment similar to MLMA-Iden. Meanwhile, the proposed MLMA-Mv and MLMA-Avl methods are compatible with BERT and can be used to align the inner states of BERT.

A Case Study of Chinese NER
We conduct experiments and evaluate our approaches on a distant language pair, English-Chinese. The experiment results are shown in Ta-ble 2. Wang and Manning (2014) utilize 80K parallel sentences for annotation projection and report a strong performance. As Chinese and English do not share the alphabet, the number of identical strings is significantly smaller than similar languages pairs such as English-Spanish. Therefore, the MLMA-Iden achieves a lower result comparing to MUSE which uses adversarial training. The MLMA-Avl method performs a direct alignment of internal representations and achieves a significant improvement over the word-level methods. The initialization from CLWEs also proves its effectiveness for distant language pairs by gaining further improvement and reaching a comparable result with Wang and Manning (2014). This experiment suggests that cross-lingual transfer is still challenging between distant language pairs.

Results for POS
We evaluate our methods on another sequence labeling task POS, and the results are shown in Table 3. We compare with previous studies using unsupervised cross-lingual clustering (Fang and Cohn, 2017) and large-scale parallel corpora (Das and Petrov, 2011). As shown in Table 3, our models with deep semantic alignment outperform previous lexicon-based cross-lingual clustering by a large margin. When comparing to the previous method with a small amount of training data, the MLMA-Avl method obtains an improved accuracy without training data in the target languages. For further comparison, We also list the performance of applying the method from Xie et al. (2018) and  multilingual BERT to the POS task. 4 POS mainly relies on the information of each single word, and parallel corpora providing word alignment are effective for cross-lingual POS. Thus, previous annotation projection methods through parallel corpora are strong approaches for cross-lingual POS and often achieve a significantly better performance against previous unsupervised methods. The experimental results show that the proposed CLCRs are competitive and even achieve better average accuracy.

Self-Weighted v.s. Fully-Weighted Sum
As shown in Table 1, 2 and 3, we observe that Self-Weighted Sum (SWS) generally outperforms Fully-Weighted Sum (FWS) in NER tasks, while the opposite is true for POS tasks. SWS allows weights to vary at each position in a sequence, while FWS imposes adaptive weights for each hidden dimension. We hypothesize that NER is more context-sensitive and requires models to adapt to different context information, which makes SWS a better option. On the other hand, the POS of words is more independent across different context, but certain feature dimensions in contextualized representations may be critical for making a judgment. Therefore, FWS has the edge over SWS for its ability to select out these dimensions.

What is Connected during Alignment?
In this section, we dive into the MLMA and investigate the question of what is connected between different languages during the alignment. From English 1B and Spanish Wikipedia, we randomly select 1,000 sentences for each language and extract their cross-lingual contextual representations using our MLMA-Avl model. We calculate the nearest neighbors in cosine distance for each word, and some of them are listed in Table 4.
In these cases, the MLMA can disambiguate word senses according to context information. For example, for the word brown in English, the MLMA groups color brown with verde (green), and name Brown with Neira (a person name in Spanish) in the Spanish corpus. The proposed method is different from unsupervised translation in that, instead of learning a precise matching between English and Spanish words, the CLCRs establishes a high-level semantic connection between the source and the target language. The next example demonstrates that the MLMA is able to distinguish the part-of-speech of words. It connects an English verb chair with a Spanish verb presidir (preside), while a noun chair with a noun asiento (seat) in Spanish. To compare with unsupervised cross-lingual word embeddings, we list the top 5 similar words calculated using MUSE. As shown in Table 4, MUSE successfully groups the English word brown with Spanish words that are related to colors. However, without the help of contextual information, its ability of word sense disambiguation is limited.

Related Work
Previous work in cross-lingual transfer learning can be roughly divided into two main branches: annotation projection and model transfer.

Annotation Projection
In annotation projection approaches, parallel or comparable corpora are commonly used (Yarowsky et al., 2001;Ehrmann et al., 2011;Das and Petrov, 2011;Li et al., 2012;Täckström et al., 2013;Wang and Manning, 2014;Ni et al., 2017). The source language sentences of parallel corpora are first annotated either manually or by a pretrained tagger. Then, annotations on the source side are projected to the target side through word alignment to generate distantly supervised training data. Finally, a model of the target language is trained on the generated data. Wikipedia contains multilingual articles for various topics and can thus be used to generate parallel/comparable corpora or even weakly annotated target language sentences (Kim et al., 2012).
However, parallel corpora and Wikipedia can be rare for true low-resource languages. Mayhew et al. (2017) reduce the resource requirement by proposing a cheap translation method, which "translates" the training data from the source to the target language word by word through a bilingual lexicon. While Xie et al. (2018) reduce the requirement of bilingual lexicons by an unsupervised word-by-word translation through CLWEs.

Model Transfer
Model transfer methods train a model on the source language with language-independent features. Thus, the trained model can be directly applied to the target language.
McDonald et al. (2011) design a cross-lingual parser based on delexicalized features like universal POS tags. Täckström et al. (2012) reveal that cross-lingual word cluster features induced using large parallel corpora are useful. Lexicon and Wikipedia also demonstrate effectiveness for language-independent feature engineering. Zirikly and Hagiwara (2015) generate multilingual gazetteers from the source language gazetteers and comparable corpus. Page categories and linkage information to entries from Wikipedia are extracted as strong language-independent features (wikifier features) (Tsai et al., 2016). Bharadwaj et al. (2016) facilitate the cross-lingual transfer through phonetic features, which work well between languages like Turkish, Uzbek, and Uyghur, but are not strictly language independent. Recently, CLWEs are used as language-invariant representations for direct model transfer in NER (Ni et al., 2017) and POS (Fang and Cohn, 2017).
Some of the previous work also proposes sequence labeling models with shared parameters between languages for performing cross-lingual knowledge transfer (Lin et al., 2018;Cotterell and Duh, 2017;Yang et al., 2017;Ammar et al., 2016;Kim et al., 2017). However, these models are usually obtained through joint learning and require annotated data from the target language.

Conclusion
In this paper, we focused on a low-resources cross-lingual setting and proposed transfer learning methods based on the alignment of deep semantic spaces between different languages. The proposed multilingual language model bridges different languages by automatically learning crosslingual disambiguated representations. Abundant NER and POS experiments are conducted on the benchmark datasets. Experimental results show that our approaches using only monolingual corpora achieve improved performance comparing to previous strong cross-lingual studies with extra resources.