Improving Low-Resource Named Entity Recognition using Joint Sentence and Token Labeling

Exploiting sentence-level labels, which are easy to obtain, is one of the plausible methods to improve low-resource named entity recognition (NER), where token-level labels are costly to annotate. Current models for jointly learning sentence and token labeling are limited to binary classification. We present a joint model that supports multi-class classification and introduce a simple variant of self-attention that allows the model to learn scaling factors. Our model produces 3.78%, 4.20%, 2.08% improvements in F1 over the BiLSTM-CRF baseline on e-commerce product titles in three different low-resource languages: Vietnamese, Thai, and Indonesian, respectively.


Introduction
Neural named entity recognition (NER) has become a mainstream approach due to its superior performance (Huang et al., 2015;Lample et al., 2016;Ma and Hovy, 2016;Chiu and Nichols, 2016;Akbik et al., 2018). However, neural NER typically requires a large amount of manually labeled training data, which are not always available in low-resource languages. Training neural NER with limited labeled data can be very challenging. In this paper, we consider bridging multi-task learning (MTL) (Caruana, 1993;Ruder, 2017) and pretraining (Peters et al., 2018;Devlin et al., 2019) to leverage training signals of an auxiliary task that has a sufficiently large number of labeled data.
Researchers have investigated a wide variety of auxiliary tasks and resources to boost the performance of neural NER, e.g., training coarsegrained NER (Aguilar et al., 2017), fine-tuning bilingual word embeddings (Wang et al., 2017), applying language models (Rei, 2017), integrating part-of-speech (POS) tagging (Lin et al., 2018), using cross-lingual knowledge (Feng et al., 2018), and learning paraphrases (Watanabe et al., 2019). While most of the previous studies have exploited token-level information from auxiliary tasks, a few of them have tried to use sentence-level information (Rei and Søgaard, 2018;Devlin et al., 2019). Our work is closely related to the joint labeling framework in Rei and Søgaard (2019). However, they only focused on binary classification, while we attempt to handle multi-class classification on both sentence and token levels.
In this work, we focus on improving lowresource NER by exploiting large data, only having sentence-level labels. Figure 1 shows examples of product titles on an e-commerce website in Vietnamese. While the product titles with NER annotation done by our annotators are limited, those with product categories (e.g., ELECTRONICS) labeled by sellers are abundant, which can be used to train a sentence-level classifier. 1 A key challenge is to pass useful training signals from the sentence-level classification to the token-level NER.
Our contributions are as follows. We present the joint sentence and token labeling framework that enables multi-class classification equipped with a pre-training strategy ( §2.1). We show that the current attention mechanisms can produce suboptimal s Max-pooling ℒNER ℒJOINT = + ℒC λ  Figure 2: Architecture of our joint sentence and token labeling model. The attention layer is optional, which can be skipped or replaced with the desired approach.

L S -P A T T E R N E L E C T R O N
results and propose a simple approach that allows the model to learn scaling factors to obtain a proper attention distribution ( §2.2). Results on product title texts indicate that the proposed method is effective for low-resource NER across three different languages: Vietnamese, Thai, and Indonesian.
2 Proposed method Figure 2 shows the architecture of our joint sentence and token labeling model. Our model is based on hard parameter sharing (Ruder, 2017) in which the hidden layers are shared between two tasks. The task-specific layers include a conditional random field (CRF) layer for NER and a linear layer for sentence classification. 2 Unlike the standard MTL, which trains multiple tasks at once and expects the model to perform well on all tasks (Hashimoto et al., 2017;Rei and Søgaard, 2019), the goal of our work is to improve the performance of the main task (NER) using the auxiliary task (sentence classification) for creating pre-trained representations and as a regularizer.

Joint learning framework for multi-class classification
Shared layers Let w 1 , . . . , w T be an input token sequence, where w t denotes the t-th token in the sequence. We represent each w t using a pre-trained word embedding e t ∈ R de , where d e is the dimensionality of word embeddings. We do not fine-tune word embeddings but project them into a new space where W 1 ∈ R de×de is a trainable weight matrix. We then feed the projected embedding sequence X = [x 1 , . . . , x T ] ∈ R T ×de to a bidirectional long short-term memory (BiLSTM) layer to obtain a forward hidden state sequence We concatenate the hidden states of both directions to obtain the final hidden rep- We can either use H for both the sentence classification and NER tasks directly or apply an attention mechanism on it to help the model focus on particular tokens (detailed in §2.2). Sentence classification We create a fixed size vector by applying max-pooling (Collobert et al., 2011;Conneau et al., 2017) over H, which encourages the model to capture the most useful local features encoded in the hidden states. We feed the fixed size global feature vector to a linear layer to obtain the unnormalized predicted scores for each class. Let K be the number of target classes, s k be the k-th normalized predicted score after applying a softmax function, and t ∈ R K be the one-hot encoded true label. To train the sentence classification model, we minimize the multi-class cross-entropy loss: where i denotes the sentence index, and N is the number of training examples.
We not only train the sentence classification and NER models jointly but also pre-train the sentence classification model using a sufficiently large number of training examples with sentence-level labels only. We expect that pre-trained hidden representations would help the model generalize better on our main task, as described below. NER Following Huang et al. (2015); Lample et al. (2016), we feed H to a CRF layer to obtain the probability of a label sequence y. To train the NER model, we minimize the negative log-likelihood of the correct label sequences over the training set: Joint labeling objective Combining Eqs. (1) and (2), we obtain: where λ is the balancing parameter. The L C acts as a regularization term, which helps in reducing the risk of overfitting on our main task.

Revisiting attention mechanisms
We first consider a soft-attention mechanism (Shen and Lee, 2016), which is used in Rei andSøgaard (2018, 2019). This method is computationally efficient because the attention distribution a ∈ R T over tokens in a sentence is computed from the final hidden representation without considering relationships between hidden states. Specifically, the new final representation H ∈ R T ×d h can be derived as follows: where are trainable parameters, and ⊗ denotes the column-wise matrix-vector multiplication. We use a residual connection (He et al., 2016) between the input hidden representation and the attention output as shown in Figure 2. H can be fed to NER and sentence classification. We further explore attention mechanisms that take into account the relationships between hidden states. In particular, we apply the multi-head self-attention mechanism in Transformer (Vaswani et al., 2017), which has shown promising results in many applications (Radford et al., 2018;Devlin et al., 2019). We replace Eq. (4) with: where trainable parameters, and n is the number of parallel heads. The attention function can be computed by: We drop the head index j for simplicity and introduce the scaling factor α ∈ R. When setting α = d h /n, Eq. (6) falls back to the standard scaled dot-product attention in Transformer. Yan et al. (2019) observed that the scaled dot-product attention produces poor results for NER and proposed the un-scaled dot-product attention, where α = 1.
In this work, we consider α as the softmax temperature (Hinton et al., 2015) that allows adjusting the probability distribution of a softmax output. Using a higher temperature yields a softer attention distribution. However, a sharper attention distribution might be more suitable for NER because only a few tokens in the sentence are named entities. Instead of setting α to 1 or d h /n, we propose to learn the scaling factors δ ∈ R T for each token. We modify Eq. (6) with: where w 4 ∈ R d h , b 4 ∈ R are the trainable parameters. Since the ReLU activation function produces output values in the range [0, ∞), the t-th element of δ is bounded in the range [1, 1 + d h /n]. This allows the model to dynamically adapt δ without increasing much computational cost.

Datasets
The data used in our experiments are product titles obtained from major e-commerce websites in Southeast Asian countries during May-June, 2019. They cover three languages, including Vietnamese (VI), Thai (TH), and Indonesian (ID). A product title is a brief, information-rich description (less than 200 characters) written by the sellers. We hired annotators and linguists for each language to annotate the product titles based on our definitions and annotation guidelines. After the annotation process, we obtained 2,000 product titles per language labeled with 6 product attribute NER tags, including PRODUCT, BRAND, CONSUMER_GROUP, MATERIAL, PATTERN, and COLOR. For each language, we split the data into 1,000/500/500 -training/development/test sets. 3 The statistics of NER tags can be found in Table 3  (see Appendix A).
For some NER tags, especially PRODUCT, the number of tags is much larger than the number of examples used. One reason is that the sellers writing a product title tend to include multiple different expressions referring to the same entity (near-synonyms), with the likely intention of acquiring more hits from potential customers. Using English to illustrate: "Genuine Leather Sling Bag Crossbody Bag Messenger bag for Men Women Office Laptop", the underlined elements are 3 PRODUCT and 2 CONSUMER_GROUP entities.
The other reason is that in one product title, it is common to find repeated identical expressions in the same language, as well as the same entity words appearing in English. Using a VI example to illustrate: "T-Shirt -Áo thun in phản quang -Ao thun Nam -Ao thun nữ -Áo thun phong cách Nam Nữ", the underlined elements refer to the same product (t-shirt), appearing multiple times in VI and in English.

Training details
We implement our model on top of the Flair framework (Akbik et al., 2019), which has recently achieved state-of-the-art results in various sequence labeling tasks. Following Lample et al. (2016), we use the IOBES tagging scheme. We use the pretrained word embeddings of fastText 4 (Bojanowski et al., 2016) with d e = 300 dimensions for each language and a single-layer BiLSTM with d h = 512 hidden units. We apply a locked dropout (Merity et al., 2018) with the probability of 0.5 before and after the BiLSTM layer and to the attention output before the residual connection. For the multi-head self-attention layer, we adapt the implementation of "The Annotated Transformer" (Rush, 2018) 5

and use its default hyperparameters.
We train all models using Adam (Kingma and Ba, 2015) with the batch size of 32, the learning rate of 1e-3, and the gradient clipping of 5. We initialize all model parameters by sampling from U(−0.1, 0.1). We set λ in Eq. (3) to 1. We use the same parameter setting for all languages. We apply early stopping in which the learning rate decays by 0.5 if the F1 score on the NER development set does not improve 3 times. We train until the learning rate drops below 1e-5, or the training epochs reach 100.

Pre-trained classification models
We collect unannotated product titles for each language and group them into 6 main categories, including FASHION, HEALTH_BEAUTY, ELEC-TRONICS, HOME_FURNITURE, MOTORS, and OTHER. Since the number of product titles is different from one language to another, we can create 360k/30k, 1.2M/60k, 864k/60k -training/development sets for VI, TH, and ID, respectively. Since product titles are not segmented in TH, we segment them using a character cluster-based method simplified from the hybrid model of Kruengkrai et al. (2009). We implement our word segmenter based on CRFsuite (Okazaki, 2007) and train the model using the BEST corpus (Kosawat et al., 2009).
We pre-train the classification models for each language. Since our batch size is relatively small compared to the training data size, we find it suffices to train for 2 epochs. The F1 scores on the development sets are 90.08%, 89.79%, and 91.91% for VI, TH, and ID, respectively. The pre-trained model parameters are used to initialize the projection and BiLSTM layers.

Main results
We run each experiment 10 times using different random seeds and report the average F1 score. All experiments are run on NVIDIA Tesla P100 GPUs. Table 1 shows the results of various models on the test sets. The Joint models consistently show improvements over the NER-only models, while the Joint + Pre-trained models further boost the F1 scores. These results suggest that the proposed framework is effective for all three languages. The Joint + Pre-trained model with the Self + Learned attention mechanism achieves the best F1 scores at 62.16%, 61.54%, and 76.10% (i.e., 3.78%, 4.20%, and 2.08% improvements over the NER-only baselines) for VI, TH, and ID, respectively.
In addition, we experiment using simple data augmentation. The "+10k" and "+50k" rows in Table 1 Rei and Søgaard, 2019); Self = multi-head self-attention described in §2.2, where Scaled = scaled dot-product (Vaswani et al., 2017), Un-scaled = unscaled dot-product (Yan et al., 2019), and Learned = our learned scaling factors.  results and hence do not pursue this idea further with the attention mechanisms. Table 2 shows the model ablations for our best configuration, the Joint + Pre-trained model with the Self + Learned attention mechanism. Feeding the attention output to the CRF layer without the residual connection leads to a consistent drop in the F1 scores, although it shows a less pronounced effect on TH. The results indicate that the residual connection is a useful component in our architecture. Adding the attention output to the hidden representation without applying the locked dropout (i.e., setting the dropout probability to 0) hurts the F1 scores on VI and TH but shows an improvement on ID, suggesting that fine-tuning the dropout rate could help boost the F1 scores.

Discussion
Our Self + Learned scaling approach shows the competitive results for the NER-only model and achieves the best results when training in tandem with the Joint + Pre-trained model. The Soft attention mechanism (Shen and Lee, 2016; Rei and Søgaard, 2019) shows slight or no improvements, suggesting that considering relationships between hidden states when computing the attention distribution is crucial for the NER task. The Self + Unscaled approach (Yan et al., 2019) yields better F1 scores than the Self + Scaled approach (Vaswani et al., 2017) for all configurations, suggesting that a sharper attention distribution is helpful for the NER task.
Although VI, TH, and ID are used in Southeast Asia, they do not belong to the same language family and have different writing systems and scripts (i.e., VI = Austroasiatic; TH = Kra-Dai; ID = Austronesian). Handling these three languages without much engineering effort reflects the generalizability of our method. Furthermore, we examine whether our method still provides improvements, even if the NER training data size increases. We create an additional set of 2k labeled examples for VI and add them to the training set (3k in total). The baseline NER-only produces 66.81% F1, while Joint + Pre-trained with Self + Learned achieves 69.26% F1 (i.e., 2.45% improvement).

Conclusion
We have shown that the proposed joint sentence and token labeling model is remarkably effective for low-resource NER in three different languages: Vietnamese, Thai, and Indonesian. Our model supports multi-class classification where the sentence and token labels can be weakly related, which indicates the potential of our model for many other real-world applications. Using a larger amount of general domain texts to build pre-trained representations (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019;Clark et al., 2020) can complement with our model and is one of the directions that we plan to take in future work.