Low-Resource Name Tagging Learned with Weakly Labeled Data

Name tagging in low-resource languages or domains suffers from inadequate training data. Existing work heavily relies on additional information, while leaving those noisy annotations unexplored that extensively exist on the web. In this paper, we propose a novel neural model for name tagging solely based on weakly labeled (WL) data, so that it can be applied in any low-resource settings. To take the best advantage of all WL sentences, we split them into high-quality and noisy portions for two modules, respectively: (1) a classification module focusing on the large portion of noisy data can efficiently and robustly pretrain the tag classifier by capturing textual context semantics; and (2) a costly sequence labeling module focusing on high-quality data utilizes Partial-CRFs with non-entity sampling to achieve global optimum. Two modules are combined via shared parameters. Extensive experiments involving five low-resource languages and fine-grained food domain demonstrate our superior performance (6% and 7.8% F1 gains on average) as well as efficiency.


Introduction
Name tagging is the task of identifying the boundaries of entity mentions in texts and classifying them into the pre-defined entity types (e.g., person).It serves as a fundamental role as providing the essential inputs for many IE tasks, such as Entity Linking (Cao et al., 2018a) and Relation Extraction (Lin et al., 2017).
Many recent methods utilize a neural network (NN) with Conditional Random Fields (CRFs) (Lafferty et al., 2001) by treating name tagging as a sequence labeling problem (Lample   1 Our project can be found in https://github.com/zig-kwin-hu/Low-Resource-Name-Tagging. 2 Someone may call it Named Entity Recognition (NER).
… Barangay Ginebra and Formula Shell forming a rivalry … Barangay Ginebra and Formula Shell forming a rivalry O'Neal Shaquille … Barangay Ginebra and Formula Shell forming a rivalry Formula shell won game one in Philippines … Weakly Labelled, extensively exists on the web Fully Labelled, expensive to obtain s 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " z D 0 + i H g 1 5 V q / d e f Z n 4 J z G P 6 v t + A = " > A A A C x n i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d F l 0 0 2 V F + 4 B a S j K d 1 t C 8 y E y U U g R / w K 1 + m v g H + h f e G a e g F t E J S c 6 c e 8 + Z u f f 6 a R g I 6 T i v B W t h c W l 5 p b h a W l v f 2 N w q b + + 0 R J J n j D d Z E i Z Z x / c E D 4 O Y N 2 U g Q 9 5 J M + 5 F f s j b / v h c x d u 3 P B N B E l / J S c p 7 k T e K g 2 H A P E n U p e i 7 / X L F q T p 6 2 f P A N a A C s x p J + Q X X G C A B Q 4 4 I H D E k 4 R A e B D 1 d u H C Q E t f D l L i M U K D j H P c o k T a n L E 4 Z H r F j + o 5 o 1 z V s T H v l K b S a 0 S k h v R k p b R y Q J q G 8 j L A 6 z d b x X D s r 9 j f v q f Z U d 5 v Q 3 z d e E b E S N 8 T + p Z t l / l e n a p E Y 4 l T X E F B N q W Z U d c y 4 5 L o r 6 u b 2 l 6 o k O a T E K T y g e E a Y a e W s z 7 b W C F 2 7 6 q 2 n 4 2 8 6 U 7 F q z 0 x u j n d 1 S x q w + 3 O c 8 6 B 1 V H W d q n t x X K m d m V E X s Y d 9 H N I 8 T 1 B D H Q 0 0 y X u E R z z h 2 a p b s Z V b d 5 + p V s F o d v F t W Q 8 f A i G Q H A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " z D 0 + i H g 1 5 V q / d e f Z n 4 J z G P 6 v t + A = " > A A A C x n i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d F l 0 0 2 V F + 4 B a S j K d 1 t C 8 y E y U U g R / w K 1 + m v g H + h f e G a e g F t E J S c 6 c e 8 + Z u f f 6 a R g I 6 T i v B W t h c W l 5 p b h a W l v f 2 N w q b + + 0 R J J n j D d Z E i Z Z x / c E D 4 O Y N 2 U g Q 9 5 J M + 5 F f s j b / v h c x d u 3 P B N B E l / J S c p 7 k T e K g 2 H A P E n U p e i 7 / X L F q T p 6 2 f P A N a A C s x p J + Q X X G C A B Q 4 4 I H D E k 4 R A e B D 1 d u H C Q E t f D l L i M U K D j H P c o k T a n L E 4 Z H r F j + o 5 o 1 z V s T H v l K b S a 0 S k h v R k p b R y Q J q G 8 j L A 6 z d b x X D s r 9 j f v q f Z U d 5 v Q 3 z d e E b E S N 8 T + p Z t l / l e n a p E Y 4 l T X E F B N q W Z U d c y 4 5 L o r 6 u b 2 l 6 o k O a T E K T y g e E a Y a e W s z 7 b W C F 2 7 6 q 2 n 4 2 8 6 U 7 F q z 0 x u j n d 1 S x q w + 3 O c 8 6 B 1 V H W d q n t x X K m d m V E X s Y d 9 H N I 8 T 1 B D H Q 0 0 y X u E R z z h 2 a p b s Z V b d 5 + p V s F o d v F t W Q 8 f A i G Q H A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " z D 0 + i H g 1 5 V q / d e f Z n 4 J z G P 6 v t + A = " > A A A C x n i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d F l 0 0 2 V F + 4 B a S j K d 1 t C 8 y E y U U g R / w K 1 + m v g H + h f e G a e g F t E J S c 6 c e 8 et al., 2016), which has became a basic architecture due to its superior performance.Nevertheless, NN-CRFs require exhaustive human efforts for training annotations, and may not perform well in low-resource settings (Ni et al., 2017).Many approaches thus focus on transferring cross-domain, cross-task and cross-lingual knowledge into name tagging (Yang et al., 2017;Peng and Dredze, 2016;Mayhew et al., 2017;Pan et al., 2017;Lin et al., 2018;Xie et al., 2018).However, they are usually limited by the extra knowledge resources that are effective only in specific languages or domains.Actually, in many low-resource settings, there are extensive noisy annotations that naturally exist on the web yet to be explored (Ni et al., 2017).In this paper, we propose a novel model for name tagging that maximizes the potential of weakly labeled (WL) data.As shown in Figure 1, s 2 is weakly labeled, since only Formula shell and Barangay Ginebra are annotated, leaving the remaining words unannotated.
WL data is more practical to obtain, since it is difficult for people to accurately annotate those entities that they do not know or are not interested in.We can construct them from online resources, such as the anchors in Wikipedia.However, the following natures of WL data make learning name tagging from them more challenging: Partially-Labeled Sequence Automatically arXiv:1908.09659v1[cs.CL] 26 Aug 2019 derived WL data does not contain complete annotations, thus can not be directly used for training.Ni et al. (2017) select the sentences with highest confidence, and assume missing labels as O (i.e., non-entity), but it will introduce a bias to recognize mentions as non-entity.Another line of work is to replace CRFs with Partial-CRFs (Täckström et al., 2013), which assign unlabeled words with all possible labels and maximize the total probability (Yang et al., 2018;Shang et al., 2018).However, they still rely on seed annotations or domain dictionaries for high-quality training.
Massive Noisy Data WL corpora are usually generated with massive noisy data including missing labels, incorrect boundaries and types.Previous work filtered out WL sentences by statistical methods (Ni et al., 2017) or the output of a trainable classifier (Yang et al., 2018).However, abandoning training data may exacerbate the issue of inadequate annotation.Therefore, maximizing the potential of massive noisy data as well as highquality part, yet being efficient, is challenging.
To address these issues, we first differentiate noisy data from high-quality WL sentences via a lightweight scoring strategy, which accounts for the annotation confidence as well as the coverage of all mentions in one sentence.To take best advantages of all WL data, we then propose a unified neural framework that solves name tagging from two perspectives: sequence labeling and classification for two types of data, respectively.
Specifically, the classification module focuses on noisy data to efficiently pretrain the tag classifier by capturing textual context semantics.It is trained only using annotated words without noisy unannotated words, and thus it is robust and efficient during training.The costly sequence labeling module is to achieve sequential optimum among word tags.It further alleviates the burden of seed annotations in Partial-CRFs and increases randomness via Non-entity Sampling strategy, which samples O words according to some linguistic natures.These two modules are combined via shared parameters.Our main contributions are as follows: • We propose a novel neural name tagging model that merely relies on WL data without feature engineering.It can thus be adapted for both low-resource languages and domains, while no previous work deals with them at the same time.
• We consider name tagging from two perspec-tives of sequence labeling and classification, to efficiently take the best advantage of both high-quality and noisy WL data.
• We conduct extensive experiments in five low-resource languages and a fine-grained domain.Since few work has been done in two types of low-resource settings simultaneously, we arrive at two types of baselines from state-of-the-art methods.Our model achieves significant improvements (6% and 7.8% F1 on average), yet being efficient demonstrated in further ablation studies.

Related Work
Name tagging is an fundamental task of extracting entity information, which shall benefit many applications, such as information extraction (Zhang et al., 2017;Kuang et al., 2019;Cao et al., 2019a) and recommendation (Wang et al., 2019;Cao et al., 2019b).It can be treated as either a multiclass classification problem (Hammerton, 2003;Xu et al., 2017) or a sequence labeling problem (Collobert et al., 2011), but very little work combined them together.The difference between them mainly lies in whether the method models sequential label constraints, which have been demonstrated effective in many NN-CRFs models (Lample et al., 2016;Ma and Hovy, 2016;Chiu and Nichols, 2016).However, they require a large amount of human annotated corpora, which are usually expensive to obtain.
The above issue motivates a lot of work on name tagging in low-resource languages or domains.A typical line of effort focuses on introducing external knowledge via transfer learning (Fritzler et al., 2018;Hofer et al., 2018), such as the use of crossdomain (Yang et al., 2017), cross-task (Peng and Dredze, 2016;Lin et al., 2018) and cross-lingual resources (Ni et al., 2017;Xie et al., 2018;Zafarian et al., 2015;Zhang et al., 2016;Mayhew et al., 2017;Tsai et al., 2016;Feng et al., 2018;Pan et al., 2017).Although they achieve promising results, there are a large amount of weak annotations on the Web, which have not been well studied (Nothman et al., 2008;Ehrmann et al., 2011).Yang et al. (2018); Shang et al. (2018) utilized Partial-CRFs (Täckström et al., 2013) to model incomplete annotations for specific domains, but they still rely on seed annotations or a domain dictionary.Therefore, we aim at filling the gap in lowresource name tagging research by using only WL data, and adapt it to arbitrary low-resource languages or domains, which can be further improved by the above transfer-based methods.
3 Preliminaries and Framework

Preliminaries
We formally define the name tagging task as follows: given a sequence of words

Framework
The goal of our method is to extract WL data from Wikipedia and use them as training corpora for name tagging.As shown in Figure 2, there are two steps in our framework: Weakly Labeled Data Generation generates as many WL data as possible for higher tagging recall.It contains two components of label induction and data selection scheme.First, the label induction assigns each word a label based on Wikipedia anchors and taxonomy.Then, the data selection scheme computes the quality scores for the WL sentences by considering the coverage of mentions as well as the label confidence.According to the scores, we split the entire set into two parts: a small set of high-quality data for the sequence labeling module, and a large amount of noisy data for the classification module.
Neural Name Tagging Model aims at efficiently and robustly utilizing both high-quality and noisy WL data, ensuring satisfying tagging precision.It is to make best use of labeled words via the sequence labeling module and the classification module.More specifically, we pre-train the classification module to capture the textual context semantics from massive noisy data, and then the sequence labeling module further fine-tunes the shared neural networks using a Partial-CRFs layer with Non-Entity Sampling.

Weakly Labeled Data Generation
Existing methods use Wikipedia (Ni et al., 2017;Pan et al., 2017;Geiß et al., 2017) to train an extra classifier to predict entity categories for name tagging training.Instead, we aim at lowering the requirements of additional resources in order to support more low-resource settings.We thus utilize a lightweight strategy to generate WL data including label induction and data selection scheme.

Label Induction
Given a sentence X including anchors A(X) and taxonomy T , we aim at inducing a label ỹ ∈ Ỹ for each word x ∈ X. Obviously, the words outside of anchors should be labeled with UN, indicating that it is unlabeled and could be O or unannotated mentions.For the words in an anchor m, e , we label it according to the entity categories.For example, words Formula and Shell (Figure 1) in s 2 are labeled as B-ORG and I-ORG, respectively, because mention Formula Shell is linked to entity Shell Turbo Chargers, which belongs to category Basketball teams.We trace it along the taxonomy T : Basketball teams→...→Organizations, and find that it is a child of Organizations.According to a manually defined mapping Γ(Y) → C (e.g., Γ(ORG) =Organizations), we denote all the classes and their children with the same type (e.g., ORG).
However, there are two minor issues.First, for the entities without category information C(e) = ∅, we label them as B-NT or I-NT, indicating that they have no type information.Second, for the entities referring to multiple categories, we induce labels that maximizes the conditional probability: By doing so, we obtain a set of WL sentences D = {(X, Ỹ )}.However, the induction process may introduce incorrect boundaries and types of labels due to the crowdsourcing nature of source data.We thus design a data selection scheme to deal with the issues.

Data Selection Scheme
Following Ni et al. (2017), we compute quality scores for sentences to distinguish high-quality and noisy data from two aspects: the annotation confidence and the annotation coverage.
The annotation confidence measures the likelihood of the text spans being mentions (i.e., cor-rectness of boundaries), and being assigned with the types.We define it as follows:

|X|
(2) where p(C(e)|x i ) is the conditional probability of x i linking to an entity belong to category C(e), we compute it based on its statistical frequency among Wikipedia anchors.
The annotation coverage measures to which ratio the words are being labeled in the sentence: We select high-quality sentences D hq satisfying: where θ q and θ n are the hyperparameters.Thus, the remaining sentences are noisy D noise .For example (Figure 2), the sentence ... Barangay Ginebra and Formula Shell ... is highquality, and The team is owned by Ginebra is noisy.This is because there are more anchors that link Formula Shell to an organization entity and the anchors within the sentence account for a large proportion, leading to a higher quality score.Note that Barangay and Ginebra are labeled with B-NT and I-NT, indicating the type information is missing.Our model may learn the textual semantics for classifying Ginebra to ORG from the noisy sentence, where Ginebra is labeled with B-ORG.

Neural Name Tagging Model
Our neural model contains two modules that share the same NN architecture except the Partial-CRFs layer.Given D hq and D noise , we first pre-train the classification module using massive noisy data D noise to efficiently capture the textual semantics.Then, we use the sequence labeling module to fine-tune the classification module on highquality data D hq by considering the transitional constraints among sequential labels.

Sequence Labeling Module
Before describing the NN of the classification module, we first introduce the sequence labeling module.Different from conventional NN-CRFs models, we utilize the Partial CRFs layer to maximize the probability of all possible sequential labels for the sentence with transitional constraints, where the probability of missing word labels is controlled by non-entity sampling.

Partial-CRFs
Partial-CRFs (PCRFs) was first proposed in the field of Part-of-Speech Tagging (Täckström et al., 2013).It can be trained when the coupled word and label constraints provide only a partial signal by assuming that the uncoupled words may refer to multiple labels.Given (X, Ỹ ), we traverse all possible labels Y for each unannotated word {x i |ỹ i ∈ UN,B-NT,I-NT} (e.g., the red paths in Figure 2), and compute the total probability of possible fully labeled sentences Y(X, Ỹ ) = {(X, Y )}: where p(Y |X) = softmax(s(X, Y )), the same as in CRFs, and the score function s(X, Y ) is: where P x i ,y i is the score indicating how possible x i is labeled with y i , which is defined as the output of NN and will be detailed in the next section.
A y i ,y i+1 is the transition score from label y i to y i+1 that is learned in this layer.
Instead of the single correct label sequence in CRFs, the loss function of partial-CRFs is to minimize the negative log-probability of ground truth over all possible labeled sequences:

Non-entity Sampling
A crucial drawback of using partial CRFs for WL sentences is that there are no words labeled with O (i.e., non-entity words) for training (Section 6.5).
To further alleviate the reliance on seed annotations, we introduce non-entity sampling that samples O from unlabeled words as follows: where α is non-entity ratio to balance how many unlabeled words are sampled as O, we set α = 0.9 in experiments according to Augenstein et al. (2017).
Weighting parameters satisfy 0 ≤ λ 1 , λ 2 , λ 3 ≤ 1, and f 1 , f 2 , f 3 are feature scores.We define f 1 = 1(x i adjoins an entity), which implies that the words around a mention are possible to be O; f 2 is the ratio of the number of x i labeled with entities to its total occurrences, reflecting how frequent a word is in a mention; and f 3 = tf * df , where tf is term frequency and df is document frequency in Wikipedia articles.
As shown in Figure 2, three words and, forming and an are labeled with N since they are outside of anchors.During training, they should be regarded as all labels of Y in Partial CRFs, while we sample some of them as O words according to Equation 8. Thus, and and an are instead treated as O words, because they do not appear in any anchor, and are too general due to a high f 3 value.

Classification Module
To efficiently utilize the noisy WL sentences, this module regards name tagging as a multi-label classification problem.On one hand, it predicts each word's label separately, naturally addressing the issue of inconsecutive labels.On the other hand, we only focus on the labeled words, so that the module is robust to the noise since most noise arises from the unlabeled words, and enjoy an efficient training procedure.
Formally, given a noisy sentence (X, Ỹ ) ∈ D noise , we classify words {x i |ỹ i ∈ Y} by capturing textual semantics within the context.Independently of languages and domains, we combine the character and word embeddings for each word, then feed them into an encoder layer to capture contextual information for the classification layer.

Character and Word Embeddings
As inputs, we introduce character information to enhance word representations to improve the robustness to morphological and misspelling noise following (Ma and Hovy, 2016).Concretely, we represent a word x by concatenating word embedding w and Convolutional Neural Networks (CNN) (LeCun et al., 1989) based character embedding c, which is obtained through convolution operations over characters in a word followed by max pooling and drop out techniques.

Encoder Layer
Given an arbitrary length of sentence X, this component encodes the semantics of words as well as their compositionality into a lowdimensional vector space.The most common encoders are CNN, Long-Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) and Transformer (Vaswani et al., 2017).We use the bi-directional LSTM (Bi-LSTM) due to its superior performance.We discuss it in Section 6.2.
Bi-LSTM (Graves et al., 2013) has been widely used in modeling sequential words, so as to capture both past and future input features for a given word.It stacks a forward LSTM and a backward LSTM, so that the output of a word where

Classification Layer
The classification layer makes independent labeling decisions for each word, so that we can only focus on labeled words, while robustly and efficiently skip the noisy unlabeled words.
In this layer, we estimate the score P x i ,y i (Equation 6) for word x i being the label y i .We use a fully connected layer followed by softmax to output a probability-like score: where W ∈ R |Y| .Note that we have no training instance for O words.Thus, we also use the nonentity sampling (Section 5.1).Given (X, Ỹ ) ∈ D noise , this module is trained to minimize crossentropy of the predicted and ground truth:

Training and Inference
To distill the knowledge derived from noisy data, we first pre-train the classification module, then share the overall NN with the sequence labeling module.If we choose a loose threshold θ p and θ n , there is no noisy data and our model shall degrade to the sequential model without the pre-trained classifier.When the threshold is strict, there is no high-quality data and our model will degrade to the classification module only (Section 6.4).
For inference, we use the sequence labeling module to predict the output label sequence with the largest score as in Equation 6.

Experiment
We verify our model using five low-resource languages and a specific domain.Furthermore, we investigate the impacts of the main components as well as hyperparameters in the ablation study.

Experiment Settings
Datasets Since most datasets on low-resource languages are not publicly available, we use Wikipedia data as the "ground truth" following Pan et al. (2017).Thus, we can test name tagging in low-resource languages as well as domains.We choose five languages: Welsh, Bengali, Yoruba, Mongolian and Egyptian Arabic (or CY, BN, YO, MN and ARZ for short), at different lowresource levels, and select 3 types: Person, Location and Organization.For food domain, we reorganized the entities in Wikipedia category Food and drink into 5 types: Drinks, Meat, Vegetables, Condiments and Breads, for name tagging, and extract sentences containing those entities from all English Wikipedia articles for as many data as possible.We use 20190120 Wikipedia dump for WL data construction, where the ratio of words in anchors to the whole sentence is nearly 0.12, 0.07, 0.14, 0.07 and 0.06 for languages CY, BN, YO, MN and ARZ, and 0.13 for food domain, demonstrating that unlabeled words are dominant.By heuristically setting θ q = 0.1, θ n = 0.9, we obtain 56,571, 16,718, 4,131, 8,332, 6,266, 11,297 high-quality and 49,970, 50,197, 32,417, 10,918, 12,434, 16,501  CY, BN, YO, MN and ARZ, and food domain, respectively.For correctness, we then pick up test data of 25% sentences that has highest annotation confidence and exceed 0.3 coverage.We randomly choose 25% of high-quality data as validation for early stop, and the rest for training.The statistics 3 is in Table 1.

Training Details
For tuning of hyper-parameters, we set nonentity feature weights to λ 1 = 0, λ 2 = 0.9, λ 3 = 0.1 heuristically.We pre-train word embeddings using Glove (Pennington et al., 2014), and finetune embeddings during training.We set the dimension of words and characters as 100 and 30, respectively.We use 30 filter kernels, where each kernel has the size of 3 in character CNN, and dropout rate is set to 0.5.For bi-LSTM, the hidden state has 150 dimensions.The batch size is set to 32 and 64 for sequence labeling module and classification module.We adopt Adam with L2 regularization for optimization, and set the learning rate and weight decay to 0.001 and 1e −9 .Baselines Since most low-resource name tagging methods introduce external knowledge (Section 2), which has limited availability and is out of the scope for this paper, we arrive at two types of baselines from weakly supervised models: Typical NN-CRFs models (Ni et al., 2017) by selecting high-quality WL data and regarding unlabeled words as O, which usually achieve very competitive results.NN denotes CNN, Transformer (Trans for short) or Bi-LSTM.
NN-PCRFs model (Yang et al., 2018;Shang et al., 2018).Although they achieves state-ofthe-art performance, methods of this type are only evaluated in specific domains and require a small set of seed annotations or a domain dictionary.We thus carefully adapt them to low-resource languages and domains by selecting the highestquality WL data (θ n > 0.3) as seeds 4 . 3The statistics includes noisy data, which greatly increases the size but cannot be used for evaluation. 4We adopt the common part of their models related to

Results on Low-Resource Languages
Table 2 shows the overall performance of our proposed model as well as the baseline methods (P and R denote Precision and Recall).We can see: Our method consistently outperforms all baselines in five languages w.r.t F1, mainly because we greatly improve recall (2.7% to 9.34% on average) by taking best advantage of WL data and being robust to noise via two modules.As for the precision, partial-CRFs perform poorly compared with CRFs due to the uncertainty of unlabeled words, while our method alleviates this issue by introducing linguistic features in non-entity sampling.An exception occurs in CY, because it has the most training data, which may bring more accurate information than sampling.Actually, we can tune hyper-parameter non-entity ratio α to improve precision 5 , more studies can be found in Section 6.5.Besides, the sampling technique can utilize more prior features if available, we leave it in future.
Among all encoders, Bi-LSTM has greater ability for feature abstraction and achieves the highest precision in most languages.An unexpected exception is Yoruba, where CNN achieves the higher performance.This indicates that the three encoders capture textual semantics from different perspectives, thus it is better to choose the encoder by considering the linguistic natures.
As for the impacts of resources, all the models perform the worst in Yoruba.Interestingly, we conclude that the performance for name tagging in low-resource languages doesn't depend entirely on the absolute number of mentions in the training data, but largely on the average number of annotations per sentence.For example, Bengali has 1.9 mentions per sentence and all methods achieve their best results, while the opposite is Welsh with handling weakly labeled data, removing the other parts that are specifically designed for domains, such as instance selector (Yang et al., 2018) which makes it worse since we have already selected the high-quality data. 5In this table, we show the performance using the same hyper-parameters in different languages for fairness.1.4 mentions per sentence.This verifies our data selection scheme (e.g., annotation coverage n(•)), and we will give more discussion in Section 6.4.Table 3 shows the overall performance in food domain, where D, M, V, C and B denote: Drink, Meat, Vegetables, Condiments and Breads.We can observe that there is a performance drop compared to that in low-resource languages, mainly because of more types and sparse training data.Our model outperforms all of the baselines in all food types by 7.8% on average.The performance in condiments is relatively low, because most of them are composed of meat or vegetables, such as steak sauce, which is overlapped with other types and make the recognition more difficult.Here is a representative case demonstrating that our model is robust to noise induced by unlabeled words.In Figure 4, the sentence is from the noisy WL training data of food domain, and only Maize is labeled as B-V.Although our model is trained on this sentence, it successfully predicts yams as B-V.This example shows that our two-modules design can utilize the noisy data while avoiding side effects caused by incomplete annotation.

Efficiency Analysis
We utilize θ n , the main factor to annotation quality (Section 6.2), to trade off between high-quality and noisy WL data.As shown in Figure 3(a), the red curve denotes the training time and the blue curve denotes F1.We can see that the performance of our model is relatively stable when θ n ∈ [0, 0.15), while the time cost drops dramatically (from 90 to 20 minutes), demonstrating the robustness and efficiency of two-modules design.When θ n ∈ [0.15, 0.3], the performance decreases greatly due to less available high-quality data for sequence labeling module; meanwhile, little time is saved through classification module.Thus, we pick up θ n = 0.1 in experiments.A special case happens when θ n = 0, our model degrades to sequence labeling without pre-trained classifier.We can see the performance is worse than that of θ n = 0.1 due to massive noisy data.

Impact of Non-Entity Sampling Ratio
We use non-entity ratio α to control sampling, and a higher α denotes that more unlabeled words are labeled with O.As shown in Figure 3(b), the precision increases as more words are assigned with labels, while the recall achieves two peaks (α = 0.4, 0.9), leading to the highest F1 when α = 0.9, which conforms to the statistics in Augenstein et al. (2017).There are two special cases.When α = 0, our model degrades to a NN-PCRFs model without non-entity sampling and there is no seed annotations for training.We can see the model performs poorly due to the dominant unlabeled words (Section 5.1).When α = 1 indicating all unlabeled words are sampled as O, our model degrades to NN-CRFs model, which has higher precision at the cost of recall.Clearly, the model suffers from the bias to O labeling.

Impact of Non-Entity Features
We propose three features for non-entity samples: nearby entities (f 1 ), ever within entities (f 2 ) and term/document frequency (f 3 ).We now investigate how effective each feature is. Figure 3(c) shows the performance of our model that samples non-entity words using each feature as well as their combinations.The first bar denotes the performance of sampling without any features.It is not satisfying but competitive, indicating the importance of non-entity sampling to partial-CRFs.The single f 2 contributes the most, and gets enhanced with f 3 because they provide complementary information.Surprisingly, f 1 seems better than f 3 , but makes the model worse if we use it combined with f 2 , f 3 , thus we set λ 1 = 0.

Conclusions
In this paper, we propose a novel name tagging model that consists of two modules of sequence labeling and classification, which are combined via shared parameters.We automatically construct WL data from Wikipedia anchors and split them into high-quality and noisy portions for training each module.The sequence labeling module focuses on high-quality data and is costly due to the partial-CRFs layer with non-entity sampling, which models all possible label combinations.The classification module focuses on the annotated words in noisy data to pretrain the tag classifier efficiently.The experimental results in five low-resource languages and a specific domain demonstrate the effectiveness and efficiency.
In the future, we are interested in incorporating entity structural knowledge to enhance text representation (Cao et al., 2017(Cao et al., , 2018b)), or transfer learning (Sun et al., 2019) to deal with massive rare words and entities for low-resource name tagging, or introduce external knowledge for further improvement.

Figure 1 :
Figure 1: Example of weakly labeled data.B-NT and I-NT denote incomplete labels without types.

Figure 2 :
Figure2: Framework.Rectangles denote the main components for two steps, and rounded rectangles consist of two modules of the neural model.In input sentences, bold fonts denote labeled words, and at the top is corresponding outputs.We use Partial-CRFs to model all possible label sequences (red paths from left to right by picking up one label per column) controlled by non-entity sampling (strikethrough labels according to the distribution).We replace "UN" and "x-NT" label with corresponding possible labels to clarify the principle of PCRF.
aims to infer a sequence of labels Y = y 1 , • • • , y i , • • • , y |X| , where |X| is the length of the sequence, y i ∈ Y is the label of the word x i , each label consists of the boundary and type information, such as B-ORG indicating that the word is Begin of an ORGanization entity.To make notations consistent, we use Ỹ = Y {UN,B-NT,I-NT} to denote the label set of WL data, where UN indicates that the word is unlabeled, and NT denote only the type is unlabeled.In other words, the word with UN may be any one of the label in Y, and the word with NT may be any type.We define Ỹ for notation clarity.To deal with the issue of limited annotations, we construct WL data D = {(X, Ỹ )} based on Wikipedia anchors and taxonomy, whereỸ = ỹ1 , • • • , ỹi , • • • ,ỹ|X| and ỹi ∈ Ỹ.An anchor m, e ∈ A links a mention m to an entity e ∈ E, where m contains one or several consecutive words of length |m|.Particularly, we define A(X) as the set of anchors in X.Most entities are mapped to hierarchically organized categories, namely taxonomy T , which provides category information C = {c}.We define C(e) as the category set of e, and T ↓ (c) as the children of c.
(a) Efficiency analysis.(b) Impact of non-entity sampling ratio.(c)Impact of non-entity features.

Figure 3 :
Figure 3: Ablation study of our model in Mongolian.

Figure 4 :
Figure 4: Our predictions on a noisy WL sentence.

Table 1 :
The statistics of weakly labeled dataset.

Table 2 :
noisy WL sentences for language Performance (%) on low-resource languages.