Name Tagging for Low-resource Incident Languages based on Expectation-driven Learning

In this paper we tackle a challenging name tagging problem in an emergent setting - the tagger needs to be complete within a few hours for a new incident language (IL) us-ing very few resources. Inspired by observing how human annotators attack this challenge, we propose a new expectation-driven learning framework. In this framework we rapidly acquire, categorize, structure and zoom in on IL-specific expectations (rules, features, patterns, gazetteers, etc.) from various non-traditional sources: consulting and encoding linguistic knowledge from native speakers, mining and projecting patterns from both mono-lingual and cross-lingual corpora, and typing based on cross-lingual entity linking. We also pro-pose a cost-aware combination approach to compose expectations. Experiments on seven low-resource languages demonstrate the effectiveness and generality of this framework: we are able to setup a name tagger for a new IL within two hours, and achieve 33.8%-65.1% F-score 1 .


Introduction: "Tibetan Room"
In many emergent situations such as disease outbreaks and natural disasters, there is great demand to rapidly develop a Natural Language Processing (NLP) system, such as name tagger, for a "surprise" Incident Language (IL) with very few resources. Traditional supervised learning methods that rely on large-scale manual annotations would be too costly.
Let's start by investigating how a human would discover information in a foreign IL environment. When we are in a foreign country, even if we don't know the language, we would still be able to guess the word "gate" from the airport broadcast based on its frequency and position in a sentence; guess the word "station" by pattern mining of many subway station labels; and guess the word "left" or "right" from a taxi driver's GPS speaker by matching movement actions. We designed a "Tibetan Room" game, similar to "Chinese Room" (Searle, 1980), by asking a human user who doesn't know Tibetan to find persons, locations and organizations from some Tibetan documents. We designed an interface where test sentences are presented to the player one by one. When the player clicks token, the interface will display up to 100 manually labeled Tibetan sentences that include this token. The player can also see translations of some common words and a small gazetteer of common names (800 entries) in the interface.
14 players who don't know Tibetan joined the game. Their name tagging F-scores ranged from 0% to 94%. We found that good players usually bring in some kind of "expectations" derived from their own native languages, or general linguistic knowledge, or background knowledge about the scenario. Then they actively search, confirm, adjust and update these expectations during tagging. For example, they know from English that location names are often ended with suffix words such as "city" and "country", so they search for phrases starting or ending with the translations of these suffix words. After they successfully tag some seeds, they will continue to discover more names based on more expectations.
For example, if they already tagged an organization name A, and now observe a sequence matching a common English pattern "[A (Organization)] s [Title] [B (Person)]", they will tag B as a person name. And if they know the scenario is about Ebola, they will be looking for a phrase with translation similar to "West Africa" and tag it as a location. Similarly, based on the knowledge that names appear in a conjunction structure often have the same type, they propagate high-confidence types across multiple names. They also keep gathering and synthesizing common contextual patterns and rules (such as position, frequency and length information) about names and non-names to expand their expectations. For example, after observing a token frequently appearing between a subsidiary and a parent organization, they will predict it as a preposition similar to "of " in English, and tag the entire string as a nested organization.
Based on these lessons learned from this game, we propose to automatically acquire and encode expectations about what will appear in IL data (names, patterns, rules), and encode those expectations to drive IL name tagging. We explored various ways of systematically discovering and unifying latent and expressed expectations from nontraditional resources: • Language Universals: Language-independent rules and patterns; • Native Speaker: Interaction with native speakers through a machine-readable survey and supervised active learning; • Prior Mining: IL entity prior knowledge mining from both mono-lingual and cross-lingual corpora and knowledge bases; Furthermore, in emergent situations these expectations might not be available at once, and they may have different costs, so we need to organize and prioritize them to yield optimal performance within given time bounds. Therefore we also experimented with various cost-aware composition methods with the input of acquired expectations, plus a time bound for development (1 hour, 2 hours), and the output as a wall-time schedule that determines the best sequence of applying modules and maximizes the use of all available resources. Experiments on seven low-resource languages demonstrate that our frame-work can create an effective name tagger for an IL within a couple of hours using very few resources.

Starting Time: Language Universals
First we use some language universal rules, gazetteers and patterns to generate a binary feature vector F = {f 1 , f 2 , ...} for each token. Table 1 shows

Approach Overview
Figure 1 illustrates our overall approach of acquiring various expectations, by simulating the strategies human players adopted during the Tibetan Room game. Next we will present details about discovering expectations from each source.

Case
-Capitalized; -AllUppercased; -MixedCase Punctuation -IternalPeriod: includes an internal period Digit -Digits: consisted of digits Length -LongLength: a name including more than 4 tokens is likely to be an ORG TF-IDF -TF-IDF: if a capitalized word appears at the beginning of a sentence, and has a low TF-IDF, then it's unlikely to be a name Patterns -Pattern1: "Title ⟨ PER Name ⟩" -Pattern2: "⟨P ERN ame⟩, 00 * ," where 00 are two digits -Pattern3: "[⟨N ame i ⟩...], ⟨N amen − 1⟩⟨singleterm⟩⟨N amen⟩" where all names have the same type. Multioccurrences -MultipleOccurrence: If a word appears in both uppercased and lowercased forms in a single document, it's unlikely to be a name. Table 1: Universal Name Tagger Features the-loop process to acquire knowledge from native speakers. To meet the needs in the emergent setting, we design a comprehensive survey that aims to acquire a wide-range of IL-specific knowledge from native speakers in an efficient way. The survey categorizes questions and organizes them into a tree structure, so that the order of questions is chosen based on the answers of previous questions. The survey answers are then automatically translated into rules, patterns or gazetteers in the tagger. Some example questions are shown in Table 2.

Mono-lingual Expectation Mining
We use a bootstrapping method to acquire IL patterns from unlabeled mono-lingual IL documents. Following the same idea in (Agichtein and Gravano, 2000;Collins and Singer, 1999), we first use names identified by high-confident rules as seeds, and generalize patterns from the contexts of these seeds. Then we evaluate the patterns and apply high-quality ones to find more names as new seeds. This process is repeated iteratively 2 .
We define a pattern as a triple ⟨lef t, name, right⟩, where name is a name, left and right 3 are context vectors with weighted terms (the weight is computed based on each token's tf-idf score). For example, from a Hausa sentence "gwamnatin kasar Sin ta samar wa kasashen yammacin Afirka ... (the Government of China has given ... products to the West African countries)", we can discover a pattern:

This pattern matches strings like "gwamnatin kasar Fiji ta (by the government of Fiji)".
For any two triples t i = ⟨l i , name i , r i ⟩ and t j = ⟨l j , name j , r j ⟩, we comput e their similarity by: We use this similarity measurement to cluster all triples and select the centroid triples in each cluster as candidate patterns.
Similar to (Agichtein and Gravano, 2000), we evaluate the quality of a candidate pattern P by: ,where P positive is the number of positive matches for P and P negative is the number of negative matches. Due to the lack of syntactic and semantic resources to refine these lexical patterns, we set a conservative confidence threshold 0.9.

Cross-lingual Expectation Projection
Name tagging research has been done for highresource languages such as English for over twenty years, so we have learned a lot about them. We collected 1,362 patterns from English name tagging literature. Some examples are listed below: The names of people, organizations and locations start with a capitalized (uppercased) letter 3. The first word of a sentence starts with a capitalized (uppercased) letter 4. Some periods indicate name abbreviations, e.g., St. = Saint, I.B.M. = International Business Machines. 5. Locations usually include designators, e.g., in a format like country United states , city Washington 6. Some prepositions are part of names Text input 1. Morphology: please enter preposition suffixes as many as you can (e.g. " da" in "Ankara da yaşıyorum (I live in Ankara)" is a preposition suffix which means "in"). Translation 1. Please translate the following English words and phrases: -organization suffix: agency, group, council, party, school, hospital, company, office, ... -time expression: January, ..., December; Monday, ..., Sunday; ...

Table 2: Survey Question Examples
Besides the static knowledge like patterns, we can also dynamically acquire expected names from topically-related English documents for a given IL document. We apply the Stanford name tagger (Finkel et al., 2005) to the English documents to obtain a list of expected names. Then we translate the English patterns and expected names to IL. When there is no human constructed English-to-IL lexicon available, we derive a word-for-word translation table from a small parallel data set using the GIZA++ word alignment tool (Och and Ney, 2003). We also convert IL text to Latin characters based on Unicode mapping 4 , and then apply Soundex code (Mortimer and Salathiel, 1995;Raghavan and Allan, 2004) to find the IL name equivalent that shares the most similar pronunciation as each English name. For example, the Bengali name "টিন ে য়ার" and "Tony Blair" have the same Soundex code "T500 B460".

Mining Expectations from KB
In addition to unstructured documents, we also try to leverage structured English knowledge bases (KBs) such as DBpedia 5 . Each entry is associated with a set of types such as Company, Actor and Agent. We utilize the Abstract Meaning Representation corpus (Banarescu et al., 2013) which contains both entity type and linked KB title annotations, to automatically map 9, 514 entity types in DBPedia to three main entity types of interest: Person (PER), Location (LOC) and Organization (ORG).
Then we adopt a language-independent crosslingual entity linking system (Wang et al., 2015) to link each IL name mention to English DBPedia. This linker is based on an unsupervised quantified collective inference approach. It constructs knowledge networks from the IL source documents based on entity mention co-occurrence, and knowledge networks from KB. Each IL name is matched with candidate entities in English KB using name translation pairs derived from inter-lingual KB links in Wikipedia and DBPedia. We also apply the wordfor-word translation tables constructed from parallel data as described in Section 3.4 to translate some uncommon names. Then it performs semantic comparison between two knowledge networks based on three criteria: salience, similarity and coherence. Finally we map the DBPedia types associated with the linked entity candidates to obtain the entity type for each IL name.

Supervised Active Learning
We anticipated that not all expectations can be encoded as explicit rules and patterns, or covered by projected names, therefore for comparison we introduce a supervised method with pool-based active learning to learn implicit expectations (features, new names, etc.) directly from human data annotation. We exploited basic lexical features including ngrams, adjacent tokens, casing information, punctuations and frequency to train a Conditional Random Fields (CRFs) (Lafferty et al., 2001) based model through active learning (Settles, 2010).
We segment documents into sentences and use each sentence as a training unit. Let x * b be the most informative instance according to a query strategy ϕ(x), which is a function used to evaluate each instance x in the unlabeled pool U . Algorithm 1 illustrates the procedure.
Algorithm 1 Pool-based Active Learning 1: L ← labeled set, U ← unlabeled pool 2: ϕ(·) ← query strategy, B ← query batch size 3: M ← maximum number of tokens 4: while Length(L)< M do 5: θ = train(L); 6: for b ∈ {1, 2, ..., B} do 7: x * b = arg maxx∈U ϕ(x) 8: 10: end for 11: end while Jing et al. (2004) proposed an entropy measure for active learning for image retrieval task. We compared it with other measures proposed by (Settles and Craven, 2008) and found that sequence entropy (SE) is most effective for our name tagging task. We use ϕ SE to represent how informative a sentence is: , where T is the length of x, m ranges over all possible token labels and P θ (y t = m) is the probability when y t is tagged as m.

Cost-aware Combination
A new requirement for IL name tagging is a Linguistic Workflow Generator, which can generate an activity schedule to organize and maximize the use of acquired expectations to yield optimal F-scores within given time bounds. Therefore, the input to the IL name tagger is not only the test data, but also a time bound for development (1 hour, 2 hours, 24 hours, 1 week, 1 month, etc.). Figure 2 illustrates our cost-aware expectation composition approach. Given some IL documents as input, as the clock ticks, the system delivers name tagging results at time 0 (immediately), time 1 (e.g., in one hour) and time 2 (e.g., in two hours). At time 0, name tagging results are provided by the universal tagger described in Section 2. During the first hour, we can either ask the native speaker to annotate a small amount of data for supervised active learning of a CRFs model, or fill in the survey to build a rulebased tagger. We estimate the confidence value of  Table 3: Data Statistics each expectation-driven rule based on its precision score on a small development set of ten documents. Then we apply these rules in the priority order of their confidence values. When the results of two taggers are conflicting on either mention boundary or type, if the applied rule has high confidence we will trust its output, otherwise adopt the CRFs model's output.

Experiments
In this section we will present our experimental details, results and observations.

Data
We evaluate our framework on seven low-resource incident languages: Bengali, Hausa, Tagalog, Tamil, Thai, Turkish and Yoruba, using the groundtruth name tagging annotations from the DARPA LORELEI program 6 . Table 3 shows data statistics.

Cost-aware Overall Performance
We test with three checking points: starting time, within one hour, and within two hours. Based on the combination approach described in Section 5, we can have three possible combinations of the expectationdriven learning and supervised active learning methods during two hours: (1) expectation-driven learning + supervised active learning; (2) supervised active learning + expectation-driven learning; and (3) supervised active learning for two hours. Figure 3 compares the overall performance of these combinations for each language.
We can see that our approach is able to rapidly set up a name tagger for an IL and achieves promising performance. During the first hour, there is no clear winner between expectation-driven learning or  Figure 2: Cost-aware Expectation Composition supervised active learning. But it's clear that supervised active learning for two hours is generally not the optimal solution. Using Hausa as a case study, we take a closer look at the supervised active learning curve as shown in Figure 4. We can see that supervised active learning based on simple lexical features tends to converge quickly. As time goes by it will reach its own upper-bound of learning and generalizing linguistic features. In these cases our proposed expectation-driven learning method can compensate by providing more explicit and deeper ILspecific linguistic knowledge. Table 4 shows the performance gain of each type of expectation acquisition method. IL gazetteers covered some common names, especially when the universal case-based rules failed at identifying names from non-Latin languages. IL name patterns were mainly effective for classification. For example, the Tamil name "கத் தோலிக் கன் சிரியன் வங் கியில (Catholic Syrian Bank)" was classified as an organization because it ends with an organization suffix word "வங் கியில(bank)". The patterns projected from English were proven very effective at identifying name boundaries. For example, some nonnames such as titles are also capitalized in Turkish, so simple case-based patterns produced many spurious names. But projected patterns can fix many of them. In the following Turkish sentence, "An-  [Person]" and successfully identified "Catherine Ashton" as a person. Cross-lingual entity linking based typing successfully enhanced classification accuracy, especially for languages where names often appear the same as their English forms and so entity linking achieved high accuracy. For example, "George Bush" keeps the same in Hausa, Tagalog and Yoruba as English. Figure 5 shows the comparison of supervised active learning and passive learning (random sampling in training data selection). We asked a native speaker to annotate Chinese news documents in one hour, and estimated the human annotation speed approximately as 7,000 tokens per hour. Therefore we set the number of tokens as 7,000 for one hour, and 14,000 for two hours. We can clearly see that supervised active learning significantly outperforms passive learning for all languages, especially for Tamil, Tagalog and Yoruba. Because of the rich morphology in Turkish, the gain of supervised active learning is relatively small because simple lexical features cannot capture name-specific characteristics regardless of the size of labeled data. For example, some prepositions (e.g., "nin (in)") can be part of the names, so it's difficult to determine name boundaries, such as "<ORG Ludian bölgesi hastanesi>nin (in <ORG Ludian Hospital>)"     Table 5 presents the detailed break-down scores for all languages. We can see that name identification, especially organization identification is the main bottleneck for all languages. For example, many organization names in Hausa are often very long, nested or all low-cased, such as "makaran-tar horas da Malaman makaranta ta Bawa Jan Gwarzo (Bawa Jan Gwarzo Memorial Teachers College)" and "kungiyar masana'antu da tattalin arziki ta kasar Sin (China's Association of Business and Industry)". Our name tagger will further benefit from more robust universal word segmentation, rich morphology analysis and IL-specific knowledge. For example, in Tamil "ஃ" is a visarga used as a diacritic to write foreign sounds, so we can infer a phrase including it (e.g., "ஹெய் ஃபாவின் (Haifa)") is likely to be a foreign name. Therefore our survey should be enriched by exercising with many languages to capture more categories of linguistic phenomena. et al., 2000), German (Thielen, 1995), Italian (Cucchiarelli et al., 1998), Greek (Karkaletsis et al., 1999), Spanish (Arévalo et al., 2002), Portuguese (Hana et al., 2006), Serbo-croatian (Nenadić and Spasić, 2000), Swedish (Dalianis and Åström, 2001) and Turkish (Tür et al., 2003). However, most of previous work relied on substantial amount of resources such as language-specific rules, basic tools such as part-of-speech taggers, a large amount of labeled data, or a huge amount of Web ngram data, which are usually unavailable for low-resource ILs. In contrast, in this paper we put the name tagging task in a new emergent setting where we need to process a surprise IL within very short time using very few resources.

Impact of Supervised Active Learning
The TIDES 2003 Surprise Language Hindi Named Entity Recognition task  had a similar setting. A name tagger was required to be finished within a time bound (five days). However, 628 labeled documents were provided in the TIDES task, while in our setting no labeled documents are available at the starting point. Therefore we applied active learning to efficiently annotate about 40 documents for each language and proposed new methods to learn expectations. The results of the tested ILs are still far from perfect, but we hope our detailed comparison and result analysis can introduce new ideas to balance the quality and cost of name tagging.

Conclusions and Future Work
Name tagging for a new IL is a very important but also challenging task. We conducted a thorough study on various ways of acquiring, encoding and composing expectations from multiple nontraditional sources. Experiments demonstrate that this framework can be used to build a promising name tagger for a new IL within a few hours. In the future we will exploit broader and deeper entity prior knowledge to improve name identification. We will aim to make the framework more transparent for native speakers so the survey can be done in an automatic interactive question-answering fashion. We will also develop methods to make the tagger capable of active self-assessment to produce the best workflow within time bounds.