Platforms for Non-speakers Annotating Names in Any Language

We demonstrate two annotation platforms that allow an English speaker to annotate names for any language without knowing the language. These platforms provided high-quality ’‘silver standard” annotations for low-resource language name taggers (Zhang et al., 2017) that achieved state-of-the-art performance on two surprise languages (Oromo and Tigrinya) at LoreHLT20171 and ten languages at TAC-KBP EDL2017 (Ji et al., 2017). We discuss strengths and limitations and compare other methods of creating silver- and gold-standard annotations using native speakers. We will make our tools publicly available for research use.


Introduction
Although researchers have been working on unsupervised and semi-supervised approaches to alleviate the demand for training data, most state-ofthe-art models for name tagging, especially neural network-based models  still rely on a large amount of training data to achieve good performance. When applied to low-resource languages, these models suffer from data sparsity. Traditionally, native speakers of a language have been asked to annotate a corpus in that language. This approach is uneconomical for several reasons. First, for some languages We thank Kevin Blissett and Tongtao Zhang from RPI for their contributions to the annotations used for the experiments. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contracts No. HR0011-15-C-0115 and No. HR0011-16-C-0102. The views, opinions and/or findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. 1 https://www.nist.gov/itl/iad/mig/lorehlt-evaluations with extremely low resources, it's not easy to access native speakers for annotation. For example, Chechen is only spoken by 1.4 million people and Rejiang is spoken by 200,000 people. Second, it is costly in both time and money to write an annotation guideline for a low-resource language and to train native speakers (who are usually not linguists) to learn the guidelines and qualify for annotation tasks. Third, we observed poor annotation quality and low inter-annotator agreement among newly trained native speakers in spite of high language proficiency. For example, under DARPA LORELEI, 2 the performance of two native Uighur speakers on name tagging was only 69% and 73% F 1 -score respectively. Previous efforts to generate "silver-standard" annotations used Web search (An et al., 2003), parallel data (Wang and Manning, 2014), Wikipedia markups (Nothman et al., 2013;Tsai et al., 2016;, and crowdsourcing (Finin et al., 2010). Annotations produced by these methods are usually noisy and specific to a particular writing style (e.g., Wikipedia articles), yielding unsatisfactory results and poor portability.
It is even more expensive to teach Englishspeaking annotators new languages. But can we annotate names in a language we don't know? Let's examine a Somali sentence: "Sida uu saxaafadda u sheegay Dr Jaamac Warsame Cali oo fadhigiisu yahay magaalada Baardheere hadda waxaa shuban caloolaha la yaalla xarumaha caafimaadka 15-cunug oo lagu arkay fuuq bax joogto ah, wuxuu xusay dhakhtarku in ay wadaan dadaallo ay wax kaga qabanayaan xaaladdan" Without knowing anything about Somali, an English speaker can guess that "Jaamac Warsame Cali" is a person name because it's capitalized, the word on its left, "Dr," is similar to "Dr." in English, and its spelling looks similar to the English "Jamac Warsame Ali." Similarly, we can identify "Baardheere" as a location name if we know that "magaalada" in English is "town" from a common word dictionary, and its spelling is similar to the English name "Bardhere." What about languages that are not written in Roman (Latin) script? Fortunately language universal romanization (Hermjakob et al., 2018) or transliteration 3 tools are available for most living languages. For example, the following is a Tigrinya sentence and its romanized form: An English speaker can guess that "ዓብደልፈታሕ አል-ሲሲ" is a person name because its romanized form "aabedalefataahhe 'ale-sisi" sounds similar to the English name "Abdel-Fattah el-Sissi," and the romanized form of the word on its left, "ፕረዝደንት," (perazedanete) sounds similar to the English word "president." Moreover, annotators (may) acquire languagespecific patterns and rules gradually during annotation; e.g., a capitalized word preceded by "magaalaa" is likely to be a city name in Oromo, such as "magaalaa Adaamaa" (Adama city). Synchronizing such knowledge among annotators both improves annotation quality and boosts productivity.
The Information Sciences Institute (ISI) developed a "Chinese Room" interface 4 to allow a nonnative speaker to translate foreign language text into English, based on a small set of parallel sentences that include overlapped words. Inspired by this, RPI and JHU developed two collaborative annotation platforms that exploit linguistic intuitions and resources to allow non-native speakers to perform name tagging efficiently and effectively.
Word recognition. Presentation of text in a familiar alphabet makes it easier to see similarities and differences between text segments, to learn aspects of the target language morphology, and to remember sequences previously seen.
Word pronunciation. Because named entities often are transliterated into another language, access to the sound of the words is particularly important for annotating names. Sounds can be exposed either through a formal expression language such as IPA, 5 or by transliteration into the appropriate letters of the annotator's native language.
Word and sentence meaning. The better the annotator understands the full meaning of the text being annotated, the easier it will be both to identify which named entities are likely to be mentioned in the text and what the boundaries of those mentions are. Meaning can be conveyed in a variety of ways: dictionary lookup to provide fixed meanings for individual words and phrases; description of the position of a word or phrase in a semantic space (e.g., Brown clusters or embedding space) to define words that are not found in a dictionary; and full sentence translation.
Word context. Understanding how a word is used in a given instance can benefit greatly from understanding how that word is used broadly, either across the document being annotated, or across a larger corpus of monolingual text. For example, knowing that a word frequently appears adjacent to a known person name suggests it might be a surname, even if the adjacent word in the current context is not known to be a name.
World knowledge. Knowledge of some of the entities, relations, and events referred to in the text allows the annotator to form a stronger model of what the text as a whole might be saying (e.g., a document about disease outbreak is likely to include organizations like Red Cross), leading to better judgments about components of the text.
History. Annotations previously applied to a use of a word form a strong prior on how a new instance of the word should be tagged. While some of this knowledge is held by the annotator, it is difficult to maintain such knowledge over time. Programmatic support for capturing prior conclusions (linguistic patterns, word translations, possible annotations for a mention along with their frequency) and making them available to the annotator is essential for large collaborative annotation efforts.
Adjudication. Disagreements among annotators can indicate cases that require closer examination. An adjudication interface is beneficial to enhance precision (see Section 4).
The next section discusses how we embody these requirements in two annotation platforms.

Annotation Platforms
We developed two annotation tools to explore the range of ways the desiderata might be fulfilled: ELISA and Dragonfly. After describing these interfaces, Figure 1 shows how they fulfill the desiderata outlined in Table 2. Annotation Panel. For each sentence in a document, we show the text in the original language, its English translation if available, and automatic romanization results generated with a languageuniversal transliteration library. 7 To label a name mention, the annotator clicks its first and last tokens, then chooses the desired entity type in the annotation panel. clicking a token in the document will show its full definition in lexicons and bilingual example sentences containing that token. A floating pop-up displaying romanization and simple definition appears instantly when hovering over a token. Rule Editor. Annotators may discover useful hueristics to identify and classify names, such as personal designators and suffixes indicative of locations. They can encode such clues as rules in the rule editor. Once created, each rule is rendered as a strikethrough line in the text and is shared among annotators. For example (Figure 1, if an annotator marks "agency" as an organization, all annotators will see a triangular sign below each occurrence of this word.

ELISA
Adjudication Interface. If multiple users process the same document we can consolidate their annotations through an adjudication interface (Figure 3). This interface is similar to the annotation interface, except that competing annotations are displayed as blocks below the text. Clicking a block will accept the associated annotation. The adjudicator can accept annotations from either annotator or accept the agreed cases at once by clicking one of the three interface buttons. Then, the adjudicator need only focus on disputed cases, which are highlighted with a red background.

Dragonfly
Dragonfly, developed at the Johns Hopkins University Applied Physics Laboratory, takes a more word-centric approach to annotation. Each sentence to be annotated is laid out in a row, each column of which shows a word augmented with a variety of information about that word. Figure 4 shows a screenshot of a portion of the Dragonfly tool being used to annotate text written in the Kannada language. The top entry in each column is the Kannada word. Next is a Romanization of the word (Hermjakob et al., 2018). The third entry is one or more dictionary translations, if available. The fourth entry is a set of dictionary translations of other words in the word's Brown cluster. (Brown et al., 1992) While these tend to be less accurate than translations of the word, they can give a strong signal that a word falls into a particular category. For example, a Brown cluster containing translations such as "Paris," "Rome" and "Vienna" is likely to refer to a city, even if no translation exists to indicate which city. Finally, if automated labels for the sentence have been generated, e.g., by a trained name tagger, those labels  In addition to word-specific information, Dragonfly can present sentence-level information. In Figure 4, an automatic English translation of the sentence is shown above the words of the sentence (in this example, from Google Translate). Translations might also be available when annotating a parallel document collection. Other sentence-level information that might prove useful in this slot includes a topic model description, or a bilingual embedding of the entire sentence. Figure 4 shows a short sentence that has been annotated with two name mentions. The first word of the sentence (Romanization "uttara") has translations of "due north," "northward," "north," etc. The second word has no direct translations or Brown cluster entries. However, its Romanization, "koriyaavannu," begins with a sequence that suggests the word 'Korea' with a morphological ending. Even without the presence of the phrase "North Korea" in the MT output, an annotator likely has enough information to draw the conclusion that the GPE "North Korea" is mentioned here. The presence of the phrase "North Korea" in the machine translation output confirms this choice.
The sentence also contains a word whose Romanization is "ttramp." This is a harder call. There is no translation, and the Brown cluster translations do not help. Knowledge of world events, examination of other sentences in the document, the translation of the following word, and the MT output together suggest that this is a mention of "Donald Trump;" it can thus be annotated as a person.

Experiments
We asked ten non-speakers to annotate names using our annotation platforms on documents in various low-resource languages released by the DARPA LORELEI program and the NIST TAC-KBP2017 EDL Pilot . The genres of these documents include newswire, discussion forum and tweets. Using non-speaker annotations as "silver-standard" training data, we trained    (Lample et al., 2016). The lexicons loaded into the ELISA IE annotation platform were acquired from Panlex, 8 Geonames 9 and Wiktionary. 10 Dragonfly used bilingual lexicons by (Rolston and Kirchhoff, 2016).

Overall Performance
The agreement between non-speaker annotations from the ELISA annotation platform and gold standard annotations from LDC native speakers on the same documents is between 72% and 85% for various languages. The ELISA platform enables us to develop cross-lingual entity discovery and linking systems which achieved state-of-the-art performance at both NIST LoreHLT2017 11 and ten languages at TAC-KBP EDL2017 evaluations .  Four annotators used two platforms (two each) to annotate 50 VOA news documents for each of the five languages listed in Table 2. Their annotations were then adjudicated through the ELISA adjudication interface. The process took about one week. For each language we used 40 documents for training and 10 documents for test in the TAC-KBP2017 EDL Pilot. In Table 2 we see that the languages with more annotated names (i.e., Albanian and Swahili) achieved higher performance.

Silver Standard Creation
We compare our method with Wikipedia based silver standard annotations  on Oromo and Tigrinya, two low-resource languages in the LoreHLT2017 evaluation. Table 3 shows the data statistics. We can see that with the ELISA annotation platform we were able to acquire many more topically-relevant training sentences and thus achieved much higher performance.   Native-speaker Annotation Non-speaker Annotation Figure 5: Russian Name Tagging Performance Using Native-speaker and Non-speaker Annotations. Figure 5 compares the performance of Russian name taggers trained from Gold Standard by LDC native speakers and Silver Standard by nonspeakers through our annotation platforms, testing on 1,952 sentences with ground truth annotated by LDC native speakers. Our annotation platforms got off to a good start and offered higher performance than annotations from native speakers, because non-speakers quickly capture common names, which can be synthesized as effective features and patterns for our name tagger. However, after all low-hanging fruit was picked, it became difficult for non-speakers to discover many uncommon names due to the limited coverage of lexicon and romanization; thus the performance of the name tagger converged quickly and hits an upper-bound. For example, the most frequently missed names by non-speakers include organization abbreviations and uncommon person names. Table 5 shows that the adjudication process significantly improved precision because annotators were able to fix annotation errors after extensive discussions on disputed cases and also gradually learned annotation rules and linguistic patterns. Most missing errors remained unfixed during the adjudication so the recall was not improved.