Phonologically Aware Neural Model for Named Entity Recognition in Low Resource Transfer Settings

Named Entity Recognition is a well established information extraction task with many state of the art systems existing for a variety of languages. Most systems rely on language speciﬁc resources, large annotated corpora, gazetteers and feature engineering to perform well monolingually. In this paper, we introduce an attentional neural model which only uses language universal phonological character representations with word embeddings to achieve state of the art performance in a monolingual setting using super-vision and which can quickly adapt to a new language with minimal or no data. We demonstrate that phonological character representations facilitate cross-lingual transfer, out-perform orthographic representations and incorporating both attention and phonological features improves statistical efﬁciency of the model in 0-shot and low data transfer settings with no task speciﬁc feature engineering in the source or target language.


Introduction
Named Entity Recognition (NER) (Nadeau and Sekine, 2007;Marrero et al., 2013) is an information extraction task that deals with finding and classifying entities in text into a fixed set of types of interest. It is challenging for a variety of reasons. Named Entities (NEs) can be arbitrarily synthesized (eg. people's/organization's names). NEs are often not subject to uniform cross-linguistic spelling conventions: compare France (English) and Frantsiya (Uzbek). NEs occur rarely in data which makes gen-eralization difficult. Skewed class statistics necessitate measures to prevent models from merely favoring a majority class.
Named entities must also be annotated in context (eg. "[New York Times] ORG " vs. "[New York] LOC "). Lexical ambiguity (Turkey-country vs. bird), semantic ambiguity ("I work at the [New York Times] ORG " vs. "I read the New York Times") and sparsity induced by morphology add complexity.
Despite the challenges mentioned above, competent monolingual systems that rely on having sufficient annotated data, knowledge and resources available for engineering features have been developed. A more challenging task is to design a model that retains competence in monolingual scenarios and can easily be transferred to a low resource language with minimum overhead in terms of data annotation requirements and feature engineering. This transfer setting introduces additional challenges such as varying character usage conventions across languages with same script, differing scripts, lack of NE transliteration, varying morphology, different lexicons and mutual non-intelligibility to name a few.
We propose the following changes over prior work  to address the challenges of the low-resource transfer setting. We use: 1. Language universal phonological character representations instead of orthographic ones.
2. Attention over characters of a word while labeling it with an NE category. We show that using phonological character representations instead does not negatively impact performance on two languages: Spanish and Turkish. We then show that using global phonological representations enables model transfer from one or more source languages to a target language with no extra effort, even when the languages use different scripts. We demonstrate that while attention over characters of words has marginal utility in monolingual and high resource settings, it greatly improves the statistical efficiency of the model in 0-shot and low resource transfer settings. We do require a mapping from a language's script to phonological feature space which is script specific and not task specific. This presents little or no overhead due to existence of tools like PanPhon (Littell et al., 2016). Figure 1 provides a high level overview of our model. We model the words of a sentence at the type level and the token level. At the type level (ignorant of sentential context), we use bidirectional character LSTMs as in figure 2 to compose characters of a word to obtain its word representation and concatenate this with a word embedding that captures distributional semantics. This can memorize entities or capture morphological and suffixal clues that can help at a discriminative task like NER. We compose type level word representations with bidirectional LSTMs to obtain token level (cognizant of sentential context) representations. Using token level word representations along with an attentional context vector for each word based on the sequence of characters it contains, we generate score functions used by a Conditional Random Field (CRF) for inference. To facilitate transfer across languages with different scripts, we use Epitran 1 and PanPhon (Littell et al., 2016).

Our Approach
Epitran is a straightforward orthography-to-IPA (International Phonetic Alphabet [language universal]) system including a collection of preprocessors and grapheme-to-phoneme mappings for a variety of language-script pairs. It converts a word from its native script into a sequence of IPA characters, each of which approximately corresponds to a phoneme. PanPhon is a database of IPA-to-phonological feature vector mappings and a library for querying, manipulating, and exploiting this database. It consumes the output of Epitran and produces feature vectors (21 dimensions) of phonological characteristics such as whether a phoneme is articulated with (accompanied by) vibration of the vocal folds (voiced), with the tongue in a high, low, back, or front position, with the lips rounded or unrounded, with tongue tip or blade (coronal), etc. Figure 3   in Uyghur (Perso-Arabic script) and Turkish (Latin script), thus making the equivalence across scripts apparent. We concatenate the feature vectors from PanPhon and 1-hot encodings of the corresponding IPA characters and use these as inputs to the character bi-LSTMs.
This shift to IPA space is motivated by prior work (Tsvetkov et al., 2015;Tsvetkov and Dyer, 2015) which demonstrated the value of projecting orthographic surface forms of words into a phonological space for detecting loan words, transliteration and cognates even in language pairs that exhibit significant typological, morphological and phonological differences. Our underlying assumption is that named entities are likely to be transliterated or retain pronunciation patterns across languages. Additionally, phenomena such as vowel harmony manifest explicitly in IPA representation and can potentially be helpful for NER. Foreign named entities for example, need not obey vowel harmony rules prevalent in languages like Turkish. A powerful sequence model such as a LSTM could be tolerant to the noise created by lexical aberrations, lack of spelling conventions, vowel raising etc. when given a phonological representation of an NE in different languages.
Our second proposed change is to incorporate attention over the sequence of IPA segments in a word when predicting scores for its possible labels. Attention is an unsupervised alternative to convolution or feature engineering to capture helpful localized phenomena like capitalization of first letter, presence of case markers, special characters, helpful morphological suffixes etc. or the conjunction of multiple such phenomena. Such features have been the mainstay of most prior work for NER. Most of these features are subtle and occur at the type level, whereas the CRF performs inference at the token level. We show (empirically) that attention makes the CRF more sensitive to such useful type level phenomena during inference and improves the statistical efficiency of the model in certain scenarios. Having described our intuitions, we now provide mathematical details of our model.

LSTM
Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) belongs to a special breed of neural networks called Recurrent Neural Networks (RNNs) which are capable of processing sequential input of unbounded and arbitrary length. This makes them suitable for a sequence labeling task. LSTMs incorporate gating functions at each time step to allow the network to forget, remember and update contextual memory and mitigate problems like vanishing gradient. We use the following implementation: where indicates element-wise product and σ indicates element-wise sigmoid function.
In practice we use bi-directional LSTMs (each with its own parameters) to mitigate cases where the LSTM may be biased towards the last few inputs it reads. This is done both at the word-level and the character level. Let the hidden state at time step t of the forward LSTM be denoted by − → h t and the corresponding state of the backward LSTM be denoted by ← − h t . At the character level, for a word with L characters, we only take the final hidden states in and concatenate them to get a representation of the word that comprises these characters. At the word level, we concatenate corresponding forward and backward LSTM states which is the token level representation for the t th word in a sentence X. We use this to generate un-normalized energy/score functions for the CRF layer.

Attention
indicate the concatenated word bi-LSTM output (of dimension d 1 ) at step t corresponding to the t th word (X t ) in the input sequence X. Let M t be the matrix containing the concatenated bidirectional character LSTM outputs for each character of X t . It has dimensions (l t , d 2 ) where d 2 is the dimension of the concatenated bi-directional character LSTM hidden states and l t is the number of characters in X t . Let m i t denote the i th row of M t . Let P be a parameter matrix of dimension (d 1 , d 2 ) and p be a bias vector of length d 2 . We follow  in the formulation of attention context vector a t : The attentional context vector a t is then appended to w t to obtain concatenated vector u t = [a t ; w t ]. We apply a linear transform U (matrix of dimension (d 1 + d 2 , k) where k is the number of unique NER tags). This gives us: where e t is a vector of un-normalized energy/score functions indicating the compatibility between word X t and each of the k possible NER labels it can be given. Note that each word has a separate attention context vector obtained using the character LSTM hidden states generated by its constituent characters.

Conditional Random Field
Unlike Hidden Markov Models, CRFs do not enforce any independence assumptions among observed data and directly model the likelihood of a labeling hypothesis discriminatively. They also model adjacency compatibility between NER tags which can capture strong constraints like an 'I-label' tag not following an O tag without a 'B-label' tag in between them (see section 2.6). In our model, the CRF is parametrized as follows: For a word sequence X = (x 1 , x 2 , ...x N ), let E be a matrix of dimension (k,N) where k is the number of unique NER tags and N is the sequence length. The t th column is the vector e t obtained in equation 1. Let T be the square transition matrix of size (k+2, k+2) which captures transition score between the k NER tags, the start and the end symbols. Let Y = (y 1 , y 2 , ...y N ) be the label sequence associated with the input word sequence, each y i being an index into the ordered set of unique NER tags. Let y 0 be the start symbol and y N +1 be the end symbol. The score of this sequence is evaluated as: Let Y X indicate the exponential space of all possible labelings of this sequence X. The partition function associated with this CRF is then evaluated as: The probability of a specific labeling Y ∈ Y X : The training objective is to maximize conditional log probability of the correct labeling sequence Y * : The decoding criteria for an input sequence X is: Normally, evaluating the partition function over the exponential space of all possible labelings would be intractable. However, as described in (Lafferty et al., 2001), this can be done efficiently for linear chain CRFs using the forward backward algorithm.

Word Representations
The inputs to our model are in the form of type level word representations (figure 2). Motivated by the distributional hypothesis (Harris, 1954;Firth, 1957) we use word embeddings as inputs. In the monolingual scenario, we use structured skipgram word embeddings (Ling et al., 2015a). For the transfer scenario, embeddings can optionally be trained using techniques like multi CCA described in (Ammar et al., 2016). By learning a linear transformation from a shared vector space between languages, the model may acquire some transfer capability to the target language.
We use character bi-LSTMs to handle the Out Of Vocabulary (OOV) problem as in (Ling et al., 2015b). However, just as a distributional hypothesis exists for words, prior work (Tsvetkov and Dyer, 2015;Tsvetkov et al., 2015) suggests phonological character representations capture inherent similarities between characters that are not apparent from orthogonal one-hot orthographic character representations and can serve as a language universal surrogate for character representations. This is also useful for multi-lingual named entity recognition as explained in section 2. Therefore we make use of Epitran and PanPhon as in figure 3 to obtain both 1hot IPA character encodings and phonological feature vectors capturing similarity between IPA characters. These form the inputs to the character bi-LSTMs. The mapping from orthographic character segments to IPA is sometimes many to one (ambiguous) and unarticulated characters (like periods in NE abbreviations) and capitalization information is lost by default. Given the importance of such orthographic features for NER and the ambiguity introduced, a drop in performance may be expected by using phonological character representations.

Training
Our model parametrization is similar to . The weights to be trained include the the CRF transition matrix T, the projection parameters are used to generate matrix E (P and U), the LSTM parameters and word and character embedding matrices. The objective is to maximize the log probability of the correct labeling sequence as given in equation 2. This objective is fully differentiable and standard back-propagation is used to learn weights. We use Stochastic Gradient Descent with a learning rate of 0.05 and gradients clipped at 5.0 providing best performance. Using Adadelta or Adam leads to faster convergence but slightly worse performance.
Word level LSTMs use a hidden layer size of 100, orthographic character LSTMs (if used) used a hidden layer of size 25 while phonological character LSTMs used a hidden layer of size 15 due to the smaller phonetic alphabet. Tuning these did not have a major effect on performance. Dropout of 0.5 is applied after concatenation of the word embeddings and character LSTM outputs. Best dropout value was chosen through ablation studies. For Spanish, we use word embeddings pre-trained on the Spanish Gigaword version 3. For transfer experiments, we use multilingual word embeddings trained using multi CCA described in (Ammar et al., 2016).

Entity Types and Tagging Schemes
In all of the datasets in

Organizations (ORG)
Names of entities with organization and managerial structure. E.g. Democratic Party, Google, JetBlue, etc.
A BIO tagging scheme is used for all annotations. All non-entity tokens are annotated as 'O'. The first token of an entity span is annotated as 'B-label' and all remaining tokens in the entity span are annotated as 'I-label'.
This demonstrates the monolingual competence of phonological character representations vs. orthographic representations.
2. Turkish NER using the Linguistic Data Consortium's LDC2014E115 BOLT Turkish Language Pack 2 . This demonstrates the utility of phonological character representations and attention in a morphologically rich, low resource language. We compare against a state-of-theart monolingual model  that uses orthographic character LSTMs. 3. Transfer Experiments from Uzbek to Turkish using LDC2014E112 BOLT 3 data pack for Uzbek and LDC2014E115 BOLT data pack for Turkish. These two languages have similar syntax and word order (Dryer, 2013)

Results
Tables 2 and 3 report results from monolingual experiments in Spanish. In table 3, we report the performance of our best model against other state-ofthe-art models for the Spanish CoNLL 2002 NER task (Sang., 2002). Our model performs marginally better than other benchmarks with the optimal configuration of hyper-parameters and using pre-trained word embeddings. Table 2 report ablation study results, which reveal that using pre-trained word embeddings without using character LSTMs yields a very strong baseline that already out-performs various previous benchmarks. Using character LSTMs that compose orthographic character representations yields a +0.91 improvement in F1 score and a further     Uzbek and Turkish +0.12 F1 with attention. Using phonological character representations instead yields an improvement of +0.47 F1 and a further +0.8 F1 with attention. Thus, phonological representations benefit more from attention applied over them to beat out orthographic representations in that scenario. Using sparse features indicating the character category (alphabet vs. number vs. punctuation/non-phonetic) and capitalization in conjunction with with phonological character representations and word embeddings with attention over phonological characters yields the best configuration that slightly outperforms the best published models so far. Given that many previous benchmarks used features that rely heavily on orthography, this is an encouraging result since one would expect to lose some performance by using more abstract phonological representations as explained in section 2.4.
Tables 4 and 5 highlight results from monolingual experiments on Turkish. This dataset is much more challenging since the annotated training courpus is significantly smaller than the CoNLL 2002 shared task dataset and because Turkish is an agglutinative language exhibiting sparsity inducing morphology which leads to huge vocabulary size. As a competitive baseline, we train the LSTM CRF described in  due to its documented ability to obtain state-of-the-art monolingual results for many languages with minimal feature engineering. Our best model from the Turkish ablation study outperforms this baseline. We also see a stark contrast between the ablation study results for Turkish compared to Spanish. Firstly, word embeddings alone perform rather poorly due to the challenges of reliably estimating them for a large vocabulary given a small dataset. Character composed representations of words provide a significant performance boost (+17.27 F1 for the best model). Secondly, usage of sparse character features (like capitalization) seems to hurt performance in all but the last model in table 4. Thirdly, phonological and orthographic character representations are complementary in the case of Turkish, unlike Spanish. This is not too surprising since Turkish exhibits phonological phenomena like vowel harmony. Lack of vowel harmony in a word could give-away a foreign word or a named entity for example. Results show that attention is helpful as well. We would also like to point out that the only models in the ablation studies eligible for transfer are those that do not use orthographic character representations. Among these, the model that uses phonological representation with attention and word vectors performs the best and also outperforms the baseline system.
Our next experiments on model transfer are arguably the most interesting. Tables 7 and 8 document single source (Uzbek to Turkish) transfer results. We find that using sparse character category and capitalization features in conjunction with attention and phonological features yields the best 0-shot transfer performance (no training labels in the target language). Specifically, attention provides +6 F1 in 0-shot and 5% labeled-target language data scenarios. It is interesting to note that using multilingual word embeddings for the source and target languages alone accounts for very poor transfer. We also find that with as little as 20% of the data, we approach the performance of a monolingual target model that was trained on all the data. We also notice that the transfer models retain a consistent advantage over a monolingually trained target language model across all data availability scenarios. Lastly, we note that while attention provides a significant improvement in 0-shot and 5% data availability scenarios, a model without attention (or sparse features like capitalization) eventually does better with more data. This indicates that the model is competent enough to leverage transfer via phonology alone. This could also possibly be because attention patterns from Uzbek could bring in a bias that is eventually sub-optimal for Turkish due to dif-   ferent morphology and phonology. In future work, we shall perform more insightful error analysis to explain these trends. Table 6 documents NIST evaluation results on an unseen Uyghur test set (with gold annotations) for the best transfer model configuration jointly trained on Turkish and Uzbek gold annotations and Uyghur training annotations produced by a non-speaker linguist (non-gold). Since Uyghur lacks helpful typelevel orthographic features such as capitalization, our transfer model in table 6 does not use any sparse features or attention but benefits from transfer via the phonological character representations we've proposed. Despite the noisy supervision provided in the target language, transferring from Turkish and Uzbek provides a +14.1 F1 improvement over a state of the art monolingual model trained on the same Uyghur annotations. It is worth pointing out that this transfer was achieved across 3 languages each with different scripts, morphology, phonology and lexicons.

Prior Work
NER is a well-studied sequence-labeling problem for which many different approaches have been proposed. Early works had a monolingual focus and relied heavily on feature engineering. Approaches include maximum entropy models (Chieu and Ng, 2003), hierarchically smoothed tries (Cucerzan and Yarowsky, 1999), decision trees (Carreras et al., 2002) and models incorporating syntactic, semantic and world knowledge (Wakao et al., 1996). Each of these models brings in a bias of its own. Florian et al. (2003) successfully tried ensembling multiple classifiers and improved performance.
Since NER is a sequence labeling problem, there are local dependencies both among NE labels associated with words and among the words themselves, that could aid the labeling process. To explicitly deal with these sequential characteristics, Hidden Markov Models (HMMs) and Conditional Random Fields (CRFs) became very popular. (Klein et al., 2003;Florian et al., 2003;McCallum and Li, 2003;Ratinov and Roth, 2009;Chandra et al., 1981;Lin and Wu, 2009;Yang et al., 2016;Ma and Hovy, 2016). CRFs eventually became more popular because they are discriminative models that directly model the required posterior probability of a labeling sequence using parametrized functions of features. They do not model the probability of the observed sentence itself, avoid Markovian independence assumptions made by HMMs and avoid the label bias problem.
Most of the work cited so far makes use of hand engineered features. The following approaches minimize the use of features while still maintaining a monolingual focus. Collobert et al. (2011), Turian et al. (2010, and Ando and Zhang (2005) Huang et al. (2015) instead use bi-directional LSTMs over the sequence of words, along with spelling and orthographic features.
The most recent work eliminates feature engineering altogether and combines CRFs with LSTMs which can model long sequences while remembering appropriate past context.  proposed an architecture that uses both character and word level LSTMs to produce score functions for CRF inference conditioned on global context. Ma and Hovy (2016) replace the character LSTMs of  with a CNN instead. Yang et al. (2016) follow a very similar architecture to , replacing the LSTMs with Gated Recurrent Units . However, Yang et al. (2016) also tackle multi task and multi-lingual joint training scenarios.
Most of the models cited so far are monolingual either because they use hand crafted features and language specific resources or because of deepseated assumptions. For example a change in orthography, lexicon, script, word order or addition of complex morphology makes transfer impossible. This is the central challenge that we tackle. There has been much less work catering to this scenario. Kim et al. (2012) use weak annotations from Wikipedia metadata and parallel data for multi lingual NER. Yang et al. (2016) addresses the use case of multilingual joint training, which assumes there is sufficient data available in all languages. Nothman et al. (2013) also operate under the assumption of availability of Wikipedia data.
To the best of our knowledge, a scenario involving transfer of a model trained in one (or more) source language(s) to another language with little or no labeled data, different script, different morphology, different lexicon, lack of transliteration, non-mutual intelligibility etc. has not been addressed recently.

Conclusion
In this paper, we presented two improvements over a state-of-the-art monolingual model to address Named Entity Recognition in transfer settings. The first seeks to reconcile various dimensions of variability between languages such as varying script, orthographic conventions, phonological phenomena etc. by representing words as sequences of IPA (International Phonetic Alphabet) segments consistent across all languages, rather than sequences of characters specific to a particular language. Secondly, we exploit the one-to-one mapping between input sequence words and output labels and advocate for the use of attention over the character/IPA sequence of a word when predicting its label. We show empirically that these two improvements 1) achieve at least state-of-the-art performance on a monolingual NER task in Spanish, 2) handle complex morphology in languages such as Turkish, Uzbek and Uyghur better than state of the art, 3) provide 0-shot performance in a transfer scenario to a related new language, well above that possible using multilingual word embeddings alone, and 4) are capable of generalizing to a new language with much less training data than a monolingually trained model. Moreover, all of this is achieved without any extra feature engineering specific to the task or language, apart from mapping characters in that language to IPA. We believe these results to be encouraging and look forward to replicating these results on more language pairs in different language families and further automating the process of obtaining phonological character representations.

Acknowledgement
This work is sponsored by Defense Advanced Research Projects Agency Information Innovation Office (I2O). Program: Low Resource Languages for Emergent Incidents (LORELEI). Issued by DARPA/I2O under Contract No. HR0011-15-C-0114. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.