Neural Cross-Lingual Coreference Resolution And Its Application To Entity Linking

We propose an entity-centric neural crosslingual coreference model that builds on multi-lingual embeddings and language independent features. We perform both intrinsic and extrinsic evaluations of our model. In the intrinsic evaluation, we show that our model, when trained on English and tested on Chinese and Spanish, achieves competitive results to the models trained directly on Chinese and Spanish respectively. In the extrinsic evaluation, we show that our English model helps achieve superior entity linking accuracy on Chinese and Spanish test sets than the top 2015 TAC system without using any annotated data from Chinese or Spanish.


Introduction
Cross-lingual models for NLP tasks are important since they can be used on data from a new language without requiring annotation from the new language (Ji et al., 2014(Ji et al., , 2015. This paper investigates the use of multi-lingual embeddings (Faruqui and Dyer, 2014;Upadhyay et al., 2016) for building cross-lingual models for the task of coreference resolution (Ng and Cardie, 2002;Pradhan et al., 2012). Consider the following text from a Spanish news article: "Tormenta de nieve afecta a 100 millones de personas en EEUU. Unos 100 millones de personas enfrentaban el sábado nuevas dificultades tras la enorme tormenta de nieve de hace días en la costa este de Estados Unidos." The mentions "EEUU" ("US" in English) and "Estados Unidos" ("United States" in English) are coreferent. A coreference model trained on English data is unlikely to coreference these two mentions in Spanish since these mentions did not appear in English data and a regular English style abbreviation of "Estados Unidos" will be "EU" instead of "EEUU". But in the bilingual English-Spanish word embedding space, the word embedding of "EEUU" sits close to the word embedding of "US" and the sum of word embeddings of "Estados Unidos" sit close to the sum of word embeddings of "United States". Therefore, a coreference model trained using English-Spanish bilingual word embeddings on English data has the potential to make the correct coreference decision between "EEUU" and "Estados Unidos" without ever encountering these mentions in training data.
The contributions of this paper are two-fold. Firstly, we propose an entity-centric neural crosslingual coreference model. This model, when trained on English and tested on Chinese and Spanish from the TAC 2015 Trilingual Entity Discovery and Linking (EDL) Task (Ji et al., 2015), achieves competitive results to models trained directly on Chinese and Spanish respectively. Secondly, a pipeline consisting of this coreference model and an Entity Linking (henceforth EL) model can achieve superior linking accuracy than the official top ranking system in 2015 on Chinese and Spanish test sets, without using any supervision in Chinese or Spanish.
Although most of the active coreference research is on solving the problem of noun phrase coreference resolution in the Ontonotes data set, invigorated by the 2011 and 2012 CoNLL shared task (Pradhan et al., 2011(Pradhan et al., , 2012, there are many important applications/end tasks where the mentions of interest are not noun phrases. Consider the sentence, "(U.S. president Barack Obama who started ((his) political career) in (Illinois)), was born in (Hawaii)." The bracketing represents the Ontonotes style noun phrases and underlines represent the phrases that should be linked to Wikipedia by an EL system. Note that mentions like "U.S." and "Barack Obama" do not align with any noun phrase. Therefore, in this work, we focus on coreference on mentions that arise in our end task of entity linking and conduct experiments on TAC TriLingual 2015 data sets consisting of English, Chinese and Spanish.

Coreference Model
Each mention has a mention type (m type) of either name or nominal and an entity type (e type) of Person (PER) / Location (LOC) / GPE / facility (FAC) / organization (ORG) (following standard TAC (Ji et al., 2015) notations).
The objective of our model is to compute a function that can decide whether two partially constructed entities should be coreferenced or not. We gradually merge the mentions in the given document to form entities. Mentions are considered in the order of names and then nominals and within each group, mentions are arranged in the order they appear in the document. Suppose, the sorted order of mentions are m 1 , . . ., m N 1 , m N 1 +1 , . . . , m N 1 +N 2 where N 1 and N 2 are respectively the number of the named and nominal mentions. A singleton entity is created from each mention. Let the order of entities be e 1 , . . . , e N 1 , e N 1 +1 , . . . , e N 1 +N 2 . We merge the named entities with other named entities, then nominal entities with named entities in the same sentence and finally we merge nominal entities across sentences as follows: Step 1: For each named entity e i (1 ≤ i ≤ N 1 ), antecedents are all entities e j (1 ≤ j ≤ i − 1) such that e j and e i have same e type. Training examples are triplets of the form (e i , e j , y ij ). If e i and e j are coreferent (meaning, y ij =1), they are merged.
Step 2: For each nominal entity e i (N 1 + 1 ≤ i ≤ N 1 + N 2 ), we consider antecedents e j such that e i and e j have the same e type and e j has some mention that appears in the same sentence as some mention in e i . Training examples are generated and entities are merged as in the previous step.
Step 3: This is similar to previous step, except e i and e j have no sentence restriction. Features: For each training triplet (e 1 , e 2 , y 12 ), the network takes the entity pair (e 1 , e 2 ) as input and tries to predict y 12 as output. Since each entity represents a set of mentions, the entity-pair embedding is obtained from the embeddings of mention pairs generated from the cross product of the entity pair. Let M (e 1 , e 2 ) be the set Then, every feature in φ m i ,m j is embedded as a vector in the real space. Let v m i ,m j dentote the concatenation of embeddings of all features in φ m i ,m j . Embeddings of all features except the words are learned in the training process. Word embeddings are pre-trained. v m i ,m j includes the following language independent features: String match: whether m i is a substring or exact match of m j and vice versa (e.g. m i = "Barack Obama" and m j = "Obama") Distance: word distance and sentence distance between m i and m j discretized into bins m type: concatenation of m types for m i and m j e type: concatenation of e types for m i and m j Acronym: whether m i is an acronym of m j or vice versa (e.g. m i = "United States" and m j = "US") First name mismatch: whether m i and m j belong to e type of PERSON with the same last name but different first name (e.g. m i ="Barack Obama" and m j = "Michelle Obama") Speaker detection: whether m i and m j both occur in the context of words indicating speech e.g. "say", "said" In addition, v m i ,m j includes the average of the word embeddings of m i and average of the word embeddings of m j .

Network Architecture
The network architecture from the input to the output is shown in figure 1. Embedding Layer: For each training triplet (e 1 , e 2 , y), a sequence of vectors v m i ,m j (for each ((m i , m j ) ∈ M (e 1 , e 2 ))) is given as input to the network.
To generate the entity-pair embedding, we need to combine the embeddings of mention pairs generated from the entity-pair. Consider two entities e 1 = (President 1 , Obama)} and e 2 = {(President 2 , Clinton)}. Here the superscripts are used to indicate two different mentions with the same surface form. Since the named mention pair (Obama, Clinton) has no string overlap, e 1 and e 2 should not be coreferenced even though the nominal mention pair (President 1 , President 2 ) has full string overlap. So, while combining the embeddings for the mention pairs, mention pairs with m type (name, name) should get higher weight than mention pairs with m type (nominal, nominal). The entity pair embedding is the weighted sum of the mention-pair embeddings. We introduce 4 parameters a name,name , a name,nominal , a nominal,nominal and a nominal,name as weights for mention pair embeddings with m types of (name, name), (name, nominal), (nominal, nominal) and (nominal, name) respectively. The entity pair embedding is computed as follows: The training objective is to maximize L. L = d∈D (e 1 ,e 2 ,y 12 )∈S d P (y 12 |e 1 , e 2 ; W (1) , W (2) , a, w s ) (1) Here D is the corpus and S d is the training triplets generated from document d.
Decoding proceeds similarly to training algorithm, except at each of the three steps, for each entity e i , the highest scoring antecdent e j is selected and if the score is above a threshold, e i and e j are merged.

A Zero-shot Entity Linking model
We use our recently proposed cross-lingual EL model, described in (Sil et al., 2018), where our target is to perform "zero shot learning" (Socher et al., 2013;Palatucci et al., 2009). We train an EL model on English and use it to decode on any other language, provided that we have access to multi-lingual embeddings from English and the target language. We briefly describe our techniques here and direct the interested readers to the paper. The EL model computes several similarity/coherence scores S in a "feature abstraction layer" which computes several measures of similarity between the context of the mention m in the query document and the context of the candidate link's Wikipedia page which are fed to a feed-forward neural layer which acts as a binary classifier to predict the correct link for m. Specifically, the feature abstraction layer computes cosine similarities (Sil and Florian, 2016) between the representations of the source query document and the target Wikipedia pages over various granularities. These representations are computed by performing CNNs and LSTMs over the context of the entities. Then these similarities are fed into a Multi-perspective Binning layer which maps each similarity into a higher dimensional vector. We also train fine-grained similarities and dissimilarities between the query and candidate document from multiple perspectives, combined with convolution and tensor networks.
The model achieves state-of-the-art (SOTA) results on English benchmark EL datasets and also performs surprisingly well on Spanish and Chinese. However, although the EL model is "zeroshot", the within-document coreference resolution in the system is a language-dependent SOTA coreference system that has won multiple TAC-KBP (Ji et al., 2015;Sil et al., 2015) evaluations but is trained on the target language. Hence, our aim is to apply our proposed coreference model to the EL system to perform an extrinsic evaluation of our proposed algorithm.

Experiments
We evaluate cross-lingual transfer of coreference models on the TAC 2015 Tri-Lingual EL datasets. It contains mentions annotated with their grounded Freebase 1 links (if such links exist) or corpus-wide clustering information for 3 languages: English (henceforth, En), Chinese (henceforth, Zh) and Spanish (henceforth, Es). Table 1 shows the size of the training and test sets for the three languages. The documents come from two genres of newswire and discussion forums. The mentions in this dataset are either named entities or nominals that belong to five types: PER, ORG, GPE, LOC and FAC. Hyperparameters: Every feature is embedded in a 50 dimensional space except the words which reside in a 300 dimensional space. The Relu and Sigmoid layers have 100 and 500 neurons respectively. We use SGD for optimization with an initial learning rate of 0.05 which is linearly reduced to   0.0001. Our mini batch size is 32 and we train for 50 epochs and keep the best model based on dev set.
Coreference Results: For each language, we follow the official train-test splits made in the TAC 2015 competition. Except, a small portion of the training set is held out as development set for tuning the models. All experimental results on all languages reported in this paper were obtained on the official test sets. We used the official CoNLL 2012 evaluation script and report MUC, B 3 and CEAF scores and their average (CONLL score). See Pradhan et al. (2011Pradhan et al. ( , 2012. To test the competitiveness of our model with other SOTA models, we train the publicly available system of Clark and Manning (2016) (henceforth, C&M16) on the TAC 15 En training set and test on the TAC 15 En test set. The C&M16 system normally outputs both noun phrase mentions and their coreference and is trained on Ontonotes. To ensure a fair comparison, we changed the configuration of the system to accept gold mention boundaries both during training and testing. Since the system was unable to deal with partially overlapping mentions, we excluded such mentions in the evaluation. Table 2 shows that our model outperforms C&M16 by 8 points.
For cross-lingual experiments, we build monolingual embeddings for En, Zh and Es using the widely used CBOW word2vec model (Mikolov et al., 2013a). Recently Canonical Correlation Analysis (CCA) (Faruqui and Dyer, 2014), Multi-CCA (Ammar et al., 2016) and Weighted Regression (Mikolov et al., 2013b) have been proposed for building the multi-lingual embedding space from monolingual embedding. In our prelimi-  nary experiments, the technique of Mikolov et al. (2013b) performed the best and so we used it to project the embeddings of Zh and Es onto En. In Table 3, "En Model" refers to the model that was trained on the En training set of TAC 15 using multi-lingual embeddings and tested on the Es and Zh testing set of TAC 15. "Es Model" refers to the model trained on Es training set of TAC 15 using Es embeddings. "Zh Model" refers to the model trained on the Zh training set of TAC 15 using Zh embeddings. The En model performs 0.5 point below the Es model on the Es test set. On the Zh test set, the En model performs only 0.3 point below the Zh model. Hence, we show that without using any target language training data, the En model with multi-lingual embeddings gives comparable results to models trained on the target language. EL Results: We replace the in-document coreference system (trained on the target language) of SIL18 with our En model to investigate the performance of our proposed algorithm on an extrinsic task. Table 4 shows the EL results on Es and Zh test sets respectively. "EL -Coref" refers to the case where the first step of coreference is not used and EL is used to link the mentions directly to Freebase. "EL + En Coref" refers to the case where the neural english coreference model is first used on Zh or Es data followed by the EL model. The former is 3 points below the latter on Es and 2.6 points below Zh, implying coreference is a vital task for EL. Our "EL + En Coref" outperforms the 2015 TAC best system by 0.7 points on Es and 0.8 points on Zh, without requiring any training data for coreference on Es and Zh respectively. Finally, we show the SOTA results on these two data sets recently reported by SIL18. Although their EL model does not use any supervision from Es or Zh, their coreference resolution model is trained on a large internal data set on the same language as

Related Work
Rule based (Raghunathan et al., 2010) and statistical coreference models (Bengtson and Roth, 2008;Rahman and Ng, 2009;Fernandes et al., 2012;Durrett et al., 2013;Clark and Manning, 2015;Martschat and Strube, 2015;Björkelund and Kuhn, 2014) are hard to transfer across languages due to their use of lexical features or patterns in the rules. Neural coreference is promising since it allows cross-lingual transfer using multilingual embedding. However, most of the recent neural coreference models (Wiseman et al., 2015(Wiseman et al., , 2016Manning, 2015, 2016;Lee et al., 2017) have focused on training and testing on the same language. In contrast, our model performs cross-lingual coreference. There have been some recent promising results regarding such cross-lingual models for other tasks, most notably mention detection (Ni et al., 2017) and EL (Tsai and Roth, 2016;Sil and Florian, 2016). In this work, we show that such promise exists for coreference also. The tasks of EL and coreference are intrinsically related, prompting joint models (Durrett and Klein, 2014;Hajishirzi et al., 2013). However, the recent SOTA was obtained using pipeline models of coreference and EL (Sil et al., 2018). Compared to a joint model, pipeline models are easier to implement, improve and adapt to a new domain.

Conclusion
The proposed cross-lingual coreference model was found to be empirically strong in both intrinsic and extrinsic evaluations in the context of an entity linking task.