Neural Cross-Lingual Relation Extraction Based on Bilingual Word Embedding Mapping

Relation extraction (RE) seeks to detect and classify semantic relationships between entities, which provides useful information for many NLP applications. Since the state-of-the-art RE models require large amounts of manually annotated data and language-specific resources to achieve high accuracy, it is very challenging to transfer an RE model of a resource-rich language to a resource-poor language. In this paper, we propose a new approach for cross-lingual RE model transfer based on bilingual word embedding mapping. It projects word embeddings from a target language to a source language, so that a well-trained source-language neural network RE model can be directly applied to the target language. Experiment results show that the proposed approach achieves very good performance for a number of target languages on both in-house and open datasets, using a small bilingual dictionary with only 1K word pairs.


Introduction
Relation extraction (RE) is an important information extraction task that seeks to detect and classify semantic relationships between entities like persons, organizations, geo-political entities, locations, and events.It provides useful information for many NLP applications such as knowledge base construction, text mining and question answering.For example, the entity Washington, D.C. and the entity United States have a CapitalOf relationship, and extraction of such relationships can help answer questions like "What is the capital city of the United States?" Traditional RE models (e.g., Zelenko et al. (2003); Kambhatla (2004); Li and Ji (2014)) require careful feature engineering to derive and combine various lexical, syntactic and semantic features.Recently, neural network RE models (e.g., Zeng et al. (2014); dos Santos et al. (2015); Miwa and Bansal (2016); Nguyen and Grishman ( 2016)) have become very successful.These models employ a certain level of automatic feature learning by using word embeddings, which significantly simplifies the feature engineering task while considerably improving the accuracy, achieving the state-of-the-art performance for relation extraction.
All the above RE models are supervised machine learning models that need to be trained with large amounts of manually annotated RE data to achieve high accuracy.However, annotating RE data by human is expensive and timeconsuming, and can be quite difficult for a new language.Moreover, most RE models require language-specific resources such as dependency parsers and part-of-speech (POS) taggers, which also makes it very challenging to transfer an RE model of a resource-rich language to a resourcepoor language.
There are a few existing weakly supervised cross-lingual RE approaches that require no human annotation in the target languages, e.g., Kim et al. (2010); Kim and Lee (2012); Faruqui and Kumar (2015); Zou et al. (2018).However, the existing approaches require aligned parallel corpora or machine translation systems, which may not be readily available in practice.
In this paper, we make the following contributions to cross-lingual RE: • We propose a new approach for direct crosslingual RE model transfer based on bilingual word embedding mapping.It projects word embeddings from a target language to a source language (e.g., English), so that a well-trained source-language RE model can be directly applied to the target language, with no manually annotated RE data needed for the target language.
• We design a deep neural network architecture for the source-language (English) RE model that uses word embeddings and generic language-independent features as the input.The English RE model achieves thestate-of-the-art performance without using language-specific resources.
• We conduct extensive experiments which show that the proposed approach achieves very good performance (up to 79% of the accuracy of the supervised target-language RE model) for a number of target languages on both in-house and the ACE05 datasets (Walker et al., 2006), using a small bilingual dictionary with only 1K word pairs.To the best of our knowledge, this is the first work that includes empirical studies for cross-lingual RE on several languages across a variety of language families, without using aligned parallel corpora or machine translation systems.
We organize the paper as follows.In Section 2 we provide an overview of our approach.In Section 3 we describe how to build monolingual word embeddings and learn a linear mapping between two languages.In Section 4 we present a neural network architecture for the source-language (English).In Section 5 we evaluate the performance of the proposed approach for a number of target languages.We discuss related work in Section 6 and conclude the paper in Section 7.

Overview of the Approach
We summarize the main steps of our neural crosslingual RE model transfer approach as follows.
1. Build word embeddings for the source language and the target language separately using monolingual data.
2. Learn a linear mapping that projects the target-language word embeddings into the source-language embedding space using a small bilingual dictionary.
3. Build a neural network source-language RE model that uses word embeddings and generic language-independent features as the input.
4. For a target-language sentence and any two entities in it, project the word embeddings of the words in the sentence to the sourcelanguage word embeddings using the linear mapping, and then apply the source-language RE model on the projected word embeddings to classify the relationship between the two entities.An example is shown in Figure 1, where the target language is Portuguese and the source language is English.
We will describe each component of our approach in the subsequent sections.
A monolingual word embedding model maps words in the vocabulary V of a language to realvalued vectors in R d×1 .The dimension of the vector space d is normally much smaller than the size of the vocabulary V = |V| for efficient representation.It also aims to capture semantic similarities between the words based on their distributional properties in large samples of monolingual data.
Cross-lingual word embedding models try to build word embeddings across multiple languages (Upadhyay et al., 2016;Ruder et al., 2017).One approach builds monolingual word embeddings separately and then maps them to the same vector space using a bilingual dictionary (Mikolov et al., 2013b;Faruqui and Dyer, 2014).Another approach builds multilingual word embeddings in a shared vector space simultaneously, by generating mixed language corpora using aligned sentences (Luong et al., 2015;Gouws et al., 2015).
In this paper, we adopt the technique in (Mikolov et al., 2013b) because it only requires a small bilingual dictionary of aligned word pairs, and does not require parallel corpora of aligned sentences which could be more difficult to obtain.

Monolingual Word Embeddings
To build monolingual word embeddings for the source and target languages, we use a variant of the Continuous Bag-of-Words (CBOW) word2vec model (Mikolov et al., 2013a).
The standard CBOW model has two matrices, the input word matrix X ∈ R d×V and the output word matrix X ∈ R d×V .For the ith word w i in V, let e(w i ) ∈ R V ×1 be a one-hot vector with 1 at index i and 0s at other indexes, so that xi = Xe(w i ) (the ith column of X) is the input vector representation of word w i , and x i = Xe(w i ) (the ith column of X) is the output vector representation (i.e., word embedding) of word w i .
Given a sequence of training words w 1 , w 2 , ..., w N , the CBOW model seeks to predict a target word w t using a window of 2c context words surrounding w t , by maximizing the following objective function: log P (w t |w t−c , ..., w t−1 , w t+1 , ..., w t+c ) The conditional probability is calculated using a softmax function: where x t = Xe(w t ) is the output vector representation of word w t , and is the sum of the input vector representations of the context words.
In our variant of the CBOW model, we use a separate input word matrix Xj for a context word at position j, −c ≤ j ≤ c, j = 0.In addition, we employ weights that decay with the distances of the context words to the target word.Under these modifications, we have We use the variant to build monolingual word embeddings because experiments on named entity recognition and word similarity tasks showed this variant leads to small improvements over the standard CBOW model (Ni et al., 2017).et al. (2013b) observed that word embeddings of different languages often have similar geometric arrangements, and suggested to learn a linear mapping between the vector spaces.

Mikolov
Let D be a bilingual dictionary with aligned word pairs (w i , v i ) i=1,...,D between a source language s and a target language t, where w i is a source-language word and v i is the translation of w i in the target language.Let x i ∈ R d×1 be the word embedding of the source-language word w i , y i ∈ R d×1 be the word embedding of the targetlanguage word v i .
We find a linear mapping (matrix) M t→s such that M t→s y i approximates x i , by solving the fol-lowing least squares problem using the dictionary as the training set: Using M t→s , for any target-language word v with word embedding y, we can project it into the source-language embedding space as M t→s y.

Length Normalization and Orthogonal Transformation
To ensure that all the training instances in the dictionary D contribute equally to the optimization objective in (4) and to preserve vector norms after projection, we have tried length normalization and orthogonal transformation for learning the bilingual mapping as in (Xing et al., 2015;Artetxe et al., 2016;Smith et al., 2017).First, we normalize the source-language and target-language word embeddings to be unit vectors: x = x ||x|| for each source-language word embedding x, and y = y ||y|| for each target-language word embedding y.
Next, we add an orthogonality constraint to (4) such that M is an orthogonal matrix, i.e., M T M = I where I denotes the identity matrix: (5) M O t→s can be computed using singular-value decomposition (SVD).

Semi-Supervised and Unsupervised Mappings
The mapping learned in (4) or ( 5) requires a seed dictionary.To relax this requirement, Artetxe et al.
(2017) proposed a self-learning procedure that can be combined with a dictionary-based mapping technique.Starting with a small seed dictionary, the procedure iteratively 1) learns a mapping using the current dictionary; and 2) computes a new dictionary using the learned mapping.Artetxe et al. (2018) proposed an unsupervised method to learn the bilingual mapping without using a seed dictionary.The method first uses a heuristic to build an initial dictionary that aligns the vocabularies of two languages, and then applies a robust self-learning procedure to iteratively improve the mapping.Another unsuper-vised method based on adversarial training was proposed in Conneau et al. (2018).
We compare the performance of different mappings for cross-lingual RE model transfer in Section 5.3.2.

Neural Network RE Models
For any two entities in a sentence, an RE model determines whether these two entities have a relationship, and if yes, classifies the relationship into one of the pre-defined relation types.We focus on neural network RE models since these models achieve the state-of-the-art performance for relation extraction.Most importantly, neural network RE models use word embeddings as the input, which are amenable to cross-lingual model transfer via cross-lingual word embeddings.In this paper, we use English as the source language.
Our neural network architecture has four layers.The first layer is the embedding layer which maps input words in a sentence to word embeddings.The second layer is a context layer which transforms the word embeddings to context-aware vector representations using a recurrent or convolutional neural network layer.The third layer is a summarization layer which summarizes the vectors in a sentence by grouping and pooling.The final layer is the output layer which returns the classification label for the relation type.

Embedding Layer
For an English sentence with n words s = (w 1 , w 2 , ..., w n ), the embedding layer maps each word w t to a real-valued vector (word embedding) x t ∈ R d×1 using the English word embedding model (Section 3.1).In addition, for each entity m in the sentence, the embedding layer maps its entity type to a real-valued vector (entity label embedding) l m ∈ R dm×1 (initialized randomly).In our experiments we use d = 300 and d m = 50.

Context Layer
Given the word embeddings x t 's of the words in the sentence, the context layer tries to build a sentence-context-aware vector representation for each word.We consider two types of neural network layers that aim to achieve this.

Bi-LSTM Context Layer
The first type of context layer is based on Long Short-Term Memory (LSTM) type recurrent neural networks (Hochreiter and Schmidhuber, 1997;Graves and Schmidhuber, 2005).Recurrent neural networks (RNNs) are a class of neural networks that operate on sequential data such as sequences of words.LSTM networks are a type of RNNs that have been invented to better capture long-range dependencies in sequential data.
We pass the word embeddings x t 's to a forward and a backward LSTM layer.A forward or backward LSTM layer consists of a set of recurrently connected blocks known as memory blocks.
The memory block at the t-th word in the forward LSTM layer contains a memory cell − → c t and three where σ is the element-wise sigmoid function and is the element-wise multiplication.
The hidden state vector − → h t in the forward LSTM layer incorporates information from the left (past) tokens of w t in the sentence.Similarly, we can compute the hidden state vector ← − h t in the backward LSTM layer, which incorporates information from the right (future) tokens of w t in the sentence.The concatenation of the two vectors is a good representation of the word w t with both left and right contextual information in the sentence.

CNN Context Layer
The second type of context layer is based on Convolutional Neural Networks (CNNs) (Zeng et al., 2014;dos Santos et al., 2015), which applies convolution-like operation on successive windows of size k around each word in the sentence.Let z t = [x t−(k−1)/2 , ..., x t+(k−1)/2 ] be the concatenation of k word embeddings around w t .The convolutional layer computes a hidden state vector for each word w t , where W is a weight matrix and b is a bias vector, and tanh(•) is the element-wise hyperbolic tangent function.

Summarization Layer
After the context layer, the sentence (w 1 , w 2 , ..., w n ) is represented by (h 1 , ...., h n ).Suppose m 1 = (w b 1 , .., w e 1 ) and m 2 = (w b 2 , .., w e 2 ) are two entities in the sentence where m 1 is on the left of m 2 (i.e., e 1 < b 2 ).As different sentences and entities may have various lengths, the summarization layer tries to build a fixed-length vector that best summarizes the representations of the sentence and the two entities for relation type classification.
We divide the hidden state vectors h t 's into 5 groups: are left to the first entity m 1 .
• G 2 = {h b 1 , .., h e 1 } includes vectors that are in the first entity m 1 .
• G 4 = {h b 2 , .., h e 2 } includes vectors that are in the second entity m 2 .
• G 5 = {h e 2 +1 , .., h n } includes vectors that are right to the second entity m 2 .
We perform element-wise max pooling among the vectors in each group: where d h is the dimension of the hidden state vectors.Concatenating the h G i 's we get a fixedlength vector

Output Layer
The output layer receives inputs from the previous layers (the summarization vector h s , the entity label embeddings l m 1 and l m 2 for the two entities under consideration) and returns a probability distribution over the relation type labels:

Cross-Lingual RE Model Transfer
Given the word embeddings of a sequence of words in a target language t, (y 1 , ..., y n ), we project them into the English embedding space by applying the linear mapping M t→s learned in Section 3.2: (M t→s y 1 , M t→s y 2 , ..., M t→s y n ).The neural network English RE model is then applied on the projected word embeddings and the entity label embeddings (which are language independent) to perform relationship classification.Note that our models do not use languagespecific resources such as dependency parsers or POS taggers because these resources might not be readily available for a target language.Also our models do not use precise word position features since word positions in sentences can vary a lot across languages.

Experiments
In this section, we evaluate the performance of the proposed cross-lingual RE approach on both in-house dataset and the ACE (Automatic Content Extraction) 2005 multilingual dataset (Walker et al., 2006).
The ACE05 dataset includes manually annotated RE data for 3 languages: English, Arabic and Chinese.It defines 7 entity types (Person, Organization, Geo-Political Entity, Location, Facility, Weapon, Vehicle) and 6 relation types between the entities (Agent-Artifact, General-Affiliation, ORG-Affiliation, Part-Whole, Personal-Social, Physical).
For both datasets, we create a class label "O" to denote that the two entities under consideration do not have a relationship belonging to one of the relation types of interest.

Source (English) RE Model Performance
We build 3 neural network English RE models under the architecture described in Section 4: • The first neural network RE model does not have a context layer and the word embeddings are directly passed to the summarization layer.We call it Pass-Through for short.
• The second neural network RE model has a Bi-LSTM context layer.We call it Bi-LSTM for short.
First we compare our neural network English RE models with the state-of-the-art RE models on the ACE05 English data.The ACE05 English data can be divided to 6 different domains: broadcast conversation (bc), broadcast news (bn), telephone conversation (cts), newswire (nw), usenet (un) and webblogs (wl).We apply the same data split in (Plank and Moschitti, 2013;Gormley et al., 2015;Nguyen and Grishman, 2016), which uses news (the union of bn and nw) as the training set, a half of bc as the development set and the remaining data as the test set.
We learn the model parameters using Adam (Kingma and Ba, 2015).We apply dropout (Srivastava et al., 2014) to the hidden layers to reduce overfitting.The development set is used for tuning the model hyperparameters and for early stopping.
In Table 1 we compare our models with the best models in (Gormley et al., 2015) and (Nguyen and Grishman, 2016).Our Bi-LSTM model outperforms the best model (single or ensemble) in (Gormley et al., 2015) and the best single model in (Nguyen and Grishman, 2016), without using any language-specific resources such as dependency Table 3: Performance of the supervised English RE models on the in-house and ACE05 English test data. parsers.
While the data split in the previous works was motivated by domain adaptation, the focus of this paper is on cross-lingual model transfer, and hence we apply a random data split as follows.For the source language English and each target language, we randomly select 80% of the data as the training set, 10% as the development set, and keep the remaining 10% as the test set.The sizes of the sets are summarized in Table 2.
We report the Precision, Recall and F 1 score of the 3 neural network English RE models in Table 3.Note that adding an additional context layer with either Bi-LSTM or CNN significantly improves the performance of our English RE model, compared with the simple Pass-Through model.Therefore, we will focus on the Bi-LSTM model and the CNN model in the subsequent experiments.

Cross-Lingual RE Performance
We apply the English RE models to the 7 target languages across a variety of language families.

Dictionary Size
The bilingual dictionary includes the most frequent target-language words and their translations in English.To determine how many word pairs are needed to learn an effective bilingual word embedding mapping for cross-lingual RE, we first evaluate the performance (F 1 score) of our cross-lingual RE approach on the target-language development sets with an increasing dictionary size, as plotted in Figure 2.
We found that for most target languages, once the dictionary size reaches 1K, further increasing the dictionary size may not improve the transfer performance.Therefore, we select the dictionary size to be 1K.

Comparison of Different Mappings
We compare the performance of cross-lingual RE model transfer under the following bilingual word embedding mappings: • Regular-1K: the regular mapping learned in (4) using 1K word pairs; • Orthogonal-1K: the orthogonal mapping with length normalization learned in (5) using 1K word pairs (in this case we train the English RE models with the normalized English word embeddings); • Semi-Supervised-1K: the mapping learned with 1K word pairs and improved by the selflearning method in (Artetxe et al., 2017); • Unsupervised: the mapping learned by the unsupervised method in (Artetxe et al., 2018).
The results are summarized in Table 4.The regular mapping outperforms the orthogonal mapping consistently across the target languages.While the orthogonal mapping was shown to work better than the regular mapping for the word translation task (Xing et al., 2015;Artetxe et al., 2016;Smith et al., 2017), our cross-lingual RE approach directly maps target-language word embeddings to the English embedding space without conducting word translations.Moreover, the orthogonal mapping requires length normalization, but we observed that length normalization adversely affects the performance of the English RE models (about 2.0 F 1 points drop).
We apply the vecmap toolkit1 to obtain the semi-supervised and unsupervised mappings.The unsupervised mapping has the lowest average accuracy over the target languages, but it does not require a seed dictionary.Among all the mappings, the regular mapping achieves the best average accuracy over the target languages using a dictionary with only 1K word pairs, and hence we adopt it for the cross-lingual RE task.

Performance on Test Data
The cross-lingual RE model transfer results for the in-house test data are summarized in Table 5 and the results for the ACE05 test data are summarized in Table 6, using the regular mapping learned with a bilingual dictionary of size 1K.In the tables, we also provide the performance of the supervised RE model (Bi-LSTM) for each target language, which is trained with a few hundred thousand tokens of manually annotated RE data in the target-language, and may serve as an upper bound for the cross-lingual model transfer performance.
Among the 2 neural network models, the Bi-LSTM model achieves a better cross-lingual RE performance than the CNN model for 6 out of the 7 target languages.In terms of absolute performance, the Bi-LSTM model achieves over 40.0 F 1 scores for German, Spanish, Portuguese and Chinese.In terms of relative performance, it reaches over 75% of the accuracy of the supervised targetlanguage RE model for German, Spanish, Italian and Portuguese.While Japanese and Arabic appear to be more difficult to transfer, it still achieves 55% and 52% of the accuracy of the supervised Japanese and Arabic RE model, respectively, without using any manually annotated RE data in Japanese/Arabic.
We apply model ensemble to further improve the accuracy of the Bi-LSTM model.We train 5 Bi-LSTM English RE models initiated with different random seeds, apply the 5 models on the target languages, and combine the outputs by selecting the relation type labels with the highest probabilities among the 5 models.This Ensemble approach improves the single model by 0.6-1.9F 1 points, except for Arabic.

Discussion
Since our approach projects the target-language word embeddings to the source-language embedding space preserving the word order, it is expected to work better for a target language that has more similar word order as the source language.This has been verified by our experiments.The source language, English, belongs to the SVO (Subject, Verb, Object) language family where in a sentence the subject comes first, the verb second, and the object third.Spanish, Italian, Portuguese, German (in conventional typology) and Chinese also belong to the SVO language family, and our approach achieves over 70% relative accuracy for these languages.On the other hand, Japanese belongs to the SOV (Subject, Object, Verb) language family and Arabic belongs to the VSO (Verb, Subject, Object) language family, and our approach achieves lower relative accuracy for these two languages.

Related Work
There are a few weakly supervised cross-lingual RE approaches.Kim et al. (2010) and Kim and Lee (2012) project annotated English RE data to Korean to create weakly labeled training data via aligned parallel corpora.Faruqui and Kumar (2015) translates a target-language sentence into English, performs RE in English, and then projects the relation phrases back to the target-language sentence.Zou et al. (2018) proposes an adversarial feature adaptation approach for cross-lingual relation classification, which uses a machine translation system to translate source-language sentences into target-language sentences.Unlike the existing approaches, our approach does not require aligned parallel corpora or machine translation systems.There are also several multilingual RE approaches, e.g., Verga et al. (2016); Min et al. (2017); Lin et al. (2017), where the focus is to improve monolingual RE by jointly modeling texts in multiple languages.
Many cross-lingual word embedding models have been developed recently (Upadhyay et al., 2016;Ruder et al., 2017).An important application of cross-lingual word embeddings is to enable cross-lingual model transfer.In this paper, we apply the bilingual word embedding mapping technique in (Mikolov et al., 2013b) to cross-lingual RE model transfer.Similar approaches have been applied to other NLP tasks such as dependency parsing (Guo et al., 2015), POS tagging (Gouws and Søgaard, 2015) and named entity recognition (Ni et al., 2017;Xie et al., 2018).

Conclusion
In this paper, we developed a simple yet effective neural cross-lingual RE model transfer approach, which has very low resource requirements (a small bilingual dictionary with 1K word pairs) and can be easily extended to a new language.Extensive experiments for 7 target languages across a variety of language families on both in-house and open datasets show that the proposed approach achieves very good performance (up to 79% of the accuracy of the supervised target-language RE model), which provides a strong baseline for building cross-lingual RE models with minimal resources.

Figure 1 :
Figure 1: Neural cross-lingual relation extraction based on bilingual word embedding mapping -target language:Portuguese, source language: English.

Figure 2 :
Figure 2: Cross-lingual RE performance (F 1 score) vs. dictionary size (number of bilingual word pairs for learning the mapping (4)) under the Bi-LSTM English RE model on the target-language development data.

Table 2 :
Number of documents in the training/dev/test sets of the in-house and ACE05 datasets.

Table 4 :
Comparison of the performance (F 1 score) using different mappings on the target-language development data under the Bi-LSTM model.

Table 6 :
Performance of the cross-lingual RE approach on the ACE05 target-language test data.