Neural Cross-Lingual Event Detection with Minimal Parallel Resources

The scarcity in annotated data poses a great challenge for event detection (ED). Cross-lingual ED aims to tackle this challenge by transferring knowledge between different languages to boost performance. However, previous cross-lingual methods for ED demonstrated a heavy dependency on parallel resources, which might limit their applicability. In this paper, we propose a new method for cross-lingual ED, demonstrating a minimal dependency on parallel resources. Specifically, to construct a lexical mapping between different languages, we devise a context-dependent translation method; to treat the word order difference problem, we propose a shared syntactic order event detector for multilingual co-training. The efficiency of our method is studied through extensive experiments on two standard datasets. Empirical results indicate that our method is effective in 1) performing cross-lingual transfer concerning different directions and 2) tackling the extremely annotation-poor scenario.


Introduction
Event detection (ED) is a crucial natural language processing problem that aims to identify event triggers in texts (Ahn, 2006;Nguyen and Grishman, 2015). For example, in the sentence: "A man died when a tank fired on the hotel", ED requires a system to identify two event triggers, died and fired, along with their types Die and Attack.
Generally, training an ED system requires to obtain a considerably large amount of labeled data. However, owing to the complexity and high costs of annotation, existing event resources are scarce and unbalanced across languages (Hsi et al., 2016), which prevents us from building an ED system in languages with insufficient training data. Cross-lingual ED (Ji, 2009; Chen and * Equal contribution. Ji, 2009;Zhu et al., 2014;Hsi et al., 2016;Liu et al., 2018a) aims to tackle this challenge by transferring knowledge cross languages to boost performance. However, previous cross-lingual ED methods rely on either high-performance machine translation (MT) systems trained on large numbers of parallel sentences or manually aligned documents to achieve a decent performance -the required parallel resources may only exist for a small fraction of language pairs (Koehn et al., 2007), which greatly limits the applicability of these methods.
In this paper, we propose a new simple but effective method for cross-lingual ED, which can overcome the data scarcity problem in annotationpoor languages by jointly training with resources in other languages. Compared with previous methods, our approach demonstrates a minimal dependency on parallel resources, which may fit with language pairs that do not have large bitexts.
To achieve cross-lingual transfer, two challenges exist: 1) how to build lexical mappings between different languages, and 2) how to handle the word order difference problem (Xie et al., 2018). For the first challenge, previous studies (Guo et al., 2015;Ni et al., 2017;Mayhew et al., 2017;Xie et al., 2018;Lample et al., 2018) have investigated embedding projection-based method in cross-lingual applications and achieved promising results. For example, (Xie et al., 2018) proposed a novel "cheap translation" based method which has greatly advanced the performance in zero-shot named entity recognition (NER). However, these methods may not directly fit with crosslingual ED, as the lexical mapping in ED is usually context-dependent but not deterministic as in other tasks. Consider the following English-to-Chinese lexical mapping examples. To preserve its original meaning, the trigger word "fire" in "gangsters fire at a policeman" (which evokes an Attack event), and "the house caught fire" (which evokes an NA event) should be translated as different Chinese words "开 火(open fire)" and "着 火(be on fire)" respectively. But in previous lexical mapping methods, the "fired" is always having the same transferred representation irrespective of its contexts. This problem might introduce noise for cross-lingual ED.
To address the above issues, in this paper, we devise a content-dependent lexical mapping method for cross-lingual ED. Similar to (Xie et al., 2018), for each source word, we first project it into a shared embedding space, but instead of adopting a deterministic word-to-word translation, we retrieve different translation candidates by looking for its nearest neighbors, and we then adopt a context-aware selective attention mechanism to rank these candidates to find the best-suited translation. Compared with previous methods, our approach can obtain content-dependent translations for each word in the source training data, which may be more suitable for cross-lingual ED.
Considering the word order difference, to our best knowledge, (Xie et al., 2018) is the only work which adopted a self-attention mechanism (Vaswani et al., 2017) to tackle this problem (in cross-lingual NER). Difference with them, we propose a shared syntactic order event detector for cross-lingual ED, which explores the syntactic similarity of resources in different languages and circumvents the word order difference problem in performing multilingual co-training.
To illustrate our motivation, consider an English example E1 and its parallel Chinese translation C1 in Figure 1. As shown, E1 and C1 have rather different word orders (Figure 1a), but they share a similar syntactic structure which captures enough generality for identifying event triggers ( Figure 1b). This observation motivates us to explore the syntactic similarity to achieve multilingual co-training. To achieve this goal, we propose a shared syntactic order event detector based on Graph Convolutional Networks (GCNs) (Kipf and Welling, 2016), which can provide each word a contextual feature based on its immediate neighbors in the syntactic graph irrespective of its original position in the sentence. This decoder allows us to train a model on multilingual resources effectively, which circumvents the word order different problem.
To estimate our method, we have conducted extensive experiments on two standard datasets, using English, Chinese, and Spanish as experimental languages respectively. The experimental results demonstrate that: 1) our model can perform cross-lingual transfer between different language pairs. Especially, the improvement in Chinese ED is large, with an absolute 3.8% in F1 over the monolingual approaches. 2) Our model is robust in the extreme annotation-poor scenario where a language has very limited training data, which demonstrates a definite advantage over previous monolingual models. Additionally, compared with MT-based methods, our model achieves competitive results but requires much less parallel resource. This paper is organized as follows: Section 2 briefly introduces the task description and terminologies used in ED; Section 3 elaborates details of our approach; Section 4 reports on our experimental results and other analysis; Section 5 reviews related work; Section 6 concludes the paper and illustrates future work.

Task Description
Event detection (ED) is a subtask defined in the overall Event Extraction (EE) evaluation in Automatic Content Extraction (ACE) 2005 program. We first introduce some ACE terminologies to facilitate the understanding of ED task.
In ACE, 1) an event mention refers to a phrase/sentence within which an event is described. 2) Event trigger refers to a specific word in an event mention which is considered the most representative of the event. Each event trigger has a certain type corresponding to the event mention.
3) Event arguments are participants of the event.
With these definitions, the goal of ED is to locate event triggers and categorize their types. For example, in sentence "The old man died in the hospital", ED requires a system to detect a Die event along with the event trigger died. The detection of event arguments The old man (role=Person) and hospital (role=Place) is not involved in the ED task.
Following previous work (Nguyen and Grishman, 2015;, we formulate ED as a token-level multi-class classification task. Namely, given a sentence, we treat every token in it as a trigger candidate, and we aim to classify each candidate into one of 34 categories (33 event types defined in ACE in addition to an NA type indicating "not an event trigger").

Methodology
This study focuses on cross-lingual ED, which aims to transfer knowledge from a source language with abundant labeled data to a target language with insufficient training data. Figure 2 visualizes the overall architecture of our model, which consists of three main components: (1) Monolingual embedding layer, which transforms each token into a continuous vector representation. (2) Context-dependent lexical mapping, which maps each word in the source language to its best-suited translation in the target language, by examining its contextual representation and imposing a selective attention over different translation candidates. (3) Shared syntactic order event detector, which employs a Graph Convolutional Networks (GCNs) to explore syntactic similarity of resources of different languages, in order to achieve multilingual co-training.
For the sake of convenience, in the following illustrations, we assume the source language is English and the target language is Chinese, and we use an English sentence s = {w 1 , w 2 , . . . , w n } to illustrate our idea.

Monolingual Embedding Layer
In the monolingual embedding layer, each word is assigned to a distributed vector as its representation. Specifically, we first train English/Chinese word embeddings on the corresponding Wikipedia dumps via Skip-gram model (Mikolov et al., 2013) with a dimensional size d = 300. And then we transform each token into its word embedding as its vectorized feature representation.
In this way, s is transformed into an embedding indicates the word embedding of the token w i .

Context-Dependent Lexical Mapping
For each token w i in s, context-dependent lexical mapping aims to search for its best-suited word translation according to its contextual representation. This process involves: 1) learning multilingual alignment, 2) retrieving translation candidates, and 3) ranking translation words via a selective attention mechanism.

Learning Multilingual Alignment
Let X and Y be the English and Chinese embedding spaces. In order to achieve multilingual alignment, we learn a mapping W ∈ R d×d from X to Y via a seed dictionary with a size of m, by optimizing: are two matrices containing the aligned embeddings of words in the seed dictionary; || · || F indicates the Frobenius norm. To get a closed form solution, following (Xing et al., 2015;Lample et al., 2018), we impose an orthogonality constraint on W (i.e.,W W T =W T W =I), and in this way, the optimized solution of W corresponds to the singular value decomposition

Retrieving Translation Candidates
Next, we retrieve translation candidates for each token w i in s. Specifically, we first project w i into the aligned embedding space (i.e., by applying W on x i ), and then we explore its neighborhood to find the nearest Chinese words as its translation candidates. In order to measure the distance

Translation Candidates
Selective Attention Figure 2: The overview architecture of our model. The figure illustrates the process of performing cross-lingual transfer for an English sentence "A man died when a tank fired on the hotel" into Chinese and using the shared syntactic order event detector to predict the event type for the word "fired".
of w i and a Chinese word y in the aligned space, we adopt the cross-domain similarity local scaling (CSLS) metric (Lample et al., 2018): where y denotes the (Chinese) word embedding of y; r Y (W x i ) indicates the mean cosine similarity between W x i and its K neighbors in Y , which is defined as: In our method, for w i , we take J Chinese words which have the smallest CSLSs as its translation candidates. We denote by T (w i ) the set of translation candidates for w i , where T (w i ) j indicates the jth element of T (w i ) .

Content-Aware Selective Attention
Finally, for each token w i , we perform a contextaware selective attention mechanism to weigh each translation candidate in T (w i ) and get the best-suited translation for it.
Learning Contextual Representation. We employ the self-attention mechanism (Vaswani et al., 2017) to learn context representation of w i . Specifically, given E s = [x 1 , x 2 , . . . , x n ] T , we use different single-layer neural networks to learn queries Q, keys K, values V respectively. For example, Q = tanh(E s W m + b m ), where W m ∈ R d×d and b ∈ R d are parameter matrix and bias respectively. Then, we compute a self-attention matrix by computing: where d indicates the word embedding dimension. We take c i as the contextual representation of w i . Learning Selective Attention. For each token w i , after obtaining its translation candidates list T (w i ) and contextual representation c i , we impose a selective attention mechanism to automatically weigh each candidate. Specifically, the weight of the jth candidate T (w i ) j is computed as: where m j measures the semantic relatedness of c i and T (w i ) j , which is computed by: where [;] indicates the concatenation operations; y (w i ) j denotes the Chinese word embedding of ; W r ∈ R d×1 and b r ∈ R are parameter matrix and bias respectively. Finally, we select the candidate which has the maximal attention weight as the best-suited translation for w i , which is denoted by w i . In this way, the original sentence s is transfer into a Chinese word sequences t = {w 1 , w 2 , . . . , w n } with a same length.

Shared Syntactic Order Event Detector
As English and Chinese usually have different word orders, the transferred result t might be seen as a corrupted sentences from Chinese, which could introduce noise for multilingual co-training. We tackle this problem by proposing a Graph Convolutional Neural Networks (GCNs) (Kipf and Welling, 2016) based syntactic order event detector, which provides each word with a feature vector based on its immediate neighbors in the syntactic graph irrespective of its position in the sentence. This allows our model to train with the translated data t and the other labeled data in Chinese indiscriminately.

Extracting Graph Convolution Feature
Specifically, for each token w i , our model computes a graph convolution feature vector based on its immediate neighbors in the syntactic graph. Figure 3 illustrates the process of extracting the feature for "fired".
Let N (w i ) denote the set of neighbors of w i in the syntactic graph, and L(w i , v) indicate the label of the dependency arc (w i → v) (For example, L("fired", "hotel") = nmod in the example in Figure 3). The original GCNs compute a graph convolution vector for w i at (k+1)th layer by: where g denotes the ReLU function; W k L(w i ,v) and b k L(w i ,v) are parameters of the dependency label L(w i , v) in the kth layer. However, retaining parameters for every dependency label is space-consuming and compute-intense (there are approximately 50 labels), in our model, we limit L(w i , v) to have only three types of labels 1) an original edge, 2) a self loop edge, and 3) an added inverse edge, as suggested in (Nguyen and Grishman, 2018). Additionally, since the generated syntactic parsing structures usually contain noise, we apply attention gates on the edges to weigh their individual importances: where σ is the logistic sigmoid function. U k and d k (w i ,v) are the weight matrix and the bias of the gate. With this gating mechanism, the final syntactic GCNs computation in our model is:  Figure 3: The illustration of using GCNs to compute the order-invariant feature for the word "fired".
We set the initial vectors h 0 w i for w i as the Chinese word embedding of w i (its translated word), and we stack 2 layers of GCNs (i.e., k = 2) to obtain the final feature for w i , denoted as f i .

Event Type Classification
Our model incorporates a logistic regression classifier to predict w i 's event type. Specifically, we compute a prediction vector for w i by taking f i as the input: where W o ∈ R d×c and b o ∈ R c are parameters, and c is the total number of event types (i.e., 34 in this study). The probability of t-th class type is denoted as P (t|w i ), which corresponds to the t-th element of out.

Multilingual Co-Training
To enable multilingual co-training, we adopt the cross-entropy loss, and we use λ to balance the contribution of multilingual resources (which is set as 0.7 through a grid search): where Θ denotes all the parameters in our model; w e ranges over each token in the translated examples and w c enumerate each token in the original Chinese training set; l w e and l wc denote the ground-truth event types of w e and w c respectively. We adopt Adam rules (Kingma and Ba, 2014) to update our model's parameters and add dropout layers to prevent over-fitting.

Datasets and Evaluation
Our main experiments are conducted on ACE 2005 and TAC KBP 2017, two widely used ED datasets which contain annotated documents in English and Chinese (The documents are not parallel). For ACE English ED, we split the dataset to training/development/test set as suggested in (Li et al., 2013;Nguyen and Grishman, 2015). For ACE Chinese ED, we perform a 10-fold crossvalidation as suggested in (Chen and Ji, 2009;Lin et al., 2018). For TAC KBP 2017 English and Chinese, we use the official test sets for testing, and we split the remaining data with a ratio of 9:1 for training and developing. The bilingual dictionary is obtained from the MUSE project 1 with a size of 5k. The number of candidate translations J (in Section 3.2.2) is set as 3. We use the Stanford CoreNLP (Manning et al., 2014) to obtain syntactic trees for each language. Precision (Pre.), Recall (Rec.), and F1-score (F1) are used as evaluation metrics, same as previous ED studies to ensure compatibility. Significant tests (with p=0.05) were conducted using method described in (Yeh, 2000).

Experimental Results
We conduct two groups of experiments to investigate the ability of our model in 1) performing cross-lingual transfer concerning different language directions, and 2) handling the annotationpoor scenario.

Cross-Lingual Transfer Concerning Different Language Directions
We investigate both English-to-Chinese and Chinese-to-English transfers to investigate whether cross-lingual transfer is feasible concerning different language directions. In these experiments, we jointly train on the translated data with all the labeled data in target language. We compare our cross-lingual approach (CL Trans) with our monolingual approach (Monolingual, which only uses the training data of the target language) and the existing state-of-the-art monolingual ED models (Monolingual SOTA). In ACE ED, we take models proposed in (Nguyen and Grishman, 2015) and (Lin et al., 2018) as the SOTA English and Chinese ED systems respectively. In TAC KBP ED, we take the top systems reported in the official evaluation (Mitamura et al., 2017) as Monolingual SOTA. We also include a vanilla embedding projection based method for comparison (denoted as Embedding proj). Figure 4 summarizes the results. From the results, 1) our cross-lingual approach CL Trans outperforms two monolingual systems (+2.95% on F1 on the average) and the vanilla embedding projection based method (+1.60% on F1), in the four evaluations. This justifies the effectiveness of our approach concerning cross-lingual in different language directions. 2) Additionally, our cross-lingual approach is more effective for Chinese (+3.80% on F1) than for English (+2.10% on F1). This is understandable as the number of English examples is much larger than that of Chinese examples (5, 285 v.s. 2, 710 in ACE 2005, and24, 979 v.s. 10, 630 in TAC KBP 2017). 3) We obtain interesting findings by investigating each event type. For example, in TAC KBP 2017, the type of "contact/correspondence" has only 167 samples in Chinese but 996 samples in English. By adopting cross-lingual training, our approach leads to an improvement of 15.3% (from 10.3% to 25.6%) in Chinese ED for this type, compared with the monolingual approach. This proves that our method can handle the annotation sparseness problem in the target language.

Exploring the Annotation-Scarce Scenario
We next investigate the annotation-poor scenario, where the source language is set as English and the target language is set as Chinese to compare with previous works. In this scenario, we assume that only a few of annotated documents are available in Chinese.
In Comparison to Monolingual ED Models. We first compare our cross-lingual approach with the existing monolingual ED models, including CNN (Nguyen and Grishman, 2015) which employs Convolutional Neural Networks for the task and Hbrid (Feng et al., 2016) which combines CNN with Recurrent Neural Networks (RNN) for ED. Figure 5 presents the experimental results, where the number of the available Chinese documents ranges from 0 to 50.  From the results, our approach demonstrates a definite advantage over monolingual ED models in the annotation-poor scenario. Particularly, when there is no Chinese training document available (i.e., in the unsupervised cross-lingual transfer scenario), our model achieves a performance of 27.0% on F1 in ACE, and 28.7% on F1 in TAC KBP, while supervised ED methods completely fail. Additionally, our approach can consistently outperform the embedding-projection method.
In Comparison to Cross-Lingual Models. We next compare our model to the existing crosslingual ED methods, including 1) LexMap (Hsi et al., 2016), which combines embedding projection method with multilingual feature extraction to perform cross-lingual ED, and 2) MTED (Zhu et al., 2014), which uses a MT system to translate the training examples in source language to obtain additional data for training. In our re-implementation, we employ OpenNMT (Klein et al.) as the translation model and we use Open- Subtitles (Lison and Tiedemann, 2016) to train it.
To ensure comparability of results, we use the setting of (Hsi et al., 2016) (i.e., using one-fold data (64 annotated documents) for training) to conduct the experiments. Table 1 gives the results. From the results, 1) our method outperforms LexMap by a rather large margin (+5.8% on F1). The poor performance of LexMap might be attributed to its feature engineering process, which is often very difficult and requires expert knowledge. 2) Our model behaves competitively to machine translation based method (which are trained on 400k parallel sentences) yet relies on much less parallel resources (a dictionary with a size of 5k).

Ablation Study
We conduct ablation study to explore the effects of our different model components. We limit our study in the extremely annotation-poor scenario, that is, we assume there is no training data in the target language (Chinese).
Exploring Lexical Mapping Method. To explore our lexical mapping method, we compare the performance of several variant systems retrieving a different number of candidates (ranging from 1 to 5) and the embedding-projection method (Embedding proj). Note the system retrieving only one candidate actually takes the nearest Chinese neighbor as the word translation. The lexical mapping in it is still context-independent. Table 2 summarizes the results.
From the results, we observe that 1) even though both of CL Trans (1 cand.) and Embedding proj are content-independent mapping methods, the former outperforms the latter by a margin (+3.2% on F1). This implies that the embeddingprojection method might suffer from the misalignment in the shared embedding space, and enforcing a word-to-word alignment (as in CL Trans (1 cand.)) could alleviate this problem to some ex-   tent. 2) Retrieving more translation candidates could consistently improve Recall. But when too many candidates (e.g., 5) are added, the Precision drops, which harms the overall F1 measure.
Exploring the Syntactic Order Event Detector. We compare our syntactic order event detector (CL Trans GCN) with several event detectors, including 1) CL Trans MLP, which employs a feed-forward network as event detector; 2) CL Trans CNN, which uses CNNs as the event detector; 3) CL Trans Hbrid, which use a hybrid network (Feng et al., 2016) for event detection.
We also compared our model with several variants including 4) CL Trans Self., which replaces the GCNs with a self-attention network, and 5) CL Trans GCN Self, which combines GCNs with a self-attention network. We train these models on the same translated English data. Table 3 shows the results.
From the results, 1) CL Trans MLP, CL Trans CNN, and CL Trans Hbrid behave poorly, as expected. The reason might be that these models usually employ order-sensitive structures (e.g., CNNs) for ED, which would suffer the word order inconsistency problem when trained on the translated data. 2) CL Trans Self. yields relatively good performance. The reason might be that self-attention network could provide each word with a feature vector based on all the words of a sentence, which is also irrespective of the words' positions in a sentence. This   could address the word order difference to some extent. 3) Our syntactic order event detector yields the best performance. While, we do not observe salient advantages by combining GCNs with a self-attention network (by comparing CL Trans GCN with CL Trans GCN Self).

Beyond English-Chinese Pair
We conduct additional experiments on Spanish to investigate cross-lingual transfer beyond English-Chinese transfer. The Spanish corpus is in TAC-KBP solely, with much smaller size and fewer publish evaluations. Experimental results demonstrate that, our model, without any modification, surpasses the best-reported Spanish system (42.8 (Mitamura et al., 2017) with F1 scores of 44.0 and 43.8 concerning EN → SP and CH → SP transfer. The value changes to 20.8 and 18.9 in zero transfer scenarios. This demonstrates that our approach is language-independent.

Case Study
We give a case study on the cross-lingual transfer process of a real example in ACE: "36,000 people died every year from the flu". Table 4 and Figure  6 gives the Chinese translation candidates and the learned attention weights respectively. From the results, the best-suited translations indeed often correspond to larger attention weights, which implies the validity of our approach. In the above example, the Chinese words "从(from)" and "流感(flu)" do not correspond to the nearest neighbor of "from" and "flu", but our contentdependent lexical mapping method enable us to successfully obtain them as the translations.
The case study also poses several future directions for this work. For example, one is how to address the one-to-many mapping between different languages. In the above example, the correct Chinese translation of "every year" should be one single word "每年", not the combination of two words "每个(every)" and "年(year)". This calls for more advanced lexical mapping methods.

Related Work
Event detection (ED) is a hot topic in natural language processing, which has attracted extensive attention in the past few years. Traditionally, the study of ED has focused on monolingual training. The proposed models can be divided into feature-based methods which employ fine-grained features (Ahn, 2006;Ji and Grishman, 2008;Liao and Grishman, 2010;Hong et al., 2011;Li et al., 2013;Li and Ji, 2014), and deep learning-based methods which employ neural networks to automatically learn features for the task (Chen et al., 2015;Nguyen and Grishman, 2015;Nguyen et al., 2016;Liu et al., 2018b;Orr et al., 2018;Liu et al., 2019). Usually, their performance is limited by the amount of labeled data in a specific language.
Cross-lingual ED attempts transfer knowledge between different languages to boost performance.
To name a few, (Chen and Ji, 2009) used an English detector to label events on parallel documents to obtain additional data for boosting Chinese ED; (Zhu et al., 2014;Liu et al., 2018a) used machine translation to obtain additional labeled data for training; (Hsi et al., 2016) combined the embedding-projection method with multilingual feature extraction for bilingual ED. Nevertheless, the heavily dependency on parallel resources often limits the applicability of these methods.
Our study also relates to cross-lingual studies in other applications (Guo et al., 2015;Ni et al., 2017;Mayhew et al., 2017;Xie et al., 2018;Lample et al., 2018). These approaches adopted embedding projection based method to achieve cross-lingual transfer and achieved prosing results. However, since the lexical mapping in these methods is usually deterministic and irrespective of contexts, they might not directly fit with cross-lingual ED, where the cross-lingual transfer should be context-dependent.

Conclusions and Future Work
In this paper, we propose a new cross-lingual approach for event detection, which demonstrates a minimal dependency on parallel resources. Specifically, we propose a context-dependent lexical mapping method to obtain content-dependent translation, and we devise a shared syntactic order event detector to explore the syntactic similarity for multilingual co-training. Experiments demonstrate the effectiveness of our method.
Currently, as our approach is predicated on the availability of syntax trees of training examples, it might not fit with languages which lack syntactic parsers. In the future, we plan to investigate more language-independent patterns in crosslingual transfer to circumvent this dependency.