MT/IE: Cross-lingual Open Information Extraction with Neural Sequence-to-Sequence Models

Cross-lingual information extraction is the task of distilling facts from foreign language (e.g. Chinese text) into representations in another language that is preferred by the user (e.g. English tuples). Conventional pipeline solutions decompose the task as machine translation followed by information extraction (or vice versa). We propose a joint solution with a neural sequence model, and show that it outperforms the pipeline in a cross-lingual open information extraction setting by 1-4 BLEU and 0.5-0.8 F1.


Introduction
Suppose an English-speaking user is faced with the daunting task of distilling facts from a collection of Chinese documents. One solution is to first translate the Chinese documents into English using a Machine Translation (MT) service, then extract the facts using an English-based Information Extraction (IE) engine. Unfortunately, imperfect translations negatively impact the IE engine, which may have been trained to expect natural English input (Sudo et al., 2004). Another approach is to first run a Chinese-based IE engine and then translate the results, but this relies on IE resources in the source language. Such problems with pipeline systems compound when the IE engine relies on parsers or other analytics as features.
We propose to solve the cross-lingual IE task with a joint approach. Further, we focus on Open IE, which allows for an open set of semantic relations between a predicate and its arguments. Open IE in the monolingual setting has shown to be useful in a wide range of tasks, such as question answering (Fader et al., 2014), ontology learning (Suchanek, 2014), and summarization (Chris- tensen et al., 2013). A variety of work has achieved compelling results at monolingual Open IE (Banko et al., 2007;Fader et al., 2011;Angeli et al., 2015). But we are not aware of efforts that focus on both the cross-lingual and open aspects of cross-lingual Open IE, despite significant work in related areas, such as cross-lingual IE on a closed, pre-defined set of events/entities (Sudo et al., 2004;Parton et al., 2009;Ji, 2009;Snover et al., 2011;Ji et al., 2016), or bootstrapping of monolingual Open IE systems in multiple languages (Faruqui and Kumar, 2015;Kozhevnikov and Titov, 2013;van der Plas et al., 2014). Inspired by the recent success of neural models in machine translation (Kalchbrenner and Blunsom, 2013;, syntactic parsing (Vinyals et al., 2015;Choe and Charniak, 2016), and semantic parsing (Dong and Lapata, 2016), we propose a sequence-to-sequence model that enables end-toend cross-lingual Open IE. Essentially, we recast the problem as structured translation: the model encodes natural-language sentences and decodes predicate-argument forms (Figure 1). We show that the joint approach outperforms the pipeline on various metrics, and that the neural model is critical for the joint approach because of its capability in generating complex open IE patterns.

Cross-lingual Open IE Framework
Open IE involves the extraction of relations whose schema need not be specified in advance; typically the relation name is represented by the text linking the arguments, which can be identified by manually-written patterns and/or parse trees. We define our extractions based on PredPatt 1 (White et al., 2016), a lightweight tool for identifying predicate-argument structures with a set of Universal Dependencies (UD) based patterns.
PredPatt represents predicates and arguments in a tree structure where a special dependency ARG is built between a predicate head token and its arguments' head tokens, and original UD dependencies within predicate phrases and argument phrases are kept. For example, Fig 1b shows a tree structure identified by PredPatt from the sentence: "Chris wants to build a boat." Our framework assumes the availability of a bitext, e.g. a corpus of Chinese sentences and their English translations. We run PredPatt on the target side (e.g. English) to obtain (Chinese sentence, English PredPatt) pairs. This is used to train a cross-lingual Open IE system that maps directly from Chinese sentence to English PredPatt representations. Besides the UD parser required for running PredPatt on the target side, our framework requires no additional resources.
Compared to existing Open IE (Banko et al., 2007;Fader et al., 2011;Angeli et al., 2015), the use of manual patterns on Universal Dependencies means that the rules are interpretable, extensible and language-agnostic, which makes PredPatt a linguistically well-founded component for crosslingual Open IE. Note that our joint model is agnostic to the IE representation, and can be adapted to other Open IE frameworks.

Proposed Method
Our goal is to learn a model which directly maps a sentence input A in the source language into predicate-argument structures output B in the target language. Formally, we regard the input as a sequence A = x 1 , · · · , x |A| , and use a linearized representation of the predicate-argument structure as the output sequence B = y 1 , · · · , y |B| . While tree-based decoders are conceivable , linearization of structured outputs to sequences simplifies decoding and has been shown 1 https://github.com/hltcoe/PredPatt effective in, e.g. (Vinyals et al., 2015), especially when a model with strong memory capabilities (e.g. LSTM's) are employed. Our model maps A into B using a conditional probability which is decomposed as:

Linearized PredPatt Representations
We begin by defining a linear form for our Pred-Patt predicate-argument structures. To convert a tree structure such as Figure 1b to a linear sequence, we first take an in-order traversal of every node (token). We then label each token with the type it belongs to: p for a predicate token, a for an argument token, p h for a predicate head token, and a h for an argument head token. We insert parentheses to either the beginning or the end of an argument, and we insert brackets to either the beginning or the end of a predicate. To recover the predicate-argument tree structure, we simply build it recursively from the outermost brackets. At each layer of the tree, parentheses help recover argument nodes. The labels a h and p h help identify the head token of a predicate and an argument, respectively. We define that an auto-generated linearized PredPatt is malformed if it has unmatched brackets or parentheses, or a predicate (or an argument) has zero or more than one head token.

Seq2Seq Model
Our sequence-to-sequence (Seq2Seq) model consists of an encoder which encodes a sentence input A into a vector representation, and a decoder which learns to decode a sequence of linearized PredPatt output B conditioned on encoded vector. We adopt a model similar to that which is used in neural machine translation . The encoder uses an L-layer bidirectional RNN (Schuster and Paliwal, 1997) which consists of a forward RNN reading inputs from x 1 to x |A| and a backward RNN reading inputs in reverse from x |A| to x 1 . Let − → h l i ∈ R n denote the forward hidden state at time step i and layer l; it is computed by states at the previous timestep and at a lower layer: (Hochreiter and Schmidhuber, 1997). The lowest layer − → h 0 i is the word embedding of the token x i . The backward hidden state ← − h l i is computed similarly using another LSTM, and the representation of each token x i is the concatenation of the top-layers: The decoder is an L-layer RNN which predicts the next token y i , given all the previous words y <i = y 1 , · · · , y i−1 and the context vector c i that captures the attention to the encoder side Luong et al., 2015), computed as a weighted sum of hidden representations: c i = l j=1 a ij h j . The weight a ij is computed by where v a ∈ R n , W l a ∈ R n×n and U a ∈ R n×2n are weight matrices.
The conditional probability of the next token y i is defined as: where U o ∈ R |V B |×n and C o ∈ R |V B |×2n are weight matrices.
[j] indexes jth element of a vector. s L i is the top-layer hidden state at time step i, computed recursively by s is the word vector of the previous token y i−1 , with W B ∈ R |V B |×n being a parameter matrix. Training: The objective function is to minimize the negative log likelihood of the target linearized PredPatt given the sentence input: where D is the batch of training pairs, and P (y i | y <i , A) is computed by Eq.(3). Inference: We use greedy search to decode tokens one by one:ŷ i = arg max y i ∈V B P (y i |ŷ <i , A)

Experiments
We describe the data for evaluation, hyperparameters, comparing approaches and evaluation results. 2 Data: We choose Chinese as the source language and English as the target language. To prepare the data for evaluation, we first collect about 2M Chinese-English parallel sentences 3 . We then tokenize Chinese sentences using Stanford Word Segmenter (Chang et al., 2008), and generate English linearized PredPatt by running SyntaxNet Parser (Andor et al., 2016) and PredPatt (White et al., 2016) on English sentences. After removing long sequences (length>50), we result in 990K pairs of Chinese sentences and English linearized PredPatt, which are then randomly divided for training (950K), validation (10K) and test (40K). Hyperparameters: Our proposed model (Joint-Seq2Seq) is trained using the Adam optimiser (Kingma and Ba, 2014), with mini-batch size 64 and step size 200. Both encoder and decoder have 2 layers and hidden state size 512, but different LSTM parameters sampled from U(-0.05,0.05). Vocabulary size is 40K for both sides. Dropout (rate=0.5) is applied to non-recurrent connections (Srivastava et al., 2014). Gradients are clipped when their norm is bigger than 5 (Pascanu et al., 2013). We use sampled softmax to speed up training (Jean et al., 2015).
Comparisons: As an alternative, we train a phrase-based machine translation system, Moses (Koehn et al., 2007), directly on the same data we used to train Joint-Seq2Seq, i.e. pairs of Chinese sentences and English linearized Pred-Patt. We call this system Joint-Moses. We also train a Pipeline system which consists of a Moses system that translates Chinese sentence to English sentence, followed by SyntaxNet Parser (Andor et al., 2016) for Universal Dependency parsing on English, and PredPatt for predicate-argument identification.
Results: We regard the generation of linearized PredPatt or linearized predicates 4 as a translation problem, and use BLEU score (Papineni et al., 2002) for evaluation. As shown in Table 1  We also evaluate predicates in the same vein as event detection evaluation using the weighted F 1 score. 5 There are totally 9,535 predicate tokens in the test data. To enable a coarser-grain evaluation, we also partitioned these predicates into k clusters (k ∈ {150, 1252}) and evaluated F 1 on the cluster identities.The clusters are obtained by running Bisecting k-Means algorithm on pre-trained word embeddings (Rastogi et al., 2015). 6 Table 2 shows the F 1 scores: Joint Seq2Seq outperforms Pipeline by 0.5-0.8 at different granularities.
An important aspect of the auto-generated linearized PredPatt is its recoverability. Table 3 shows the number of unrecoverable outputs (including empty or malformed ones). Since the last step in Pipeline is to run PredPatt, Pipeline generates no malformed output. However, 15% of its 4 In linearized predicates, arguments are replaced by placeholders. For example, the linearized PredPatt in 5 Weighted F1 is the weighted average of individual F1 for each predicate, with weights proportional to predicate frequencies in the test data. We use token-level F1 score (Liu et al., 2015) which gives partial credits to partial matches. 6 Downloaded from: https://github.com/se4u/ mvlsa.    Table 4 shows an example output. While some arguments (e.g., "The focus of focus" in Table 4) are not correct, the output of Joint Seq2Seq is closest to the gold in terms of translation. Pipeline has the higher precision in predicting the same predicate head tokens as the gold, but its overall meaning is less close. Joint Moses often generates unrecoverable outputs (e.g., the predicate in Table 4 has two head tokens: "focus" and "related".) zh sent: 重点 审计 关注 与 老百姓 生活 密切 相 关 的 专项 资金 . en sent: The focus of the auditing will be on special item funds that are closely related to people 's living . gold: [(The focus of the auditing) will be on special special funds [(special item funds) are closely related to (people 's living)]] Pipeline: [(the key auditing concern and ordinary people) are closely related to (the life of the special funds)] Joint-Moses: [(the auditing focus (attention) to (life) with (ordinary people) are closely related to (the special funds)] Joint-Seq2Seq: [(The focus of focus) focused on (the special collection of the specific funds) [(the special funds) related to (people 's lives)]]

Conclusions
We focus on the problem of cross-lingual open IE, and propose a joint solution based on a neu-ral sequence-to-sequence model. Our joint approach outperforms the pipeline solution by 1-4 BLEU and 0.5-0.8 F 1 . Future work includes minimum risk training (Shen et al., 2016) for directly optimizing the cross-lingual open IE metrics of interest. Furthermore, as PredPatt works on any language that has UD parsers available, we plan to evaluate cross-lingual Open IE on other target languages. We are also interested in exploring how our cross-lingual open IE output, which contains rich information about predicates and arguments, can be used to facilitate existing IE tasks like relation extraction, event detection, and named entity recognition in a cross-lingual setting.