Halo: Learning Semantics-Aware Representations for Cross-Lingual Information Extraction

Cross-lingual information extraction (CLIE) is an important and challenging task, especially in low resource scenarios. To tackle this challenge, we propose a training method, called Halo, which enforces the local region of each hidden state of a neural model to only generate target tokens with the same semantic structure tag. This simple but powerful technique enables a neural model to learn semantics-aware representations that are robust to noise, without introducing any extra parameter, thus yielding better generalization in both high and low resource settings.


Introduction
Cross-lingual information extraction (CLIE) is the task of distilling and representing factual information in a target language from the textual input in a source language (Sudo et al., 2004;Zhang et al., 2017c).For example, Fig. 1 illustrates a pair of input Chinese sentence and its English predicateargument information 1 , where predicate and argument are well used semantic structure tags.
It is of great importance to solve the task, as to provide viable solutions to extracting information from the text of languages that suffer from no or little existing information extraction tools.Neural models have empirically proven successful in this task (Zhang et al., 2017c,b), but still remain unsatisfactory in low resource (i.e.small number of training samples) settings.These neural models learn to summarize a given source sentence and target prefix into a hidden state, which aims to generate the correct next target token after being Chinese input text (a) and linearized English PredPatt output (b), where ':p' and blue stand for predicate while ':a' and purple denote argument.
passed through an output layer.As each member in the target vocabulary is essentially either predicate or argument, a random perturbation on the hidden state should still be able to yield a token with the same semantic structure tag.This inductive bias motivates an extra term in training objective, as shown in Fig. 2, which enforces the surroundings of any learned hidden state to generate tokens with the same semantic structure tag (either predicate or argument) as the centroid.We call this technique Halo, because the process of each hidden state taking up its surroundings is analogous to how the halo is formed around the sun.The method is believed to help the model generalize better, by learning more semantics-aware and noise-insensitive hidden states without introducing extra parameters.

The Problem
We are interested in learning a probabilistic model that directly maps an input sentence {x i } I i=1 = x 1 x 2 . . .x I of the source language S into an output sequence {y t } T t=1 = y 1 y 2 . . .y T of the target language T , where S can be any human natural language (e.g.Chinese) and T is the English PredPatt (White et al., 2016).In the latter vocabulary, each type is tagged as either predicate or argument-those with ":p" are predicates while those with ":a" are arguments.
For any distribution P in our proposed family, the log-likelihood of the model P given any  model learns to summarizes the current known information into a hidden state and predict the next target token, the surroundings of this hidden state in the same space (twodimensional in this example) are supervised to generate tokens with the same semantic structure tag.For example, at the last shown step, the centroid of purple area is the summarized hidden state and learns to predict 'mortars:a', while a randomly sampled neighbor is enforced to generate an argument, although it may not be 'mortars' (thus denoted by '?').Similar remarks apply to the blue regions.
) pair is: where y 0 is a special beginning of sequence token.We denote vectors by bold lowercase Roman letters such as h, and matrices by bold capital Roman letters such as W throughout the paper.Subscripted bold letters denote distinct vectors or matrices (e.g., p t ).Scalar quantities, including vector and matrix elements such as h d and p t,yt , are written without bold.Capitalized scalars represent upper limits on lowercase scalars, e.g., 1 ≤ d ≤ D. Function symbols are notated like their return type.All R → R functions are extended to apply elementwise to vectors and matrices.

The Method
In this section, we first briefly review how the baseline neural encoder-decoder models work on this task, and then introduce our novel and wellsuited training method Halo.

Baseline Neural Models
Previous neural models on this task (Zhang et al., 2017c,b) all adopt an encoder-decoder architecture with recurrent neural networks, particularly LSTMs (Hochreiter and Schmidhuber, 1997).At each step t in decoding, the models summarize the input {x i } I i=1 and output prefix y 1 , . . ., y t−1 into a hidden state h t ∈ (−1, 1) D , and then project it with a transformation matrix W ∈ R |V|×D to a distribution p t over the target English PredPatt vocabulary V: where 1 is a |V|-dimensional one vector such that p t is a valid distribution.Suppose that the ground truth target token at this step is y t , the probability of generating y t under the current model is p t,yt , obtained by accessing the y t -th element in the vector p t .Then the loglikelihood is constructed as = T t=1 log p t,yt , and the model is trained by maximizing this objective over all the training pairs.

Halo
Our method adopts a property of this task-the vocabulary V is partitioned into P, set of predicates that end with ":p", and A, set of arguments that end with ":a".As a neural model would summarize everything known up to step t into h t , would a perturbation h t around h t still generate the same token y t ?This bias seems too strong, but we can still reasonably assume that h t would generate a token with the same semantic structure tag (i.e.predicate or argument).That is, the prediction made by h t should end with ":p" if y t is a predicate, and with ":a" otherwise.
This inductive bias provides us with another level of supervision.Suppose that at step t, a neighboring h t is randomly sampled around h t , and is then used to generate a distribution p t in the same way as equation ( 2).Then we can get a distribution q t over C = {predicate, argument}, by summing all the probabilities of predicates and those of arguments: This aggregation is shown in Fig. 3. Then the extra objective is = T t=1 log q t,ct , where c t = predicate if the target token y t ∈ P (i.e.ending with ":p") and c t = argument otherwise.
Therefore, we get the joint objective to maximize by adding and : which enables the model to learn more semanticsaware and noise-insensitive hidden states by enforcing the hidden states within a region to share the same semantic structure tag.2

Sampling Neighbors
Sampling a neighbor around h t is essentially equivalent to adding noise to it.Note that in a LSTM decoder that previous work used, h t ∈ (−1, 1) D because h t = o t tanh(c t ) where o t ∈ (0, 1) D and tanh(c t ) ∈ (−1, 1) D .Therefore, extra work is needed to ensure h t ∈ (−1, 1) D .For this purpose, we follow the recipe3 : • Sample h t ∈ (−1, 1) D by independently sampling each entry from an uniform distribution over (−1, 1); • Sample a scalar λ t ∈ (0, 1) from a Beta distribution B(α, β) where α and β are hyperparameters to be tuned; ) D lies on the line segment between h t and h t .
Note that the sampled hidden state h t is only used to compute q t , but not to update the LSTM hidden state, i.e., h t+1 is independent of h t .

Roles of Hyperparameters
The Halo technique adds an inductive bias into the model, and its magnitude is controlled by λ t : extra supervision on the model; • λ t → 1 makes h t uniformly sampled in entire (−1, 1) D , and causes underfitting just like a L-2 regularization coefficient goes to infinity.
We sample a valid λ t from a Beta distribution with α > 0 and β > 0, and their magnitude can be tuned on the development set: • When α → 0 and β is finite, or α is finite and β → ∞, we have λ t → 0; • When α → ∞ and β is finite, or α is finite and β → 0, we have λ t → 1; • Larger α and β yield larger variance of λ t , and setting λ t to be a constant is a special case that α → ∞, β → ∞ and α/β is fixed.
Besides α and β, the way of partitioning V (i.e. the definition of C) also serves as a knob for tuning the bias strength.Although on this task, the predicate and argument tags naturally partition the vocabulary, we are still able to explore other possibilities.For example, an extreme is to partition V into |V| different singletons, meaning that C = Va perturbation around h t should still predict the same token.But this extreme case does not work well in our experiments, verifying the importance of the semantic structure tags on this task.

Related Work
Cross-lingual information extraction has drawn a great deal of attention from researchers.Some (Sudo et al., 2004;Parton et al., 2009;Ji, 2009;Snover et al., 2011;Ji and Nothman, 2016) worked in closed domains, i.e. on a predefined set of events and/or entities, Zhang et al. (2017c) explored this problem in open domain and their attentional encoder-decoder model significantly outperformed a baseline system that does translation and parsing in a pipeline.Zhang et al. (2017b) further improved the results by inventing a hierarchical architecture that learns to first predict the next semantic structure tag and then select a tagdependent decoder for token generation.Orthogonal to these efforts, Halo aims to help all neural models on this task, rather than any specific model architecture.
Halo can be understood as a data augmentation technique (Chapelle et al., 2001;Van der Maaten et al., 2013;Srivastava et al., 2014;Szegedy et al., 2016;Gal and Ghahramani, 2016).Such tricks have been used in training neural networks to achieve better generalization, in applications like image classification (Simard et  2000; Simonyan and Zisserman, 2015;Arpit et al., 2017;Zhang et al., 2017a) and speech recognition (Graves et al., 2013;Amodei et al., 2016).Halo differs from these methods because 1) it makes use of the task-specific informationvocabulary is partitioned by semantic structure tags; and 2) it makes use of the human belief that the hidden representations of tokens with the same semantic structure tag should stay close to each other.Some

Experiments
We evaluate our method on several real-world CLIE datasets measured by BLEU (Papineni et al., 2002) and F1, as proposed by Zhang et al. (2017c).For the generated linearized PredPatt outputs and their references, the former metric 4 measures their n-gram similarity, and the latter measures their token-level overlap.In fact, F1 is computed separately for predicate and argument, as F1 PRED and F1 ARG respectively.

Datasets
Multiple datasets were used to demonstrate the effectiveness of our proposed method, where one sample in each dataset is a source language sentence paired with its linearized English PredPatt output.These datasets were first introduced as the DARPA LORELEI Language Packs (Strassel and Tracey, 2016), and then used for this task by Zhang et al. (2017c,b).As shown in table 1, the CHINESE dataset has almost one million training samples and a high token/type ratio, while the oth- 4 The MOSES implementation (Koehn et al., 2007) was used as in all the previous work on this task.ers are low resourced, meaning they have much fewer samples and lower token/type ratios.

Model Implementation
Before applying our Halo technique, we first improved the current state-of-the-art neural model of Zhang et al. (2017b) by using residual connections (He et al., 2016) and multiplicative attention (Luong et al., 2015), which effectively improved the model performance.We refer to the model of Zhang et al. (2017b) and our improved version as ModelZ and ModelP respectively5 .

Experimental Details
In experiments, instead of using the full vocabularies shown in table 1, we set a minimum count threshold for each dataset, to replace the rare words by a special out-of-vocabulary symbol.These thresholds were tuned on dev sets.
The Beta distribution is very flexible.In general, its variance is a decreasing function of α + β, and when α + β is fixed, the mean is an increasing function of α.In our experiments, we fixed α + β = 20 and only lightly tuned α on dev sets.Optimal values of α stay close to 1. Additionally, experiments were also conducted on two other low resource datasets AMHARIC and YORUBA that Zhang et al. (2017b) included, and α = 0 in Halo was found optimal on the dev sets.In such cases, this regularization was not helpful so no comparison need be made on the held-out test sets.

Conclusion and Future Work
We present a simple and effective training technique Halo for the task of cross-lingual information extraction.Our method aims to enforce the local surroundings of each hidden state of a neural model to only generate tokens with the same semantic structure tag, thus enabling the learned hidden states to be more aware of semantics and robust to random noise.Our method provides new state-of-the-art results on several benchmark cross-lingual information extraction datasets, including both high and low resource scenarios.
As future work, we plan to extend this technique to similar tasks such as POS tagging and Semantic Role Labeling.One straightforward way of working on these tasks is to define the vocabularies as set of 'word-type:POS-tag' (so c t = POS tag) and 'word-type:SR' (so c t = semantic role), such that our method is directly applicable.It would also be interesting to apply Halo widely to other tasks as a general regularization technique.

Figure 2 :
Figure 2: Visualization of Halo method.While a neural

Figure 3 :
Figure 3: Visualization of how q (distribution over C) is obtained by aggregating p (distribution over V).

Table 2 :
BLEU and F1 scores of different models on all these datasets, where PRED stands for predicate and ARG for argument.Best numbers are highlighted as bold.

Table 2 ,
ModelP outperforms Mod-elZ on all the datasets measured by all the metrics, except for F1 PRED on CHINESE dataset.Our Halo technique consistently boosts the model performance of MODELP except for F1 PRED on TURKISH.