Adversarial Alignment of Multilingual Models for Extracting Temporal Expressions from Text

Although temporal tagging is still dominated by rule-based systems, there have been recent attempts at neural temporal taggers. However, all of them focus on monolingual settings. In this paper, we explore multilingual methods for the extraction of temporal expressions from text and investigate adversarial training for aligning embedding spaces to one common space. With this, we create a single multilingual model that can also be transferred to unseen languages and set the new state of the art in those cross-lingual transfer experiments.


Introduction
The extraction of temporal expressions from text is an important processing step for many applications, such as topic detection and questions answering (Strötgen and Gertz, 2016). However, there is a lack of multilingual models for this task. While recent temporal taggers, such as the work by Laparra et al. (2018) focus on English, only little work was dedicated to multilingual temporal tagging so far. Strötgen and Gertz (2015) proposed to automatically generate language resources for the rulebased temporal tagger HeidelTime, but all of these models are language specific and can only process texts from a fixed language. In this paper, we propose to overcome this limitation by training a single model on multiple languages to extract temporal expressions from text. We experiment with recurrent neural networks using FastText embeddings (Bojanowski et al., 2017) and the multilingual version of BERT (Devlin et al., 2019). In order to process multilingual texts, we investigate an unsupervised alignment technique based on adversarial training, making it applicable to zero-or low-resource scenarios and compare it to standard dictionary-based alternatives (Mikolov et al., 2013). We demonstrate that it is possible to achieve competitive performance with a single multilingual model trained jointly on English, Spanish and Portuguese. Further, we demonstrate that this multilingual model can be transferred to new languages, for which the model has not seen any labeled sentences during training by applying it to unseen French, Catalan, Basque, and German data. Our model shows superior performance compared to Heidel-Time (Strötgen and Gertz, 2015) and sets new stateof-the-art results in the cross-lingual extraction of temporal expressions.

Related Work
Temporal Tagging. The current state of the art for temporal tagging are rule-based systems, such as HeidelTime (Strötgen and Gertz, 2013) or SU-Time (Chang and Manning, 2012). In particular, HeidelTime uses a different set of rules depending on the language and domain. Strötgen and Gertz (2015) automatically generated HeidelTime rules for more than 200 languages in order to support many languages. However, the quality of these rules does not match the high quality of manually created rules and the models are still language specific. Aside from rule-based systems, Lee et al. (2014) proposed to learn context-dependent semantic parsers for extracting temporal expressions from text. Laparra et al. (2018) made a first step towards neural models by using recurrent neural networks. However, they only performed experiments on English corpora using monolingual models. In contrast, we propose a truly multilingual model.

Multilingual Embeddings.
Recently, it became popular to train embedding models on resources from many languages jointly (Lample and Conneau, 2019;Conneau et al., 2019). For example, multilingual BERT (Devlin et al., 2019) was trained on Wikipedia articles from more than 100 languages. Although performance improvements show the possibility to use multilingual BERT in Figure 1: Overview of our multilingual system with adversarial training for improving the embedding space. monolingual (Hakala and Pyysalo, 2019), multilingual (Tsai et al., 2019) and cross-lingual settings (Wu and Dredze, 2019), it has been questioned whether multilingual BERT is truly multilingual (Pires et al., 2019;Singh et al., 2019;Libovickỳ et al., 2019). Therefore, we will investigate the benefits of aligning its embeddings in our experiments.
Aligning Embedding Spaces. A common method to create multilingual embedding spaces is the alignment of monolingual embeddings (Mikolov et al., 2013;Joulin et al., 2018). Smith et al. (2017) proposed to align embedding spaces by creating orthogonal transformation matrices based on bilingual dictionaries, which we use as baseline alignment method.
It was shown that BERT can also benefit from alignment, i.a. in cross-lingual (Schuster et al., 2019;Liu et al., 2019) or multilingual settings (Cao et al., 2020). In contrast to prior work, we experiment with aligning BERT using adversarial training, which is related to using adversarial training for domain adaptation (Ganin et al., 2016), coping with bias or confounding variables (Li et al., 2018;Raff and Sylvester, 2018;Zhang et al., 2018;Barrett et al., 2019;McHardy et al., 2019) or transferring models from a source to a target language (Zhang et al., 2017;Keung et al., 2019;Wang et al., 2019). Similar to Chen and Cardie (2018), we use a multinomial discriminator in our setting.

Methods
We model the task of extracting temporal expressions as a sequence tagging problem and explore the performance of state-of-the-art recurrent neural networks with FastText and BERT embeddings, respectively. In particular, we train multilingual models that process all languages in the same model. To create and improve the multilingual embedding spaces, we propose an unsupervised alignment approach based on adversarial training and compare it to two baseline approaches. Figure 1 provides an overview of the system. The different components are described in detail in the following.

Temporal Expression Extraction Model
Following previous work, e.g., Lample et al. (2016), we train a bidirectional long-short term memory network (BiLSTM) (Hochreiter and Schmidhuber, 1997) with a conditional random field (CRF) (Lafferty et al., 2001) output layer. As input, we experiment with two embedding methods: (i) pre-trained FastText (Bojanowski et al., 2017) word embeddings from multiple languages, 1 and (ii) multilingual BERT (Devlin et al., 2019) embeddings. 2 For BERT, we use the averaged output of the last four layers as input to the BiLSTM and fine-tune the whole model during the training of temporal information extraction. We also experimented with a BERT setup similar to Devlin et al. (2019) where the embeddings are directly mapped to the label space and the softmax function is used to compute the label probabilities instead of a CRF. However, we found superior performance for the BiLSTM-CRF models.

Alignment of Embeddings
We propose an unsupervised approach based on adversarial training to align multilingual embeddings in a common space (Section 3.2.2) and compare it with two approaches from related work based on linear transformation matrices (Section 3.2.1).

Baseline Alignment
Embedding spaces are typically aligned using a linear transformation based on bilingual dictionaries. We follow the work from Smith et al. (2017), and align embedding spaces based on orthogonal transformation matrices. These matrices can either be constructed in an unsupervised way by using words that appear in the vocabularies from both languages, i.e., equal words that can be identified using string matching, or in a supervised way based on real-world dictionaries (Mikolov et al., 2013;Joulin et al., 2018). For the latter method, we build dictionaries based on translations from wiktionary. 3 For both methods, we reduce the vocabularies to the most frequent 5k words per language and treat English as the pivot language.

Adversarial Alignment
We propose to use gradient reversal training to align embeddings from different (sub)spaces in an unsupervised way. Note that neither dictionaries nor other language resources are needed for this approach, making it applicable to zeroor low-resource scenarios. In particular, we extend the extraction model C with a discriminator D. Both model parts are trained alternately in a multi-task fashion. The feature extractor F is shared among them and consists of the embedding layer E, followed by a non-linear mapping: F (x) = tanh(W E(x)) with x being the current word, W ∈ R S×S and S being the embedding dimensionality.
The discriminator D is a multinomial non-linear classifier consisting of one hidden layer with ReLU activation (Hahnloser et al., 2000): D(x) = softmax(T ReLU(V F (x))) with V ∈ R S×H , T ∈ R H×O , H being a hyperparameter and O the number of different languages.
In total, we distinguish three sets of parameters: θ C : the parameters of the downstream classification model (i.e., the temporal tagger), θ D : the parameters of the discriminator, and θ F : the parameters of the feature extractor. The loss functions of the temporal tagger L C and of the discriminator L D are cross-entropy loss functions. While θ C and θ D are updated using standard gradient descent, gradient reversal training updates θ F as follows: with η being the learning rate and λ a hyperparameter to control the discriminator influence. Thus, θ F is updated in the opposite direction of the gradients from the discriminator loss, making the discriminator an adversary. With this, the discriminator is optimized for predicting the correct origin language of a given sentence, but at the same time the feature extractor gets updated with gradient reversal, such that the language detection becomes harder and the discriminator cannot easily distinguish the word representations from different languages.

Evaluation Metrics and Datasets
For evaluation, we use the TempEval3 evaluation script and report strict and relaxed extraction F 1 score for complete and partial overlap to gold standard annotations, respectively. We also report the type F 1 score for the classification into the four temporal types: Date, Time, Duration, and Set. Our multilingual models are trained using the Portuguese TimeBank (Costa and Branco, 2012) and TempEval3 (UzZaman et al., 2013) for Spanish and English (TimeBank subset). To demonstrate that our model is able to generalize to unseen languages, we perform tests using the French (Bittar et al., 2011), Catalan (Saurı and Badia, 2012) and Basque (Altuna et al., 2016) TimeBanks and the Zeit subset of the German KRAUTS corpus (Strötgen et al., 2018). Corpus statistics are shown in Table 1.

Hyperparameters and Model Training
We use the AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 1e−5 for the BiLSTM-CRF model part and 1e−6 for BERT. The model is trained for a maximum of 50 epochs using early stopping on the development set. The BiL-STM has a hidden size of 128 units per direction. The labels are encoded in the IOB2 format. For  regularization, we apply dropout with a rate of 10% after the input embeddings. The discriminator for adversarial training has a hidden size H of 100 units and is trained after every 10 th batch of the sequence tagger with λ set to 0.001.

Results
The results for the multilingual experiments are shown in Table 2. We trained three models with different random seeds and report the performance of the model with median performance on the combined development set of all languages. Current state of the art for English (Lee et al., 2014) achieves 83.1/91.4/85.4 for strict/relaxed/type F 1 . However, this is a monolingual model that can only be applied to English. The effects of aligning FastText embeddings are clearly visible in Table 2. The supervised alignment using a dictionary is always superior compared to the unsupervised alignment without a dictionary or the unaligned embeddings. Our proposed adversarial alignment (w/ AT) leads to the best results across languages. The performance of BERT is close to the best FastText model. 4 Aligning BERT with adversarial training also increases performance. The improvements are smaller compared to FastText but still statistically significant for English. Table 3 provides transfer results of the models with BERT embeddings to languages without labeled training data. 5 In particular, the model using the Wikipedia data for training the discriminator is effective in generalizing to languages without train-4 Additional experiments with the multilingual XLM model (Lample and Conneau, 2019) trained on 100 languages led to similar results as the multilingual BERT model. 5 The results of the FastText models were considerably lower for cross-lingual transfer.  Table 3: Results for the unsupervised cross-lingual setting. We compare to HeidelTime with automatically generated resources, which resembles a similar setting.

BERT
ing resources for temporal expression extraction, as these languages are also aligned during model training. It outperforms the state-of-the-art Heidel-Time models by a large margin. The impressive performance of the multilingual BERT in the crosslingual setting can be explained by the fact that the model has seen many sentences in our target languages during the pre-training phase, which can now be effectively leveraged in this new setting.

Analysis
The embedding spaces of BERT before and after aligning are shown in Figure 2. The left sub-figure presents the original BERT embeddings without any fine-tuning. In this visualization, clear clusters for each language exist. After fine-tuning on multilingual temporal expression extraction and adversarial alignment (right sub-figure) the clusters for each language mostly disappear.

Conclusion
In this paper, we investigated how a multilingual neural model with FastText or BERT embeddings can be used to extract temporal expressions from text. We investigated adversarial training for creating multilingual embedding spaces. The model can effectively be transferred to unseen languages in a cross-lingual setting and outperforms a stateof-the-art model by a large margin.