SEx BiST: A Multi-Source Trainable Parser with Deep Contextualized Lexical Representations

We describe the SEx BiST parser (Semantically EXtended Bi-LSTM parser) developed at Lattice for the CoNLL 2018 Shared Task (Multilingual Parsing from Raw Text to Universal Dependencies). The main characteristic of our work is the encoding of three different modes of contextual information for parsing: (i) Treebank feature representations, (ii) Multilingual word representations, (iii) ELMo representations obtained via unsupervised learning from external resources. Our parser performed well in the official end-to-end evaluation (73.02 LAS – 4th/26 teams, and 78.72 UAS – 2nd/26); remarkably, we achieved the best UAS scores on all the English corpora by applying the three suggested feature representations. Finally, we were also ranked 1st at the optional event extraction task, part of the 2018 Extrinsic Parser Evaluation campaign.


Introduction
Feature representation methods are an essential element for neural dependency parsing. Methods such as Feed Forward Neural Network (FFN) (Chen and Manning, 2014) or LSTM-based word representations (Kiperwasser and Goldberg, 2016; have been proposed to provide fine-grained token representations, and these methods provide state of the art performance. However, learning efficient feature representations is still challenging, especially for underresourced languages. One way to cope with the lack of training data is a multilingual approach, which makes it possible to use different corpora in different languages as training data. In most cases, for instance in the CoNLL 2017 shared task (Zeman et al., 2017), the teams that have adopted this approach used a multilingual delexicalized parser (i.e. a multi-source parser trained without taking into account lexical features). However, it is evident that delexicalized parsing cannot capture contextual features that depend on the meaning of words within the sentence.
Following previous proposals promoting a model-transfer approach with lexicalized feature representations (Guo et al., 2016;Ammar et al., 2016;Lim and Poibeau, 2017), we have developed the SEx BiST parser (Semantically EXtended Bi-LSTM parser), a multi-source trainable parser using three different contextualized lexical representations: • Corpus representation: a vector representation of each training corpus.
• Multilingual word representation: a multilingual word representation obtained by the projection of several pre-trained monolingual embeddings into a unique semantic space (following a linear transformation of each embedding).
In this paper, we extend the multilingual graphbased parser proposed by Lim and Poibeau (2017) with the three above representations.
Our parser is open source and available at: https://github.com/CoNLL-UD-2018/ LATTICE/. Our parser performed well in the official end-toend evaluation (73.02 LAS -4 th out of 26 teams, and 78.72 UAS -2 nd out of 26). We obtained very good results for French, English and Korean where we were able to extensively exploit the three above features (for these languages, we obtained the best UAS performance on all the treebanks, and among the best LAS performance as well). Unfortunately we were not able to exploit the same strategy for all the languages due to a lack of a GPU and, correspondingly, time for training, and also due a lack of training data for some languages.
The structure of the paper is as follows. We first describe the feature extraction and representation methods (Section 2 and 3) and then present our POS tagger and our parser based on multi-task learning (Section 4). We then give some details on our implementation (Section 5) and we finally provide an analysis of our official results (Section 6).

Deep Contextualized Token Representations
The architecture of our parser follows the multilingual LATTICE parser presented in Lim and Poibeau (2017), with the addition of the three feature representations presented in the introduction.
The basic token representations is as follows. Given a sentence of tokens s=(t 1 ,t 2 ,..t n ), the i th token t i can be represented by a vector x i , which is the result of the concatenation (•) of a word vector w i and a character-level vector c i of t i : When the approach is monolingual, w i corresponds to the external word embeddings provided by Facebook (Bojanowski et al., 2016). Otherwise we used our own multilingual strategy based on multilingual embeddings (see Section 3.2)

Character-Level Word Representation
Token t i can be decomposed as a vector of characters (ch 1 , ch 2 ,.. ch m ) where ch j is the j th character of t i . The function Char (that generates the character-level word vector c i ) corresponds to a vector obtained from the hidden state representation h j of the LSTM, with an initial state h 0 (m is the length of token t i ) 1 : 1 Note that i refers to the i th token in the sentence and that j refers to the j th character of the i th token. Here, we use lowercase italics for vectors and uppercase italics for matrices. So h j = LSTM (ch) (h 0 , (ch 1 ,ch 2 ,..ch m )) j c i = w c h m For LSTM-based character-level representations, previous studies have shown that the last hidden layer h m represents a summary of all the information based on the input character sequences (Shi et al., 2017). It is then possible to linearly transform this with a parameter w c so as to get the desired dimensionality. Another representation method involves applying an attention-based linear transformation of the hidden layer matrix H i , for which attention weights a i are calculated as follows: Since we apply the Softmax function, making weights sum up to 1 after a linear transformation of H i with attention parameter w att , the selfattention weight a i intuitively corresponds to the most informational characters of token t i for parsing. Finally, by summing up the hidden state H i of each word according to its attention weights a i , we obtain our character-level word representation vector for token t i . Most recently, Dozat et al. (2017) suggested an enhanced character-level representation based on the concatenation of h m and a i H i so as to capture both the summary and context information in one go for parsing. This is an option that could be explored in the future.
After some empirical experiments, we chose bidirectional LSTM encoders rather than a single directional one and then introduced the hidden state H i into the two-layered Multi-Layer Perceptron (MLP) without bias terms for computing the attention weight a i : For training, we used the charter-level word representations for all the languages except Kazakh and Thai (see Section 5).

Corpus Representation
Following Lim and Poibeau (2017), we used a one-hot treebank representation strategy to encode a set of hidden state Hi is a matrix stacked on m characters. In this paper, all the letters w and W denote parameters that the system has to learn. language-specific features. In other words, each language has its own set of specific lexical features.
For languages with several training corpora (e.g., French-GSD and French-Spoken), our parser computes an additional feature vector taking into account corpus specificities at word level. Following the recent work of Stymne et al. (2018), who proposed a similar approach for treebank representations, we chose to use a 12 dimensional vector for corpus representation. This representation tr i is concatenated with the token representation x i : We used this approach (corpus representation) for 24 corpora, and its effectiveness will be discussed in Section 5.

Contextualized Representation
ELMo (Embedding from Language Model (Peters et al., 2018)) is a function that provides a representation based on the entire input sentence. ELMo contextualized embedding is a new technique for word representation that has achieved state-of-theart performance across a wide range of language understanding tasks. This approach is able to capture both subword and contextual information. As stated in the original paper by Peters et al. (2018), the goal is to "learn a linear combination of the vectors stacked above each input word for each end task, which markedly improves performance over just using the top LSTM layer".
We trained our language model with bidirectional LSTM using ELMo as an intermediate layer in the bidirectional language model (biLM), and we used ELMo embeddings to improve again the performance of our model.
In (1), x LM i and h LM i,0 are word embedding vectors corresponding to the token layer.
← → h LM i,j is a hidden LSTM vector consisting of a multi-layer and a bidirectional LSTM layer. h LM i,j is a concatenated vector composed of x LM i and ← → h LM i,j . We computed our model with all the biLM layers weighted. In (2), s j is softmax weight that is trainable to normalize multi-layer LSTM layers. γ is the scalar parameter to efficiently train the model. We used a 1024 dimensions ELMo embedding.

Multilingual Feature Representations
The supervised, monolingual approach to parsing, based on syntactically annotated corpora, has long been the most common one. However, thanks to recent developments involving powerful word representation methods (a.k.a. word embeddings), it is now possible to develop accurate multilingual lexical models by mapping several monolingual embeddings into a single vector space. This multilingual approach to parsing has yielded encouraging results for both low- (Guo et al., 2015) and high-resource languages (Ammar et al., 2016). In this work, we extend the recent multilingual dependency parsing approach proposed by Lim and Poibeau (2017) that achieved state-of-the-art performance during the last CoNLL shared task by using multilingual embeddings mapped based on bilingual dictionaries.

Embedding Projection
There are different strategies to produce multilingual word embeddings (Ruder et al., 2018), but a very efficient one consists in simply projecting one word embedding on top of the other to make both representations share the same semantic space (Artetxe et al., 2016). The alternative involves directly generating bilingual word embeddings from bilingual corpora Gouws and Sgaard, 2015), but this requires a large amount of bilingual data aligned at sentence or document level. This kind of resource is not available for most language pairs, especially for underresourced languages.
We thus chose to train independently monolingual word embeddings and then map these word embeddings one to another. This approach is powerful since monolingual word embeddings generally share a similar structure (especially if they have been trained on similar corpora) and so can be superimposed with little information loss.
To project embeddings, we applied the linear transformation method using bilingual dictionar-ies proposed by Artetxe et al. (2017). We took the bilingual dictionaries from OPUS 2 and Wikipedia.
The projection method can be described as follows. Let X and Y be the source and target word embedding matrix so that x i refers to i th word embedding of X and y j refers to j th word embedding of Y. And let D be a binary matrix where D ij = 1, if x i and y j are aligned. Our goal is to find a transformation matrix W such that Wx approximates y. This is done by minimizing the sum of squared errors: The method is relatively simple since converting a bilingual dictionary into D is quite straightforward. The size of the dictionary used for training is around 250 pairs, and the projected word embedding is around 1.8GB. The dictionaries and the projected word embeddings are publicly available on Github. 3

Training with Multilingual Embedding
After having trained multilingual embeddings, we associate them with word representation w i as follows: We applied the multilingual embedding mostly to train the nine low-resource languages of the 2018 CoNLL evaluation, for which only a handful of annotated sentences were provided.

Multi-Task Learning for Tagging and Parsing
In this section, we describe our Part-Of-Speech (POS) tagger and dependency parser using the encoded token representation x i based on Multi-Task Learning (MTL) (Zhang and Yang, 2017).

Part-Of-Speech Tagger
As presented in Section 2 and 3, our parser is based on models trained with a combination of features, encoding different contextual information. However, the attention mechanism for the character-level word vector c i is focusing only on a limited number of features within the token, and the word representation element w i is thus needed to transform a bidirectional LSTM, as a way to capture the overall context of a sentence. Finally, a token is encoded as a vector g i : We transform the token vector g i to a vector of the desired dimensionality by two-layered MLP with a bias term to classify the best candidate of universal part-of-speech (UPOS): Finally, we randomly initialize the UPOS embedding as p i and map the predicted UPOS y i as a POS vector:

Dependency Parser
To take into account the predicted POS vector on the main target task (i.e. parsing), we concatenate the predicted POS vector p i with the word representation w i and then we encode the resulting vector via BiLSTM. This enriches the syntactic representations of the token by back-propagation during training: v i = BiLSTM (dep) (v 0 , (x 1 ,x 2 ,..x n )) i Following Dozat and Manning (2016), we used a deep bi-affine classifier to score all the possible head and modifier pairs Y = (h,m). We then selected the best dependency graph based on Eisner's algorithm (Eisner and Satta, 1999). This algorithm tries to find the maximum spanning tree among all the possible graphs: With this algorithm, it has been observed that parsing results (for some sentences) can have multiple roots, which is not a desirable feature. We thus followed an empirical method that selects a unique root based on the word order of the sentence, as already proposed by Lim and Poibeau (2017) to ensure tree well-formedness. After the selection of the best-scored tree, another bi-affine classifier is applied for the classification of relation labels, based on the predicted tree. We trained our tagger and parser simultaneously using a single objective function with penalized terms: loss = αCrossEntropy(p , p (gold) ) + βCrossEntropy(arc , arc (gold) ) + γCrossEntropy(dep , dep (gold) ) where arc and dep refer to the predicted arc (head) and dependency (modifier) results.
Since UAS directly affects LAS, we assumed that UAS would be crucial for parsing unseen corpora such as Finnish PUD, as well as other corpora from low-resource languages. Therefore, we gave more weight to the parameters predicting arc than rel and p , since arc directly affects UAS. We set α = 0.1, β = 0.7 and γ = 0.2. Unfortunately, during the testing phase, we did not adjust weight parameters that would have benefited LAS for the 61 big treebanks, and this made our results on big treebanks suffer a bit (7 th ) compared to those we obtained on Small and PUD treebanks (3 th ) regarding LAS. This also explains the gap between the UAS and LAS scores in our overall results.

Implementation Details
In this section, we provide some details on our implementation for the CoNLL 2018 shared task (Zeman et al., 2018b).

Training
We have trained both monolingual and multilingual models for parsing. In the first case, we simply used the available Universal Dependency 2.2 corpora for training (Zeman et al., 2018a). In the second case, for the multilingual approach, as both multilingual word embeddings and corresponding training corpora (in the Universal Dependency 2.2 format) were required, we concatenated the corresponding available Universal Dependency 2.2 corpora to artificially create multilingual training corpora.
The number of epochs was set to 200, with one epoch processing the entire training corpus in each language and with a batch size of 32. We then picked the best five performing models to parse the test corpora on TIRA (Potthast et al., 2014).
The five models were used as an ensemble run (described in Section 5.2).
Hyperparameters. Each deep learning parser has a number of hyperparameters that can boost the overall performance of the system. In our implementation, most hyperparameter settings were identical to Dozat et al. (2017), except of course those concerning the additional features we have introduced before. We used 100 dimensional character-level word representations with a 200 dimensional MLP, as presented in Section 2, and for corpus representation, we used a 12 dimensional vector. We set the learning-rate to 0.002 with Adam optimization.
Multilingual Embeddings. As described in Section 3, we specifically trained multilingual embedding models for nine low-resource languages. Table 2 gives the list of languages for which we adopted this approach, along with the language used for knowledge transfer. We selected language pairs based on previous studies (Lim and Poibeau, 2017; for bxr, kk, kmr, sme, and hsb, and the others where chosen based on the public availability of bilingual dictionaries (this explains why we chose to map several languages with English, even when there was no real linguistically motivated reason to do so). Since we could not find any pre-trained embeddings for pcm nsc, we applied a delexicalized parsing approach based on an English monolingual model.
ELMo. We used ELMo weights to train specific models for five languages: Korean, French, English, Japanese and Chinese. ELMo weights were pre-trained using the CoNLL resources provided 4 . We used AllenNLP 5 for training, and used the default hyperparameters. We included ELMo only at the level of the input layer for both training and inference (we set up dropout to 0.5 and used 1024 dimensions for the ELMo embedding layer in our model). All the other hyper-parameters are the same as for our other models (without ELMo).

Testing
All the tests were done on the TIRA platform provided by the shared task organizers. During the test phase, we applied an ensemble mechanism using five models trained with two different "seeds". The seeds are integers randomly produced by the    Python random library and are used to initialize the two parameters W and w (see Section 2). Generally, an ensemble mechanism combines the best performing models obtained from different seeds, so as to ensure robustness and efficiency. In our case, due to a lack of a GPU, different models have been trained simply based on the use of two different seeds. Finally, the five best performing models produced by the two seeds were put together to form the ensemble model. This improved the performances by up to 0.6%, but other improvements could be expected by testing with a larger set of seeds.

Hardware Resources
The training process for all the language models with the ensemble and ELMo was done using 32 CPUs and 7 GPUs (Geforce 1080Ti) in approximately two weeks. The memory usage of each model depends on the size of external word embeddings (3GB RAM by default plus the amount needed for loading the external embeddings). In the testing phase on the TIRA platform, we submitted our models separately, since testing with a model trained with ELMo takes around three hours. Testing took 46.2 hours for the 82 corpora using 16 CPUs and 16GB RAM.

Results
In this section, we discuss the results of our system and the relative contributions of the different features to the global results.
Overall results. The official evaluation results are given in Table1. Our system achieved 73.02 LAS (4 th out of 26 teams) and 78.71 UAS (2 nd out of 26).
The comparison of our results with those obtained by other teams shows that there is room for improvement regarding preprocessing. For example, our system is 0.86 points below HIT-SCIR (Harbin) for sentence segmentation and 1.03 for tokenization (HIT-SCIR obtained the best overall results). Those two preprocessing tasks (sentence segmentation and tokenization) affect tagging and parsing performance directly. As a result, our parser ranked second on small treebanks (LAS), where most teams used the default segmenter and tokenizer, avoiding the differences on this aspect. In contrast, we achieved 7 th on the big treebanks, probably because there is a more significant gap (1.72) here at the tokenization level.
Corpus Representation. Results with corpus representation (corpora marked tr in column Method of Table 1) exhibit relatively better performance than those without it, since tr makes it possible to capture corpus-oriented features. Results were positive not only for small treebanks (e.g., cs fictree and ru taiga) but also for big treebanks (e.g., cs cac and ru syntagrus). Corpus representation with ELMo shows the best performance for parsing English and French.
Multilinguality. As described in Section 3, we applied the multilingual approach to most of the low-resource languages. The best result is obtained for hy armtdp, while sme giella and hsb ufal also gave satisfactory results. We only applied the delexicalized approach to pcm nsc since we could not find any pre-trained embeddings for this language. We got a relatively poor result for pcm nsc, despite testing different strategies and different feature combinations (we assume that the English model is not fit for it).
Additionally, we found that character-level representation is not always helpful, even in the case of some low-resource languages. When we tested kk ktb (Kazakh) trained with a Turkish corpus, with multilingual word embeddings and characterlevel representations, the performance dramatically decreased. We suspect this has to do with the writing systems (Arabic versus Latin), but this theory should be further investigated.
sme giella is another exceptional case since we chose to use a multilingual model trained with three different languages. Although Russian and Finnish do not use the same writing system, applying character and corpus representation improve the results. This is because the size of the training corpus for sme giella is around 900 sentences, which seems to be enough to capture its main characteristics.
Language Model (ELMo). We used ELMo embeddings for five languages: Korean, French, English, Japanese and Chinese (they are marked with el in the method column in Table 1). The experiments with ELMo models showed excellent overall performance. All the English corpora, fr gsd and fr sequoia in French, and Korean ko kaist obtained the best UAS. We also obtained the best LAS for English en gum and en pud, and for fr sequoia in French.
Contributions of the Different System Components to the General results. To analyze the  effect of the proposed representation methods on parsing, we evaluated four different models with different components. We set our baseline model with a token representation as where w i is a randomly initialized word vector, c i is a character-level word vector and p i is a POS vector predicted by UDpipe1.1 (note that we did not apply our 2018 POS tagger here, since it is trained jointly with the parser and that affects the overall feature representation). We then initialized the word vector w i with external word embeddings as provided by the CoNLL shared organizers. We also re-run the experiment by adding treebank and ELMo representations. The results are shown in Table 3 (em denotes the use of the external word embedding and tr and el denotes treebank and ELMo representations respectively.). We observe that each representation improves the overall results. This is especially true regarding LAS when using ELMo (el), which means this representation has a positive effect on relation labeling. Extrinsic Parser Evaluation (EPE 2018). Participants in the CoNLL shared task were invited to also participate in the 2018 Extrinsic Parser Evaluation (EPE) campaign 6 (Fares et al., 2018), as a way to confirm the applicability of the developed methods on practical tasks. Three downstream tasks were proposed this year in the EPE: biomedical event extraction, negation resolution and opinion analysis (and each task was run independently from the others). For this evaluation, participants were only required to send a parsed version of the different corpora received as input back to the organizers using a UD-type format (the organizers then ran the different scripts related to the different tasks and computed the corresponding results). We trained one single English model for the three tasks using the three English corpora provided (en lines, en ewt, en gum) without treebank embeddings (tr), since we did not know which corpus embedding would perform better. In addition, we did not apply our ensemble process on TIRA since it would have been too time consuming.
Our results are listed in Table 4. They include an intrinsic evaluation (overall performance of the parser on the different corpora considered as a whole) (Nivre and Fang, 2017) and taskspecific evaluations (i.e. results for the three different tasks). In the intrinsic evaluation, we obtained the best LAS among all the participating systems, which confirms the portability of our approach across different domains. As for the taskspecific evaluations, we obtained the best result for event extraction, but our parser did not perform so well on negation resolution and opinion analysis. This means that specific developments would be required to properly address the two tasks under consideration, taking semantics into consideration.

Conclusion
In this paper, we described the SEx BiST parser (Semantically EXtended Bi-LSTM parser) developed at Lattice for the CoNLL 2018 Shared Task. Our system was an extention of our 2017 parser (Lim and Poibeau, 2017) with three deep contextual representations (multilingual word representation, corpus representations, ELMo representation). It also included a multi-task learning process able to simultaneously handle tagging and parsing. SEx BiST achieved 73.02 LAS (4 th over 26 teams), and 78.72 UAS (2 nd out of 26), over the 82 test corpora of the evaluation. In the future, we hope to improve our sentence segmenter and our tokenizer since this seems to be the most obvious target for improvements to our system. The generalization of ELMo representation to new languages (beyond what we could do for the 2018 evaluation) should also have a positive effect on the results.