Big Data Small Data, In Domain Out-of Domain, Known Word Unknown Word: The Impact of Word Representations on Sequence Labelling Tasks

Word embeddings -- distributed word representations that can be learned from unlabelled data -- have been shown to have high utility in many natural language processing applications. In this paper, we perform an extrinsic evaluation of five popular word embedding methods in the context of four sequence labelling tasks: POS-tagging, syntactic chunking, NER and MWE identification. A particular focus of the paper is analysing the effects of task-based updating of word representations. We show that when using word embeddings as features, as few as several hundred training instances are sufficient to achieve competitive results, and that word embeddings lead to improvements over OOV words and out of domain. Perhaps more surprisingly, our results indicate there is little difference between the different word embedding methods, and that simple Brown clusters are often competitive with word embeddings across all tasks we consider.


Introduction
Recently, distributed word representations have grown to become a mainstay of natural language processing (NLP), and been show to have empirical utility in a myriad of tasks (Collobert and Weston, 2008;Turian et al., 2010;Baroni et al., 2014;Andreas and Klein, 2014).The underlying idea behind distributed word representations is simple: to map each word w in our vocabulary V onto a continuous-valued vector of dimensionality d |V |.Words that are similar (e.g., with respect to syntax or lexical semantics) will ideally be mapped to similar regions of the vector space, implicitly supporting both generalisa-tion across in-vocabulary (IV) items, and countering the effects of data sparsity for low-frequency and out-of-vocabulary (OOV) items.
Without some means of automatically deriving the vector representations without reliance on labelled data, however, word embeddings would have little practical utility.Fortunately, it has been shown that they can be "pre-trained" from unlabelled text data using various algorithms to model the distributional hypothesis (i.e., that words which occur in similar contexts tend to be semantically similar).Pre-training methods have been refined considerably in recent years, and scaled up to increasingly large corpora.
As with other machine learning methods, it is well known that the quality of the pre-trained word embeddings depends heavily on factors including parameter optimisation, the size of the training data, and the fit with the target application.For example, Turian et al. (2010) showed that the optimal dimensionality for word embeddings is taskspecific.One factor which has received relatively little attention in NLP is the effect of "updating" the pre-trained word embeddings as part of the task-specific training, based on self-taught learning (Raina et al., 2007).Updating leads to word representations that are task-specific, but often at the cost of over-fitting low-frequency and OOV words.
In this paper, we perform an extensive evaluation of five word embedding approaches under fixed experimental conditions, applied to four sequence labelling tasks: POS-tagging, full-text chunking, named entity recognition (NER), and multiword expression (MWE) identification.In this, we explore the following research questions: RQ1: are word embeddings better than baseline Distributional representation methods map each word w to a context word vector C w , which is constructed directly from co-occurrence counts between w and its context words.The learning methods either store the co-occurrence counts between two words w and i directly in C wi (Sahlgren, 2006;Turney et al., 2010;Honkela, 1997) or project the concurrence counts between words into a lower dimensional space ( Řehůřek and Sojka, 2010;Lund and Burgess, 1996), using dimensionality reduction techniques such as SVD (Dumais et al., 1988) and LDA (Blei et al., 2003).
Cluster-based representation methods build clusters of words by applying either soft or hard clustering algorithms (Lin and Wu, 2009;Li and McCallum, 2005).Some of them also rely on a co-occurrence matrix of words (Pereira et al., 1993).The Brown clustering algorithm (Brown et al., 1992) is the best-known method in this category.
Distributed representation methods usually map words into dense, low-dimensional, continuous-valued vectors, with x ∈ R d , where d is referred to as the word dimension.

Selected Word Representations
Over a range of sequence labelling tasks, we evaluate five methods for inducing word representations: Brown clustering (Brown et al., 1992) ("BROWN"), the neural language model of Collobert & Weston ("CW") (Collobert et al., 2011), the continuous bag-of-words model ("CBOW") (Mikolov et al., 2013a), the continu-ous skip-gram model ("SKIP-GRAM") (Mikolov et al., 2013b), and Global vectors ("GLOVE") (Pennington et al., 2014).With the exception of CW, all have have been shown to be at or near stateof-the-art in recent empirical studies (Turian et al., 2010;Pennington et al., 2014).CW is included because it was highly influential in earlier research, and the pre-trained embeddings are still used to some degree in NLP.The training of these word representations is unsupervised: the common underlying idea is to predict occurrence of words in the neighbouring context.Their training objectives share the same form, which is a sum of local training factors J(w, ctx(w)), where V is the vocabulary of a given corpus, and ctx(w) denotes the local context of word w.The local context of a word can either be its previous k words, or the k words surrounding it.Local training factors are designed to capture the relationship between w and its local contexts of use, either by predicting w based on its local context, or using w to predict the context words.Other than BROWN, which utilises a cluster-based representation, all the other methods employ a distributed representation.
The starting point for CBOW and SKIP-GRAM is to employ softmax to predict word occurrence: where v ctx(w) denotes the distributed representation of the local context of word w.CBOW derives v ctx(w) based on averaging over the context words.That is, it estimates the probability of each w given its local context.In contrast, SKIP-GRAM applies softmax to each context word of a given occurrence of word w.In this case, v ctx(w) corresponds to the representation of one of its context words.This model can be characterised as predicting context words based on w.In practice, softmax is too expensive to compute over large corpora, and thus Mikolov et al. (2013b) use hierarchical softmax and negative sampling to scale up the training.CW considers the local context of a word w to be m words to the left and m words to the right of w.The concatenation of the embeddings of w and all its context words are taken as input to a neural network with one hidden layer, which produces a higher level representation f (w) ∈ R d .Then the learning procedure replaces the embedding of w with that of a randomly sampled word w and generates a second representation f (w ) ∈ R d with the same neural network.The training objective is to maximise the difference between them: This approach can be regarded as negative sampling with only one negative example.
GLOVE assumes the dot product of two word embeddings should be similar to the logarithm of the co-occurrence count X ij of the two words.As such, the local factor J(w, ctx(w)) becomes: where b i and b j are the bias terms of words i and j, respectively, and g(X ij ) is a weighting function based on the co-occurrence count.This weighting function controls the degree of agreement between the parametric function v T i v j + b i + b j and log(X ij ).Frequently co-occurring word pairs will be larger weight than infrequent pairs, up to a threshold.
BROWN partitions words into a finite set of word classes V .The conditional probability of seeing the next word is defined to be: where h k denotes the word class of the word w k , w k−1 k−m are the previous m words and h k−1 k−m are their respective word classes.Then Since there is no tractable method to find an optimal partition of word classes, the method uses only a bigram class model, and utilises hierarchical clustering as an approximation method to find a sufficiently good partition of words.

Building Word Representations
For a fair comparison, we train BROWN, CBOW, SKIP-GRAM, and GLOVE on a fixed corpus, comprised of freely available corpora, as detailed in Tab. 1.The joint corpus was preprocessed with the Stanford CoreNLP sentence splitter and tokeniser.All consecutive digit substrings were replaced by NUMf, where f is the length of the digit substring (e.g., 10.20 is replaced by NUM2.NUM2.Due to the computational complexity of the pre-training,  1: Corpora used to pre-train the word embeddings for CW, we simply downloaded the pre-compiled embeddings from: http://metaoptimize. com/projects/wordreprs.
The dimensionality of the word embeddings and the size of the context window are the key hyperparameters when learning distributed representations.We use all combinations of the following values to train word embeddings on the combined corpus: BROWN requires only the number of clusters as a hyperparameter.We perform clustering with b ∈ {250, 500, 1000, 2000, 4000} clusters.

Sequence Labelling Tasks
We evaluate the different word representations over four sequence labelling tasks: POS-tagging (POS-tagging), full-text chunking (Chunking), NER (NER) and MWE identification (MWE).For each task, we fed features into a first order linearchain graph transformer (Collobert et al., 2011) made up of two layers: the upper layer is identical to a linear-chain CRF (Lafferty et al., 2001), and the lower layer consists of word representation and hand-crafted features.If we treat word representations as fixed, the graph transformer is a simple linear-chain CRF.On the other hand, if we can treat the word representations as model parameters, the model is equivalent to a neural network with word embeddings as the input layer.We trained all models using AdaGrad (Duchi et al., 2011).
As in Turian et al. (2010), at each word position, we construct word representation features from the words in a context window of size two to either side of the target word, based on the pre-trained representation of each word type.For BROWN, the features are the prefix features extracted from word clusters in the same way as Turian et al. (2010).As a baseline (and to test RQ1), we include a one-hot representation (which is equiva-lent to a linear-chain CRF with only lexical context features).
Our hand-crafted features for POS-tagging, Chunking and MWE, are those used by Collobert et al. (2011), Turian et al. (2010) and Schneider et al. (2014b), respectively.For NER, we use the same feature space as Turian et al. (2010), except for the previous two predictions, because we want to evaluate all word representations with the same type of model -a first-order graph transformer.
In training the distributed word representations, we consider two settings: ( 1 English Named Entity Recognition data set, for which the source data was taken from Reuters newswire articles (Tjong Kim Sang and De Meulder (2003): "Reuters") MWE: the MWE dataset of Schneider et al. (2014b), over a portion of text from the English Web Treebank2 ("EWT") For all tasks other than MWE,3 we additionally have an out-of-domain test set, in order to evaluate the out-of-domain robustness of the different word representations, with and without updating.These datasets are as follows: POS-tagging: the English Web Treebank with Penn POS tags ("EWT") Chunking: the Brown Corpus portion of the Penn Treebank ("Brown"), converted into IOB-style full-text chunks using the CoNLL conversion scripts NER: the MUC-7 named entity recognition corpus4 ("MUC7") For reproducibility, we tuned the hyperparameters with random search over the development data for each task (Bergstra and Bengio, 2012).In this, we randomly sampled 50 distinct hyperparameter sets with the same random seed for the nonupdating models (i.e. the models that don't update the word representation), and sampled 100 distinct hyperparameter sets for the updating models (i.e. the models that do).For each set of hyperparameters and task, we train a model over its training set and choose the best one based on its performance on development data (Turian et al., 2010).We also tune the word representation hyperparameters -namely, the word vector size d and context window size m (distributed representations), and in the case of Brown, the number of clusters.
For the updating models, we found that the results over the test data were always inferior to those that do not update the word representations, due to the higher number of hyperparameters and small sample size (i.e.100).Since the two-layer model of the graph transformer contains a distinct set of hyperparameters for each layer, we reuse the best-performing hyperparameter settings from the non-updating models, and only tune the hyperparameters of AdaGrad for the word representation layer.This method requires only 32 additional runs and achieves consistently better results than 100 random draws.
In order to test the impact of the volume of training data on the different models (RQ2), we split the training set into 10 partitions based on a base-2 log scale (i.e., the second smallest partition will be twice the size of the smallest partition), and created 10 successively larger training sets by merging these partitions from the smallest one to the largest one, and used each of these to train a model.From these, we construct learning curves over each task.
For ease of comparison with previous results, we evaluate both in-and out-of-domain using chunk/entity/expression-level F1-measure ("F1") for all tasks except POS-tagging, for which we use token-level accuracy ("ACC").To test performance over OOV (unknown) tokens -i.e., the words that do not occur in the training set -we

Experimental Results and Discussion
We structure our evaluation by stepping through each of our five research questions (RQ1-5) from the start of the paper.In this, we make reference to: (1) the best-performing method both inand out-of-domain vs. the state-of-the-art (Tab.3); (2) a heat map for each task indicating the convergence rate for each word representation, with and without updating (Fig. 1); (3) OOV accuracy both in-domain and out-of-domain for each task (Fig. 2); and (4) visualisation of the impact of updating on word embeddings, based on t-SNE (Fig. 3).
RQ1: Are word embeddings better than onehot unigram features and Brown clusters?As shown in Tab. 3, the best-performing method for every task except in-domain Chunking is a word embedding method, although the precise method varies greatly.Fig. 1, on the other hand, tells a more subtle story: the difference between UNIGRAM and the other word representations is relatively modest, esp.as the amount of training data increases.Additionally, the difference between BROWN and the word embedding methods is modest across all tasks.So, the overall answer would appear to be: yes for unigrams when there is little training data, but not really for BROWN.
RQ2: Do word embedding features require less training data?Fig. 1 shows that for POS-tagging and NER, with only several hundred training instances, word embedding features achieve superior results to UNIGRAM.For example, when trained with 561 instances, the POS-tagging model using SKIP-GRAM+UP embeddings is 5.3% above UNIGRAM; and when trained with 932 instances, the NER model using SKIP-GRAM is 11.7% above UNIGRAM.Similar improvements are also found for other types of word embeddings and BROWN, when the train-ing set is small.However, all word representations perform similarly for Chunking regardless of training data size.For MWE, BROWN performs slightly better than the other methods when trained with approximately 25% of the training instances.Therefore, we conjecture that the POS-tagging and NER tasks benefit more from distributional similarity than Chunking and MWE.
RQ3: Does task-specific updating improve all word embeddings across all tasks?Based on Fig. 1, updating of word representations can equally correct poorly-learned word representations, and harm pre-trained representations, due to overfitting.For example, CW and GLOVE perform significantly worse than SKIP-GRAM in both POS-tagging and NER without updating, but with updating, the gap between their results and the best performing method becomes smaller.In contrast, SKIP-GRAM performs worse over the test data with updating, despite the results on the development set improving by 1%.
To further investigate the effects of updating, we sampled 60 words and plotted the changes in their word embeddings under updating, using 2-d vector fields generated by using matplotlib and t-SNE (van der Maaten and Hinton, 2008).Half of the words were chosen manually to include known word clusters such as days of the week and names of countries; the other half were selected randomly.Additional plots with 100 randomly-sampled words and the top-100 most frequent words, for all the methods and all the tasks, can be found in the supplementary material and at https://123abc123abd. wordpress.com/.In each plot, a single arrow signifies one word, pointing from the position of the original word embedding to the updated representation.
In Fig. 3, we show vector fields plots for Chunking and NER using SKIP-GRAM embeddings.For Chunking, most of the vectors were changed with similar magnitude, but in very different directions, including within the clusters of days of the week and country names.In contrast, for NER, there was more homogeneous change in

Task Benchmark
In-domain Test set Out-of-domain Test set POS-tagging (ACC) 0.972 (Toutanova et al., 2003) 0.959 (SKIP-GRAM+UP) 0.910 (SKIP-GRAM) Chunking word vectors belonging to the same cluster.This greater consistency is further evidence that semantic homogeneity appears to be more beneficial for NER than Chunking.
RQ4: What is the impact of word embeddings cross-domain and for OOV words?As shown in Tab. 3, results predictably drop when we evaluate out of domain.The difference is most pronounced for Chunking, where there is an absolute drop in F1 of around 30% for all methods, indicating that word embeddings and unigram features provide similar information for Chunking.
Another interesting observation is that updating often hurts out-of-domain performance because the distribution between domains is different.This suggests that, if the objective is to optimise per-formance across domains, it is best not to perform updating.
We also analyze performance on OOV words both in-domain and out-of-domain in Fig. 2. As expected, word embeddings and BROWN excel in out-of-domain OOV performance.Consistent with our overall observations about cross-domain generalisation, the OOV results are better when updating is not performed.
RQ5 Overall, are some word embeddings better than others?Comparing the different word embedding techniques over our four sequence labelling tasks, for the different evaluations (overall, out-of-domain and OOV), there is no clear winner among the word embeddings -for POS-tagging, SKIP-GRAM appears to have a slight advantage,   While the aim of this paper was not to achieve the state of the art over the respective tasks, it is important to concede that our best (in-domain) results for NER, POS-tagging and Chunking are slightly worse than the state of the art (Tab.3).The 2.7% difference between our NER system and the best performing system is due to the fact that we use a first-order instead of a second-order CRF (Ando and Zhang, 2005), and for the other tasks, there are similarly differences in the learner and the complexity of the features used.Another difference is that we tuned the hyperparameters with random search, to enable replication using the same random seed.In contrast, the hyperparameters for the state-of-the-art methods are tuned more extensively by experts, making them more difficult to reproduce.
5 Related Work  Bansal et al. (2014) reported that direct use of word embeddings in dependency parsing did not show improvement.They achieved an improvement only when they performed hierarchical clustering of the word embeddings, and used features extracted from the cluster hierarchy.In a similar vein, Andreas and Klein (2014) explored the use of word embeddings for constituency parsing and concluded that the information contained in word embeddings might duplicate the one acquired by a syntactic parser, unless the training set is extremely small.Other syntactic parsing studies that reported improvements by using word embeddings include Koo et al. (2008), Koo et al. (2010), Haffari et al. (2011) andTratz andHovy (2011).
Word embeddings have also been applied to other (non-sequential NLP) tasks like grammar induction (Spitkovsky et al., 2011), and semantic tasks such as semantic relatedness, synonymy detection, concept categorisation, selectional preference learning and analogy (Baroni et al., 2014).

Conclusions
We have performed an extensive extrinsic evaluation of five word embedding methods under fixed experimental conditions, and evaluated their applicability to four sequence labelling tasks: POStagging, Chunking, NER and MWE identification.We found that word embedding features reliably outperformed unigram features, especially with limited training data, but that there was relatively little difference over Brown clusters, and no one embedding method was consistently superior across the different tasks and settings.Word embeddings and Brown clusters were also found to improve out-of-domain performance and for OOV words.We expected a performance gap between the fixed and task-updated embeddings, but the observed difference was marginal.Indeed, we found that updating can result in overfitting.We also carried out preliminary analysis of the impact of updating on the vectors, a direction which we intend to pursue further.
) the word representations are fixed during sequence model training; and (2) the graph transformer updated the tokenlevel word representations during training.As outlined in Tab. 2, for each sequence labelling task, we experiment over the de facto corpus, based on pre-existing training-dev-test splits where available: 1 POS-tagging: the Wall Street Journal portion of the Penn Treebank (Marcus et al. (1993): "WSJ") with Penn POS tags Chunking: the Wall Street Journal portion of the Penn Treebank ("WSJ"), converted into IOBstyle full-text chunks using the CoNLL conversion scripts for training and dev, and the WSJ-derived CoNLL-2000 full text chunking test data for testing (Tjong Kim Sang and Buchholz, 2000) NER: the English portion of the CoNLL-2003

Figure 1 :
Figure 1: Results for each type of word representation over POS-tagging, Chunking, NER and MWE, optionally with updating ("+UP").The y-axis indicates the training data sizes (on a log scale).Green = high performance, and red = low performance, based on a linear scale of the best-to worst-result for each task.

Figure 2 :
Figure 2: ACC over out-of-vocabulary (OOV) words for in-domain and out-of-domain test sets.

Figure 3 :
Figure 3: A t-SNE plot of the impact of updating on SKIP-GRAM

Table 2 :
Training, development and test (in-and out-of-domain)data for each sequence labelling task.
Schneider et al. (2014a)Schneider et al. (2014a)found that BROWN clustering enhances Twitter POS tagging and MWE, respectively.Compared to previous work, we consider more word representations including the most recent work and evaluate them on more sequence labelling tasks, wherein the models are trained with training sets of varying size.