Online Updating of Word Representations for Part-of-Speech Tagging

We propose online unsupervised domain adaptation (DA), which is performed incrementally as data comes in and is applicable when batch DA is not possible. In a part-of-speech (POS) tagging evaluation, we find that online unsupervised DA performs as well as batch DA.


Introduction
Unsupervised domain adaptation is a scenario that practitioners often face when having to build robust NLP systems.They have labeled data in the source domain, but wish to improve performance in the target domain by making use of unlabeled data alone.Most work on unsupervised domain adaptation in NLP uses batch learning: It assumes that a large corpus of unlabeled data of the target domain is available before testing.However, batch learning is not possible in many real-world scenarios where incoming data from a new target domain must be processed immediately.More importantly, in many real-world scenarios the data does not come with neat domain labels and it may not be immediately obvious that an input stream is suddenly delivering data from a new domain.
Consider an NLP system that analyzes emails at an enterprise.There is a constant stream of incoming emails and it changes over time -without any clear indication that the models in use should be adapted to the new data distribution.Because the system needs to work in real-time, it is also desirable to do any adaptation of the system online, without the need of stopping the system, changing it and restarting it as is done in batch mode.
In this paper, we propose online unsupervised domain adaptation as an extension to traditional unsupervised DA.In online unsupervised DA, domain adaptation is performed incrementally as data comes in.Specifically, we adopt a form of representation learning.In our experiments, the incremental updating will be performed for representations of words.Each time a word is encountered in the stream of data at test time, its representation is updated.
To the best of our knowledge, the work reported here is the first study of online unsupervised DA.More specifically, we evaluate online unsupervised DA for the task of POS tagging.We compare POS tagging results for three distinct approaches: static (the baseline), batch learning and online unsupervised DA.Our results show that online unsupervised DA is comparable in performance to batch learning while requiring no retraining or prior data in the target domain.

Experimental setup
Tagger.We reimplemented the FLORS tagger (Schnabel and Schütze, 2014), a fast and simple tagger that performs well in DA.It treats POS tagging as a window-based (as opposed to sequence classification), multilabel classification problem.FLORS is ideally suited for online unsupervised DA because its representation of words includes distributional vectors -these vectors can be easily updated in both batch learning and online unsupervised DA.More specifically, a word's representation in FLORS consists of four feature vectors: one each for its suffix, its shape and its left and right distributional neighbors.Suffix and shape features are standard features used in the literature; our use of them is exactly as described by Schnabel and Schütze (2014).
Distributional features.The i th entry x i of the left distributional vector of w is the weighted number of times the indicator word c i occurs immediately to the left of w: x i = tf (freq (bigram(c i , w))) where c i is the word with frequency rank i in the corpus, freq (bigram(c i , w)) is the number of occurrences of the bigram "c i w" and we weight non- zero frequencies logarithmically: tf(x) = 1 + log(x).The right distributional vector is defined analogously.We restrict the set of indicator words to the n = 500 most frequent words.To avoid zero vectors, we add an entry x n+1 to each vector that counts omitted contexts: x 501 = tf( j:j>n freq (bigram(c j , w))) Let f (w) be the concatentation of the two distributional and suffix and shape vectors of word w.Then FLORS represents token v i as follows: f ) where ⊕ is vector concatenation.FLORS then tags token v i based on this representation.
FLORS assumes that the association between distributional features and labels does not change fundamentally when going from source to target.This is in contrast to other work, notably Blitzer et al. (2006), that carefully selects "stable" distributional features and discards "unstable" distributional features.The hypothesis underlying FLORS is that basic distributional POS properties are relatively stable across domains -in contrast to semantic and other more complex tasks.The high performance of FLORS (Schnabel and Schütze, 2014) suggests this hypothesis is true.
Data.Test set.We evaluate on the development sets of six different TDs: five SANCL (Petrov and McDonald, 2012) domains -newsgroups, weblogs, reviews, answers, emails -and sections 22-23 of WSJ for in-domain testing.
We use two training sets of different sizes.In condition l:big (big labeled data set), we train FLORS on sections 2-21 of Wall Street Journal (WSJ).Condition l:small uses 10% of l:big.
Data for word representations.We also vary the size of the datasets that are used to compute the word representations before the FLORS model is trained on the training set.In condition u:big, we compute distributional vectors on the joint corpus of all labeled and unlabeled text of source and target domains (except for the test sets).We also include 100,000 WSJ sentences from 1988 and 500,000 sentences from Gigaword (Parker, 2009).In condition u:0, only labeled training data is used.
Methods.We implemented the following modification compared to the setup in (Schnabel and Schütze, 2014): distributional vectors are kept in memory as count vectors.This allows us to increase the counts during online tagging.
We run experiments with three versions of FLORS: STATIC, BATCH and ONLINE.All three methods compute word representations on "data for word representations" (described above) before the model is trained on one of the two "training sets" (described above).
STATIC.Word representations are not changed during testing. BATCH.
Before testing, we update count vectors by freq (bigram(c i , w)) += freq * (bigram(c i , w)), where freq * (•) denotes the number of occurrences of the bigram "c i w" in the entire test set.
ONLINE.Before tagging a test sentence, both left and right distributional vectors are updated via freq (bigram(c i , w)) += 1 for each appearance of bigram "c i w" in the sentence.Then the sentence is tagged using the updated word representations.As tagging progresses, the distributional representations become increasingly specific to the target domain (TD), converging to the representations that BATCH uses at the end of the tagging process.
In all three modes, suffix and shape features are always fully specified, for both known and unknown words.  1 shows that our setup with these three changes (lines BATCH and ONLINE) has state-of-the-art performance on SANCL for domain adaptation (bold numbers).

Experimental results
Table 2 investigates the effect of sizes of labeled and unlabeled data on performance of ONLINE and BATCH.We report accuracy for all (ALL) tokens, for tokens occurring in both l:big and l:small (KN), tokens occurring in neither l:big nor l:small (OOV) and tokens ocurring in l:big, but not in l:small (SHFT).1 Except for some minor variations in a few cases, both using more labeled data and using more unlabeled data improves tagging accuracy for both ONLINE and BATCH.ONLINE and BATCH are generally better or as good as STATIC (in bold), always on ALL and OOV, and with a few exceptions also on KN and SHFT.
ONLINE performance is comparable to BATCH performance: it is slightly worse than BATCH on u:0 (largest ALL difference is .29)and at most .02different from BATCH for ALL on u:big.We ex- plain below why ONLINE is sometimes (slightly) better than BATCH, e.g., for ALL and condition l:small/u:big.

Time course of tagging accuracy
The ONLINE model introduced in this paper has a property that is unique compared to most other work in statistical NLP: its predictions change as it tags text because its representations change.
To study this time course of changes, we need a large application domain because subtle changes will be too variable in the small test sets of the SANCL TDs.The only labeled domain that is big enough is the WSJ corpus.We therefore reverse the standard setup and train the model on the dev sets of the five SANCL domains (l:big) or on the first 5000 labeled words of reviews (l:small).In this reversed setup, u:big uses the five unlabeled SANCL data sets and Gigaword as before.Since variance of performance is important, we run on 100 randomly selected 50% samples of WSJ and report average and standard deviation of tagging error over these 100 trials.
The results in Table 3 2 show that error rates are only slightly worse for ONLINE than for BATCH or the same.In fact, l:small/u:0 known error rate (.1186) is lower for ONLINE than for BATCH (similar to what we observed in Table 2).This will be discussed at the end of this section.
Table 3 includes results for "unseens" as well as unknowns because Schnabel and Schütze (2014) show that unseens cause at least as many errors as unknowns.We define unseens as words with a tag that did not occur in training; we compute unseen error rates on all occurrences of unseens, i.e., occurrences with both seen and unseen tags are included.As Table 3 shows, the error rate for unknowns is greater than for unseens which is in turn greater than the error rate on known words.
2 Significance test: test of equal proportion, p < .05 Examining the single conditions, we can see that ONLINE fares better than STATIC in 10 out of 12 cases and only slightly worse for l:small/u:big (unseens, known words: .1086vs .1084,.0802vs .0801).In four conditions it is significantly better with improvements ranging from .005(.1404 vs .1451:l:big/u:0, unknown words) to >.06 (.3094 vs .3670:l:small/u:0, unknown words).
The differences between ONLINE and STATIC in the other eight conditions are negligible.For the six u:big conditions, this is not surprising: the Gigaword corpus consists of news, so the large unlabeled data set is in reality the same domain as WSJ.Thus, if large unlabeled data sets are available that are similar to the TD, then one might as well use STATIC tagging since the extra work required for ONLINE/BATCH is unlikely to pay off.
Using more labeled data (comparing l:small and l:big) always considerably decreases error rates.We did not test for significance here because the differences are so large.By the same token, using more unlabeled data (comparing u:0 and u:big) also consistently decreases error rates.The differences are large and significant for ONLINE tagging in all six cases (indicated by * in the table ).
There is no large difference in variability ON-LINE vs. BATCH (see columns "std").Thus, given that it has equal variability and higher performance, ONLINE is preferable to BATCH since it assumes no dataset available prior to the start of tagging.
Figure 1 shows the time course of tagging accuracy. 3BATCH and STATIC have constant error rates since they do not change representations during tagging.ONLINE error decreases for unknown words -approaching the error rate of BATCH -as 3 In response to a reviewer question, the initial (leftmost) errors of ONLINE and STATIC are not identical; e.g., ONLINE has a better chance of correctly tagging the very first occurrence of an unknown word because that very first occurrence has a meaningful (as opposed to random) distributed representation.more and more is learned with each additional occurrence of an unknown word (top).
Interestingly, the error of ONLINE increases for unseens and known words (middle&bottom panels) (even though it is always below the error rate of BATCH).The reason is that the BATCH update swamps the original training data for l:small/u:0 because the WSJ test set is bigger by a large fac-tor than the training set.ONLINE fares better here because in the beginning of tagging the updates of the distributional representations consist of small increments.We noticed this in Table 2 too: there, ONLINE outperformed BATCH in some cases on KN for l:small/u:big.In future work, we plan to investigate how to weight distributional counts from the target data relative to that from the (labeled und unlabeled) source data.

Related work
Online learning usually refers to supervised learning algorithms that update the model each time after processing a few training examples.Many supervised learning algorithms are online or have online versions.Active learning (Lewis and Gale, 1994;Tong and Koller, 2001;Laws et al., 2011) is another supervised learning framework that processes training examples -usually obtained interactively -in small batches (Bordes et al., 2005).
All of this work on supervised online learning is not directly relevant to this paper since we address the problem of unsupervised DA.Unlike online supervised learners, we keep the statistical model unchanged during DA and adopt a representation learning approach: each unlabeled context of a word is used to update its representation.

Conclusion
We introduced online updating of word representations, a new domain adaptation method for cases where target domain data are read from a stream and BATCH processing is not possible.We showed that online unsupervised DA performs as well as batch learning.It also significantly lowers error rates compared to STATIC (i.e., no domain adaptation).Our implementation of FLORS is available at cistern.cis.lmu.de/flors

Figure 1 :
Figure1: Error rates for unknown words, words with unseen tags and known words for l:small/u:0.The x represents the number of tokens of the respective type (e.g., number of tokens of unknown words).

Table 1 :
BATCH and ONLINE accuracies are comparable and state-of-the-art.Best number in each column is bold.