Parser Adaptation to the Biomedical Domain without Re-Training

We present a distributional approach to the problem of inducing parameters for un-seen words in probabilistic parsers. Our KNN-based algorithm uses distributional similarity over an unlabelled corpus to match unseen words to the most similar seen words, and can induce parameters for those unseen words without retraining the parser. We apply this to domain adaptation for three different parsers that employ ﬁne-grained syntactic categories, which allows us to focus on modifying the lexicon, while leaving the structure of the parser itself intact. We demonstrate up-lifts for dependency recovery of 2%-6% on novel vocabulary in biomedical text.


Introduction
Parsing is an important component in many NLP applications. Shallower analyses may allow the discovery of local relations, but to handle the full complexity of speech and text requires knowledge of the hierarchical structures that parsers are designed to uncover. This is particularly true of long range dependencies such as that between activities and decreased in the specific synthetic activities of electrophoretically purified myosin heavy chain decreased. Such dependencies have proven to be useful features in many text mining and knowledge extraction applications, for example identifying biomarkers in the biomedical literature (Seoud and Mabrouk, 2013) or extracting family history from clinical text (Lewis et al., 2011).
Correctly identifying the dependencies within a string of words is generally based on finding the most probable structure over them, and this in turn requires knowing what sort of relations each word is likely to enter into. Unfortunately, gold standard training data, annotated with these syntactic relations, is generally in short supply. The vocabulary for which we have explicitly seen examples of the type of dependencies each word supports is therefore typically small and performance on real data is often degraded in handling out-of-vocabulary items.
Although the Penn Treebank has been a vital tool in the development and evaluation of parsing technology, providing a standard dataset for comparison of parsers, practical application of these techniques usually requires adaptation to new domains. Rimell and Clark (2009), for example, examine the adaptation of a WSJ-trained CCG parser to the biomedical domain. The divergence between these two domains, news and biology, is manifest in terms of both vocabulary and also stylistic differences in the prevalence of various syntactic structures. For example, biomedical writing eschews personal pronouns and tolerates long sequences of noun modifiers, whereas the style of news articles tends to reverse these preferences. Rimell and Clark's (2009) approach to adapting to these differences is based on retraining elements of the model using biomedical texts which have been hand-tagged with gold-standard tags. While this is undoubtedly effective, achieving an overall improvement of F-score of over 5%, it requires a considerable commitment of skilled resources to manually annotate a substantial corpus with the linguistically correct tags.
Here, we consider a distributional approach to domain adaptation using the information about syntactic structure that is implicit in raw text. We estimate parameters for unseen words using a KNN approach that matches them to the nearest seen words and averages over their parameters. We explore a number of different approaches to measuring distributional similarity and find that vectors based on counts of occurrence within ngram contexts give the best results. Bag-of-word approaches and neural embeddings, which have worked well for semantic tasks, do not appear to capture the information about syntactic similarity that this task requires.
Our use of ngram contexts is inspired by psycholinguistic research into the acquisition of syntactic categories. Cartwright and Brent (1997), for example, consider how children might use a word's distribution across a range of templates, such as the XXX is good , to infer its syntactic properties. They show, in simulations, that such distributional information can be used to infer syntactic categories from child-directed speech. Mintz (2003) analyses distributions over a simpler type of template, which he calls a frequent frame, consisting of a pair of common lexical items flanking a word of interest, e.g. you XXX it or the XXX is . In addition to showing how such distributional information can be used to induce categories, he also discusses the evidence that adults and children are sensitive to these frames. Redington et al. (1998) consider even simpler contexts, based simply on bigram colocations, e.g. the XXX . Pinker (Pinker, 1987), on the other hand, has long contested the possibility of using such distributional information to acquire valid grammatical categories, and proposes instead that grammatical categories are bootstrapped using semantic knowledge.
While the patterns and templates described above can be used to characterise a word's behaviour in terms of concrete occurrences in specific contexts, neural networks have recently become popular as a means to create more abstract representations. In this case, as the network adapts to the data, representations are learned that embed discrete inputs in a continuous space defined by its internal states. Researchers have been interested in the nature of such internal representations for some time (e.g., Small et al., 1995;Joanisse and Seidenberg, 1999). However, it has now become practical to induce such embeddings from large quantities of text and employ them in linguistic applications. For example, Tsuboi (2014) and Collobert et al. (2011) apply neural representations to POS tagging, and this suggests that at least some useful information about the syntax of unseen words might be gained from this source.
While POS tags can provide a coarse-grained description of words' syntactic behaviour, accurate parsing typically requires finer-grained detail. We can distinguish between two approaches, which may be combined, to specifying this addi-tional level of detail. The first approach simply makes use of finer-grained syntactic categories, either instead of or in addition to POS tags (Steedman, 2000;Klein and Manning, 2003b;Petrov et al., 2006). These categories can then determine the missing information about the dependencies a word will take part in, such as whether a verb is intransitive or whether it takes prepositional arguments. The second approach instead increases the granularity of the production rules, by conditioning the probabilities on the heads of the phrases involved (Charniak, 2001;Collins, 2003). In this way, words are associated with probabilities for the structure of phrases that they head, determining, for example, the types of object that a verb phrase expands into.
Although the two approaches are compatible, a significant difference makes the former more conducive to our purposes. Enhancing the granularity of the syntactic categories results in a much richer lexicon containing more information about how words behave syntactically. In principle, this should lead to an enlargement of the lexicon having a greater impact on performance by itself. In the latter approach, of lexicalising the production rules, expanding the vocabulary of the parser may be much more complicated, requiring modifications throughout the model. In contrast our approach simply adds new entries to the lexicon without the need to retrain the parser. In fact, our approach does not even require full sentences and can be applied to an unlabelled corpus of ngram counts.
Our KNN approach and the three parsers we modify are described in Sections 2 and 3 respectively. We then use a biomedical dependency recovery task, specified in Section 4, to evaluate the performance of the modified parsers, as reported in Section 5.

Approach
Our approach is based on the assumption that words with similar syntactic properties should have similar distributional characteristics. We evaluate both neural embeddings and also raw context frequencies as the basis for measuring distributional similarity. These context vectors have components which correspond to occurrences within a corpus of raw biomedical text and we employ both SENNA (Collobert et al., 2011) and Skip-gram (Mikolov et al., 2013) em-beddings. In all cases, we induce parameters for unseen words by averaging the parameters from the k nearest neighbours seen in the training data.

Context Vectors
Distributional similarity is here based on comparing vectors that are constructed from raw context counts. We considered two approaches to defining these contexts: ngrams and bags-of-word (BOW).
The ngram approach counts occurrences in 2gram, 3gram and 4gram contexts that are intended to emphasise syntactic -as opposed to semantic -characteristics, following the structure of templates and frames proposed by e.g. Cartwright andBrent (1997), Mintz (2003) and Redington et al. (1998). Thus our 2gram contexts have two forms that distinguish occurrence on the left from occurrence on the right: lef t token XXX and XXX right token . The 3gram contexts are equivalent to Mintz's (2003) frequent frames: lef t token XXX right token . And the 4gram contexts extend this frame to the right, mimicking the form of templates described by Brent (1991) and Cartwright and Brent (1997): lef t token XXX right token 1 right token 2 .
The BOW approach ignores the sequential information contained in the ngram contexts and relies instead on counts of individual words that occur anywhere in 5 word-windows each side of a target word.
In each case, we built distributional vectors using the most common of these contexts, with vector components based on a ratio of probabilities.
where c i is the ith context, w t is the target word, f req i,t is the count of the number of times w t occurs in context c i , f req i is the overall count of the number of times context c i occurs with all words, f req t is the overall count for w t in all contexts and f req total is the total count for all words in all contexts. Target words with f req t < 10 were discarded as containing too little useful information.
The distance between two vectors, u and v, was measured in terms of the city block metric: This appeared to work more effectively on sparse vectors than the more usual cosine metric.
We built these representations on a corpus of 1.2 billion words of titles and abstracts from the Medline database. Collobert et al. (2011) trained a neural net language model on a snapshot of the English Wikipedia (≈ 631M words) and published the feature vectors 1 induced for each word in the first hidden layer of the network. They showed that these embeddings are useful in enhancing the performance of a number of tasks, including POS tagging and semantic role labelling. Using these representations as features, Bansal et al. (2014) obtained improvements in dependency recovery in the MST Parser (McDonald and Pereira, 2006). Andreas and Klein (2014) also used these embeddings on a number of tasks, including an attempt to expand the vocabulary of the Berkeley Parser by matching unseen words to the nearest word already in the lexicon. However, instead of inducing parameters for the new vocabulary they simply replaced unseen words with their seen matches in the input. Unfortunately they did not find a reliable benefit from this approach. Like the context vectors described above, the SENNA representations were derived from large quantities of raw text and reflect the distributional behaviour of words in that data. However, unlike our context vectors, which have components derived from explicit distributional contexts, the components of their neural embeddings are abstract dimensions whose values derive from the optimization of a particular mathematical model. In this case the form of this model was based on distinguishing between real 11-word phrases drawn from the unlabelled corpus and an incorrect phrase which had the central word replaced with a randomly chosen item. The model tries to maximise the difference between these two phrases in terms of scores which are a nonlinear function of the vectors representing the words they contain.

SENNA
Training involved stochastic gradient ascent optimisation of an objective function based on a ranking criterion for the two phrase scores, and resulted in each word within a 100,000 word vocabulary being assigned a vector representation. The published embeddings are of dimension 50 and we measured the similarity of these vectors in terms of the cosine measure: (3)

Skip-gram
Like the SENNA model, the Skip-gram model (Mikolov et al., 2013) is trained to differentiate between the correct central word of a phrase and a random replacement, which they refer to as negative sampling. Unlike SENNA, however, the Skipgram model tries to make this prediction using only a single one of the surrounding words at a time and ignores the ordering of those words, i.e. taking a bag-of-words approach to context. The published 300-dimensional vectors 2 were trained on 100B words of Google News text using stochastic gradient ascent, and cover a vocabulary of 3M words. We also retrained the same 300dimensional model on our 1.2 billion word unlabelled biomedical corpus, giving a vocabulary of around 1M words. In both cases, we measured similarity using the cosine metric, Equation 3.

KNN Parameter Induction
Our approach to inducing parser parameters for unseen words is a form of k-nearest-neighbor induction. 3 Specifically, we constructed parameters for unseen words by finding the most similar words in the lexicon, using the distributional measures described above, and then averaging over their existing parameters in the parsing model. We did this for each parser, varying the dimensions of the context vectors, and the number of nearest neighbours to find the optimal model. To ensure that the parameters that we average over are wellestimated and reliable, we only consider words that appear more than a hundred times in the Penn Treebank when finding the nearest neighbours.

The Parsers
We extend the vocabulary of three parsers, all of which make use of fine-grained lexical categories.
The first of these parsers induces sub-categories beneath the level of POS-tags during training while the other two require hand-annotation of the categories in the training data. In all cases, we modify the parser merely by inserting new items, along with their tag parameters, into the lexicon while leaving the rule probabilities in the rest of the parser unchanged. Sections 3.1, 3.2 and 3.3 outline these parsers, focusing particularly on the contents of the lexicon which our methods modify as decribed in Section 2.

The Berkeley Parser
While an unlexicalized parser that uses syntactic categories based solely on the symbols found in the Penn Treebank will generally perform poorly, a number of results show that refining these categories can substantially improve performance. Klein and Manning (2003b), for example, show that the performance of an unlexicalised model can be substantially improved by splitting the existing symbols down into finer categories. Their subcategorizations were developed by hand based on linguistic intuitions and a careful error analysis. The Berkeley Parser 4 (Petrov et al., 2006), in contrast, is based on a method for automatically finding useful subcategorizations during training by splitting and merging the original nodes.
The model is an unlexicalized generative PCFG, but the granularity of the terminal and nonterminal categories found in training give it a much greater sensitivity to the syntactic behaviour of words and phrases than is possible using standard POS tags. The lexicon specifies each word's association to the terminal categories, and the rest of the parser is entirely unlexicalized. Parsing is complicated by the large number of syntactic categories which threaten to make standard techniques infeasible, due to the size of the search space and also even just the amount of memory required to hold the chart. However, the hierarchical structure resulting from the split-merge process enables a form of coarse to fine pruning that makes the problem tractable (Petrov and Klein, 2007). Training is based on the EM algorithm along with 6 cycles of splitting each symbol into two and remerging the 50% of sub-symbols carrying the least information. Output from the Berkeley Parser consists of trees labelled with the original Penn Treebank symbols, and we use the EnglishGrammat-icalStructure class from the Stanford Parser 5 to convert the trees to Stanford-style dependencies. Out-of-vocabulary items are handled by a process that uses orthography and sentence position to estimate probabilities for unseen words.
Expanding the lexicon of this model using our KNN method is complicated by the fact that it is generative, so that inserting new vocabulary with non-zero probabilities requires adjusting the probabilities of everything else in the lexicon to maintain normalization. Since the parser uses a cutoff of a word count of 100 or lower to determine whether word given tag probabilities are smoothed, we assigned all new vocabulary a count of 101, and partitioned this count according to the induced tag and sub-tag probabilities. In fact, our attempts to use KNN to induce probabilities over the sub-categories below the level of POS tags were fruitless, producing worse results than the original model in all experiments. Thus, we resorted to using the KNN approach to induce POS level probabilities and then basing the lower level probabilities on a 50-50 interpolation of a general profile for each POS tag and the probabilities assigned by the OOV process.

C&C
Whereas the Berkeley Parser automatically induces a set of fine-grained categories during training in an attempt to maximize parsing performance, the categories of CCG (Steedman, 2000) have been linguistically designed to represent the dependencies that words will support. In particular, they have a close correspondence to the functional types of lambda calculus representations. So, for example, an intransitive verb has the CCG category S\N P , which can be interpreted as identifying this as a syntactic structure that takes a noun phrase to its left (represented by \N P ) to produce a sentence (represented by S). In other words, it is a function from entities of type N P to type S. In comparison, a transitive verb has the type (S\N P )/N P , which describes structure that takes a noun phrase to its right (/N P ) to produce a structure equivalent to an intranstive verb (S\N P ), which is itself a category looking for an N P to its left to produce a sentence. Thus, the transitive verb category is a function from two N P s -one to the right and one to the left -to an entity of type S. 5 http://nlp.stanford.edu/software/lex-parser.shtml The C&C parser 6 (Curran et al., 2007) is a discriminative parser, which has been trained on CCGbank (Hockenmaier and Steedman, 2007), a translation of the Penn Treebank into the CCG formalism. Roughly, the parser can be split into three modules: a POS-tagger, a super-tagger and the parser itself. The POS-tagger assigns fixed POS tags to the text to be parsed, based on a window of five words centred on the word to be tagged. The super-tagger takes these POS tags and words as input and, using the same five token window, passes CCG tags to the parser. The parser in turn tries to build a derivation from the CCG tags it has been given, but can request a re-analysis from the super-tagger if this fails.
Each module uses a log-linear model to predict which structures, ω, are most likely given the input, S: where the f i are a set of features, the λ i are feature weights and Z S is a normalising constant.
Here we only consider modifying the POStagger and super-taggers, and then only to introduce weights connecting a new lexical item with its corresponding tag. Both taggers make use of many additional features, for example features relating to the dependency of a tag on the two words to either side. However, these additional feature weights do not seem to be effectively estimated by the approach we consider here. Instead, we focus on estimating the feature weights that correspond to the likelihood of a given word taking a particular tag.

EasyCCG
EasyCCG 7 (Lewis and Steedman, 2014) is another CCG-based parser that also relies on a log-linear model, as described by Equation 4, but only within what is essentially its super-tagger. POS-tagging is avoided as it represents a bottle-neck within the C&C parser, with wrongly assigned POS tags being difficult to recover from. Similarly, the probabilistic model of parse trees is discarded, and instead an A* parser (Klein and Manning, 2003a) is used to search for the valid CCG derivation that maximises the probabilities of the categories assigned to words in the input. The effectiveness of this approach depends both on the constraints imposed on derivations by the CCG formalism and also on the performance of the super-tagger, with the latter aspect being reliant on the features chosen for this model. Whereas the features used by the C&C parser are structures that are explicitly present in the training data, such as a particular sequence of tags or a CCG rule that involves particular head and dependent words, EasyCCG uses low-dimensional word vectors as features, alongside more traditional features such as capitalisation and 2character suffixes. The CCG category of an input token is then predicted by a log-linear classifier using the features in a 7-word window surrounding it. The word vectors are initialised using the 50-dimensional embeddings induced by Turian (2010) on 37 million words of newswire text, and are further optimised during training on CCGbank. The use of these word vectors allows EasyCCG to generalise well to out-of-domain data, both because embeddings are available for a wider vocabulary than is found in CCGbank and also because the low dimensionality of the vectors counters some of the problems of sparsity.

Evaluation
We measure the performance of our parsers in terms of the ability to recover dependencies from biomedical text. Dependency recovery is not only a useful component in processing both clinical text (Lewis et al., 2011;Sohn et al., 2012) and biomedical literature (Seoud and Mabrouk, 2013;Cohen and Elhadad, 2012;Miyao et al., 2008;Poon and Vanderwende, 2010;Qian and Zhou, 2012), it also provides an evaluation metric that is independent of the particular syntactic formalism employed in the parser.
BioInfer (Pyysalo et al., 2007b) is a corpus of about 35,000 words from PUBMED abstracts, annotated with grammatical relations using a slight modification of the Stanford dependencies scheme (de Marneffe et al., 2006). Our models were tuned on a development set of 600 sentences and then evaluated on the remaining 500 sentence test set, using the same split as Pyysalo et al. (2007a) and Rimmel and Clark (2009). The vocabulary in these sentences diverges considerably from that found in the WSJ, with about 27% of the tokens being unseen. Of the ≈ 3, 000 unseen word types found in BioInfer, 92% occur in the unlabelled Medline corpus that we use to induce distributional representations, and over 80% are assigned parameters by the KNN method. In contrast, only about 700 of those unseen words are present in the SENNA vocabulary, all of which are assigned parameters. Table 1 compares the performance of the Berkeley, C&C and EasyCCG parsers on the BioInfer development set, after KNN adaptation using various forms of distributional similarity. The results for each parser are grouped together with the first line in each of these groups giving the baseline F-score achieved on the BioInfer development set before expanding the vocabulary. Each subsequent line then corresponds to the best model found for each type of representation, with columns containing D, the number of dimensions in the distributional vectors, k, the number of nearest neighbours, and lastly the F-Score.

Results
The types of distributional representation used in the KNN algorithm are subdivided into those constructed on our Medline titles and abstracts and those trained by their authors on other data sources before being made publicly available. The former group consist of the ngram contexts (2gram, 3gram and 4gram), the bag-of-words contexts (BOW) and the retrained Skip-gram model (SG-bio). The downloaded Skip-gram (SG-news) and SENNA (SENNA) vectors make up the latter group.
Looking first at the differences between these approaches to constructing distributional representations, it is reasonably clear that within each parser the worst performing models tend to be those based on bag-of-words contexts (BOW, SGnews and SG-bio). Of the neural embedding models, SENNA gets the best performance, which we attribute to its preservation of sequential order in handling context. Surprisingly, the Skip-gram model retrained on biomedical data (SG-bio) fared worse than the original (SG-news), due probably in large part to the fact that the original training data was almost 100 times larger than our 1.2B word corpus. The ngram contexts achieved the best F-Scores fairly consistently for all parsers, vindicating our appeal to the psycholinguistic research of Cartwright andBrent (1997), Mintz (2003) and Redington et al. (1998).
Turning now to each parser individually, the baseline performance of the Berkeley Parser proved difficult to exceed, with only the 2gram distributional contexts giving any improvement. The best model used the 200 most frequent bigrams as contexts and averaged over 10 nearest neighbours to achieve an uplift of only 0.7% in F-Score. All other types of model resulted in the Berkeley Parser's performance degrading. For the C&C parser, in contrast, most types of representation, except SG-bio and BOW, achieved an uplift. The best model used the 500 most frequent 3gram contexts, and 3 nearest neighbours to infer parameters for unseen words, improving the F-Score by 1.43%. In comparison, the EasyCCG models achieve higher F-Scores but show smaller uplifts. Here, the best model is based on 2grams, using only 100 such contexts, but requiring 7 nearest neighbours to raise the F-Score by 0.93%.
The results of applying these best performing models to the BioInfer test set are given in Table  2. We evaluate performance on both the set of all dependencies and also the subset of dependencies involving unseen words only. All parsers show an uplift on both measures, with C&C achieving the greatest gains: 2.13% over the whole test set and 6.44% on unseen words. The other parsers obtain smaller uplifts of around 3% on the unseen words but these OOV improvements are nonetheless significant at p < 0.01 on a bootstrap test (Efron and Tibshirani, 1993) for all parsers. The improvements over the whole test set are diluted by comparison, although still positive.

Discussion
We have demonstrated a KNN algorithm to estimate parameters for new lexical items that produces improvements in F-score of up to 6% in the recovery of dependencies in biomedical text. These improvements were obtained without having to retrain the parsers, based simply on distributional representations constructed on unlabelled corpora. In fact, since the context vectors comprehensively outperformed the neural embeddings, our approach achieved these gains without having to induce a clustering or other model over the unlabelled corpora and required only counts for ngrams containing the seen and unseen words. In principle, this method could be applied on the fly, as and when the parser encounters new vocabulary. The success of this ngram based approach is also consistent with psycholinguistic research into syntactic acquisition (Cartwright and Brent, 1997;Mintz, 2003;Redington et al., 1998) We were able to assign parameters to over 80% of the unseen word types. This introduction of parameters for new word types into the lexicon was the only modification made to the parsers, with the remainder of the models being left unchanged. When combined with methods that could adapt the existing model parameters to the statistics of the new domain, such as self-training (e.g., Deoskar et al., 2014), we expect further improvements to be achievable.
Nonetheless, there were substantial variations in the strength of the improvement attained, with the weak performance of the Berkeley Parser being a notable disappointment. Several differences could be invoked to explain this shortfall. Firstly, the Berkeley Parser has a strong OOV process, and it may just be difficult to beat the estimates it produces, without seeing gold standard data. Secondly, it is a generative rather than a discriminative model, and this complicates the process of modifying the lexicon with questions of how much probability mass to give to unseen words and how to renormalise the lexicon afterwards. Thirdly, rather than representing a single coherent type of linguistic information, the categories induced by the splitting and merging process are just simply the results of whatever splits happened to give the most improvement during training. An example of a subcategory within DT might differentiate definiteness from indefiniteness, while a subcategory in NNP might separate personal names from place names. The inhomogeneity in the type of information encoded in these subcategories probably contributed to our being unable to find distributional information which could be used to induce useful probabilities for them. Consequently, our KNN parameter induction worked only at the level of POS tags for this parser and was therefore less predictive. Andreas and Klein (2014) also struggled to obtain performance improvements for the Berkeley Parser using a distributional matching method. Their problems were also compounded by using SENNA vectors, which we found to give weaker benefits than the ngram context approach.
Our method has certain aspects in common with other approaches to domain adaptation. For example, Koo et al. (2008) train a dependency parser on features deriving from distributional clusters, with two words having similar cluster features if they have similar bigram distributions. Thus, these clusters engender a form of distributional similarity comparable to that used in our KNN algorithm.
KNN algorithms are also commonly used in Graph-Based Semi-Supervised Learning approaches (Das and Petrov, 2011;Altun et al., 2006;Subramanya et al., 2010), with the knearest-neighbour sets determining the edges that structure the graph. POS tags are then propagated through the graph from labelled to unlabelled data. Although similarity in these cases is commonly being assessed between token sequences, as opposed to word types, the features used are similar to the ngram templates used here and the bigram distributions used by Koo et al. (2008).
A major difference in our approach is that it does not require retraining the parser or constructing a full model on the unlabelled data. We simply copy parameters from words in the existing lexicon to unseen words, based on a distributional measure of similarity. Moreover, we don't need to see the entire unlabelled corpus. Instead, we can estimate parameters for an unseen word based simply on a set of ngrams centered on it, along with the corresponding ngrams for the existing lexicon.
A reasonable direction for future work would be to develop the way we select the contexts on which our distributional representations are based. In particular, it would make sense to exploit the approach of Brent (1991) and Manning (1993) in which these contexts have an a priori linguistic association with particular syntactic frames, as opposed to a merely empirical association deriving from a k-nearest-neighbour model.