Identification of Adjective-Noun Neologisms using Pretrained Language Models

Neologism detection is a key task in the constructing of lexical resources and has wider implications for NLP, however the identification of multiword neologisms has received little attention. In this paper, we show that we can effectively identify the distinction between compositional and non-compositional adjective-noun pairs by using pretrained language models and comparing this with individual word embeddings. Our results show that the use of these models significantly improves over baseline linguistic features, however the combination with linguistic features still further improves the results, suggesting the strength of a hybrid approach.


Introduction
In the context of the construction of lexical resources, such as WordNet (Miller, 1995;Fellbaum, 2012), a key task is the identifications of terms that would be of relevance for inclusion in the resource and this task is called 'neologism detection.' Detection of single word neologisms can be principally accomplished by means of frequency statistics (McCrae et al., 2017) and even new senses of words can be identified by means of topic models (Lau et al., 2012). However, this task is much harder when we consider multiword expressions as a multiword expression may consist of two or more words that are already in the dictionary but whose combination may give extra meaning that could not be understood from just the words that compose this multiword expression. For example a 'common viper' is not merely a viper that is 'common', but in fact refers to Vipera berus a specific species of snake. In contrast, a 'dangerous viper' is simply a viper that is also dangerous and as such most lexicographers would prefer not to include the term in their resources.
In this work, we focus on a particular kind of construction of neologisms, that is neologisms where the term consists of a single adjective and a noun. The reason for this focus is driven by the idea that the semantics of adjectives is complex in terms of their semantic compositionality (McCrae et al., 2014) and this can be broadly broken down into three categories, intersective, subsective and privative adjectives (Partee, 2003;Bouillon and Viegas, 1999;Morzycki, 2015). We use WordNet as the principle background knowledge and thus rely on the judgement of the WordNet lexicographers in order to deduce if a particular adjectivenoun combination is a neologism.
Our approach for detecting whether adjectivenoun pairs are likely to be neological is based on the recent breakthroughs regarding pretrained language models, such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018), which have shown to be effective for solving a wide variety of tasks (Radford et al., 2018). For this particular problem of neologism detection, it is clear that there is significant value in the use of these pretrained models as they easily create a vector that represents the adjective-noun combination and this can be compared with a word-based model such as GloVe (Pennington et al., 2014), to deduce if an adjective-noun pair is compositional or neological.
The paper is structured as follows, first in Section 2 we will describe some of the related work in the identification of neologisms, terminology and semantic compositionality. We will then, in Section 3, describe how we created a dataset for noun-adjective neologisms and in particular how we constructed a weak negative set for evaluation. We then describe our baseline methodologies and how we used pretrained language models in order to identify adjective-noun neologism with increased accuracy. The results of these experiments are presented in Section 4 before we conclude in Section 5. The code and datasets used in these experiments are available at https://github.com/jmccrae/ adj-noun-neologism-identification.

Related Work
Neologism identification is a task that is a basic task as part of the construction of a lexicon and as the task of lexicography is being increasingly automated (Kosem et al., 2013) in the context of infrastructures such as ELEXIS (Krek et al., 2018), and as such it is of increasing importance. However, while the task has received some attention, most approaches so far have significant weaknesses, even though it is a major area of work for publishers in lexicography (O'Donovan and O'Neill, 2008). Some semi-automated approaches have relied on the extraction of features and the use of classifiers such as SVMs (Falk et al., 2014) or on language-specific features (Breen, 2010).
Of close relationship to this task is automatic term recognition, where new terms are recognized based on their occurrence in a corpus. In these works, a number of metric for assessing 'termhood' (Spasić et al., 2013;Cram and Daille, 2016) have been introduced and these are often developed to work in specific domains (Buitelaar et al., 2013). It has been shown that combinations of many metrics can effectively learn terms (Astrakhantsev, 2014). However, previous work (Mc-Crae et al., 2017) as well as the results in this paper show that these metrics perform poorly at identifying semantic compositionality.
The semantics of adjectives have been studied not only from a logical perspective but as in terms of vector space models and word embeddings and in the context of analysis of semantic compositionality (Mitchell and Lapata, 2008). Most works start from Mitchell and Lapata in representing the compositional vector of an adjective-noun pair with the following equation Where p is the vector of compound, u and v are vectors for the individual words and α,β are learned weights. This has been extended by replacing the scalar values, α and β with matrices (Boleda et al., 2013):  Further, it has been suggested that adjectives themselves should be matrices (Baroni and Zamparelli, 2010), such that However, learning a matrix to represent each word can be quite difficult. This has been further extended to an approach where each word has a matrix to give a general approach to semantic compositionality (Socher et al., 2012). Moreover, it was shown that simpler models such as bidirectional LSTMs produce better results (Tai et al., 2015). This has lead to the development of pretrained models (Devlin et al., 2018;Peters et al., 2018), which can be trained on truly massive corpora and then still be effectively applied to tasks with relatively little training data.

Data Preparation
In order to develop a classifier to determine if a particular adjective-noun pair is a neologism. We first need to develop a set of pairs that we know to be neological and a set that we can assume is likely not to be. For the development of the positive set, we simply took all the two-word expressions within Princeton WordNet 3.1, and deduced the likely part-of-speech tagging using NLTK (Loper and Bird, 2002) and selected only those that were tagged as "JJ NN" or "JJ NNS". This yielded as set of 11,474 terms that we could use as a positive set.
Developing a negative set is much harder, as we would need to ask an expert lexicographer to manually evaluate a large number of adjective-noun combinations and verify that they were not neologisms that could be put into a dictionary. As such, we rely on a weakly supervised dataset that was constructed from Wikipedia. In particular, we randomly chose from Wikipedia articles a list of unique adjective-noun pairs, which again were identified by part-of-speech tagging with NLTK, and then filtered out all those pairs, which are already in Wordnet. As this negative set is still likely to contain some true neologisms, we performed a quick manual analysis of 100 of these terms showed that 5 of them were certainly worthy of inclusion in a dictionary (e.g., 'special education', 'safe position') as they have meanings that are not deducible from the two words that compose the phrase. In contrast, most of the examples in the set were clearly compositional, e.g., 'British soldiers', 'much teamwork', 'new congregation'. One example was unclear 'Korean language', which does not occur in WordNet, while other similar terms, such as 'English language' and 'German language' do. As such we estimate that our weak negative set is about 94-95% negative. We acknowledge that this is a weakness of our approach however it would be very expensive to construct a true gold standard and our experiments and analysis below show that the system is capable of effectively learning this task in spite of the noisy training data.
In this way, we constructed a set of weak negative examples that was roughly ten times larger than the positive set, as our intuition was that there are many more negative examples in text than occur naturally. We reserved two sets of 1,000 positive and negative examples for test and development as shown in Table 1.

Baseline Models
A natural approach for determining whether an adjective-noun pair is compositional would be to compare the frequency with which the adjectivenoun occurs in comparison to the adjective and noun's total frequency. This can be achieved by means of Probabilistic Mutual Information as follows: Where p(uv) represents the probability of the adjective-noun pair, uv, occurring in our corpus, i.e., the total frequency divided by the length of the corpus, and p(u) and p(v) representing the probability of the adjective, u and the noun v. For corpora we used a recent dump 1 of Wikipedia and we 1 This corpus was compiled in December 2015 developed this into a simple classifier by learning a threshold, β from the development dataset accepting a pair as a neologism if

P M I(uv) > β
The results from this (in line with our previous experience in this task) were little better than a majority class baseline and as such we developed a classifier that looked only at the words that are in the compound and deduced whether they were neological based on the words themselves. The principal reason for this is that we are attempting to distinguish between collocations and phrases representing novel concepts and it the frequency of these are very similar, meaning that PMI does a very poor job in distinguishing these two similar but distinct linguistic phenomena. In this case we used a naïve Bayes classifier which predicts if a word pair is a neologism based on whether p(Neologism|uv) > p(¬Neologism|uv) where: The relevant probabilities p(u|Neologism) was simply deduced by the frequency with which a given adjective or noun occurred in our positive or negative training set. The resulting Naïve Bayes classifier provided (surprisingly) strong results and so we continued to use it as a feature within our complete model.

Using Pretrained Models
We used three pretrained models for computing a single representation of adjective-nouns: USE Universal sentence encoders (Cer et al., 2018) were introduced to provide a way to make embeddings of whole sentences. As such, they directly model semantic compositionality and we apply them by considering our term as a sentence and generating an 512dimensional embedding of the term.
ELMo ELMo is a pretrained language model that provides a deep contextual representation of a sentence. We used the 'small' model which generates a representation of 1,024 dimensions.
BERT BERT has further innovated on the pretrained model by training in both direction. We use the final sentence encoding of our In order to deduce whether there was a significant improvement in the compositional representation that was learnt by these models in contrast to the individual words, we also used a pretrained model for the individual words, namely GloVe (Pennington et al., 2014), which we chose at is has been shown to have good performance across a wide number of tasks. We developed a single vector to represent the noun-adjective by concatenating the two vectors we have from GloVe: As we discovered that the Naïve Bayes baseline model was very strong we also calculated for each of the examples the following feature vector: We combined all these vectors as follows: Where x ∈ R 2 and we then used a single dense layer taking x as input to compare the pretrained representation, p uv with the GloVe representation, g uv . This model is depicted in Figure 1. The error function for the network was cross-entropy over the softmax of the values for x. The softmax was chosen to output two values which represent the probability of a term being neological and not being neological respectively. All models were trained with the Adam optimizer (Kingma and Ba, 2014) for a total of 200 epochs with a learning rate of 0.01 and at the end of each epoch the accuracy on the development set was evaluated and the final model selected for evaluation on the test set was the model with highest development accuracy. In general, this model occurred within the first 100 epochs so we do not expect that more training would lead to better accuracy.

Results
We evaluated the model given in Equation 1 in a number of settings, by varying the inclusion of the features from the model. Firstly we considered the model without the use of pretrained language models and only the GloVe vectors which we term the "feed forward" model, this can be considered as fixing the corresponding matrix (A) to zero. We used the GloVe vectors trained on the 6 billion word corpus which comes in four dimensions, 50, 100, 200, 300. We evaluated on all of these settings and in addition the case where we did not use any vectors of GloVe which we labelled as "n/a". As such the setting "feed forward (n/a)" could be considered as another baseline that does not use any features from deep neural networks. We then  Table 2: Result for the detection of neological adjective-noun terms using our models. * and † denote a statistically significant improvement over the Naïve Bayes baseline at p = 0.05, 0.01 respectively. evaluated all these settings on the 3 pretrained language models, USE, ELMo and BERT and the results are presented in Table 2. Statistical significance was calculated at two levels (Yeh, 2000). The strongest result in accuracy, precision and F-Measure is the BERT model with GloVe vectors of dimensionality 100, although the USE and ELMo methods present a similar result with GloVe dimensionality of 100 or 200, suggesting that the use of pretrained models in general is helpful for the identification of neological adjective-noun phrases. The difference in performance between the choice of models was however not statistically significant. Furthermore, we also observe that the larger GloVe vectors are not helpful and observations of the test set accuracy as well as preliminary experiments in more complex neural network architectures have suggested that over-fitting is likely the cause of this given the comparatively small training set.
We found that the inclusion of the frequency feature remained helpful and to evaluate this we rerun our best scoring model with the frequency features and presented them on the bottom row of Table 2, we see that the results without frequency features is still significantly better than the baseline, however the inclusion of these features does give a sizeable increase in the performance of the system. As such, this suggests that there is still a role for traditional feature engineering approaches alongside deep learning methodologies for this task.
Further, we applied a qualitative analysis of the errors made by the system, and we show an example of some of the errors generated by the ELMobased system in Table 3. For most results it is hard to see why the system made an error, however there are a few patterns, in that many of the false negatives seem to contain low-frequency adjectives such as 'antigenic' or 'Sullian'. In the false positives, as expected we see some that should not be counted as errors, in particular 'alpha interferon', and this is due to the weaknesses in our methodology that we have previously noted. We also see many cases that would also be hard for a human to decide if they are truly compositional such as 'natural world', 'Korean language' or 'constitutional law', confirming our results that the system is producing near-human results for this task.

Conclusion
We have presented a method for identifying adjective-noun pairs as neologisms and have shown that the usage of pretrained language models improves significantly over other baselines. This is particularly interesting as the systems presented in this paper do not require the usage of a large corpus and as such can be robustly and easily applied to a large number of domains. However, we discovered that simple frequency features are still important and this suggests that the combination of linguistically motivated features as well as deep learning models is likely to provide the best results.

False Negatives
False positives Suillus albivelatus uniform button critical mass natural world Norwegian elkhound constitutional law free people single tube evolutionary trend religious knowledge financial backing transitional phase total depravity pilot error fluorescent fixture Korean language right hand alpha interferon antigenic determinant regulatory region