Adapting predominant and novel sense discovery algorithms for identifying corpus-specific sense differences

Word senses are not static and may have temporal, spatial or corpus-specific scopes. Identifying such scopes might benefit the existing WSD systems largely. In this paper, while studying corpus specific word senses, we adapt three existing predominant and novel-sense discovery algorithms to identify these corpus-specific senses. We make use of text data available in the form of millions of digitized books and newspaper archives as two different sources of corpora and propose automated methods to identify corpus-specific word senses at various time points. We conduct an extensive and thorough human judgment experiment to rigorously evaluate and compare the performance of these approaches. Post adaptation, the output of the three algorithms are in the same format and the accuracy results are also comparable, with roughly 45-60% of the reported corpus-specific senses being judged as genuine.


Introduction
Human language is neither static not uniform. Almost every individual aspect of language including phonological, morphological, syntactic as well as semantic structure can exhibit differences, even for the same language. These differences can be influenced by a lot of factors such as time, location, corpus type etc. However, in order to suitably understand these differences, one needs to be able to analyze large volumes of natural language text data collected from diverse corpora. It is only in this Big Data era that unprecedented amounts of text data have become available in the form of millions of digitized books (Google Books project), newspaper documents, Wikipedia articles as well as tweet streams. This huge volume of time and location stamped data across various types of corpora now allows us to make precise quantitative linguistic predictions, which were earlier observed only through mathematical models and computer simulations. Scope of a word sense: One of the fundamental dimensions of language change is shift in word usage and word senses (Jones, 1986;Ide and Veronis, 1998;Schütze, 1998;Navigli, 2009). A word may possess many senses; however, not all of the senses are used uniformly; some are more common than the others. This particular distribution can be heavily dependent on the underlying timeperiod, location or the type of corpora. For example, let us consider the word "rock". In books, it is usually associated with the sense reflected by the words 'stone, pebble, boulder' etc., while if we look into newspapers and magazines, we find that it is mostly used in the sense of 'rock music'. Motivation for this work: The world of technology is changing rapidly, and it is no surprise that word senses also reflect this change. Let us consider the word "brand". This word is mainly used for the 'brand-name' of a product. However, it has now become a shorthand reference to the skills, actions, personality and other publicly perceived traits of individuals or for characterizing reputation, public face of the whole group or companies. The rise of social media and the ability to selfpublish and self-advertise undoubtedly led to the emergence of this new sense of "brand". To further motivate such cross corpus sense differences, let us consider the word 'relay'. A simple Google search in the News section produces results that are very different from those obtained through a search in the Books section (See Fig 1). In this paper, we attempt to automatically build corpusspecific contexts of a target word (for e.g., relay in this case) that can appropriately discriminate the two different senses of the target word -one of which is more relevant for the News corpus (context words extracted by one of our adapted methods: team, race, event, races, sprint, men, events, record, run, win) while the other is more relevant for the Books corpus (context words extracted by one of our adapted methods: solenoid, transformer, circuitry, generator, diode, sensor, transistor, converter, capacitor, transformers). Since the search engine users mostly go for generic search without any explicit mention of book or news, the target word along with a small associated context vector might help the search engine to retrieve document from the most relevant corpora automatically. We believe that the target and the automatically extracted corpus-specific context vector can be further used to enhance (i) semantic and personalized search, (ii) corpora-specific search and (iii) corpora-specific word sense disambiguation. It is an important as well as challenging task to identify predominant word senses specific to various corpora. While the researchers have started exploring the temporal and spatial scopes of word senses (Cook and Stevenson, 2010;Gulordava and Baroni, 2011;Kulkarni et al., 2015;Jatowt and Duh, 2014;Mitra et al., 2014;Mitra et al., 2015), corpora-specific senses have remained mostly unexplored. Our contributions: Motivated by the above applications, this paper studies corpora-specific senses for the first time and makes the following contributions 1 : (i) we take two different meth-1 The code and evaluation results are available at: http: ods for novel sense discovery (Mitra et al., 2014;Lau et al., 2014) and one for predominant sense identification (McCarthy et al., 2004) and adapt these in an automated and unsupervised manner to identify corpus-specific sense for a given word (noun), and (ii) perform a thorough manual evaluation to rigorously compare the corpus-specific senses obtained using these methods. Manual evaluation conducted using 60 candidate words for each method indicates that ∼45-60% of the corpus-specific senses identified by the adapted algorithms are genuine. Our work is a unique contribution since it is able to adapt three very different types of major algorithms suitably to identify corpora specific senses.
Key observations: For manual evaluation of the candidate corpus-specific senses, we focused on two aspects -a) sense representation, which tells if the word cluster obtained from a method is a good representative of the target word, and b) sense difference, which tells whether the sense represented by the corpus-specific cluster is different from all the senses of the word in the other corpus. Some of our important findings from this study are: (i) the number of candidate senses produced by Mc-Carthy et al. (2004) is far less than the two other methods, (ii) Mitra et al. (2014) produces the best representative sense cluster for a word in the time period 2006-2008and McCarthy et al. (2004 produces the best representative sense cluster for a word in the time period 1987-1995, (iii) Mitra et al. (2014) is able to identify sense differences more accurately in comparison to the other methods, (iv) considering both the aspects together, McCarthy et al. (2004) performs the best, (v) for the common results produced by Lau et al. (2014) and Mitra et al. (2014), the former does better sense differentiation while the latter does better overall.

Related Work
Automatic discovery and disambiguation of word senses from a given text is an important and challenging problem, which has been extensively studied in the literature (Jones, 1986;Ide and Veronis, 1998;Schütze, 1998;Navigli, 2009;Kilgarriff and Tugwell, 2001;Kilgarriff, 2004). Only recently, with the availability of enormous amounts of data, researchers are exploring temporal scopes of word senses. Cook and Stevenson (2010) use corpora from different time periods to study the change in the semantic orientation of words. Gulordava and Baroni (2011) use two different time periods in the Google n-grams corpus and detect semantic change based on distributional similarity between word vectors. Kulkarni et al. (2015) propose a computation model for tracking and detecting statistically significant linguistic shifts in the meaning and usage of words. Jatowt and Duh (2014) propose a framework for exploring semantic change of words over time on Google n-grams and COHA dataset. Lau et al. (2014) propose a fully unsupervised topic modellingbased approach to sense frequency estimation, which was used for the tasks of predominant sense learning, sense distribution acquisition, detecting senses which are not attested in the corpus, and identifying novel senses in the corpus which are not captured in the sense inventory. Two recent studies by Mitra et al. (2014; capture temporal noun sense changes by proposing a graph clustering based framework for analysis of diachronic text data available from Google books as well as tweets. quantify semantic change by evaluating word embeddings against known historical changes. Lea and Mirella (2016) develop a dynamic Bayesian model of diachronic meaning change. Pelevina (2016) develops an approach which induces a sense inventory from existing word embeddings via clustering of ego-networks of related words. Cook et al. (2013) induce word senses and then identify novel senses by comparing two different corpora: the 'focus corpora' (i.e., a recent version of the corpora) and the 'reference corpora' (older version of the corpora). Tahmasebi et al. (2011), propose a framework for tracking senses in a newspaper corpus containing articles between 1785 and 1985. Phani et al. (2012) study 11 years worth Bengali newswire that allows them to extract trajectories of salient words that are of importance in contemporary West Bengal. Few works (Dorow and Widdows, 2003;McCarthy et al., 2004) have focused on corpus-specific sense identification. Our work differs from these works in that we capture the cross corpus-specific sense differences by comparing the senses of a particular word obtained across two different corpora. We adapt three state-of-the-art novel and predominant sense discovery algorithms and extensively compare their performances for this task.

Dataset Description
To study corpora-specific senses, we consider books and newspaper articles as two different corpora sources. We compare these corpora for the same time-periods to ensure that the sense differences are obtained only because of the change in corpus and not due to the difference in time. A brief description of these datasets is given below.

Proposed framework
To identify corpus-specific word senses, we aim at adapting some of the existing algorithms, which have been utilized for related tasks. In principle, we compare all the senses of a word in one corpus against all the senses of the same word in another corpus. We, therefore, base this work on three different approaches, Mitra et al. (2014), Lau et al. (2014 and McCarthy et al. (2004), which could be adapted to find word senses in different corpora in an unsupervised manner. Next, we discuss these methods briefly followed by the pro-posed adaptation technique and generation of the candidate set. Mitra et al. (2014) proposed an unsupervised method to identify noun sense changes over time. They prepare separate distributional-thesaurusbased networks (DT) (Biemann and Riedl, 2013) for the two different time periods. Once the DTs have been constructed, Chinese Whispers (CW) algorithm (Biemann, 2006) is used for inducing word senses over each DT. For a given word, the sense clusters across two time-points are compared using a split-join algorithm. Proposed adaptation: In our adaptation, we apply the same framework but over the two different corpora sources in the same time period. So, for a given word w that appears in both the books and newspaper datasets, we get two different set of clusters, B and N , respectively for the two datasets.
A corpus-specific sense will predominantly be present only in that specific corpus and will be absent from the other corpus. To detect the bookspecific sense for the word w, we compare each of the |B| book clusters against all of the |N | newspaper clusters. Thus, for each cluster s bi , we identify the fraction of words that are not present in any of the |N | newspaper clusters. If this value is above a threshold, we call s bi a book-specific sense cluster for the word w. This threshold has been set to 0.8 for all the experiments, as also reported in Mitra et al. (2014).
We also apply the multi-stage filtering 3 to obtain the candidate words as mentioned in their paper, except that we do not filter the top 20% and bottom 20% of the words. We believe that removing the top 20% words would deprive us of many good cases. To take care of the rare words, we consider only those corpus-specific clusters that have ≥ 10 words .
The number of candidate words obtained after this filtering are shown in Table 1. Figure 2 (a,b) illustrates two different sense clusters of the word 'windows' -one specific to books corpus and another specific to newspaper corpus, as obtained us-3 majority voting after multiple runs of CW and POS tags 'NN' and 'NNS' ing Mitra's method. The book-specific sense corresponds to 'an opening in the wall or roof of a building'. The newspaper-specific sense, on the other hand, is related to the computing domain, suggesting Windows operating system.  (2004) developed a method to find the predominant sense of target word w in a given corpora. The method requires the nearest neighbors to the target word, along with the distributional similarity score between the target word and its neighbors. It then assigns a prevalence score to each of the WordNet synset ws i of w by comparing this synset to the neighbors of w. The prevalence score P S i for the synset ws i is given by where N w denotes the set of neighbors of w and dss(w, n j ) denotes the distributional similarity between word w and its neighbors n j . wnss(ws i , n j ) denotes the WordNet similarity between the synset ws i and the word n j , and is given by (2) where ss(ws i , ns x ) denotes the semantic similarity between WordNet synsets ws i and ns x . We use Lin Similarity measure to find similarity between two WordNet synsets. Proposed adaptation: In our adaptation to Mc-Carthy's method to find corpus-specific senses, we use the DT networks constructed for Mitra's method to obtain the neighbors as well as distributional similarity between a word and its neighbors. We then obtain the prevalence score for each sense of the target word for both the corpora sources separately, and normalize these scores so that the scores add up to 1.0 for each corpus. We call these as normalized prevalence score (N P S).
We call a sense ws i as corpora specific if its N P S i is greater than an upper threshold in one corpus and less than a lower threshold in the other corpus. We use 0.4 as the upper threshold and 0.1 as the lower threshold for our experiments. After applying this threshold, the number of candidate words are shown in Table 2. For the purpose of distributional visualization of the senses, we denote a word sense ws i using those neighbors of the word, which make the highest contribution to the prevalence score P S i . Figure 2 (c,d) illustrates two sense clusters of the word 'lap' thus obtained -one specific to books corpus and another specific to newspaper corpus. The book-specific sense corresponds to 'the top surface of the upper part of the legs of a person who is sitting down'. The news-specific sense, on the other hand corresponds to 'a complete trip around a race track that is repeated several times during a competition'.

Lau's Method
We also adapt the method described in Lau et al. (2014) to find corpus specific word senses. Their method uses topic modeling to estimate word sense distributions and is based on the word sense induction (WSI) system described in Lau et al. (2012). The system is built around a Hierarchical Dirichlet Process (HDP) (Teh et al., 2006), which optimises the number of topics in a fullyunsupervised fashion over the training data. For each word, they first induce topics using HDP. The words having the highest probabilities in each topic denote the sense cluster. The authors treat the novel sense identification task as identifying sense clusters that do not align well with any of the pre-existing senses in the sense inventory. They use topic-to-sense affinity to estimate the similarity of a topic to the set of senses given as where T and S represent the number of topics and senses respectively, and Sim(s i , t j ) is defined as where S i and T j denote the multinomial distributions over words for sense s i and topic t j . JS(X, Y ) stands for Jensen-Shannon divergence between distributions X and Y . Proposed adaptation: In our adaptation to their method to find corpus-specific senses, for a target word, a topic is called corpus-specific if its word distributions are very different from all the topics in the other corpus. We therefore compute similarity of this topic to all the topics in other corpus and if the maximum similarity is below a threshold, this topic is called as corpus-specific. We use Equation 4 to compute the similarity between two topics t i and t j as Sim(t i , t j ).
Since Lau's method is computationally expensive to run over the whole vocabulary, we run it only for those candidate words, which were flagged by Mitra's method. We then use a threshold to select only those topics which have low similarity to all the topics in the other corpus. We use 0.35 as the threshold for all the 4 cases except for news-specific senses in 2006-2008, where a threshold of 0.2 was used. The number of candidate corpus-specific senses thus obtained are shown in Table 3. Note that a word may have multiple corpus-specific senses.  Figure 2(e,f) illustrates the two different word clusters of the word 'lime' -one specific to the books corpus and another specific to the newspaper corpus, as obtained by applying their method. The book-specific sense corresponds to 'mineral and industrial forms of calcium oxide'. The newsspecific sense, on the other hand, is related to 'lemon, lime juice'.

Evaluation Framework and Results
In this section, we discuss our framework for evaluating the candidate corpus-specific senses obtained from the three methods. We perform manual evaluations using an online survey 4 among ∼ 27 agreed participants (students, researchers, professors, technical persons) with age between 18-34 years. We randomly selected 60 candidate corpus-specific senses (combining both corpora sources) from each of the three methods (roughly 30 words from each time period). Each participant was given a set of 20 candidate words to evaluate; thus each candidate sense was evaluated by 3 different annotators. In the survey, the candidate word was provided with its corpus-specific sense cluster (represented by word-clouds of the words in the cluster) and all the sense clusters in the other corpus.
Questions to the participants: The participants were asked two questions. First, whether the candidate corpus-specific sense cluster is a good representative sense of the target word? and sec-ond, whether the sense represented by the corpusspecific cluster is different from all the senses of the word in the other corpus? The participants could answer the first question as 'Yes' or 'No' and this response was taken as a measure of "sense representation" accuracy of the underlying scheme. If this answer is 'No', the answer to the second response was set as 'NA'. If this answer is 'Yes', they would answer the second question as 'Yes' or 'No', which was taken as a measure of "discriminative sense detection" accuracy of the underlying method for comparing the senses across the two corpora. The overall confidence of a method was obtained by combining the two responses, i.e., whether both the responses are 'Yes'. The accuracy values are computed using majority voting, where we take the output as 'Yes' if majority of the responses are in agreement with the system and average accuracy, where we find the fraction of responses that are in agreement with the system. Since each case is evaluated by 3 participants, micro-and macro-averages will be similar.
Accuracy results: Table 4 shows the accuracy figures for the underlying methods. Mitra's and Mc-Carthy's methods perform better for sense representation, and Mitra's method performs very well for discriminative sense detection. For discriminative sense detection, there were a few undecided cases 5 . As per overall confidence, we observe that McCarthy's method performs the best. Note that the number of candidate senses returned by Mc-Carthy were much less in comparison to the other methods. Mitra's method performs comparably for both the time periods, while Lau's method performs comparably only for 2006-2008. Inter-annotator agreement: The inter-annotator agreement for the three methods using Fleiss' kappa is shown in Table 6. We see that the interannotator agreement for Question 2 is much less in comparison to that for Question 1. This is quite natural since Question 2 is much more difficult to answer than Question 1 even for humans. Comparison among methods: Further, we wanted to check the relative performance of the three approaches on a common set of words. Mc-Carthy's output did not have any overlap with the other methods but for Lau and Mitra, among the   words selected for manual evaluation, 30 words were common. We show the comparison results in Table 5. While Lau performs better on discriminative sense detection accuracy, Mitra performs much better overall.

Discussion
In this section, we discuss the results further by analyzing some of the responses. In Table 7, we provide one example entry each for all the three possible responses for the three methods. Lau's method: In Lau's method, consider the word 'navigation'. Its news-specific sense cluster corresponds to a device to accurately ascertaining one's position and planning and following a route. The sense clusters in books corpus relate to navigation as a passage for ships among other senses and are different from the news-specific sense. The participants accordingly evaluated it as a news-specific sense. For the word 'fencing', the book-specific cluster corresponds to the sense of fencing as a sports in which participants fight with swords under some rules. We can see that the first sense cluster from news corpus has a similar sense and accordingly, it was not judged as a corpus-specific sense. Finally, the book-specific cluster of 'stalemate' does not denote any coherent sense, as also judged by the evaluators.
McCarthy's method: In McCarthy's method, consider the word 'pisces'. The book-specific cluster corresponds to the 12 th sign of the zodiac in astrology. None of the clusters in the news corpus denote this sense and it was evaluated as book-specific. For the word 'filibuster', the newsspecific sense corresponds to an adventurer in a private military action in a foreign country. We can see that the cluster in the other corpus has the same sense and was not judged as corpus-specific. The news-specific sense cluster for the word 'agora' does not correspond to any coherent sense of the word and was accordingly judged.
Mitra's method: Finally, coming to Mitra's method, consider the word 'chain'. Its newsspecific cluster corresponds to the sense of a series of establishments, such as stores, theaters, or hotels, under a common ownership or management. The sense clusters in books corpus, on the other hand, relate to chemical bonds, series of links of metals, polymers, etc. Thus, this sense of 'chain' was evaluated as news-specific. Take the word 'divider'. Its book-specific cluster corresponds to an electrical device used for various measurements.
We can see that some of the clusters in the news corpus also have a similar sense (e.g., 'pulses, amplifiers, proportional, pulse, signal, frequencies, amplifier, voltage'). Thus, this particular sense of 'divider' was not judged as a corpus-specific sense. Finally, the news-specific cluster of the word 'explanations' does not look very coherent and was judged as not representing a sense of explanations.
In general, corpus-specific senses, such as 'navigation' as 'gps, device, software' being newsspecific, 'pisces' as '12 th sign of the zodiac' being book-specific and 'chain' as 'series of establishment' being news-specific look quite sensible. Table 7: Example cases from the evaluation: First column mentions the method name, which corpus-specific, time-period and the candidate word. Second column mentions the responses to the two questions. Corpus-specific sense cluster is shown in third column and fourth column shows the sense clusters in the other corpus, separated by '##'.

Parameter Tuning
To make our experiments more rigorous, we performed parameter tuning on Lau's and McCarthy's method to find the optimal accuracy value. We decided to select 50 words from each method to evaluate. 11 words out of these are from the time period 1987-1995 and the rest from the time period 2006-2008.
Lau's method: For Lau's method, the thresholds represent maximum similarity. So, a lower value will be more restrictive as compared to a higher value. We selected three thresholds (0.30, 0.35, 0.40) for Lau's method for our experiment. Table 9 shows the total number of candidate words, words selected and average accuracy (overall con-fidence) of each threshold. First, we randomly selected 0.26% words from the most restrictive threshold (i.e., 0.30). For the next threshold (0.35), since it contains all the words of the lower threshold (0.30), we we randomly selected 0.26% words from the remaining 3715 words. We did the same for the threshold 0.40 again. Using the 50 words thus obtained, we performed the evaluation. We used the same evaluation method as outlined in Section 5.
McCarthy's method: For McCarthy's method, we have an upper and a lower threshold. A higher value for upper threshold and/or a lower value for lower threshold, would mean that it is more restrictive. Thus, a value of 0.45 for upper threshold and 0.05 for lower threshold would be the most restrictive in our set of thresholds. The total number of words, the number of words selected for evaluation and overall confidence are shown in Table 8. We used the same technique as we applied for Lau's method to evaluate a total of 50 words.
We can see that a higher value (less restrictive) of the threshold provides better results in case of Lau. For McCarthy, we infer that a higher value (more restrictive) of upper threshold and a higher value (less restrictive) of the lower threshold is optimal.

Conclusions and future work
To summarize, we adapted three different methods for novel and predominant sense detection to identify cross corpus-specific word senses. In particular, we used multi-stage filtering to restrict the candidate senses by Mitra's method, used JS similarity across the sense clusters of two different corpora sources in Lau's method and used thresholds on the normalized prevalence score as well as the concept of denoting sense cluster using the most contributing neighbors in McCarthy's method. From the example cases, it is quite clear that after our adaptations, the outputs of the three proposed methods have very similar formats. Manual evaluation results were quite decent and in most of the cases, overall confidence in the methods was around 45-60%. There is certainly scope in future for using advanced methods for comparing sense clusters, which can improve the accuracy of discriminative sense detection by these algorithms. Further, it will also be interesting to look into novel ways of combining results from different approaches.