Deep Neural Models of Semantic Shift

Diachronic distributional models track changes in word use over time. In this paper, we propose a deep neural network diachronic distributional model. Instead of modeling lexical change via a time series as is done in previous work, we represent time as a continuous variable and model a word’s usage as a function of time. Additionally, we have also created a novel synthetic task which measures a model’s ability to capture the semantic trajectory. This evaluation quantitatively measures how well a model captures the semantic trajectory of a word over time. Finally, we explore how well the derivatives of our model can be used to measure the speed of lexical change.

Diachronic distributional models are distributional models where the vector for a word changes over time. Thus, we can calculate the cosine similarity between the vectors for a word at two different time points to measure how much that word has changed over time and we can perform a nearest neighbor analysis to understand in what direc-tion a word is changing. For example, diachronic distributional models can detect that the word gay has greatly changed by comparing the word vector for gay across different time points. They can also be used to discover that gay has shifted its meaning from happy to homosexual by analyzing when those words show up as nearest neighbors to gay.
Previous research in diachronic distributional semantics has used models where data is partitioned into time bins and a synchronic model is trained on each bin. A synchronic model is a vanilla, time-independent distributional model, such as skip-gram. However, there are several technical issues associated with data binning. For example, if the bins are too large, you can only achieve extremely coarse grained representations of lexical change over time. However, if the bins are too small, the synchronic models get trained on insufficient data.
In this paper, we have built the first diachronic distributional model that represents time as a continuous variable instead of employing data binning. There are several advantages to treating time as continuous. The first advantage is that it is more realistic. Large scale change in the meaning of a word is the result of change happening one person at a time. Thus, semantic change must be a gradual process. By treating time as a continuous variable, we can capture this gradual shift. The second advantage is that it allows a greater representation of the underlying causes behind lexical change. Words change usage in reaction to real world events and multiple words can be affected by the same event. For example, the usage of gay and lesbian have changed in similar ways due to changing perceptions of homosexuality in society. By associating time with a vector and having word representations be a function of that vector, we can model a single underlying cause affecting multiple words similarly.
It is difficult to evaluate diachronic distributional models in their ability to capture semantic shift as it is extremely difficult to acquire gold data. Distributional models are traditionally evaluated with word similarity judgments, which we cannot obtain for word usage in the past. Thus, evaluation of diachronic distributional models is a focus of research, such as work done by Hellrich and Hahn (2016) and Dubossarsky et al. (2017). Our approach is to create a synthetic task to measure how well a model captures gradual semantic shifts.
We will also explore how we can use our model to predict the speed at which a word changes. Our model is differentiable with respect to time, which gives us a natural way to measure the velocity, and thus speed, of a word at a given time. We explore the capabilities and limitations of this approach.
In short, our paper provides the following contributions: • We have developed the first continuous diachronic distributional model. This is also the first diachronic distributional model using a deep neural network.
• We have designed an evaluation of a model's ability to capture semantic shift that tracks gradual change.
• We have used the derivatives of our model as a natural way to measure the speed of word use change.

Related work
Previous research in diachronic distributional models has applied a binning approach. In this approach, researchers partition the data into bins based on time and train a synchronic distributional model on that bin's data (See Figure 1). Several authors have used large bin models in their research, such as using five year sized bins (Kulkarni et al., 2015), decade sized bins (Gulordava and Baroni, 2011;Xu and Kemp, 2015;Jatowt and Duh, 2014;Hamilton et al., 2016a,b;Hahn, 2016, 2017), and era sized bins (Sagi et al., 2009(Sagi et al., , 2011 Figure 1: Difference between our approach and previous work. Previous work in diachronic distributional models (a) has trained synchronic distributional models on consecutive time bins. In our work (b), a neural network takes word and time as input and produces a time specific word vector. In (c), we sketch that previous work produces a jagged semantic trajectory (blue, solid curve) whereas our model produces a smooth semantic trajectory (pink, dotted curve).
model from the previous time bin. Bamler and Mandt (2017) developed a small bin probabilistic approach that used transition probabilities to lessen data issues. They have two versions of their method. The first version trains the distribution in each bin iteratively and the second version trains a joint distribution over all bins. In this paper, we only explore the first version as the second version does not scale well to large vocabulary sizes. Following Bamler and Mandt (2017), we compare to models used by Hamilton et al. (2016b), Kim et al. (2014), and the first version of Bamler and Mandt's's model. There have been other models of lexical change beside distributional ones. Topic modeling has been used to see how topics associated to a word have changed over time (Wijaya and Yeniterzi, 2011;Frermann and Lapata, 2016). Sentiment analysis has been applied to determine how sentiments associated to a word have changed over time (Jatowt and Duh, 2014).
As mentioned in the introduction, it is difficult to quantitatively evaluate diachronic distributional models due to the lack of gold data. Thus, previous research has attempted alternative routes to quantitatively evaluate their models. One route is to use intrinsic evaluations, such as measuring a trajectory's smoothness (Bamler and Mandt, 2017). However, intrinsic measures do not directly measure semantic shift, which is the main use of diachronic distributional models. Hamilton et al. (2016b) use attested shifts generated by historical linguists. However, outside of first attestations, it is a difficult task for historical linguists themselves to accurately detail semantic shifts (Deo, 2015). Additionally, the task used by Hamilton et al. is unusable for model comparison as all but one model had a 100% accuracy in this task. Kulkarni et al. (2015) used a synthetic task to evaluate how well diachronic distributional models can detect semantic shift. They took 20 copies of wikipedia where each is a synthetic version of a time bin and changed several words in the last 10 copies. Models were then evaluated on their ability to detect when those words changed. Our evaluation improves upon this one by having the test data be from a diachronic corpus and we model lexical change as a gradual process rather than searching for a single change point.

Models
In this section, we describe the four diachronic distributional models that we analyze in our current work. Three will be from previous research to be used as benchmarks. Each of the four models we analyze are based on skip-gram with negative sampling (SGNS). The difference between the four diachronic distributional models we analyze is how they apply SGNS to changes over time.
Skip-gram with negative sampling (SGNS) is a word embedding model that learns a latent representation of word usage (Mikolov et al., 2013). For target words w and context words c, vector representations w and c are learned to best predict if c will be in context of w in a corpus. k negative contexts are randomly sampled for each positive context. Vector representations are computed by optimizing the following loss function: (1) where D is a list of target-context pairs extracted from the corpus, P D is the unigram distribution on the corpus, σ is the sigmoid function, and k is the number of negative samples.

Binning by Decade
The first diachronic distributional model we will consider is a large time bin model proposed by Hamilton et al. (2016b). Here, time is partitioned into decades and an SGNS model is trained on each decade's worth of data. We label this model LargeBin.

Preinitialization
The second diachronic distributional model we will consider is a small time bin model proposed by Kim et al. (2014). Here, time is partitioned into years and an SGNS model is trained on each year's worth of data. Data issues are mitigated by preinitializing the model 1 for a given time bin with the vectors of the preceding time bin (Kim et al., 2014). We label this model SmallBinPreInit.

Prior and Transition Probabilities
The third diachronic distributional model we will consider comes from Bamler and Mandt (2017). Bamler and Mandt take a probabilistic approach to modeling semantic change over time. The idea is to transform the SGNS loss function into a probability distribution over the target and context vectors. Then, to create a better diachronic distributional model, they apply priors to this distribution.
The first two priors are Gaussian distributions with mean zero on the vector variables to discourage the vectors from growing too large (Barkan, 2017). More formally: where α 1 is a hyperparameter. The last two priors are also Gaussian distributions on the vector variables. The means are the vector representation from the previous bin. The goal of this prior is to discourage a vector variable from deviating from the previous bin's vectors.
where α 2 is a hyperparameter and − −− → w prev and −−→ c prev are the vectors from the previous time bin.
We are only exploring point models, thus we take the maximum a posteriori estimate of the

Integration Component
Word Component Time Component Figure 2: Diagram of DiffTime. timevec(t) encodes temporal information as a vector. M W encodes lexical information as a matrix. The target vector for w at time t, use W (w, t), is found by combining T rans w and timevec(t). Context version use C (c, t) is the same except that it has its own embedding layer. joint distribution to recover the vectors for each time bin. We apply a logarithm in constructing the estimate, which transforms the joint probability into the SGNS loss function with four regularizers (each one corresponding to a prior distribution). The prior distribution P 1 becomes w∈W α 1 2 ||w||. The prior distribution P 2 becomes c∈C α 1 2 ||c||. The prior distribution P 3 becomes w∈W The prior distribution P 4 becomes c∈C α 2 2 || c − −−→ c prev ||. W and C are the sets of target and context words. We label this model SmallBinReg.

DiffTime Model
Our model is a modification of the SGNS algorithm to accommodate a continuous time variable. The original SGNS algorithm produces a target embedding w for target word w and a context embedding c for context word c. Instead, we produce a differentiable function use W (w, t) that returns a target embedding for target word w at time t and a differentiable function use C (c, t) that produces a context embedding for context word c at time t.
Our model consists of three components. One component takes time as its input and produces an embedding that characterizes that point in time (lower right). The second component (lower left) takes a word as its input and produces a time-independent word embedding, which is then reshaped into a set of parameters that can modify the time embedding. The third component (top) combines the time embedding and the word embedding.
The first component of our model is a two-layer feed-forward neural network with tanh activation functions. These layers take a time t as input and produces a time embedding timevec(t) as output of those layers: where M 1 and M 2 are the weights of the first two layers and b 1 and b 2 are the biases. To produce the input value t, a timepoint is scaled to a value between 0 and 1, where 0 corresponds to the year 1900, and 1 corresponds to 2009, the last year for which our corpus has data. The second component incorporates wordspecific information into our model. For use W (w, t), each target word w has a target vector representation w. The vector w is then transformed into a linear transformation T rans w , which in the third component is applied to the time embedding timevec(t). We do this via a modified linear layer where the weights are a three dimensional tensor, the biases are a matrix and the output is a matrix: where T is the tensor acting as the weights and B is the matrix acting as the biases. The third component combines the wordindependent time embedding timevec(t) and the time-independent linear transformation T rans w together to produce the final result. First, T rans w is applied to timevec(t): Then, an additional linear layer is used as the output layer, taking h 3 as input: where M 4 and b 4 are the weights and biases of the output layer.

477
The above details the architecture of use W (w, t).
The corresponding function use C (c, t) for context words has the same architecture as use W (w, t) and shares weights with use W (w, t). The only exception is that use C (c, t) uses a separate set of vectors c in the second component instead of sharing the target vectors w with use W (w, t).
We train our model using a modified version of the SGNS loss function. In particular, our positive samples are now triples (w, c, t) where w is a target word, c is a context word, and t is a time, instead of pairs (w, c) which are typically used in SGNS. For each positive sample (w, c, t), we sample k negative contexts from the unigram distribution, P D . P D is trained from all contexts in the entire corpus and is time-independent. Explicitly, the loss function is:

Training
All models are trained on the same training data. We used the English Fiction section of the Google Books ngram corpus (Lin et al., 2012). We use the English fiction specifically, because it is less unbalanced than the full English section and less influenced by technical texts (Pechenick et al., 2015). We only use the years 1900 to 2009 as there is limited data before 1900.
We converted the ngram data for this corpus into a set of (target word, context word, year, frequency) tuples. The frequency is the expected number of times the target word-context word pair is sampled from that year's data using skip-gram. Following Hamilton et al. (2016b), we use subsampling with t = 10 −5 . As the number of texts published since 1900 has increased five fold, we weigh the frequencies so that the sums across each year are equal.
For the binned models, we train each bin's synchronic model using the subset of the training data corresponding to that time bin. For our model, we sample (training word, context word, year) triples from the entire training data as the year is an input to our function. Before we can evaluate the methods as models of diachronic semantics, we must first ensure that the methods model semantics accurately. To do this, we follow Hamilton et al. (2016b) by performing the MEN word similarity task on vectors extracted from a fixed time point (Bruni et al., 2012). The hope is that the word similarity predictions of a model at that point in time highly correlate with word similarity judgments in the MEN dataset. For the binned models, we used the vectors from the bin best corresponding to 1995 to reflect the 1990s bin chosen by Hamilton et al. (2016b). DiffTime represents time as a continuous variable, so we chose a time t that corresponds to the start of 1995.
The results of MEN word similarity tasks is in Table 1. All of the Spearman's ρ values are comparable to those found in Levy and Goldberg (2014) and Hamilton et al. (2016b). Thus, all of these models reflect human judgments comparable to synchronic models. Thus, the predictions of the models correlate with human judgments.

Synthetic Task
The goal of creating diachronic distributional models is to help us understand how words change meaning over time. To that end, we have created a synthetic task to compare models by how accurately they track semantic change.
Our task creates synthetic words that change between two senses over time via a sigmoidal path. A sigmoidal path will allow us to emulate a word starting from one sense, shifting gradually to a second sense, then stabilizing on that second sense. By using sigmoidal paths, we can explore how well a model can track words that have switched senses over time such as gay (lively to homosexual) and broadcast (scattering seeds to televising shows). A similar task is used to evaluate word sense disambiguation (Gale et al., 1992;Schütze, 1992).
The synthetic words are formed by a combination of two real words, e.g. banana and lobster are combined together to form banana•lobster. The real words are randomly sampled from two distinct semantic classes from the BLESS dataset (Baroni and Lenci, 2011). We use BLESS classes so that we can capture how semantically similar a synthetic word is to its component words by comparing to other words in the same BLESS classes as the component word. For example, we can capture how similar banana•lobster is to banana by comparing banana•lobster to words in the fruit BLESS class. See Appendix B for preprocessing details. We denote the synthetic words with r 1 •r 2 where r 1 and r 2 are the component real words.
We also randomly generate the sigmoidal path by which a synthetic word changes from one sense to another. For real words r 1 and r 2 , this path will be denoted shift(t; r 1 •r 2 ) and is defined by the following equation: The value s is uniformly sampled from ( 1.0 110 , 10.0 110 ) and represents the steepness of the sigmoidal path. The value m is uniformly sampled from {1930, . . . , 1980} and represents the point where the synthetic word is equally both senses. For our example synthetic word banana•lobster, banana•lobster can transition from meaning banana to meaning lobster via the sigmoidal path σ(0.05(t − 1957)) where 1957 is the time where banana•lobster is equally banana and lobster and 0.05 represents how gradually banana•lobster shifts senses.
We then use shift(t; r 1 •r 2 ) to integrate r 1 •r 2 into the real diachronic corpus data. Our training data is a set of (target word, context word, year, frequency) tuples extracted from a diachronic corpus (see 3.5). For every tuple where r 1 is the target word, we replace the target word with r 1 •r 2 and we multiply the frequency by shift(t; r 1 •r 2 ). For every tuple where r 2 is the target word, we replace the target word with r 1 •r 2 and we multiply the frequency by 1 − shift(t; r 1 •r 2 ). In other words, in the modified corpus, r 1 •r 2 has shift(t; r 1 •r 2 ) percent of r 1 's contexts at time t and 1 − shift(t; r 1 •r 2 ) percent of r 2 's contexts at time t.
We train a model mod on this modified train-ing data. This provides a representation for r 1 •r 2 over time. We can capture how much a model predicts r 1 •r 2 is more semantically similar to r 1 than r 2 by comparing mod's representation of r 1 •r 2 to words in the same semantic category as r 1 and r 2 . We use BLESS classes as our notion of semantic category. If cls 1 is the BLESS class of r 1 and cls 2 is the BLESS class of r 2 , then mod's prediction for how much more similar r 1 •r 2 is to r 1 than r 2 , rec(t; r 1 •r 2 , mod), is defined as follows: rec(t; r 1 •r 2 , mod) = 1 |cls 1 | r 1 ∈cls 1 sim mod (r 1 •r 2 , r 1 , t) − 1 |cls 2 | r 2 ∈cls 2 sim mod (r 1 •r 2 , r 2 , t) (10) sim mod (r 1 •r 2 , r 1 , t) is the cosine similarity between mod's word vector for r 1 •r 2 at time t and mod's word vector for r 1 at time t.  To evaluate a model in its ability to capture semantic shift, we use the mean sum of squares error (MSSE) between rec(t; r 1 •r 2 , mod) and shift(t; r 1 •r 2 ) across all synthetic words. The function rec(t; r 1 •r 2 , mod) is model mod's prediction of how much more similar r 1 •r 2 is to r 1 than r 2 . The gold value of rec(t; r 1 •r 2 , mod) would then be the sigmoidal path that defines how r 1 •r 2 semantically shifts from r 1 to r 2 over time, shift(t; r 1 •r 2 ). To evaluate how accurately mod predicted the semantic trajectory of r 1 •r 2 , we calculate the mean squared error between rec(t; r 1 •r 2 , mod) and shift(t; r 1 •r 2 ) as follows: 479 2009 t=1900 (rec(t; r 1 •r 2 , mod) − shift(t; r 1 •r 2 )) 2 (11) As rec(t; r 1 •r 2 , mod) and shift(t; r 1 •r 2 ) have different scales, we Z-scale both the rec(t; r 1 •r 2 , mod) values and the shift(t; r 1 •r 2 ) values before calculating the mean squared error.

Method
We use three sets of 15 synthetic words and the average is calculated over all 45 words. The synthetic words and BLESS classes we used are contained in the supplementary material. The results are in Table 2. The column AMSE is MSSE when all years are taken into account. Kim et al. (2014) noted that small bin models require an initialization period, so the column AMSE (1950-) is MSSE when only years 1950 to 2009 are taken into account and the years 1900 to 1949 are used as the initialization period. From the table, we see our model outperforms the three benchmark models in both cases. Using a paired t-test, we found that the reduction in MSSE between our model and the benchmark models are statistically significant.  Figure 3: Graph Comparisons between shift(t; r 1 •r 2 ) (red) and rec(t; r 1 •r 2 ,) (blue) for the synthetic word pistol•elm. The x-axis are the years and the y-axis are the values of shift(t; r 1 •r 2 ). rec(t; r 1 •r 2 , mod) and shift(t; r 1 •r 2 ) have been Z-scaled.
In Figure 3, we plot shift(t; r 1 •r 2 ) and rec(t; r 1 •r 2 , mod) for the synthetic word pistol•elm. Each method has a subgraph. The predictions of the large bin model LargeBin appear as a step function with large steps (top left graph). These large steps seem to cause the predicted shift (blue curve) to poorly correlate with the gold shift (red curve). Next, we consider the small bin models SmallBinPreInit (top right graph) and SmallBinReg (bottom left graph). Both predicted shifts have an initial portion that poorly fits the generated shift (between 1900 and 1950). From Kim et al. (2014), it takes several iterations for small bin models to stabilize due to each bin being fed limited data. Additionally, there are fluctuations in the graphs of the predicted shift, which we attribute to the high variance of data per bin. In contrast to the other models, our predicted shift tightly fits the gold shift (bottom right graph).
Although this evaluation provides useful information on the quality of an diachronic distributional model, it has some weaknesses. The first is that it is a synthetic task that operates on synthetic words. Thus, we have limited ability to understand how well a model will perform on real world data. Second, we only generate words that shift from one sense to another. This fails to account for other common changes, such as gaining/losing senses and narrowing/broadening. Finally, by using a sigmoidal function to generate how words change meaning, we may have privileged continuous models that incorporate a sigmoidal function in their architecture. We are working towards improving this evaluation to remove these issues.

Speed of word use change
In this section, we evaluate our model's ability to measure the speed at which a word is changing. Our model is differentiable with respect to time. Thus, we can get the derivative of use W (w, t) with respect to t to model how word w is changing usage at time t. We l2-normalize use W (w, t) beforehand to reduce frequency effects. We then get the magnitude of this normalized derivative to model the speed at which a word is changing at a given time.
We explore the connection between speed and the nearest neighbors to a word in Figure 4. First, we use apple as a baseline for discussion. We chose apple, because the meaning of the word has remained relatively stable throughout the 1900s. With apple, we see a low speed over time and a consistency in the cosine similarity to apple's nearest neighbors. While it is true that apple has other meanings beyond the fruit, such as referring to Apple Inc., those meanings are much rarer, especially in the fiction corpus we use.
In contrast to apple, the word gay has a very high speed and a drastic change for gay '  neighbors. This makes sense as gay is well established to have experienced a drastic sense change in the mid to late 1900s (Harper, 2014). Next, we explore the word mail. The word mail has a moderately high speed. This may be reflective of the fact that there have been incredible changes in the medium by which we send mail, e.g. changing from cables to email. A possible reason for the speed only being moderately high is that, even though the medium by which we send mail has changed, many of the same uses of mail, e.g. sending, receiving, opening, etc., remain the same. We see this reflected in the nearest neighbors as well as mail shifts from a high similarity to cable to a high similarity to e (as in email), yet mail is consistently similar to postal and stationery.
The next word we will explore is the word canadian. We chose this word as we were surprised to find that canadian has one of the fastest speeds in the 1930s to 1940s. The nearest neighbors to canadian have shifted from geographic terms like port and railhead to civil terms like federal and national. In further analysis, we discovered that this may be reflective of a larger push to form a Canadian identity in the early 1900s (Francis, 1997). The nearest neighbors to canadian may reflect the change from being a part of the British Empire to having its own unique national identity.
The final word we will explore is cell. The word cell also has a high speed over time. However, there is a spike in the speed during the 1980s. Analyzing the nearest neighbors we see a rapid rise in similarity to pager and handset, which indicates that this spike may be related to the rapid rise of cell phone use. Additionally, this example demonstrates a weakness in our approach. Our graph shows that our model predicts that the word cell gradually changed meaning over time and that cell started changing meaning much earlier than expected. This prediction error comes from the smoothing out of the output caused by representing time as a continuous variable.
Even though we are able to extract interesting insights from the speed of word use change, Figure 4 also exhibits some limitations. In particu-lar, most words have a sharp rise in speed in the 1930s and a steep decline in speed in the 1980s. We believe this is an artifact of our representation of word use as a function of time as there is a single time vector that influences all words. In the future, we will explore model variants to address this. We can inspect h 1 , the first layer in the time subnetwork, to gain further understanding of what our model is doing. We do this by analyzing the time points where a node in h 1 is zero.

Automatic extraction of time periods
As the activation function in h 1 is tanh, a node in h 1 switches from positive to negative (or vice versa) at the time points where it is zero. Thus, the time points where a node is zero should indicate barriers between time periods.
We visualize the time points where a node is zero in Figure 5. We see that we have a fairly even distribution of points until the 1940s, a large burst of points in the 1950s-1960s, and two points in the 1980s. Thus, there are many time periods before the 1940s (which may be caused by noisiness of the data in the first half of the century), a big transition between time periods in the 1950s-1960s, and a transition between time periods in the 1980s. Thus, these are time points that the model perceives as having increased semantic change.
However, there is a weakness to this analysis. Only 16% of the 100 nodes in h 1 are zero for time points between 1900 and 2009. Thus, a vast majority of nodes do not correspond to transitions between time periods.

Conclusion
Diachronic distributional models are a helpful tool in studying semantic shift. In this paper, we introduced our model of diachronic distributional se-mantics. Our model incorporates two hypotheses that better help the model capture how words change usage over time. The first hypothesis is that semantic change is gradual and the second hypothesis is that words can change usage due to common causes.
Additionally, we have developed a novel synthetic task to evaluate how accurately a model tracks the semantic shift of a word across time. This task directly measures semantic shift, is quantifiable, allows model comparison, and focuses on the trajectory of a word over time.
We have also used the fact that our model is differentiable to create a measure of the speed at which a word is changing. We then explored this measure's capabilities and limitations.

A Hyperparameters and preprocessing details
We used data from the English Fiction section of the Google Books ngram corpus (Lin et al., 2012). We use the English fiction specifically, because it is less unbalanced than the full English section and less influenced by technical texts (Pechenick et al., 2015). We only use the years 1900 to 2009 as there is limited data before 1900. Both the set of target words and the set of context words are the top 100,000 words by average frequency across the decades as generated by Hamilton et al. (2016b). We take a sampling approach to generating word vectors, so the corpus was converted into a list of (target word, context word, year, frequency) tuples. Frequency is the expected number of times the target word is in context of the context word that year. As the number of texts published since 1900 has increased five fold, we weigh the the frequencies so that the sums across each year are equal. For every model, the representation of a word's use at time t is a 300 dimensional vector. For SmallBinReg, α 1 is set to 1000 and α 2 is set to 1. This choice of hyperparameters comes from Bamler and Mandt (2017). For Dif f T ime, every hidden layer is 100 dimensional, except for embed W (w) which is 300 dimensional.
We trained each method using random minibatching with 10,000 samples each iteration and 990 epochs total. For LargeBin, since our study spans 11 decades   In this section, we discuss the BLESS preprocessing details. In the original dataset, there are 200 words categorized into 17 classes. However, we remove words that do not rank in the top 20,000 by frequency in any decade in our training data to ensure that the synthetic words do not lack context words at a given time. We then remove BLESS classes with less than 6 members to ensure that there are a sufficient number of words in each class. See Table 3 for the resulting list of BLESS classes and the number of members of each class.