Contextualized Diachronic Word Representations

Diachronic word embeddings play a key role in capturing interesting patterns about how language evolves over time. Most of the existing work focuses on studying corpora spanning across several decades, which is understandably still not a possibility when working on social media-based user-generated content. In this work, we address the problem of studying semantic changes in a large Twitter corpus collected over five years, a much shorter period than what is usually the norm in diachronic studies. We devise a novel attentional model, based on Bernoulli word embeddings, that are conditioned on contextual extra-linguistic (social) features such as network, spatial and socio-economic variables, which are associated with Twitter users, as well as topic-based features. We posit that these social features provide an inductive bias that helps our model to overcome the narrow time-span regime problem. Our extensive experiments reveal that our proposed model is able to capture subtle semantic shifts without being biased towards frequency cues and also works well when certain contextual features are absent. Our model fits the data better than current state-of-the-art dynamic word embedding models and therefore is a promising tool to study diachronic semantic changes over small time periods.

We devise a novel attentional model, based on Bernoulli word embeddings, that are conditioned on contextual extra-linguistic (social) features such as network, spatial and socioeconomic variables, which are associated with Twitter users, as well as topic-based features. We posit that these social features provide an inductive bias that helps our model to overcome the narrow time-span regime problem. Our extensive experiments reveal that our proposed model is able to capture subtle semantic shifts without being biased towards frequency cues and also works well when certain contextual features are absent. Our model fits the data better than current state-of-the-art dynamic word embedding models and therefore is a promising tool to study diachronic semantic changes over small time periods.

Introduction
Natural language changes over time due to a wide range of linguistic, psychological, sociocultural and encyclopedic causes (Blank and Koch, 1999;Grzega and Schoener, 2007). Studying the semantic change of a word helps us understand more about the human language and build temporally aware models, that are especially complementary to the work done in the digital humanities and his- Figure 1: The diachronic embedding computed by our proposed model for the word 'BATACLAN' reveals how the term's usage changed over the years. We list the most similar five words (with English translation in paranthesis) in each year by cosine similarity. The yaxis corresponds to "meaning", a one dimensional PCA projection of the embeddings. torical linguistics. Recently, diachronic word embeddings based on distributional hypothesis (Harris, 1954) have been used to automatically study semantic changes in a data-driven fashion from large corpora (Kim et al., 2014;Hamilton et al., 2016;Rudolph and Blei, 2018). We refer the reader to Kutuzov et al. (2018) who survey the recent methods in this field and establishes the challenges that lie ahead.
Currently, we find the literature on this problem to be focused on English corpora, spanning across several decades. This has not only created a gap in extending the diachronic word embeddings for a wider scope of languages, but also to datasets spanning across few successive years which are common in digital humanities and social sciences. In this work, we study French text from Twitter collected over just five years, which provides a challenging platform to build models that can capture semantic drifts in a noisy, subtly evolving language corpus. Figure 1 shows an instance of the evolution of the word 'Bataclan' (a theatre in Paris that was at-tacked by terrorists on November 2015) from the French corpus. It also shows that such embedding representations mostly capture the dominant sense of a word when used in synchrony and can therefore only reflect the evolution of the dominant sense when used diachronically, yet leaving open the question of whether small, subtle changes can be captured (Tahmasebi et al., 2018).
We hypothesize that the current state-of-the-art models lack inductive biases to fit data accurately in this setting. We build on the observation by Jurafsky (2018) that "it's important to consider who produced the language, in what context, for what purpose, and make sure that the models are fit to the data". Hence, we propose a novel model extending on Dynamic Bernoulli word Embeddings (Rudolph and Blei, 2018) (DBE) which exploits the inductive bias by conditioning on a number of contextualized features such as network, spatial and socio-economic variables, which are associated with Twitter users, as well as topicbased features.
We perform qualitative studies and show that our model can: (i) accurately capture the subtle changes caused due to cultural drifts, (ii) learn a smooth trajectory of word evolution despite exploiting various inductive biases. Our quantitative studies illustrate that our model can: (i) capture better semantic properties, (ii) be less sensitive to frequency cues compared to DBE model, (iii) act as better features for 2 out of 4 tweet classification tasks. Through an ablation study, we find in addition that our model can: (iv) work with a reduced set of contextualized features, (v) follow the test of law of prototypicality (Dubossarsky et al., 2015). In sum, we believe our model is a promising tool to study diachronic semantic changes over small time periods. 1 Our main contributions are as follows: • Our work is the first to study diachronic word embeddings for tweets from French language to the best of our knowledge. Unlike previous works, we consider dataset from a narrow time horizon (five years). • We propose a novel, attentional, diachronic word embedding model that derives inductive biases from several contextualized, sociodemographic, features to fit the data accurately.
• Our work is also the first to estimate the usefulness of the diachronic word embeddings for downstream task like tweet classification.  (2017), Yao et al. (2018) and Rudolph and Blei (2018) proposed to learn word embeddings across all time periods jointly along with their alignment in a single step. Rudolph and Blei (2018) represent word embeddings as sequential latent variables, naturally accommodating for time slices with sparse data and assuring word embeddings are grounded across time. Our proposed model builds upon this work to condition on several inductive biases, using contextual extra-linguistic (social) and topicbased features, to accurately fit dataset from a narrow time horizon.

Contextualized Features
Natural language text is inherently contextual, depending on the author, the period and the intended purpose (Jurafsky, 2018). For instance, features based on authors' demography although incomplete can explain some of the variance in the text (Garten et al., 2019). While diachronic word embeddings' ability to capture semantic shifts is interesting because of its flexibility, we postulate that there is a need to capture contextualized information about tweets such as the characteristics of their authors (including spatial, network, socioeconomic, interested topics) and meta-information such as their topic. To extract features, we make use of the largest French Twitter corpus to date proposed in Abitbol et al. (2018). In this section we will describe the set of contextualized feature we propose to inject to our diachronic word embedding model (see Section 4).

Spatial
Users from similar geographical areas tend to share similar properties in terms of word usage and language idiosyncrasies. Among others, Hovy and Purschke (2018) for German and Abitbol et al. (2018) for French, confirmed regional variations in geolocated users' content in social media. The latter work found the southern part of France to use a more standard language than the northern part. To exploit these geographic variations, we identify geolocated users (∼ 100K) and associate each of them to their respective region (out of 22 regions) and department (out of 96 departments) within the French territory. We learn a latent embedding for each region and department which captures the spatial information with different levels of granularity.

Socioeconomic
Users from similar socioeconomic status tend to share similar online behavior in terms of circadian cycles. Specifically, Abitbol et al. (2018) found that people of higher socioeconomic status are active to a greater degree during the daytime and also use a more standard language. National Institute of Statistics and Economic Studies (INSEE) of France provided the population level salary for each 4 hectare square patch across the whole French territory, estimated from the 2010 tax return in France. We also use IRIS dataset provided by French government which has more coarse grained annotation for socioeconomic status. This information is mapped with the geographical coordinates of users' home location from Twitter so we can roughly ascertain the economic status of every geolocated users. We create 9 socioeconomic classes by binning the income and ensuring that the sum of income is the same for each class. We learn a latent embedding for each such class, which thus captures the variation caused by status homophily. 2

Network
Users who are connected to each other in social networks are usually believed to share similar in-terests. We construct a co-mention network from the set of geolocated users as nodes and edges connecting those users who have mentioned each other at least once. We run the LINE model (Tang et al., 2015) to embed the nodes in the graph using the connectivity information and use the resulting node embedding as fixed features.

Interest
Interest feature corresponds to the set of important topics a user cares about. We obtain this information by composing a user document capturing all the words used in their posts, ranking the words in the document by the tf-idf score and selecting the top 50 of them. We then construct the user vector by summing the vectors (obtained by running word2vec on the entire corpus or geolocated tweets) corresponding to the top 50 words. We use the user vectors as fixed features.

Knowledge
Knowledge features keep track of the way the user writes and as such, it is also a summary of their content in Twitter. We learn a latent embedding for each geolocated user.

Topic
This feature associated with a tweet corresponds to the topic a tweet belongs to. Since the available corpus does not have any annotation about the topic of the tweet, we exploit the distant supervision-based idea proposed by Magdy et al. (2015) to filter geolocated tweets with an accompanying YouTube video link. We then use the YouTube public API to obtain the category of the video, which is then associated to the topic of the tweet. We learn a latent embedding for each YouTube category.

Proposed model
In this section we will first briefly discuss the 'Dynamic Bernoulli Embeddings' model (DBE) and then provide the details of our proposal, which uses DBE model as its backbone.

Dynamic Bernoulli Embeddings (DBE)
The DBE model is an extension of the 'Exponential Family Embeddings' model (EFE, (Rudolph et al., 2016)) for incorporating sequential changes to the data representation. Let the sequence of words from a corpus of text be represented by (x i , . . . , x N ) from a vocabulary V . Each word x i ∈ 0, 1 V corresponds to a one-hot vector, having 1 in the position corresponding to the vocabulary term and 0 elsewhere. The context c i represents the set of words surrounding a given word at position i. 3 DBE builds on Bernoulli embeddings, which provides a conditional model for each entry in the indicator vector x iv ∈ 0, 1, whose conditional distribution is where ρ iv ∈ (0, 1) is the Bernoulli probability and x c i is the collection of data points indexed by the context positions. Each index (i, v) in the data represents two parameter vectors, the embedding vector ρ (t) v ∈ R K and the context vector α v ∈ R K . The natural parameter of the Bernoulli is given by, (2) Since each observation x iv is associated with a time slice t i (which is a year, in our case 4 ), DBE learns a per-time-slice embedding vector ρ for every word in the vocabulary. Thus, equation 2 becomes, DBE lets the context vectors shared across the time slices to ground the successive embedding vectors in the same semantic space. DBE assumes a Gaussian random walk as a prior on the embedding vectors to encourage smooth change in the estimates of each term's embedding,

Proposed model
In this work, we argue that the DBE model fails to accurately fit the data spanning across fewer years as it discards other explanatory variables (besides time) about the complicated processes in the language in terms of evolution and construction. These variables, which we defined in Section 3 as contextualized features, carry useful signals to understand subtle changes such as cultural drifts. Our proposed model extends DBE by utilizing these contextualized features as inductive biases.
In our setting, we represent a tweet as t k = (x i , . . . , x N ) belonging to user u l . Each tuple (i, c) is associated with a set of contextualized features based on either u l or t k , f i,m ∈ R dm (m = 1, . . . , |F |) (where |F | corresponds to the number of contextualized features). Each contextualized feature not only follows a different distribution but also has different degrees of noise (e.g., sparsity of co-mention network, geolocation inaccuracy). Hence, it is harder to unify them in a single model. We propose three ways to introduce inductive bias to the DBE model. Unweighted sum: The simplest approach is to project all the feature embeddings to a common space and sum them up. This approach is not agnostic to the embedding vector x i in question and consider all the contextualized features equally. Incorporating this approach, equation 3 now becomes: where w m corresponds to the learnable weights corresponding to the linear projection of f i,,m with size as K × d m . Note that K denotes the dimension of both context and target embedding. Self-attention: Considering all the features equally would be wasteful for certain embedding vector x i . Henceforth, we propose to let the network decide the important contextualized features based on self attention. This approach gives a provision to our model to handle the effect of spurious contextual signals by paying no attention. Incorporating this approach, equation 5 will now become: where α m are the scalar weights corresponding to the self-attention mechanism: where a ∈ R K and b ∈ R are learnable parameters while φ is a softmax. Contextual attention: We can also make the attention mechanism to be context-dependent, that   is, dependent on the embedding vector. Equation 7 then becomes: where a m ∈ R K corresponds to the learnable attention parameter specific to a contextualized feature f m . We fit the diachronic embeddings with the pseudo log likelihood, the sum of log conditionals. Particularly, we regularize the pseudo log likelihood with the log priors, followed by maximization to obtain a pseudo MAP estimate. Our objective function can be summarized as, The likelihoods are given by: x iv logσ(η iv ), where S i correspond to the negative samples drawn at random (Mikolov et al., 2013) and σ(.) denote the sigmoid function, which maps natural parameters to probabilities. The prior is given by, Language evolution is a gradual process and the random walk prior prevents successive embedding vectors ρ v from drifting far apart. The objective function established in equation 9 is learned using stochastic gradients (Robbins and Monro, 1985) with the help of Adam optimizer (Kingma and Ba, 2014). Negative samples are resampled at each gradient step. Pseudo code for training our model can be found in Appendix A.1.

Experiments and Results
In this section we discuss the experimental protocol, qualitative and quantitative evaluation to understand the performance of our model.

Protocol
Data: We use the French twitter dataset proposed in Abitbol et al. (2018), which is the largest collection of French tweets to date. The original dataset consists of 190M French tweets posted by 2.5M of users between June 2014 and March 2018. To be able to use socio-geographic features and assess the validity of our model, we only considered tweets from users whose home location could be identified to be in Metropolitan France. This filtering step resulted in a data set of 18M tweets from 110K users spread across 5 years. This data set was then enriched using output from the constituency-based Stanford parser in its off-the-shelf French settings (Green et al., 2011) and from the dependency-based parser of Jawahar et al. (2018). We lowercased all the tweets, removed hashtags, mentions, URLs, emoticons and punctuations. We used 80% of the tweets from each year to train our model, split the rest equally to create validation (10%) and test set (10%). Finally, we pick the most frequent 50K words from the train set to create our vocabulary. Baseline models: We compare our proposed model with three baseline models: (i) Word2vec (Mikolov et al., 2013)

Qualitative Study
Embedding neighborhood: The goal of diachronic word embedding model is to automatically discover the changes in the usage of a word. The current usage at time t of a word w can be obtained by inspecting the nearby words of the word represented by ρ (t) w . From Table 1, we can observe that 'EMMANUEL' (first name of current French president) is associated with his last name ('macron') and office location ('élysée') by both DBE and proposed model. However, proposed model is able to capture interesting neighborhood by bringing words such as 'élection', 'présidentielle' and 'mélenchon' closer to 'EM-MANUEL' 8 . Since language evolution is a gradual process, the trajectory for a word tracked by a model should be changing smoothly. There are exceptions for words undergoing cultural shifts where the changes can be subtle and rapid. We plot the trajectory by computing the cosine similarity between word (e.g., MACRON) and its known, changed usage (e.g., PRESIDENT). Figure 2 shows that models relying on Bernoulli embeddings have smooth trajectories for known relations compared to other models. Despite fusing different, possibly noisy contextualized features, the trajectory tracked by our proposed model and DBE are comparably smooth. t-SNE: Alternatively, we can overlay the embeddings from all the time slices and visualize them using dimensionality reduction technique like t-SNE (Maaten and Hinton, 2008). From Figure 3, we see a similar result where most of the words modeled by our proposed model has experienced consistent change with time.

Quantitative Study
Log Likelihood: We can evaluate models by held-out Bernoulli probability (Rudolph and Blei, 2018). Given a held-out position, a better model assigns higher probability to the observed word and lower probability to the rest. We report L eval = L pos + L neg in Table 3. Contextual attention based model which smartly utilizes the contextualized features provides better fits to the data compared to the rest. Interestingly, the other variants of our proposed model performs poorly compared to the DBE model which suggests the importance of utilizing attention appropriately. Since all the competing methods produce Bernoulli conditional likelihoods (Equation 1), where n is the number of negative samples. We keep n to be 20 for all the methods to peform a fair comparison. Semantic Similarity: Certain tweets are tagged with a 'category' to which it belongs (as discussed in Section 3.6). Similar to Yao et al. (2018), we create the ground truth of word category based on the identification of words in years that are exceptionally numerous in one particular category. In other words, if a word is most frequent in a category, we tag the word with that category and form our ground truth. For each category c and each word w in year t, we find the percentage of occurrences p in each category. We collect such wordtime-category w,t,c triplets, avoid duplication by  Notice that for Word2vec model, we do not plot the results for time periods where at least one of the word of interest occurs below the minimum frequency threshold.  Table 3: Quantitative results based on log likelihood, semantic similarity and tweet classification. Higher numbers are better for all the tasks. Statistically significant differences to the best baseline for each task based on bootstrap test are marked with an asterisk. Note that we could not perform statistical significance studies for log likelihood experiment due to the large size of the test set and semantic similarity experiment due to the nature of clustering evaluation.  keeping the year of largest strength for each w and s combination, and remove triplets where p is less than 35%. Finally, we pick top 200 words by strength from each category and create a dataset of 3036 triplets across 15 categories, where each word-year pair is essentially strongly linked to its true category. We evaluate the purity of clustering results by using Normalized Mutual Information (NMI) metric. From Table 3, we find a similar trend in the performance of our proposed model.
As we see in Section 6.3, the reason our contextual attention based model excels in this task is due to its superiority in capturing semantic properties of a word. Synthetic Linguistic Change: We can synthetically introduce the linguistic shift by introducing changes to the corpus and then evaluate if the diachronic word embedding model is able to detect those artificial drifts accurately. We follow the work done by Kulkarni et al. (2015) to duplicate our data belonging to the 2018 year 6 times (along with the extra-linguistic information), perturb the last 3 snapshots and use the diachronic embedding model to rank all the words according to their pvalues. We then calculate the Mean Reciprocal Rank (MRR) for the perturbed words and expect it to be higher for models that can identify the words that have changed. To perturb the data, we sample a pair of words from the vocabulary exlcuding stop words, replace one of the word with the other with a replacement probability p replacement and repeat this step 100 times. We employ two types of perturbation -syntactic (where the both the words that are sampled in each step have the same most frequent part of speech tag) and frequent (where there is no restriction for the words being sampled at each step). From Figure 4, we find that DBE model is sensitive to the frequency cues from the data and fails to model subtle se-  mantic shifts (e.g. for words which has evolved in its meaning without substantial change in its syntactic functionality). Tweet Classification: We find that the existing work skips evaluating the diachronic word embeddings for a downstream NLP task. In this work we propose to test if the diachronic word embeddings can be used as features to build a temporallyaware tweet classifier. 9 We obtain a representation for a tweet by summing the embeddings for the words (belonging to the year in which tweet was posted) present in the tweet. We then train a logistic regression model and compute the F-score on the held-out instances. We establish four tweet classification tasks -Sentiment Analysis, Hashtag Prediction, Topic Categorization and Conversation Prediction (predict if a tweet will receive a reply or not) through distant supervision methods. Details of the task and dataset collection can be found in Appendix A.3. From Table 3, we find that our proposed model provides competitive performance with the baseline models for sentiment analysis and topic categorization while it outperforms them for the hashtag and conversation prediction tasks by a statistically significant margin (computed using bootstrap test (Efron and Tibshirani, 1994)). Note that there is no single best model that works for every tweet classification tasks.

Analysis
In this section we perform extended analysis of our proposed model to gain more insights about its functionality.  Figure 6: Importance score for each contextualized feature.

Ablation Study
We perform ablation studies of the proposed model by considering different set of contextualized features as inductive biases, illustrated in Table 4. It is interesting to find that our model can work with a limited set of contextualized features in practice. Dubossarsky et al. (2015) state that the likelihood of change in a word's meaning correlates with its position within its center. They define the prototypicality measure based on the word's distance from its cluter centroid (e.g., sword is a more prototypical exemplar than spear or dagger) and the prototypicality score reduces when the word undergoes change in its meaning. For all our models, we correlate the distance of word vector corresponding to 2014 and 2018 year with the distance the 2014 (2018) year vector moved from its cluster center. We then check if there is a positive correlation (r > .3). From Figure 5, we observe that there exists a positive correlation for all the variants of our model when compared to a prototypical or actual cluster centroid. Interestingly, when the cluster sizes are small (< 250), the word's meaning change is correlated with a prototypical exemplar more than a actual exemplar. On the other hand, this correlation direction gets reversed when the cluster sizes are greater than 250 and there exists more semantic areas.

Interpretation via Probing Tasks
Our tweet classification experiments (Section 5.3) demonstrated the usefulness of diachronic word embeddings as features in building a diachronic tweet classifier. Understanding the underlying properties of the tweet embeddings that enable it to outperform competing models is hard. This is why, following Conneau et al. (2018), we inves-   tigate that question by setting a diagnostic classifier that probes for important linguistic features on parsed output we mentioned earlier. Those probes are based on various prediction tasks (word content, sentence length, subject or object number detection, etc.) described in (Conneau et al., 2018) and succinctly in our Appendix A.5. In 7 out of 9 tasks the use of contextual features seems to be detrimental, but the relative performance difference between our proposed models and the baseline are negligible for 5 of them. This suggests that the addition of contextualized features does not hurt the syntactic and semantic information captured by our models. Interestingly, all dynamic embeddings models are able to perform twice better in the word prediction task than a Word2vec baseline but it is unclear if those models capture language usage or actual topic prediction within a degraded language modeling task.

Interpretation via Erasure
Alternatively, we can directly compute the importance of a contextualized feature by observing the effects on the model of erasing (setting the weights to 0) the particular feature (Li et al., 2016). By subtracting the erased model performance on the test set from that of the original model performance and post normalization, we can establish the importance score for each feature against each version of our proposed model. Figure 6 empha-sizes our finding that all contextualized features (except interest) are equally important to the performance of each variant of our proposed model.

Conclusion
In this work, we proposed a new family of diachronic word embeddings models that utilize various contextualized features as inductive biases to provide better fits to a social media corpus. Our wide range of quantitative and qualitative studies highlight the competitive performance of our models in detecting semantic changes over a short time range. In the future, we will consider the temporal nature of some of our contextualized features when incorporating them into our models. For example, the static social network we built can be dynamically evolving and more susceptible to accurately model underlying phenomenon. m t m do Sample c + 1 consecutive words from a random tweet X (t) and construct: C of n negative samples fromρ. end for end for Update the parameters θ = α, ρ, fm, w, a, b by ascending the stochastic gradient

end for end for
We utilize Adam (Kingma and Ba, 2014) to set rate η.

A.2 Hyperparameter settings
We follow the hyperparameter search space provided by Rudolph and Blei (2018) to find the best configuration of our model. Before training our model, we initialize the parameters with one epoch fit of non-diachronic Bernoulli embedding model (as defined in Equation 2 in the paper). We then train our model for 9 more epochs. We fix the embedding dimension to 100, context size to 2 and number of negative samples to 20. We select the initial learning rate η ∈ [0.01, 0.1, 1, 10], minibatch size m ∈ [0.001N, 0.0001N, 0.00001N ] (where N is the number of training records), the precision on context vectors and initial dynamic embeddings λ ∈ [1, 10] (λ 0 = λ/1000). We use the conditional likelihood metric (as discussed in Section 5.3) to sweep over the search space and select the best hyperparameters.

A.3 Tweet Classification Details
We will list down the details of tweet classification tasks where the data comes from our corpus.
• Sentiment Analysis -This is a binary task to classify the sentiment of the tweet. Following Go et al. (2009), we create a balanced dataset by tagging a tweet as positive (negative) if it contains only positive (negative) emoticons. We remove the emoticons from the tweets to avoid bias. • Hashtag Prediction -This multiclass classification task is to identify the hashtag present in the tweet. Following Weston et al. (2014), we identify the most frequent 100 hashtags from the corpus, keep the tweets that contain exactly one occurrence of the frequent hashtag, remove the hashtag from the tweet and predict them. • Topic Categorization -This multiclass classification task is to identify the topical category to which a tweet belongs to. Following Magdy et al. (2015), we filter the tweets that has a YouTube video associated with it, query the video category using the public YouTube API and associate that to the topical category of the tweet. • Conversation Prediction -This binary task is to classify if a tweet will receive a reply or not. Following Elazar and Goldberg (2018), we tag the tweet as a conversational tweet if it has at least a mention ('@') in it, otherwise it's a non-conversational tweet. We remove the mentions from the tweets to avoid bias.

A.4 Ablation Results
We perform ablation studies of the no attention and self attention variant of the proposed model by considering different set of contextualized features as inductive biases, illustrated in Table 6.

A.5 Probing Task Description
In this section we will describe briefly the set of probing tasks (proposed in Conneau et al. (2018)) used in our study.
• SentLen -The goal for the classification task is to predict the tweet length which has been binned in 6 categories with lengths ranging in the following intervals: (5 − 8), (9 − 12), (13−16), (17−20), (21−25), (26−28). • WC -This classification task is about predicting which of the target words appear on the given tweet. • TreeDepth -In this classification task the goal is to predict the maximum depth of the tweet's syntactic tree (with values ranging from 5 to 12). • TopConst -The goal of this classification task is to predict the sequence of top constituents immediately below the sentence (S) node. The classes are given by the 19 most common top-constituent sequences in the corpus, plus a 20th category for all other structures. • BShift -In this binary classification task the goal is to predict whether two consecutive tokens within the tweet have been inverted or not. • Tense -The goal of this task is to identify the tense of the main verb of the tweet. • SubjNum -The goal of this task is to identify the number of the subject of the main clause. • ObjNum -The goal of this task is to identify the number of the subject on the direct object of the main clause. • SOMO -This task classifies whether a tweet occurs as-is in the source corpus, or whether a randomly picked noun or verb was replaced with another form with the same part of speech. • CoordInv -This task distinguishes between original tweet and tweet where the order of two coordinated clausal conjoints has been inverted purposely.

A.6 Selection of time span unit
We performed preliminary experiments with DBE model to identify the time span unit that best fits the data. As shown in   Table 6: Ablation results based on log likelihood, semantic similarity and tweet classification.