AMBRA: A Ranking Approach to Temporal Text Classification

This paper describes the AMBRA system, entered in the SemEval-2015 Task 7: ‘Di-achronic Text Evaluation’ subtasks one and two, which consist of predicting the date when a text was originally written. The task is valuable for applications in digital humanities, information systems, and historical linguistics. The novelty of this shared task consists of incorporating label uncertainty by assigning an interval within which the document was written, rather than assigning a clear time marker to each training document. To deal with non-linear effects and variable degrees of uncertainty, we reduce the problem to pair-wise comparisons of the form is Document A older than Document B? , and propose a non-parametric way to transform the ordinal output into time intervals.


Introduction
Temporal text classification consists of learning to automatically predict the publication date of documents, by using the information contained in their textual content. The task finds uses in fields as varied as digital humanities, where many texts have are unidentified or controversial publication dates, information retrieval (Dakka et al., 2012), where temporal constraints can improve relevance, and historical linguistics, where the interpretation of the learned models can confirm and reveal insights.
From a technical point of view, the task is usually tackled either as regression or, more commonly, as a single-label multi-class problem, with classes defined as time intervals such as months, years, decades or centuries. The regression approach assumes that precise timestamps are uniformly available for each document, which is suitable for cases of social media documents (Preotiuc-Pietro, 2014), but less suitable for documents surrounded by more uncertainty. Multi-class classification, on the other hand, suffers from a coarseness tradeoff: using coarser classes is less informative, and using finer classes reduces the number of training instances in each class, making the problem more difficult. Furthermore, with a multi-class formulation, the temporal relationship between classes is lost.
The 'Diachronic Text Evaluation' subtasks one and two from SemEval-2015 are formulated similarly to a multi-class problem, where each document is assigned to an interval such as [1976][1977][1978][1979][1980][1981][1982]. To accommodate such labels, we propose an approach based on pairwise comparisons. We train a classifier to learn which document out of a pair is older and which is newer. If two documents come from overlapping intervals, then their order cannot be determined with certainty, so the pair is not used in training. We use the property of linear models to extend a set of pairwise decisions into a ranking of test documents (Joachims, 2006).
While previous work uses a regression-based method to map the ranking back to actual timestamps, we propose a novel non-parametric method to choose the most likely interval. In light of this, our system is named AMBRA (Anachronism Modeling by Ranking). Our implementation is available under a permissive open-source license. 1

Related Work
An important class of models for temporal classification employs prototype-based classification methods, using probabilistic language models and distances in distribution space to classify documents to the time period with the most similar language (de Jong et al., 2005;Kumar et al., 2011). Kanhabua and Nørvåg (2009) use temporal language models to assign timestamps to unlabeled documents.
An extension of such models for continuous time is proposed by Wang et al. (2008), who use Brownian motion as a model for topic change over time. This approach is simpler and faster than the discrete time version, but it cannot be directly applied to documents with different degrees of label uncertainty, such as interval labels. Dalli and Wilks (2006) train a classifier to date texts within a time span of nine years. The method uses lexical features and it is aided by words whose frequencies increase at some point in time, most notably named entities. Abe and Tsumoto (2010) propose similarity metrics to categorise texts based on keywords calculated by indexes such as tf-idf. Garcia-Fernandez et al. (2011) explore different NLP techniques on a digitized collection of French texts published between 1801 and 1944. Stylerelated markers and features, including readability features, have been shown to reveal temporal information in English as well as Portuguese (Stamou, 2005;Štajner and Zampieri, 2013).
An intersecting research direction combines diatopic (regional) and diachronic variation for French journalistic texts (Grouin et al., 2010) and for the Dutch Folktale Database, which includes texts from different dialects and varieties of Dutch, as well as historical texts (Trieschnigg et al., 2012).
More recently, Ciobanu et al. (2013) propose supervised classification with unigram features with χ 2 feature selection on a collection of historical Romanian texts, noting that the informative features are words having changed form over time. Niculae et al. (2014) circumvent the limitations of supervised classification by posing the problem as ordinal regression with a learning-to-rank approach. They evaluate their method on datasets in English, Portuguese and Romanian. The superior flexibility of the ranking approach makes it a better fit for the problem for-mulation of the 'Diachronic Text Evaluation' task, motivating us to base our implementation on it.
A different, but related, problem is to model and understand how words usage and meaning change over time. Wijaya and Yeniterzi (2011) use the Google NGram corpus aiming to identify clusters of topics surrounding the word over time. Mihalcea and Nastase (2012) split the Google Books corpus into three wide epochs and introduce the task of word epoch disambiguation. Turning this problem around, Popescu and Strapparava (2013) use a similar approach to statistically characterize epochs by lexical and emotion features.

Methods
The 'Diachronic Text Evaluation' shared task consists of three subtasks (Popescu and Strapparava, 2015): classification of documents containing explicit references to time-specific persons or events (T1), classification of documents with time-specific language use (T2), and recognition of time-specific expressions (T3). The AMBRA system participated in T1 and T2.

Corpus
The training data released for the shared task consists of 323 documents for T1 and 4,202 documents for T2. Each document has a paragraph containing, on average, 71 tokens, along with a tag indicating when each text was written/published. The publication date of texts is indicated by time intervals at all three granularity levels: fine-, medium-and coarsegrained (e.g. <textM yes="1695-1707"> for a text written between the years 1695 and 1707 in the medium-grained representation).
The shared task mentions no limitation regarding the use of external corpora. Nevertheless, to avoid thematic bias, we use only the corpora provided by the organizers under the assumption that the test and training sets are sampled from the same distribution.
The released test set consists of 267 instances for T1 and 1,041 instances for T2.

Algorithm and Features
We use a ranking approach by pairwise comparisons, previously proposed for temporal text modeling by Niculae et al. (2014) .
Learning. The model learns a linear function g(x) = w · x to preserve the temporal ordering of the texts, i.e. if document 2 x i predates document x j , which we will henceforth denote as x i ≺ x j , then g(x i ) < g(x j ). This step can be understood as learning to rank texts from older to newer. By making pairwise comparisons, the problem can be reduced to binary classification using a linear model.
A dataset annotated with intervals has the form D = {(x, [y first , y last )]} where y first < y last are the years between which document x was written. Document x i can be said to predate document x j only if its interval predates the other without overlap: This allows us to construct a dataset consisting only of correctly-ordered pairs: This reduces to linear binary classification: We form a balanced training set by flipping the order of half of the pairs in D p at random.
Prediction. Niculae et al. (2014), following Pedregosa et al. (2012), fit a monotonic function mapping from years to the space spanned by the learned linear model. In contrast, to better deal with the interval formulation, we propose a non-parametric memory-based approach. After training, we store: When queried about when a previously unseen document x was written, we compute z = w · x and search for the k closest entries in D scores , which we denote D z scores . For each candidate interval for the test document [y first , y last ] we compute its average distance to the intervals of the k nearest training doc- 2 We overload xi to refer to the document itself as well as its representation as a feature vector.
The predicted interval is the one minimizing the average distance: Importantly, this approach allows for even more flexibility in interval labels than needed for the 'Diachronic Text Evaluation' task. While in the task all intervals (at a given granularity level) have the same size, our method can deal with intervals of various sizes, 3 half-lines [−∞, a] or [a, ∞] for expressing only a lower or only an upper bound on the time of writing of a document, and even degenerate intervals [a, a] for when the time is known exactly.
We use χ 2 feature selection with classes defined as the [50·n, 50·(n+1)] interval that overlaps the most with the true one. This coarse approach to feature selection has been shown to work well for temporal classification (Niculae et al., 2014).

Results
We perform 5-fold cross-validation over the training set to estimate the task-specific score. We fix the number of neighbours used for prediction to k = 10 after cross-validation using only number of tokens as feature. The model parameter space consists of the logistic regression's regularization parameter C, the minimum and maximum frequency thresholds for pruning too rare and too common features, ngram range for tokens and for part-of-speech tags, and the number of features to keep after feature selection. We choose the best configuration after many  Table 1: Evaluation of AMBRA and the baselines on the test data. We report the task-specific score (between 0 and 1, higher is better) for the three levels of granularity, as well as the mean absolute error (MAE, lower is better) for the fine level of granularity.
iterations of randomized search. We compare our ranking model to a ridge regression baseline, employing the document length meta-features and using the middle of the time intervals as target values. We also evaluate a random baseline where one of the candidate intervals is chosen with uniform probability. For evaluation, we use the task-specific metric defined by the organizers (Popescu and Strapparava, 2015), based on the number of interval divisions between the prediction and the right answer.
For context, we also report the mean absolute error obtained by taking the center of the intervals as a point estimate of the year. Table 1 shows the performance of AMBRA and the baseline systems on the test documents. On T1, the full AMBRA system is the only to beat the random baseline in all metrics (95% confidence). On T2, where more data is available, AMBRA with length and style features outperforms ridge regression at fine granularity (95% confidence), and the full AMBRA system outperforms all others in all metrics (99% confidence). 5

Most Informative Features
To better understand the performance of our method we analyze the most informative features selected by our best models. We use identical feature sets for both tasks, and while there are some common patterns, we observe important differences in the feature rankings, confirming that T1 and T2 are different enough in nature to warrant separate modeling. Among the features useful for both tasks we find the length of a document in sentences highly predictive, with newer texts being longer. Also, the linguistic structure determiner + singular proper noun is predictive of older texts, while adjective + singular noun is predictive of newer texts. The decrease in use of the contraction 'd is captured in both cases. From the lexical features, the word letters indicates older texts, corresponding to the decreasing use of mail as telecommunication became mainstream.
Words useful for T1 are more topic-and timespecific ones, such as army, emperor, troops, while the T2 model, possibly enabled by the larger amount of data, proves capable of detecting diachronic spelling variation (publick and public are both selected, with opposite signs), outdated words (upon), and more subtle stylistic changes such as the decrease in use of the Oxford comma (a comma followed by a conjunction at the end of a list).

Conclusion and Future Work
We propose a ranking-based method to handle interval prediction and account for uncertainty in temporal text classification. Our approach proved competitive in the Semeval-2015 'Diachronic Text Evaluation' subtasks one and two. The features we used are simplistic but effective. We expect performance to improve by including linguistic and etymology expertise in the feature engineering and selection process, as well as by including world knowledge through named entities and linked data.
Our model allows for arbitrary interval labels, which is more expressive and more realistic than the task formulation. We plan to refine collections of historical texts and tighten the annotation intervals wherever possible. Our implementation can be made more scalable by following the random sampling methodology of Sculley (2009). 854