Dating Documents using Graph Convolution Networks

Document date is essential for many important tasks, such as document retrieval, summarization, event detection, etc. While existing approaches for these tasks assume accurate knowledge of the document date, this is not always available, especially for arbitrary documents from the Web. Document Dating is a challenging problem which requires inference over the temporal structure of the document. Prior document dating systems have largely relied on handcrafted features while ignoring such document-internal structures. In this paper, we propose NeuralDater, a Graph Convolutional Network (GCN) based document dating approach which jointly exploits syntactic and temporal graph structures of document in a principled way. To the best of our knowledge, this is the first application of deep learning for the problem of document dating. Through extensive experiments on real-world datasets, we find that NeuralDater significantly outperforms state-of-the-art baseline by 19% absolute (45% relative) accuracy points.


Introduction
Date of a document, also referred to as the Document Creation Time (DCT), is at the core of many important tasks, such as, information retrieval (Olson et al., 1999;Li and Croft, 2003;Dakka et al., 2008), temporal reasoning (Mani and Wilson, 2000;Llidó et al., 2001), text summarization (Wan, 2007), event detection (Allan et al., 1998), and analysis of historical text (de Jong et al., 2005a), among others. In all such tasks, the document date is assumed to be available and also DCT (?) AFTER SAME obj subj SAME AFTER Swiss adopted that form of taxation in 1995. An example document annotated with syntactic and temporal dependencies. In order to predict the right value of 1999 for the Document Creation Time (DCT), inference over these document structures is necessary. Bottom: Document date prediction by two state-of-the-art-baselines and NeuralDater, the method proposed in this paper. While the two previous methods are getting misled by the temporal expression (1995) in the document, NeuralDater is able to use the syntactic and temporal structure of the document to predict the right value (1999).
accurate -a strong assumption, especially for arbitrary documents from the Web. Thus, there is a need to automatically predict the date of a document based on its content. This problem is referred to as Document Dating. Initial attempts on automatic document dating started with generative models by (de Jong et al., 2005b). This model is later improved by (Kanhabua and Nørvåg, 2008a) who incorporate additional features such as POS tags, collocations, etc. Chambers (2012) shows significant improvement over these prior efforts through their discriminative models using handcrafted temporal features. Kotsakos et al. (2014) propose a statistical approach for document dating exploiting term bursti- Temporal Relation Extraction Dependency Parsing Figure 2: Overview of NeuralDater. NeuralDater exploits syntactic and temporal structure in a document to learn effective representation, which in turn are used to predict the document time. NeuralDater uses a Bi-directional LSTM (Bi-LSTM), two Graph Convolution Networks (GCN) -one over the dependency tree and the other over the document's temporal graph -along with a softmax classifier, all trained end-to-end jointly. Please see Section 4 for more details.
ness (Lappas et al., 2009). Document dating is a challenging problem which requires extensive reasoning over the temporal structure of the document. Let us motivate this through an example shown in Figure 1. In the document, four years after plays a crucial role in identifying the creation time of the document. The existing approaches give higher confidence for timestamp immediate to the year mention 1995. NeuralDater exploits the syntactic and temporal structure of the document to predict the right timestamp (1999) for the document. With the exception of (Chambers, 2012), all prior works on the document dating problem ignore such informative temporal structure within the document.
Research in document event extraction and ordering have made it possible to extract such temporal structures involving events, temporal expressions, and the (unknown) document date in a document (Mirza and Tonelli, 2016;Chambers et al., 2014). While methods to perform reasoning over such structures exist (Verhagen et al., 2007(Verhagen et al., , 2010UzZaman et al., 2013;Llorens et al., 2015;Pustejovsky et al., 2003), none of them have exploited advances in deep learning (Krizhevsky et al., 2012;Goodfellow et al., 2016). In particular, recently proposed Graph Convolution Networks (GCN) (Defferrard et al., 2016;Kipf and Welling, 2017) have emerged as a way to learn graph representation while encoding structural information and constraints represented by the graph. We adapt GCNs for the document dating problem and make the following contributions: • We propose NeuralDater, a Graph Convolution Network (GCN)-based approach for document dating. To the best of our knowledge, this is the first application of GCNs, and more broadly deep neural network-based methods, for the document dating problem.
• NeuralDater is the first document dating approach which exploits syntactic as well temporal structure of the document, all within a principled joint model.
• Through extensive experiments on multiple real-world datasets, we demonstrate Neu-ralDater's effectiveness over state-of-the-art baselines.
NeuralDater's source code and datasets used in the paper are available at http://github. com/malllabiisc/NeuralDater.

Related Work
Automatic Document Dating: de Jong et al. (2005b) propose the first approach for automating document dating through a statistical language model. Kanhabua and Nørvåg (2008a) further extend this work by incorporating semantic-based preprocessing and temporal entropy (Kanhabua and Nørvåg, 2008b) based term-weighting. Chambers (2012) proposes a MaxEnt based discriminative model trained on hand-crafted temporal features. He also proposes a model to learn probabilistic constraints between year mentions and the actual creation time of the document. We draw inspiration from his work for exploiting temporal reasoning for document dating. Kotsakos et al. (2014) propose a purely statistical method which considers lexical similarity alongside burstiness (Lappas et al., 2009) of terms for dating documents. To the best of our knowledge, NeuralDater, our proposed method, is the first method to utilize deep learning techniques for the document dating problem.
Event Ordering Systems: Temporal ordering of events is a vast research topic in NLP. The problem is posed as a temporal relation classification between two given temporal entities. Machine Learned classifiers and well crafted linguistic features for this task are used in (Chambers et al., 2007;Mirza and Tonelli, 2014). D'Souza and Ng (2013) use a hybrid approach by adding 437 hand-crafted rules. Chambers and Jurafsky (2008); Yoshikawa et al. (2009) try to classify with many more temporal constraints, while utilizing integer linear programming and Markov logic.
CAEVO, a CAscading EVent Ordering architecture (Chambers et al., 2014) use sieve-based architecture (Lee et al., 2013) for temporal event ordering for the first time. They mix multiple learners according to their precision based ranks and use transitive closure for maintaining consistency of temporal graph. Mirza and Tonelli (2016) recently propose CATENA (CAusal and TEmporal relation extraction from NAtural language texts), the first integrated system for the temporal and causal relations extraction between pre-annotated events and time expressions. They also incorporate sieve-based architecture which outperforms existing methods in temporal relation classification domain. We make use of CATENA for temporal graph construction in our work.

Background: Graph Convolution Networks (GCN)
In this section, we provide an overview of Graph Convolution Networks (GCN) (Kipf and Welling, 2017). GCN learns an embedding for each node of the graph it is applied over. We first present GCN for undirected graphs and then move onto GCN for directed graph setting.

GCN on Undirected Graph
Let G = (V, E) be an undirected graph, where V is a set of n vertices and E the set of edges. The input feature matrix X ∈ R n×m whose rows are input representation of node u, x u ∈ R m , ∀u ∈ V.
The output hidden representation h v ∈ R d of a node v after a single layer of graph convolution operation can be obtained by considering only the immediate neighbors of v. This can be formulated as: Here, model parameters W ∈ R d×m and b ∈ R d are learned in a task-specific setting using firstorder gradient optimization. N (v) refers to the set of neighbors of v and f is any non-linear activation function. We have used ReLU as the activation function in this paper 1 .
In order to capture nodes many hops away, multiple GCN layers may be stacked one on top of another. In particular, h k+1 v , representation of node v after k th GCN layer can be formulated as: where h k u is the input to the k th layer.

GCN on Labeled and Directed Graph
In this section, we consider GCN formulation over graphs where each edge is labeled as well as directed. In this setting, an edge from node u to v with label l(u, v) is denoted as (u, v, l(u, v)). While a few recent works focus on GCN over directed graphs (Yasunaga et al., 2017;Marcheggiani and Titov, 2017), none of them consider labeled edges. We handle both direction and label by incorporating label and direction specific filters. Based on the assumption that the information in a directed edge need not only propagate along its direction, following (Marcheggiani and Titov, 2017) we define an updated edge set E which expands the original set E by incorporating inverse, as well self-loop edges.
Here, l(u, v) −1 is the inverse edge label corresponding to label l(u, v), and is a special empty relation symbol for self-loop edges. We now define h k+1 v as the embedding of node v after k th GCN layer applied over the directed and labeled graph as: (2) We note that the parameters W k l(u,v) and b k l (u,v) in this case are edge label specific.

Incorporating Edge Importance
In many practical settings, we may not want to give equal importance to all the edges. For example, in case of automatically constructed graphs, some of the edges may be erroneous and we may want to automatically learn to discard them. Edgewise gating may be used in a GCN to give importance to relevant edges and subdue the noisy ones. Bastings et al. (2017b); Marcheggiani and Titov (2017) used gating for similar reasons and obtained high performance gain. At k th layer, we compute gating value for a particular edge (u, v, l(u, v)) as: where, σ(·) is the sigmoid function,ŵ k l(u,v) and b k l(u,v) are label specific gating parameters. Thus, gating helps to make the model robust to the noisy labels and directions of the input graphs. GCN embedding of a node while incorporating edge gating may be computed as follows.

NeuralDater Overview
The Documents Dating problem may be cast as a multi-class classification problem (Kotsakos et al., 2014;Chambers, 2012). In this section, we present an overview of NeuralDater, the document dating system proposed in this paper. Architectural overview of NeuralDater is shown in Figure  2.
NeuralDater is a deep learning-based multiclass classification system. It takes in a document as input and returns its predicted date as output by exploiting the syntactic and temporal structure of document.
NeuralDater network consists of three layers which learn an embedding for the Document Creation Time (DCT) node corresponding to the document. This embedding is then fed to a softmax classifier which produces a distribution over timestamps. Following prior research (Chambers, 2012;Kotsakos et al., 2014), we work with year granularity for the experiments in this paper. We, however, note that NeuralDater can be trained for finer granularity with appropriate training data. The NeuralDater network is trained end-to-end using training data. We briefly present NeuralDater's various components below. Each component is described in greater detail in subsequent sections.
• Context Embedding: In this layer, Neu-ralDater uses a Bi-directional LSTM (Bi-LSTM) to learn embedding for each token in the document. Bi-LSTMs have been shown to be quite effective in capturing local context inside token embeddings (Sutskever et al., 2014).
• Syntactic Embedding: In this step, Neural-Dater revises token embeddings from the previous step by running a GCN over the dependency parses of sentences in the document. We refer to this GCN as Syntactic GCN or S-GCN. While the Bi-LSTM captures immediate local context in token embeddings, S-GCN augments them by capturing syntactic context.
• Temporal Embedding: In this step, Neu-ralDater further refines embeddings learned by S-GCN to incorporate cues from temporal structure of event and times in the document.
NeuralDater uses state-of-the-art causal and temporal relation extraction algorithm (Mirza and Tonelli, 2016) for extracting temporal graph for each document. A GCN is then run over this temporal graph to refine the embeddings from the previous layer. We refer to this GCN as Temporal GCN or T-GCN. In this step, a special DCT node is introduced whose embedding is also learned by the T-GCN.
• Classifier: Embedding of the DCT node along with average pooled embeddings learned by S-GCN are fed to a fully connected softmax classifier which makes the final prediction about the date of the document.
Even though the previous discussion is presented in a sequential manner, the whole network is trained in a joint end-to-end manner using backpropagation.

NeuralDater Details
In this section, we present detailed description of various components of NeuralDater.

Context Embedding (Bi-LSTM)
Let us consider a document D with n tokens w 1 , w 2 , ..., w n . We first represent each token by a k-dimensional word embedding. For the experiments in this paper, we use GloVe (Pennington et al., 2014) embeddings. These token embeddings are stacked together to get the document representation X ∈ R n×k . We then employ a Bi-directional LSTM (Bi-LSTM) (Hochreiter and Schmidhuber, 1997) on the input matrix X to obtain contextual embedding for each token. After stacking contextual embedding of all these tokens, we get the new document representation matrix H cntx ∈ R n×rcntx . In this new representation, each token is represented in a r cntx -dimensional space. Our choice of LSTMs for learning contextual embeddings for tokens is motivated by the previous success of LSTMs in this task (Sutskever et al., 2014).

Syntactic Embedding (S-GCN)
While the Bi-LSTM is effective at capturing immediate local context of a token, it may not be as effective in capturing longer range dependencies among words in a sentence. For example, in Figure 1, we would like the embedding of token approved to be directly affected by govt, even though they are not immediate neighbors. A dependency parse may be used to capture such longer-range connections. In fact, similar features were exploited by (Chambers, 2012) for the document dating problem. NeuralDater captures such longerrange information by using another GCN run over the syntactic structure of the document. We describe this in detail below.
The context embedding, H cntx ∈ R n×rcntx learned in the previous step is used as input to this layer. For a given document, we first extract its syntactic dependency structure by applying the Stanford CoreNLP's dependency parser  on each sentence in the document individually. We now employ the Graph Convolution Network (GCN) over this dependency graph using the GCN formulation presented in Section 3.2. We call this GCN the Syntactic GCN or S-GCN, as mentioned in Section 4.
Since S-GCN operates over the dependency graph and uses Equation 2 for updating embeddings, the number of parameters in S-GCN is directly proportional to the number of dependency edge types. Stanford CoreNLP's dependency parser returns 55 different dependency edge types. This large number of edge types is going to significantly over-parameterize S-GCN, thereby increasing the possibility of overfitting. In order to address this, we use only three edge types in S-GCN. For each edge connecting nodes w i and w j in E (see Equation 1), we determine its new type L(w i , w j ) as follows: i.e., if the edge is an original dependency parse edge i.e., if the edges is an inverse edge i.e., if the edge is a self-loop with w i = w j S-GCN now estimates embedding h syn w i ∈ R rsyn for each token w i in the document using the for-mulation shown below.
Please note S-GCN's use of the new edge types L(w i , w j ) above, instead of the l(w i , w j ) types used in Equation 2. By stacking embeddings for all the tokens together, we get the new embedding matrix H syn ∈ R n×rsyn representing the document.
AveragePooling: We obtain an embedding h avg D for the whole document by average pooling of every token representation.

Temporal Embedding (T-GCN)
In this layer, NeuralDater exploits temporal structure of the document to learn an embedding for the Document Creation Time (DCT) node of the document. First, we describe the construction of temporal graph, followed by GCN-based embedding learning over this graph. Temporal Graph Construction: NeuralDater uses Stanford's SUTime tagger (Chang and Manning, 2012) for date normalization and the event extraction classifier of (Chambers et al., 2014) for event detection. The annotated document is then passed to CATENA (Mirza and Tonelli, 2016), current state-of-the-art temporal and causal relation extraction algorithm, to obtain a temporal graph for each document. Since our task is to predict the creation time of a given document, we supply DCT as unknown to CATENA. We hypothesize that the temporal relations extracted in absence of DCT are helpful for document dating and we indeed find this to be true, as shown in Section 7. Temporal graph is a directed graph, where nodes correspond to events, time mentions, and the Document Creation Time (DCT). Edges in this graph represent causal and temporal relationships between them. Each edge is attributed with a label representing the type of the temporal relation. CATENA outputs 9 different types of temporal relations, out of which we selected five types, viz., AFTER, BEFORE, SAME, INCLUDES, and IS INCLUDED. The remaining four types were ignored as they were substantially infrequent.
Please note that the temporal graph may involve only a small number of tokens in the document. For example, in the temporal graph in Figure 2, there are a total of 5 nodes: two temporal expression nodes (1995 and four years after), two event nodes (adopted and approved), and a special DCT node. This graph also consists of temporal relation edges such as (four years after, approved, BE-FORE). Temporal Graph Convolution: NeuralDater employs a GCN over the temporal graph constructed above. We refer to this GCN as the Temporal GCN or T-GCN, as mentioned in Section 4. T-GCN is based on the GCN formulation presented in Section 3.2. Unlike S-GCN, here we consider label and direction specific parameters as the temporal graph consists of only five types of edges.
Let n T be the number of nodes in the temporal graph. Starting with H syn (Section 5.2), T-GCN learns a r temp -dimensional embedding for each node in the temporal graph. Stacking all these embeddings together, we get the embedding matrix H temp ∈ R n T ×rtemp . T-GCN embeds the temporal constraints induced by the temporal graph in h temp DCT ∈ R rtemp , embedding of the DCT node of the document.

Classifier
Finally, the DCT embedding h temp DCT and averagepooled syntactic representation h avg D (see Equation  3) of document D are concatenated and fed to a fully connected feed forward network followed by a softmax. This allows the NeuralDater to exploit context, syntactic, and temporal structure of the document to predict the final document date y.

Experimental Setup
Datasets: We experiment on Associated Press Worldstream (APW) and New York Times (NYT) sections of Gigaword corpus (Parker et al., 2011). The original dataset contains around 3 million documents of APW and 2 million documents of NYT from span of multiple years. From both sections, we randomly sample around 650k documents while maintaining balance among years. Documents belonging to years with substantially fewer documents are omitted. Details of the dataset can be found in Table 1. For train, test and validation splits, the dataset was randomly divided in 80:10:10 ratio.
Evaluation Criteria: Given a document, the model needs to predict the year in which the document was published. We measure performance in terms of overall accuracy of the model.
Baselines: For evaluating NeuralDater, we compared against the following methods: • BurstySimDater Kotsakos et al. (2014): This is a purely statistical method which uses lexical similarity and term burstiness (Lappas et al., 2009) for dating documents in arbitrary length time frame. For our experiments, we took the time frame length as 1 year. Please refer to (Kotsakos et al., 2014) for more details.
• MaxEnt-Joint: Refers to MaxEnt-Time-NER combined with year mention classifier as described in (Chambers, 2012).
• MaxEnt-Uni-Time: MaxEnt based discriminative model which takes bag-of-words representation of input document with normalized time expression as its features.
• CNN: A Convolution Neural Network (CNN) (LeCun et al., 1999) based text classification model proposed by (Kim, 2014), which attained state-of-the-art results in several domains.
Hyperparameters: By default, edge gating (Section 3.3) is used in all GCNs. The parameter K represents the number of layers in T-GCN (Section 5.3). We use 300-dimensional GloVe embeddings and 128-dimensional hidden state for both

Performance Comparison
In order to evaluate the effectiveness of Neu-ralDater, our proposed method, we compare it against existing document dating systems and text classification models. The final results are summarized in Table 2. Overall, we find that Neu-ralDater outperforms all other methods with a significant margin on both datasets. Compared to the previous state-of-the-art in document dating, BurstySimDater (Kotsakos et al., 2014), we get 19% average absolute improvement in accuracy across both datasets. We observe only a slight gain in the performance of MaxEnt-based model (MaxEnt-Time+NER) of (Chambers, 2012) on combining with temporal constraint reasoner (MaxEnt-Joint). This may be attributed to the fact that the model utilizes only year mentions in the document, thus ignoring other relevant signals which might be relevant to the task. BurstySim-Dater performs considerably better in terms of precision compared to the other baselines, although it significantly underperforms in accuracy. We note that NeuralDater outperforms all these prior models both in terms of precision and accuracy. We find that even generic deep-learning based text classification models, such as CNN (Kim, 2014), are quite effective for the problem. However, since such a model doesn't give specific attention to temporal features in the document, its performance remains limited. From Figure 3, we observe that NeuralDater's top prediction achieves on average the lowest deviation from the true year.

Ablation Comparisons
For demonstrating the efficacy of GCNs and BiL-STM for the problem, we evaluate different ablated variants of NeuralDater on the APW dataset. Specifically, we validate the importance of using syntactic and temporal GCNs and the effect of eliminating BiLSTM from the model. Overall results are summarized in Table 3. The first block of rows in the table corresponds to the case when BiLSTM layer is excluded from Neural-Dater, while the second block denotes the case when BiLSTM is included. We also experiment with multiple stacked layers of T-GCN (denoted by K) to observe its effect on the performance of the model. We observe that embeddings from Syntactic GCN (S-GCN) are much better than plain GloVe embeddings for T-GCN as S-GCN encodes the syntactic neighborhood information in event and time embeddings which makes them more relevant for document dating task. Figure 4: Evaluating performance of different methods on dating documents with and without time mentions. Please see Section 7.3 for details.

Accuracy
Overall, we observe that including BiLSTM in the model improves performance significantly. Single BiLSTM model outperforms all the models listed in the first block of Table 3. Also, some gain in performance is observed on increasing the number of T-GCN layers (K) in absence of BiL-STM, although the same does not follow when BiLSTM is included in the model. This observation is consistent with (Marcheggiani and Titov, 2017), as multiple GCN layers become redundant in the presence of BiLSTM. We also find that eliminating edge gating from our best model deteriorates its overall performance.
In summary, these results validate our thesis that joint incorporation of syntactic and temporal structure of a document in NeuralDater results in improved performance.

Discussion and Error Analysis
In this section, we list some of our observations while trying to identify pros and cons of Neural-Dater, our proposed method. We divided the development split of the APW dataset into two sets -those with and without any mention of time expressions (year). We apply NeuralDater and other methods to these two sets of documents and report accuracies in Figure 4. We find that overall, NeuralDater performs better in comparison to the existing baselines in both scenarios. Even though the performance of NeuralDater degrades in the absence of time mentions, its performance is still the best relatively. Based on other analysis, we find that NeuralDater fails to identify timestamp of documents reporting local infrequent incidents without explicit time mention. NeuralDater becomes confused in the presence of multiple misleading time mentions; it also loses out on documents discussing events which are outside the time range of the text on which the model was trained. In future, we plan to eliminate these pitfalls by incorporating additional signals from Knowledge Graphs about entities mentioned in the document. We also plan to utilize free text temporal expression (Kuzey et al., 2016) in documents for improving performance on this problem.

Conclusion
We propose NeuralDater, a Graph Convolutional Network (GCN) based method for document dating which exploits syntactic and temporal structures in the document in a principled way. To the best of our knowledge, this is the first application of deep learning techniques for the problem of document dating. Through extensive experiments on real-world datasets, we demonstrate the effectiveness of NeuralDater over existing state-of-theart approaches. We are hopeful that the representation learning techniques explored in this paper will inspire further development and adoption of such techniques in the temporal information processing research community.