AD3: Attentive Deep Document Dater

Knowledge of the creation date of documents facilitates several tasks such as summarization, event extraction, temporally focused information extraction etc. Unfortunately, for most of the documents on the Web, the time-stamp metadata is either missing or can’t be trusted. Thus, predicting creation time from document content itself is an important task. In this paper, we propose Attentive Deep Document Dater (AD3), an attention-based neural document dating system which utilizes both context and temporal information in documents in a flexible and principled manner. We perform extensive experimentation on multiple real-world datasets to demonstrate the effectiveness of AD3 over neural and non-neural baselines.


Introduction
Many natural language processing tasks require document creation time (DCT) information as a useful additional metadata. Tasks such as information retrieval (Li and Croft, 2003;Dakka et al., 2008), temporal scoping of events and facts (Allan et al., 1998;Talukdar et al., 2012b), document summarization (Wan, 2007) and analysis (de Jong et al., 2005a) require precise and validated creation time of the documents. Most of the documents obtained from the Web either contain DCT that cannot be trusted or contain no DCT information at all (Kanhabua and Nørvåg, 2008). Thus, predicting the time of these documents based on their content is an important task, often referred to as Document Dating.
A few generative approaches (de Jong et al., 2005b;Kanhabua and Nørvåg, 2008) as well as a discriminative model (Chambers, 2012) have been previously proposed for this task. Kotsakos et al. (2014) employs term-burstiness resulting in improved precision on this task.
Recently proposed NeuralDater (Vashishth et al., 2018) uses a graph convolution network (GCN) based approach for document dating, outperforming all previous models by a significant margin. NeuralDater extensively uses the syntactic and temporal graph structure present within the document itself. Motivated by NeuralDater, we explicitly develop two different methods: a) Attentive Context Model, and b) Ordered Event Model. The first component tries to accumulate knowledge across documents, whereas the latter uses the temporal structure of the document for predicting its DCT.
Motivated by the effectiveness of attention based models in different NLP tasks (Yang et al., 2016a;Bahdanau et al., 2014), we incorporate attention in our method in a principled fashion. We use attention not only to capture context but also for feature aggregation in the graph convolution network (Hamilton et al., 2017). Our contributions are as follows.
• We propose Attentive Deep Document Dater (AD3), the first attention-based neural model for time-stamping documents.
• We devise a novel method for label based attentive graph convolution over directed graphs and use it for the document dating task.
• Through extensive experiments on multiple real-world datasets, we demonstrate AD3's effectiveness over previously proposed methods. OE-GCN provides the probability scores over the years given the encoded DCT, while AC-GCN provides the probability scores given the context of the document. Both the models are trained separately. made for document time-stamping task include statistical language models proposed by de Jong et al. (2005b) and Kanhabua and Nørvåg (2008). (Chambers, 2012) use temporal and hand-crafted features extracted from documents to predict DCT. They propose two models, one of which learns the probabilistic constraints between year mentions and the actual creation time, whereas the other one is a discriminative model trained on hand-crafted features. Kotsakos et al. (2014) propose a termburstiness (Lappas et al., 2009) based statistical method for the task. Vashishth et al. (2018) propose a deep learning based model which exploits the temporal and syntactic structure in documents using graph convolutional networks (GCN).
Event Ordering System: The task of extracting temporally rich events and time expressions and ordering between them is introduced in the TempEval challenge (UzZaman et al., 2013;Verhagen et al., 2010). Various approaches (Mc-Dowell et al., 2017;Mirza and Tonelli, 2016) made for solving the task use sieve-based archi-tectures, where multiple classifiers are ranked according to their precision and their predictions are weighted accordingly resulting in a temporal graph structure. A method to extract temporal ordering among relational facts was proposed in (Talukdar et al., 2012a).
Graph Convolutional Network (GCN): GCN (Kipf and Welling, 2016) is the extension of convolutional networks over graphs. In different NLP tasks such as semantic-role labeling , neural machine translation (Bastings et al., 2017), and event detection (Nguyen and Grishman, 2018), GCNs have proved to be effective. We extensively use GCN for capturing both syntactic and temporal aspect of the document.
Attention Network: Attention networks have been well exploited for various tasks such as document classification (Yang et al., 2016b), question answering (Yang et al., 2016a), machine translation (Bahdanau et al., 2014;Vaswani et al., 2017). Recently, attention over graph structure has been shown to work well by Veličković et al. (2018). Taking motivation from them, we deploy an attentive convolutional network on temporal graph for the document dating problem.

Background: GCN & NeuralDater
The task of document dating can be modeled as a multi-class classification problem. Following prior work, we shall focus on DCT prediction at the year-granularity in this paper. In this section, we summarize the previous state-of-the-art model NeuralDater (Vashishth et al., 2018), before moving onto our method. An overview of graph convolutional network (GCN) (Kipf and Welling, 2016) is also necessary as it is used in NeuralDater as well as in our model.

Graph Convolutional Network
GCN for Undirected Graph: Consider an undirected graph, G = (V, E), where V and E are the set of n vertices and set of edges respectively. Matrix X ∈ R n×m , whose rows are input representation of node u, where x u ∈ R m , ∀ u ∈ V, is the input feature matrix. The output hidden representation h v ∈ R d of a node v after a single layer of graph convolution operation can be obtained by considering only the immediate neighbours of v, as formulated in (Kipf and Welling, 2016). In order to capture information at multi-hop distance, one can stack layers of GCN, one over another. GCN for Directed Graph: Consider a labelled edge from node u to v with label l(u, v), denoted collectively as (u, v, l(u, v)). Based on the assumption that information in a directed edge need not only propagate along its direction, Marcheggiani and Titov (2017) added opposite edges viz., for each (u, v, l(u, v)), (v, u, l(u, v) −1 ) is added to the edge list. Self loops are also added for passing the current embedding information. When GCN is applied over this modified directed graph, the embedding of the node v after k th layer will be, We note that the parameters W k l(u,v) and b k l (u,v) in this case are edge label specific. h k u is the input to the k th layer. Here, N (v) refers to the set of neighbours of v, according to the updated edge list and f is any non-linear activation function (e.g., ReLU: f (x) = max(0, x)).

NeuralDater
In this sub-section, we provide a brief overview of the components of the NeuralDater (Vashishth et al., 2018). Given a document D with n tokens w 1 , w 2 , · · · w n , NeuralDater extracts a temporally rich embedding of the document in a principled way as explained below:

Context Embedding
Bi-directional LSTM is employed for embedding each word with its context. The GloVe representation of the words X ∈ R n×k is transformed to a context aware representation H cntx ∈ R n×k to get the context embedding. This is essentially shown as the Bi-LSTM in Figure 1.

Syntactic Embedding
In this step, the context embeddings are further processed using GCN over the dependency parse tree of the sentences in the document, in order to capture long range connection among words. The syntactic dependency structure is extracted by Stanford CoreNLP's dependency parser (Manning et al., 2014). NeuralDater follows the same formulation of GCN for directed graph as described in Section 3.1, where additional edges are added to the graph to model the information flow. Again following , Neu-ralDater does not allocate separate weight matrices for different types of dependency edge labels, rather it considers only three type of edges: a) edges that exist originally, b) the reverse edges that are added explicitly, and c) self loops. The S-GCN portion of Figure 1 represents this component.
More formally, H cntx ∈ R n×k is transformed to H syn ∈ R n×ksyn by applying S-GCN.

Temporal Embedding
In this layer, NeuralDater exploits the Event-Time graph structure present in the document. CATENA (Mirza and Tonelli, 2016), current state-of-the-art temporal and causal relation extraction algorithm, produces the temporal graph from the event time annotation of the document. GCN applied over this Event-Time graph, namely T-GCN, chooses n T number of tokens out of total n tokens from the document for further revision in their embeddings. Note that T is the total number of events and time mentions present in the document. A special node DCT is added to the graph and its embedding is jointly learned. Note that this layer learns both label and direction specific parameters.

Classifier
Finally, the DCT embedding concatenated with the average pooled syntactic embedding is fed to a softmax layer for classification. This whole procedure is trained jointly.

Attentive Deep Document Dater (AD3): Proposed Method
In this section, we describe Attentive Deep Document Dater (AD3), our proposed method. AD3 is inspired by NeuralDater, and shares many of its components. Just like in NeuralDater, AD3 also leverages two main types of signals from the document -syntactic and event-time -to predict the document's timestamp. However, there are crucial differences between the two systems. Firstly, instead of concatenating embeddings learned from these two sources as in NeuralDater, AD3 treats these two models completely separate and combines them at a later stage. Secondly, unlike Neu-ralDater, AD3 employs attention mechanisms in each of these two models. We call the resulting models Attentive Context Model (AC-GCN) and Ordered Event Model (OE-GCN). These two models are described in Section 4.1 and Section 4.2, respectively.

Attentive Context Model (AC-GCN)
Recent success of attention-based deep learning models for classification (Yang et al., 2016b), question answering (Yang et al., 2016a), and machine translation (Bahdanau et al., 2014) have motivated us to use attention during document dating. We extend the syntactic embedding model of Neu-ralDater (Section 3.2.2) by incorporating an attentive pooling layer. We call the resulting model AC-GCN. This model (right side in Figure 1) has two major components.
• Context Embedding and Syntactic Embedding: Following NeuralDater, we used Bi-LSTM and S-GCN to capture context and long-range syntactic dependencies in the document (Please refer to Section 3.2.1, Section 3.2.2 for brief description). The syntactic embedding, H syn ∈ R n×ksyn is then fed to an Attention Network for further processing. Note that, k syn is the dimension of the output of Syntactic-GCN and n is the number of tokens in the document.
• Attentive Embedding: In this layer, we learn the representation for the whole document through word level attention network. We learn a context vector, u s ∈ R s with respect to which we calculate attention for each token. Finally, we aggregate the token features with respect to their attention weights in order to represent the document. More formally, let h syn t ∈ R ksyn be the syntactic representation of the t th token in the document. We take non-linear projection of it in R s with W s ∈ R s×ksyn . Attention weight α t for t th token is calculated with respect to the context vector u T t as follows. .
Finally, the document representation for the AC-GCN is computed as shown below.
This representation is fed to a softmax layer for the final classification.
The final probability distribution over years predicted by the AC-GCN is given below.

Ordered Event Model (OE-GCN)
The OE-GCN model is shown on the left side of Figure 1. Just like in AC-GCN, context and syntactic embedding is also part of OE-GCN. The syntactic embedding is fed to the Attentive Graph Convolution Network (AT-GCN) where the graph is obtained from the time-event ordering algorithm CATENA (Mirza and Tonelli, 2016). We describe these components in detail below.

Temporal Graph
We use the same process used in NeuralDater (Vashishth et al., 2018) for procuring the Temporal Graph from the document. CATENA (Mirza and Tonelli, 2016)  Let E T be the edge list of the Temporal Graph. Similar to Vashishth et al., 2018), we also add reverse edges for each of the existing edge and self loops for passing current node information as explained in Section 3.1. The new edge list E T is shown below.
The reverse edges are added with reverse labels like AFTER −1 , BEFORE −1 etc . Finally, we get 10 labels for our temporal graph and we denote the set of edge labels by L.

Attentive Graph Convolution (AT-GCN)
Since the temporal graph is automatically generated, it is likely to have incorrect edges. Ideally, we would like to minimize the influence of such noisy edges while computing temporal embedding. In order to suppress the noisy edges in the Temporal Graph and detect important edges for reasoning, we use attentive graph convolution (Hamilton et al., 2017) over the Event-Time graph. The attention mechanism learns the aggregation function jointly during training. Here, the main objective is to calculate the attention over the neighbouring nodes with respect to the current node for a given label. Then the embedding of the current node is updated by mixing neighbouring node embedding according to their attention scores. In this respect, we propose a label-specific attentive graph convolution over directed graphs. Let us consider an edge in the temporal graph from node i to node j with type l, where l ∈ L and L is the label set. The label set L can be divided broadly into two coarse labels as done in Section 3.2.2. The attention weights are specific to only these two type of edges to reduce parameter and prevent overfitting. For illustration, if there exists an edge from node i to j then the edge types will be, i.e., if the edge is an original event-time edge.
First, we take a linear projection (W atten L(i,j) ∈ R F ×ksyn ) of both the nodes in R F in order to map before before a before both of them in the same direction-specific space. The concatenated vector [W atten L(i,j) × h i ; W atten L(i,j) × h j ], signifies the importance of the node j w.r.t. node i. A non linear transformation of this concatenation can be treated as the importance feature vector between i and j.
Now, we compute the attention weight of node j for node i with respect to a direction-specific context vector a L(i,j) ∈ R 2F , as follows.
where, α l(i,j) ij = 0 if node i and j is not connected through label l. N l(i,·) denotes the subset of the neighbourhood of node i with label l only. Please note that, although the linear transform weight (W atten L(i,j) ∈ R F ×ksyn ) is specific to the coarse labels L, but for each finer label l ∈ L we get these convex weights of attentions. Figure  2 illustrates the above description w.r.t. edge type BEFORE.
Valid Accuracy (%) λ Figure 3: Variation of validation accuracy with λ (for APW dataset). We observe that AC-GCN and OE-GCN are both important for the task as we get optimal λ = 0.52.
Finally, the feature aggregation is done according to the attention weights. Prior to that, another label specific linear transformation is taken to perform the convolution operation. Then, the updated feature for node i is calculated as follows.
where, α ii = 1, N l(i,·) denotes the subset of the neighbourhood of node i with label l only. Note that, α l(i,j) ij = 0 when j / ∈ N l(i,·) . To illustrate formally, from Figure 2, we see that weight α 1 and α 2 is calculated specific to label type BEFORE and the neighbours which are connected through BE-FORE is being multiplied with W bef ore prior to aggregation in the ReLU block. Now, after applying attentive graph convolution network, we only consider the representation of Document Creation Time (DCT), h DCT , as the document representation itself. h DCT is now passed through a fully connected layer prior to softmax. Prediction of the OE-GCN for the document D will be given as

AD3: Attentive Deep Document Dater
In this section, we propose an unified model by mixing both AC-GCN and OE-GCN. Even on validation data, we see that performance of both the models differ to a large extent. This significant difference (McNemar test p < 0.000001) motivated the unification. We take convex combination of the output probabilities of the two models  as shown below.
P joint (y|D) = λP AC−GCN (y|D) The combination hyper-parameter λ is tuned on the validation data. We obtain the value of λ to be 0.52 (Figure 3) and 0.54 for APW and NYT datasets, respectively. This depicts that the two models are capturing significantly different aspects of documents, resulting in a substantial improvement in performance when combined.

Experimental Setup
Dataset: Experiments are carried out on the Associated Press Worldstream (APW) and New York Times (NYT) sections of the Gigaword corpus (Parker et al., 2011). We have used the same 8:1:1 split as Vashishth et al. (2018) for all the models. For quantitative details please refer to Table 1.
Evaluation Criteria: In accordance with prior work (Chambers, 2012;Kotsakos et al., 2014;Vashishth et al., 2018) the final task is to predict the publication year of the document. We give a brief description of the baselines below.
Baseline Methods: • MaxEnt-Joint (Chambers, 2012): This method engineers several hand-crafted temporally influenced features to classify the document using MaxEnt Classifier.
• BurstySimDater (Kotsakos et al., 2014): This is a purely statistical method which uses lexical similarity and term burstiness (Lappas et al., 2009) for dating documents in arbitrary length time frame. For our experiments, we used a time frame length of 1 year.
• NeuralDater (Vashishth et al., 2018): This is the first deep neural network based approach for the document dating task. Details are provided in Section 3.2.  Israel's consumer price index increased by 1.2 percent in December, bringing the overall inflation rate for 1995 to 8.1 percent, well within the government's target rate for the year, officials said Friday. Israel radio said that it was the lowest annual inflation rate in twenty years. Hyperparameters: We use 300-dimensional GloVe embeddings and 128-dimensional hidden state for both GCNs and BiLSTM with 0.8 dropout. We use Adam (Kingma and Ba, 2014) with 0.001 learning rate for training. For OE-GCN we use 2-layers of AT-GCN. 1-layer of S-GCN is used for both the models.

Performance Analysis
In this section, we compare the effectiveness of our method with that of prior work. The deep network based NeuralDater model in (Vashishth et al., 2018) outperforms previous feature engi-   neered (Chambers, 2012) and statistical methods (Kotsakos et al., 2014) by a large margin. We observe a similar trend in our case. Compared to the state-of-the-art model NeuralDater, we gain, on an average, a 3.7% boost in accuracy on both the datasets (Table 2).
Among individual models, OE-GCN performs at par with NeuralDater, while AC-GCN outperforms it. The empirical results imply that AC-GCN by itself is effective for this task. The relatively worse performance of OE-GCN can be attributed to the fact that it only focuses on the Event-Time information and leaves out most of the contextual information. However, it captures various different (p < 0.000001, McNemar's test, 2-tailed) aspects of the document for classification, which motivated us to propose an ensemble of the two models. This explains the significant boost in performance of AD3 over NeuralDater as well as the individual models. It is worth mentioning that although AC-GCN and OE-GCN do not provide significant boosts in accuracy, their predictions have considerably lower mean-absolute-deviation as shown in Figure 4.
We concatenated the DCT embedding provided by OE-GCN with the document embedding provided by AC-GCN and trained in an end to end joint fashion like NeuralDater. We see that even with a similar training method, the Attentive Neu-ralDater model on an average, performs 1.6% better in terms of accuracy, once again proving the efficacy of attention based models over normal models.

Effectiveness of Attention
Attentive Graph Convolution (Section 4.2.2) proves to be effective for OE-GCN, giving a 2% accuracy improvement over non-attentive T-GCN of NeuralDater (Table 3). Similarly the efficacy of word level attention is also prominent from Table  3. We have also analyzed our models by visualizing attentions over words and attention over graph nodes. Figure 5 shows that AC-GCN focuses on temporally informative words such as "said" (for tense) or time mentions like "1995", alongside important contextual words like "inflation", "Israel" etc. For OE-GCN, from Figure 6 we observe that "DCT" and time-mention '1995' grabs the highest attention. Attention between "DCT" and other event verbs indicating past tense are quite prominent, which helps the model to infer 1996 (which is correct) as the most likely time-stamp of the document. These analyses provide us with a good justification for the performance of our attentive models.

Discussion
Apart from empirical improvements over previous models, we also perform a qualitative analysis of the individual models. Figure 7 shows that the performance of AC-GCN improves with the length of documents, thus indicating that richer context leads to better model prediction. Figure  8 shows how the performance of OE-GCN improves with the number of event-time mentions in the document, thus further reinforcing our claim that more temporal information improves model performance. Vashishth et al. (2018) reported that their model got confused by the presence of multiple misleading time mentions. AD3 overcomes this limitation using attentive graph convolution, which successfully filters out noisy time mentions as is evident

Conclusion
We propose AD3, an ensemble model which exploits both syntactic and temporal information in a document explicitly to predict its creation time (DCT). To the best of our knowledge, this is the first application of attention based deep models for dating documents. Our experimental results demonstrate the effectiveness of our model over all previous models. We also visualize the attention weights to show that the model is able to choose what is important for the task and filter out noise inherent in language. As part of future work, we would like to incorporate external knowledge as a side information for improved time-stamping of documents.