Citation Analysis with Neural Attention Models

Automated citation analysis (ACA) can be important for many applications including author ranking and literature based information retrieval, extraction, summarization and question answering. In this study, we developed a new compositional attention network (CAN) model to integrate local and global attention representations with a hierarchical attention mechanism. Training on a new benchmark corpus we built, our evaluation shows that the CAN model performs consistently well on both citation classiﬁcation and sentiment analysis tasks.


Introduction
Citations are relations between the cited and citing articles and are important content in literature. There are different reasons that authors choose to cite an article. Identifying the purpose of the citations has important applications including faceted navigation, citation based information retrieval, impact factor assessment and summarization of scientific papers (Hearst and Stoica, 2009).
ACA refers to the tasks of citation function classification and citation sentiment analysis. Pioneered by Garfield and others (1965), a large body of citation-related studies have been carried out to develop categorization schemes for citation function analysis. However, most of the studies are limited to to specific domain. The classification schemes are typically complex, containing multiple overlapping categories ranging from three to 35 (Bornmann and Daniel, 2008). In contrast, the success of ACA depends on a small but well-defined set of citation categories. Nanba and Okumura (1999) developed a semi-ACA based on a 3-category scheme derived from Garfield and others (1965)'s 15 categories. Similarly, Pham and Hoffmann (2003) developed rule-based approaches (cue phrases) to classify citations into one of the four classes (basis, support, limitation and comparison). Teufel et al. (2009) addressed citation function classification and sentiment analysis jointly by a hierarchical scheme with the top nodes for sentiment and the leaf nodes for function classes. Agarwal et al. (2010) developed a scheme of eight non-overlapping categories for citation function classification in biomedical literatures. This scheme simplifies Yu et al. (2009)'s hierarchical overlapping categories. Recently, a decision-tree based scheme was introduced to facilitate citation context based intelligent systems (Mandya, 2012). The citation function classes, organic and perfunctory proposed by Moravcsik and Murugesan (1975) was adapted for a facet-based classification scheme (Jochim and Schütze, 2012).
Machine learning (ML) approaches to ACA mainly adapted statistical classifiers including support vector machines (SVM), logistic regression and Nave-Bayes classifier (Athar, 2011;Athar and Teufel, 2012;Sula and Miller, 2014). The feature set extracted includes n-grams, part-of-speech tags, word stems, cue phrases, sentence dependency components, named entity mentions and word and sentence location based features. Despite the rich linguistically motivated feature sets, ACA remains a challenge, performing significantly worse than human. One of the reasons for this could be the lack of Citations that discuss or dispute the correctness and/or weakness of the cited work Confirmative Citations that imply to confirm, support or make use of outcomes of the cited work Neutral Citations that are not negational nor confirmative Don't know This category should be chosen if you dont know which one to select a large training corpus.
In this study, we report the development of a simplified citation classification schema, a subsequent large annotated corpus, and a deep learning framework for end-to-end ACA.

Citation Scheme
We developed a simple citation scheme as shown in Table 1. Following Jochim and Schütze (2012), we defined both function classification and sentiment classification schemes as separate facets. For function classification, we followed the widely adopted rhetorical IMARD categories in the scientific domain (Day and Gastel, 2012;Sollaci and Pereira, 2004), and introduced background, method and Results/findings types. We defined the standard negational, confirmative and neutral categories for sentiment classification. We added a don't know category to both function classification and sentiment classification since a previous work shows that such a category improved annotation quality (van Rooyen et al., 2015).

Machine Learning Approaches
We develop deep neural models and compare them with a baseline model for automated citation analysis.

Long Short Term Memory
Long short-term memories (LSTMs) based models are variations of recurrent neural nets and have been introduced to solve the gradient vanishing problem (Hochreiter, 1998). It has an ability to model long-term dependencies of a word sequence (or context) and has achieved notable success in a varity of NLP tasks like machine translation (Sutskever et al., 2014), speech recognition (Graves et al., 2013) and textual entailment recognition (Bowman et al., 2015). In the context of citation analysis, LSTMs read citation context to construct a dense vector representation of the citation for classification.
Let x t and h t be the input and output at time step t. Given sequence of input tokens x 1 , . . . , x l (l is the number of tokens in input text) an LSTM with hidden size k computes a sequence of the output states h 1 , . . . , h l as where W 1 , . . . , W 8 ∈ R k×k and b 1 , . . . , b 4 ∈ R k are the training parameters. σ and denote the element-wise sigmoid function and the elementwise vector multiplication. The memory cell c t and hidden state h t are updated by reading a word token x t at a time. The memory cell c t then learns to remember the contextual information that are relevant to the task. This information is then provided to the hidden state h t by using a gating mechanism and the last hidden state h l summarizes the all relevant information. i t , f t and o t are called gates. Their values are defined by non-linear combination of the previous hidden state h t−1 and the current input token x t and range from zero to one. The input gate i t controls how much information needs to flow into the memory cell while the forget get f t decides what information needs to be erased in the memory cell. The output o t finally produces the hidden state for the current input token. The final representation vector h l is subsequently given to a multi-layer perceptron (MLP) with sof tmax output layer for classification.
Bi-directional LSTMs read the input sequence in both forward and backward directions and have shown to improve further NLP tasks (Jagannatha and Yu, 2016). We implemented Bi-directional LSTM models for citation classification; here we concatenate the last vector representations of the two LSTMs for the subsequent layers.
Studies have shown that LSTMs based models do not work well on memorizing long sequences (Bahdanau et al., 2015). To overcome this limitation, we introduce the attention models.

Global Attention
Attention mechanisms allow NN models to selectively focus on the most task-relevant part of input sequence. As a result, rather than treating every input vector equally, attention models assign weights to the vectors. Since attention models are able to bring out a past and possibly distant input vector to current time step with the blending operation, it also mitigates the information flow bottleneck in RNNs.
We extend the LSTMs based models with a global attention mechanism. This type of attention mechanism is implemented by a neural network that takes a sequence of vectors (usually output vectors of LSTMs) and selectively blends those vectors into a single attention vector. We adopt the attention architecture proposed by Hermann et al. (2015).
Concretely, the global attention considers all the output vectors h 1 , . . . , h l to construct an attention weighted representation of the input sequence. Let S ∈ R k×l be a matrix of the LSTMs output vectors h 1 , . . . , h l and o l ∈ R l be a vector of ones. An attention weight vector α, an attention representation r and the final representation h are defined as where W a , W h , W s , W x ∈ R k×k are learnable matrices and w is transpose of learnable vector w ∈ R k . With the outer product W h h l ⊗ o l we repeat the transformed vector of h l l times and then combine the resulting matrix with the projected output vectors.

Compositional Attention Network
The global attention introduced in the previous section does not incorporate subsequence information as it considers the whole input as a single component. However, natural language and its text form are composed of a set of semantic units. For example, a document can be broken down into paragraphs, the paragraphs into sentences, and the sentences into words. Inspired by this, we propose our CAN model. The proposed attention is also hierarchical in a sense that it consists of different attention layers. CAN attends locally over the input subsequences and globally over the whole input and selectively composes these two types of attention representations with a second layer attention to construct a higher level representation. We use the standard neural attention network (Equation (8 -10)) from the previous section as a main building block in our CAN.
Let R ∈ R k×z be a matrix of the representations r 1 , . . . , r z (z is the number of input subsequences, i.e. the number of sentences in the input) learned by local attentions and r the output of the global attention. We then obtain the final attention representation r and the final output h as follows. The W matrices and the w vectors of this model can be tied together. When tied, the number of parameters is equal to that of the global attention models. Therefore, this attention network introduces no parametric complexity to compare with the classic global attention model. Figure 1 depicts the overall structure of this model (Equation (1-6), (8-9) and (12-14)). The input consists of the three subsequences [x 1 , x 2 , x 3 ], [x 4 , x 5 , x 6 ] and [x 7 , x 8 , x 9 ]. The local attention vectors r 1 , r 2 and r 3 are constructed by attending over the LSTM outputs for the each subsequence. Similarly, the global attention vector r is obtained by attending over the whole output sequence h 1 , . . . , h 9 . In the second layer attention, these representation are composed for the higher level representation r . The final representation h can be obtained according to Equation (9).
The intuition behind our CAN is to attentively compose words within a sentence to construct a local attention vector for each sentence and then these attention vectors are further composed in a second layer attention to learn a whole document representation. We tie the parameters of local, global and the second layer attentions so CAN is forced to learn to compose both the word and sentence presentations attentively.
We also build the bi-directional variation of these models by feeding the concatenated outputs of the forward and backward LSTMs. Due to the concatenated outputs, the size of the W matrices and w vector become 2k × 2k and 2k respectively, increasing the number of parameters to be trained.

Baseline Classifier
We implemented a baseline model, which includes extraction of TF-IDF statistics of n-grams (1, 2 and 3-grams) from each citation for feature sets and a support vector machine (SVM) classifier with a linear kernel. For the SVM model, we performed a grid search over its hyper-parameters (including the regularization parameter, C) by using the development set for evaluation. Once the best parameters were found, the final SVM model was learned on both the training and development sets and tested on the test set.

Data Collection, AMT Annotation and Gold Standard Datasets
In order to increase the generalization of data, we maximizes the total number of selected articles. Specifically, we selected a total of 5,000 citation sentences from 2,500 randomly selected PubMed Central articles (we randomly selected two citation  sentences from each article). We then developed guidelines and deployed an annotation task in a crowdsourcing platform, Amazon Mechanical Turk (AMT).
Each citation was labeled by five annotators. We provide the AMT annotators the previous and the next sentences of the citation sentence to enrich the context. We designed a quality control (attention check questions) and ended the AMT session if the AMT workers failed to answer correctly the attention check questions. To evaluate the quality of annotation, we asked a domain expert (a MD) to independently annotate 100 citation sentences randomly selected from our corpus and used it as the gold standard to evaluate inter-annotator agreement with the AMT workers.
We built two gold standard datasets to use for training and for evaluation. The first dataset is composed of labels agreed by at least three of the five annotators (three label matching). This resulted in 3,422 citations for the function analysis and 3,624 citations for the sentiment analysis. The second dataset is more relaxed in which we selected a label given by the majority of the five annotators. In this setting, we included a label that may fail inclusion by the first approach. For example, even if only two annotators agreed on a label, we will include it in our gold standard dataset because it represents a clear majority vote (the rest of three labels all differ). As a result, this dataset included 4,426 citations for the function classification and 4,423 citations for the sentiment classification.

CAN for Document-level Sentiment Analysis
In order to test the robustness of the CAN model, we also evaluate it for sentiment analysis on two publically available large-scale datasets: the IMDB movie review and Yelp restaurant review datasets. Particularly, we used the pre-split datasets by Tang et al. (2015). Each document in the datasets is associated with human ratings and we use these ratings as gold labels for sentiment classification. Table 2 reports the statistics for the datasets.

Experimental Settings
During the experiment, citations labeled with don't know were removed from the training data. Each dataset was split into 200/200/rest for dev/test/train sets with a stratified sampling. A stratified sampling is performed to preserve percentage of the citations for each class in each set. We experimented with using only the citation sentence as input example and the expansion with both the previous and the next sentences. We used ADAM (Kingma and Ba, 2014) for optimization of the neural models. The size of the LSTM hidden units was set to 200. All neural models were regularized by using 20% input and 30% output dropouts and an l 2 regularizer with strength value 1e-3. A word2vec (Mikolov et al., 2013)    model trained on a collection of PubMed Central documents transformed citation context to word vectors with size of 200 (Munkhdalai et al., 2015). The parameters of CAN are tied and equal to that of the global attention. The neural models were trained only on the training set while SVM model was built on both training and development sets. We use the development set to evaluate the neural models for each epoch to choose the best model. Each model was given 30 epochs, which was empirically found to be enough time for the models to converge to an optima. The final performances of the methods were reported on the test set. The average training time for the neural network models was approximately three hours on a single GPU (GeForce GTX 980). Table 3 lists the detailed statistics of our AMT annotated corpus. The overall agreement between the expert's annotation and the AMT annotation was 63.1% and 64.7% for function and sentiment analysis tasks. For the function classification, a majority of citations were annotated as results and findings. As shown in Table 3, for the sentiment classification, 4.8% was labeled as Negational while 75% and 19.8% were Confirmative and Neutral. This shows that the citations bias towards a positive statement, resulting a highly unbalanced class distribution. Table 4 lists the results of the function classification by using only citation sentences as input to the models. The SVM baseline obtains the lowest training error. As the models become complex the performance increases. However, some cases like the Bi-LSTMs based global attention model tend to overfit the training data. The unidirectional LSTMs with global attention achieves the best F1-score in both settings when only the citation sentence is input. Table 5 shows the performance where the inputs are represented by a larger context of the previous, citation and next sentences. We treated the each sentence related to a citation as a subsequence and applied our CAN. Here the bi-directional LSTMs with CAN is the clear winner in terms of the test performance. This model achieves 75.86% F1-score improving the results of the previous model by nearly 7% in the three label matching setup. Unlike the compositional models, the performance of the global attention models decreased in response to additional context given in the input. Furthermore, the models tend to get a higher F1-score in the three label matching setup because this setting has an extra annotation noise filter in selecting the gold labels.     Table 6 shows the evaluation results when the citation sentences are the input. The LSTMs based global attention models obtain the best F1-scores on the test sets. In Table 7, we report the results of the wider context input (citation sentence + its left and right sentences). Here the CAN models perform the best. Similar to the function classification results, the extra context information provides an increasing performance if the model is able to properly exploit. Despite the same number of training parameters, our compositional attention mechanism significantly improved the performance. We also analyzed whether lengths influence the performance.

CAN for Document-level Sentiment Analysis
We split the Yelp dataset into train/dev/test so the models see only documents with length up to 15 sentences during training and classifies much longer documents with length up to 30 sentences during test. Figure 2 plots the test performance over different lengths. The two attention models perform identically on seen lengths except that the global attention model obtains a performance gain on the shorter documents with up to five sentences. However, for unseen lengths (the right side of the green line) the performance of the compositional attention network remains almost consistent and in contrast the global attention starts to decrease in general. This shows the compositional ability of our neural net.

Conclusion
We have developed a generic and simple categorization scheme and a new benchmark corpus for automatic citation analysis. We presented several neural attention networks for the task and evaluated them by using the benchmark corpus. Among these attention mechanisms our original model, we called compositional attention network, performed consistently well on both citation function and citation sentiment classification tasks by attentively composing additional contextual information provided. In an extended experiment, we have also shown that the compositional attention network generalizes better to examples with unseen longer lengths thanks to its compositional operation.