Global Normalization of Convolutional Neural Networks for Joint Entity and Relation Classification

We introduce globally normalized convolutional neural networks for joint entity classification and relation extraction. In particular, we propose a way to utilize a linear-chain conditional random field output layer for predicting entity types and relations between entities at the same time. Our experiments show that global normalization outperforms a locally normalized softmax layer on a benchmark dataset.


Introduction
Named entity classification (EC) and relation extraction (RE) are important topics in natural language processing.They are relevant, e.g., for populating knowledge bases or answering questions from text, such as "Where does X live?" Most approaches consider the two tasks independent from each other or treat them as a sequential pipeline by first applying a named entity recognition tool and then classifying relations between entity pairs.However, named entity types and relations are often mutually dependent.If the types of entities are known, the search space of possible relations between them can be reduced and vice versa.This can help, for example, to resolve ambiguities, such as in the case of "Mercedes", which can be a person, organization and location.However, knowing that in the given context, it is the second argument for the relation "live in" helps concluding that it is a location.Therefore, we propose a single neural network (NN) for both tasks.In contrast to joint training and multitask learning, which calculate taskwise costs, we propose to learn a joint classification layer which is globally normalized on the outputs of both tasks.In particular, we train the NN parameters based on the loss of a linear-chain conditional random field (CRF) (Lafferty et al., 2001).CRF layers for NNs have been introduced for token-labeling tasks like named entity recognition (NER) or part-of-speech tagging (Collobert et al., 2011;Lample et al., 2016;Andor et al., 2016).Instead of labeling each input token as in previous work, we model the joint entity and relation classification problem as a sequence of length three for the CRF layer.In particular, we identify the types of two candidate entities (words or short phrases) given a sentence (we call this entity classification to distinguish it from the token-labeling task NER) as well as the relation between them.
To the best of our knowledge, this architecture for combining entity and relation classification in a single neural network is novel.Figure 1 shows an example of how we model the task: For each sentence, candidate entities are identified.Every possible combination of candidate entities (query entity pair) then forms the input to our model which predicts the classes for the two query entities as well as for the relation between them.
To sum up, our contributions are as follows: We introduce globally normalized convolutional neural networks for a sentence classification task.In particular, we present an architecture which allows us to model joint entity and relation classification with a single neural network and classify entities and relations at the same time, normalizing their scores globally.Our experiments confirm that a CNN with a CRF output layer outperforms a CNN with locally normalized softmax layers.Our source code is available at http: //cistern.cis.lmu.de.

Related Work
Some work on joint entity and relation classification uses distant supervision for building their own datasets, e.g., (Yao et al., 2010;Yaghoobzadeh et al., 2016).Other studies, which are described in more detail in the following, use the "entity and relation recognition" (ERR) dataset from (Roth andYih, 2004, 2007) as we do in this paper.Roth and Yih (2004) develop constraints and use linear programming to globally normalize entity types and relations.Giuliano et al. (2007)  Several studies propose different variants of non-neural CRF models for information extraction tasks but model them as token-labeling problems (Sutton and McCallum, 2006;Sarawagi et al., 2004;Culotta et al., 2006;Zhu et al., 2005;Peng and McCallum, 2006).In contrast, we propose a simpler linear-chain CRF model which directly connects entity and relation classes instead of assigning a label to each token of the input sequence.This is more similar to the factor graph by Yao et al. ( 2010) but computationally simpler.Xu and Sarikaya (2013) also apply a CRF layer on top of continuous representations obtained by a CNN.However, they use it for a token labeling task (semantic slot filling) while we apply the model to a sentence classification task, motivated by the fact that a CNN creates single representations for whole phrases or sentences.Input.Given an input sentence and two query entities, our model identifies the types of the entities and the relation between them; see Figure 1.The input tokens are represented by word embeddings trained on Wikipedia with word2vec (Mikolov et al., 2013).For identifying the class of an entity e k , the model uses the context to its left, the words constituting e k and the context to its right.For classifying the relation between two entities e i and e j , the sentence is split into six parts: left of e i , e i , right of e i , left of e j , e j , right of e j .1For the example sentence in Figure 1 and the entity pair ("Anderson", "chief"), the context split is Sentence Representation.For representing the different parts of the input sentence, we use convolutional neural networks (CNNs).CNNs are suitable for RE since a relation is usually expressed by the semantics of a whole phrase or sentence.Moreover, they have proven effective for RE in previous work (Vu et al., 2016).We train one CNN layer for convolving the entities and one for the contexts.Using two CNN layers instead of one gives our model more flexibility.Since entities are usually shorter than contexts, the filter width for entities can be smaller than for contexts.Furthermore, this architecture simplifies changing the entity representation from words to characters in future work.
After convolution, we apply k-max pooling for both the entities and the contexts and concatenate the results.The concatenated vector c z ∈ R Cz , z ∈ {EC, RE} is forwarded to a task-specific hidden layer of size H z which learns patterns across the different input parts:

Global Normalization Layer
For global normalization, we adopt the linearchain CRF layer by Lample et al. (2016). 2 It expects scores for the different classes as input.Therefore, we apply a linear layer first which maps the representations h z ∈ R Hz to a vector v z of the size of the output classes N = N EC + N RE : with W z ∈ R Hz×N .For a sentence classification task, the input sequence for the CRF layer is not inherentely clear.Therefore, we propose to model the joint entity and relation classification problem with the following sequence of scores (cf., Figure 2): with r ij being the relation between e i und e j .Thus, we approximate the joint probability of entity types T e 1 , T e 2 and relations R e 1 e 2 as follows: Our intuition is that the dependence between relation and entities is stronger than the dependence between the two entities.
The CRF layer pads its input of length n = 3 with begin and end tags and computes the following score for a sequence of predictions y: with Q k,l being the transition score from class k to class l and d p,q being the score of class q at position p in the sequence.The scores are summed because all the variables of the CRF layer live in the log space.The matrix of transition scores For training, the forward algorithm computes the scores for all possible label sequences Y to get the log-probability of the correct label sequence ŷ: For testing, Viterbi is applied to obtain the label sequence y * with the maximum score: 4 Experiments and Analysis

Data and Evaluation Measure
We use the "entity and relation recognition" (ERR) dataset from (Roth and Yih, 2004)4 with the train-test split by Gupta et al. (2016).We tune the parameters on a held-out part of train.The data is labeled with entity types and relations (see Table 1).For entity pairs without a relation, we use the label N. Dataset statistics and model parameters are provided in the appendix.
Following previous work, we compute F 1 of the individual classes for EC and RE, as well as a taskwise macro F 1 score.We also report the average of scores across tasks (Avg EC+RE).

Experimental Setups
Setup 1: Entity Pair Relations.Roth andYih (2004, 2007); Kate and Mooney (2010) train separate models for EC and RE on the ERR dataset.For RE, they only identify relations between named entity pairs.In this setup, the query entities for our model are only named entity pairs.Note that this facilitates EC in our experiments.
Setup 2: Table Filling.Following Miwa and Sasaki (2014); Gupta et al. (2016), we also model the joint task of EC and RE as a table filling task.For a sentence with length m, we create a quadratic table.Cell (i, j) contains the relation between word i and word j (or N for no relation).
A diagonal cell (k, k) contains the entity type of word k.Following previous work, we only predict classes for half of the table, i.e. for m(m + 1)/2 cells.Figure 3 shows the table for the example sentence from Figure 1.In this setup, each cell (i, j) with i = j is a separate input query to our model.Our model outputs a prediction for cell (i, j) (the relation between i and j) and predictions for cells (i, i) and (j, j) (the types of i and j).To fill the diagonal with entity classes, we aggregate all predictions for the particular entity by using majority vote.Section 4.4 shows that the individual predictions agree with the majority vote in almost all cases.
Setup 3: Table Filling Without Entity Boundaries.The table from setup 2 includes one row/column per multi-token entity, utilizing the given entity boundaries of the ERR dataset.In order to investigate the impact of the entity boundaries on the classification results, we also consider another table filling setup where we ignore the boundaries and assign one row/column per token.Note that this setup is also used by prior work on table filling (Miwa and Sasaki, 2014;Gupta et al., 2016).For evaluation, we follow Gupta et al. (2016) and score a multi-token entity as correct if at least one of its comprising cells has been classified correctly.
Comparison.The most important difference between setup 1 and setup 2 is the number of entity pairs with no relation (test set: ≈3k for setup 1, ≈121k for setup 2).This makes setup 2 more challenging.The same holds for setup 3 which considers the same number of entity pairs with no relation as setup 2. To cope with this, we randomly subsample negative instances in the train set of setup 2 and 3. Setup 3 considers the most query Figure 3: Entity-relation table entity pairs in total since multi-token entities are split into their comprising tokens.However, setup 3 represents a more realistic scenario than setup 1 or setup 2 because in most cases, entity boundaries are not given.In order to apply setup 1 or 2 to another dataset without entity boundaries, a preprocessing step, such as entity boundary recognition or chunking would be required.

Experimental Results
Table 1 shows the results of our globally normalized model in comparison to the same model with locally normalized softmax output layers (one EC and one for RE).For setup 1, the CRF layer performs comparable or better than the softmax layer.For setup 2 and 3, the improvements are more apparent.We assume that the model can benefit more from global normalization in the case of table filling because it is the more challenging setup.The comparison between setup 2 and setup 3 shows that the entity classification suffers from not given entity boundaries (in setup 3).A reason could be that the model cannot convolve the token embeddings of the multi-token entities anymore when computing the entity representation (context B and D in Figure 2).Nevertheless, the relation classification performance is comparable in setup 2 and setup 3.This shows that the model can internally account for potentially wrong entity classification results due to missing entity boundaries.
The overall results (Avg EC+RE) of the CRF are better than the results of the softmax layer for all three setups.To sum up, the improvements of the linear-chain CRF show that (i) joint EC and RE benefits from global normalization and (ii) our way of creating the input sequence for the CRF for joint EC and RE is effective.
Comparison to State of the Art.(Miwa and Sasaki, 2014), (Gupta et al., 2016). 5 Note that the results are not comparable because of the different setups and different train-test splits. 6 Our results are best comparable with (Gupta et al., 2016) since we use the same setup and traintest splits.However, their model is more complicated with a lot of hand-crafted features and various iterations of modeling dependencies among entity and relation classes.In contrast, we only use pre-trained word embeddings and train our model end-to-end with only one iteration per entity pair.When we compare with their model without additional features (G et al. 2016 (2)), our model performs worse for EC but better for RE and comparable for Avg EC+RE.

Analysis of Entity Type Aggregation
As described in Section 4.2, we aggregate the EC results by majority vote.Now, we analyze their disagreement.For our best model, there are only 9 entities (0.12%) with disagreement in the test data.For those, the max, min and median disagreement with the majority label is 36%, 2%, and 8%, resp.Thus, the disagreement is negligibly small. 5We only show results of single models, no ensembles.Following previous studies, we omit the entity class "Other" when computing the EC score. 6Our results on EC in setup 1 are also not comparable

Analysis of CRF Transition Matrix
To analyze the CRF layer, we extract which transitions have scores above 0.5.Figure 4 shows that the layer has learned correct correlations between entity types and relations.

Conclusion and Future Work
In this paper, we presented the first study on global normalization of neural networks for a sentence classification task without transforming it into a token-labeling problem.We trained a convolutional neural network with a linear-chain conditional random field output layer on joint entity and relation classification and showed that it outperformed using a locally normalized softmax layer.
An interesting future direction is the extension of the linear-chain CRF to jointly normalize all predictions for table filling in a single model pass.Furthermore, we plan to verify our results on other datasets in future work.Table 4 provides the hyperparameters we optimized on dev (nk C : number of convolutional filters for the CNN convolving the contexts, nk E : number of convolutional filters for the CNN convolving the entities; h C : number of hidden units for creating the final context representation, h E : number of hidden units for creating the final entity representation).

A Dataset Statistics
For all models, we use a filter width of 3 for the context CNN and a filter width of 2 for the entity CNN (tuned in prior experiments and fixed for the optimization of the parameters in Table 4).
For training, we apply gradient descent with a batch size of 10 and an initial learning rate of 0.1.When the performance on dev decreases, we halve the learning rate.The model is trained with early stopping on dev, with a maximum number of 20 epochs.We apply L2 regularization with λ = 10 −3 .

Figure 1 :
Figure 1: Examples of our task Figure 2 illustrates our model.Input.Given an input sentence and two query entities, our model identifies the types of the entities and the relation between them; see Figure1.The input tokens are represented by word embeddings trained on Wikipedia with word2vec(Mikolov et al., 2013).For identifying the class of an entity e k , the model uses the context to its left, the words constituting e k and the context to its right.For classifying the relation between two entities e i and e j , the sentence is split into six parts: left of e i , e i , right of e i , left of e j , e j , right of e j .1For the example sentence in Figure1and the entity pair ("Anderson", "chief"), the context split is: [] [Anderson] [, 41 , was the chief Middle ...] [Anderson , 41 , was the] [chief] [Middle East correspondent for ...]Sentence Representation.For representing the different parts of the input sentence, we use convolutional neural networks (CNNs).CNNs are suitable for RE since a relation is usually expressed by the semantics of a whole phrase or sentence.Moreover, they have proven effective for RE in previous work(Vu et al., 2016).We train one CNN layer for convolving the entities and one for the contexts.Using two CNN layers instead of one gives our model more flexibility.Since entities are usually shorter than contexts, the filter width for entities can be smaller than for contexts.Furthermore, this architecture simplifies changing the entity representation from words to characters in future work.After convolution, we apply k-max pooling for both the entities and the contexts and concatenate the results.The concatenated vector c z ∈ R Cz , z ∈ {EC, RE} is forwarded to a task-specific hidden layer of size H z which learns patterns across the different input parts:

Figure 2 :
Figure 2: Model overview; the colors/shades show which model parts share parameters

Figure 4 :
Figure 4: Most strongly correlated entity types and relations according to CRF transition matrix Anderson , 41 , was the chief Middle East correspondent for The Associated Press when he was kidnapped in Beirut

Table 1 :
Table 2 shows our results in the context of state-of-the-art results: (Roth and Yih, 2007), (Kate and Mooney, 2010), F 1 results for entity classification (EC) and relation extraction (RE) in the three setups

Table 2 :
Comparison to state of the art (S: setup) Table3provides statistics of the data composition in our different setups which are described in the paper.The N class of setup 2 and setup 3 has been subsampled in the training and development set as described in the paper.

Table 3 :
Dataset statistics for our different experimental setups Note that the sum of numbers of relation labels is slightly different to the numbers reported in(Roth and Yih, 2004).According to their website https://cogcomp.cs.illinois.edu/page/resource_view/43, they have updated the corpus.