AMORE-UPF at SemEval-2018 Task 4: BiLSTM with Entity Library

This paper describes our winning contribution to SemEval 2018 Task 4: Character Identification on Multiparty Dialogues. It is a simple, standard model with one key innovation, an entity library. Our results show that this innovation greatly facilitates the identification of infrequent characters. Because of the generic nature of our model, this finding is potentially relevant to any task that requires the effective learning from sparse or imbalanced data.


Introduction
SemEval 2018 Task 4 is an entity linking task on multiparty dialogue. 1 It consists in predicting the referents of nominals that refer to a person, such as she, mom, Judy -henceforth mentions. The set of possible referents is given beforehand, as well as the set of mentions to resolve. The dataset used in this task is based on Chen and Choi (2016) and Chen et al. (2017), and consists of dialogue from the TV show Friends in textual form.
Our main interest is whether deep learning models for tasks like entity linking can benefit from having an explicit entity library, i.e., a component of the neural network that stores entity representations learned during training. To that end, we add such a component to an otherwise relatively basic model -a bidirectional LSTM (long shortterm memory; Hochreiter and Schmidhuber 1997), the standard neural network model for sequential data like language. Training and evaluating this * denotes equal contribution.
1 https://competitions.codalab.org/ competitions/17310 model on the task shows that the entity library is beneficial in the case of infrequent entities. 2

Related Work
Previous entity linking tasks concentrate on linking mentions to Wikipedia pages (Bunescu and Paşca 2006;Mihalcea and Csomai 2007 and much subsequent work; for a recent approach see Francis-Landau et al. 2016). By contrast, in the present task (based on Chen and Choi 2016;Chen et al. 2017) only a list of entities is given, without any associated encyclopedic entries. This makes the task more similar to the way in which a human audience might watch the TV show, in that they are initially unfamiliar with the characters. What also sets the present task apart from most previous tasks is its focus on multiparty dialogue (as opposed to, typically, newswire articles).
A task that is closely related to entity linking is coreference resolution, i.e., the task of clustering mentions that refer to the same entity (e.g., the CoNLL shared task of Pradhan et al. 2011). Since mention clusters essentially correspond to entities (an insight central to the approaches to coreference in Haghighi and Klein 2010;Clark and Manning 2016), the present task can be regarded as a type of coreference resolution, but one where the set of referents to choose from is given beforehand.
Since our main aim is to test the benefits of having an entity library, in other respects our model is kept more basic than existing work both on entity linking and on coreference reso-2 Source code for our model and for the training procedure is published on https://github.com/amore-upf/ semeval2018-task4. arXiv:1805.05370v1 [cs.CL] 14 May 2018 lution (e.g., the aforementioned approaches, as well as Wiseman et al. 2016;Lee et al. 2017, Francis-Landau et al. 2016. For instance, we avoid feature engineering, focusing instead on the model's ability to learn meaningful entity representations from the dialogue itself. Moreover, we deviate from the common strategy to entity linking of incorporating a specialized coreference resolution module (e.g., Chen et al. 2017).

Model Description
We approach the task of character identification as one of multi-class classification. Our model is depicted in Figure 1, with inputs in the top left and outputs at the bottom. In a nutshell, our model is a bidirectional LSTM (long short-term memory, Hochreiter and Schmidhuber 1997) that processes the dialogue text and resolves mentions, through a comparison between the LSTM's hidden state (for each mention) to vectors in a learned entity library.
The model is given chunks of dialogue, which it processes token by token. The i th token t i and its speakers S i (typically a singleton set) are represented as one-hot vectors, embedded via two distinct embedding matrices (W t and W s , respectively) and finally concatenated to form a vector x i (Eq. 1; see also Figure 1). In case S i contains multiple speakers, their embeddings are summed.
We apply an activation function f (= tanh). The hidden state − → h i of a unidirectional LSTM for the i th input is recursively defined as a combination of that input with the LSTM's previous hidden state − → h i−1 . For a bidirectional LSTM, the hidden state h i is a concatenation of the hidden states − → h i and ← − h i of two unidirectional LSTMs which process the data in opposite directions (Eq. 2; see also Figure 1). In principle, this enables a bidirectional LSTM to represent the entire dialogue with a focus on the current input, including for instance its relevant dependencies on the context.
In the model, learned representations of each entity are stored in the entity library E ∈ R N ×k (see Figure 1): E is a matrix which represents each of N entities through a k-dimensional vector, and whose values are updated (only) during training.  For every token t i that is tagged as a mention, 3 we map the corresponding hidden state h i to a vector e i ∈ R 1×k . This extracted representation is used to retrieve the (candidate) referent of the mention from the entity library: The similarity of e i to each entity representation stored in E is computed using cosine, and softmax is then applied to the resulting similarity profile to obtain a probability distribution o i ∈ [0, 1] 1×N over entities ('class scores' in Figure 1): At testing time, the model's predictionĉ i for the i th token is the entity with highest probability: We train the model with backpropagation, using negative log-likelihood as loss function. Besides the BiLSTM parameters, we optimize W t , W s , W o , E and b. We refer to this model as AMORE-UPF, our team name in the SemEval competition. Note that, in order for this architecture to be successful, e i needs to be as similar as possible to the entity vector of the entity to which mention t i refers. Indeed, the mapping W o should effectively specialize in "extracting" entity representations from the hidden state because of the way its output e i is used in the model-to do entity retrieval. Our entity retrieval mechanism is inspired by the attention mechanism of Bahdanau et al. (2016), that has been used in previous work to interact with an external memory (Sukhbaatar et al., 2015;Boleda JOEY Figure 2), respectively, which are annotated with the ID of the entity to which they refer (e.g., 335, 183). The utterances are further annotated with the name of the speaker (e.g., JOEY TRIBBIANI). Overall there are 372 entities in the training data (test data: 106). Our models do not use any of the provided automatic linguistic annotations, such as PoS or named entity tags. We additionally used the publicly available 300dimensional word vectors that were pre-trained on a Google News corpus with the word2vec Skipgram model (Mikolov et al., 2013). 5 Parameter settings Using 5-fold crossvalidation on the training data, we performed a random search over the hyperparameters and chose those which yielded the best mean F1-score. Specifically, our submitted model is trained in batch mode using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.0005. Each batch covers 24 scenes, which are given to the model in chunks of 757 tokens. The token embeddings (W t ) are initialized with the word2vec vectors. Dropout rates of 0.008 and 0.0013 are applied on the input x i and hidden layer h i of the LSTM, respectively. The size of h i is set to 4 The organizers also provided data divided by episodes rather than scenes, which we didn't use. 5 The word vectors are available at https://code. google.com/archive/p/word2vec/.  459 units, the embeddings of the entity library E and speakers W s are set to k = 134 dimensions. Other configurations, including randomly initialized token embeddings, weight sharing between E and W s , self-attention (Bahdanau et al., 2016) on the input layer, a uni-directional LSTM, and rectifier or linear activation function f on the input embeddings did not improve performance.
For the final submission of the answers for the test data, we created an ensemble model by averaging the output (Eq. 3) of the five models trained on the different folds.

Results
Two evaluation conditions were defined by the organizers -all entities and main entities -with macroaverage F 1 -score and label accuracy as the official metrics, and macro-average F 1 -score in the all entities condition applied to the leaderboard. The all entities evaluation has 67 classes: 66 for entities that are mentioned at least 3 times in the test set and one grouping all others. The main entities evaluation has 7 classes, 6 for the main characters and one for all the others. Among all four participating systems in this SemEval task our model achieved the highest score on the all entities evaluation, and second-highest on the main entities evaluation. Table 1 gives our results in the two evaluations, comparing the models described in Section 4. While both models perform on a par on main entities, AMORE-UPF outperforms NoEntLib by a substantial margin when all characters are to be predicted (+15 points in F 1 -score, +3 points in accuracy; Table 1). 6 The difference between the models with/without an entity library are statistically significant based on approximate randomization tests (Noreen, 1989), with the significance level p < 0.001. This shows that the use of an entity library can be beneficial for the linking of rarely mentioned characters. Figure 3 shows that most of the target mentions in the test data fall into one of five grammatical categories. The dataset contains mostly pronouns (83%), with a very high percentage of first person pronouns (44%). Figures 4 and 5 present the accuracy and F 1 -score which the two models described above obtain on all entities for different categories of mentions. The entity library is beneficial when the mention is a first person pronoun or a proper noun (with an increase of 30 points in F 1 -score for both categories; Figure 4), and closer inspection revealed that this effect was larger for rare entities.

Discussion
The AMORE-UPF model consists of a bidirectional LSTM linked to an entity library. Compared to an LSTM without entity library, NoEntLib, the AMORE-UPF model performs particularly well on rare entities, which explains its top score in the all entities condition of SemEval 2018 Task 4. This finding is encouraging, since rare entities are especially challenging for the usual approaches in NLP, due to the scarcity of information about them.
We offer the following explanation for this beneficial effect of the entity library, as a hypothesis for future work. Having an entity library requires the LSTM of our model to output some representation of the mentioned entity, as opposed to outputting class scores more or less directly as in the variant NoEntLib. Outputting a meaningful entity representation is particularly easy in the case of first person pronouns and nominal mentions (where the effect of the entity library appears to reside; Figure 4): the LSTM can learn to simply forward the speaker embedding unchanged in the case of pronoun I, and the token embedding in the case of nominal mentions. This strategy does not discriminate between frequent and rare entities; it works for both alike. We leave further analyses required to test this potential explanation for future work.
Future work may also reveal to what extent the induced entity representations may be useful in others, to what extent they encode entities' attributes and relations (cf. Gupta et al. 2015), and to what extent a module like our entity library can be employed elsewhere, in natural language processing and beyond.