SemEval 2018 Task 4: Character Identification on Multiparty Dialogues

Character identification is a task of entity linking that finds the global entity of each personal mention in multiparty dialogue. For this task, the first two seasons of the popular TV show Friends are annotated, comprising a total of 448 dialogues, 15,709 mentions, and 401 entities. The personal mentions are detected from nominals referring to certain characters in the show, and the entities are collected from the list of all characters in those two seasons of the show. This task is challenging because it requires the identification of characters that are mentioned but may not be active during the conversation. Among 90+ participants, four of them submitted their system outputs and showed strengths in different aspects about the task. Thorough analyses of the distributed datasets, system outputs, and comparative studies are also provided. To facilitate the momentum, we create an open-source project for this task and publicly release a larger and cleaner dataset, hoping to support researchers for more enhanced modeling.


Introduction
Most of the earlier works in natural language processing (NLP) had focused on formal writing such as newswires, whereas many recent works have targeted at colloquial writing such as text messages or social media. Since the evolution of Web 2.0, the amount of user-generated contents involving colloquial writing has exceeded the one with formal writing. NLP tasks are relatively well-explored at this point for certain types of colloquial writing i.e., microblogs and reviews (Ritter et al., 2011;Kong et al., 2014;Ranganath et al., 2016;Shin et al., 2017). However, the genre of multiparty dialogue is still under-explored, even though digital contents in dialogue forms keep increasing at a faster rate than any other types of writing. 1 This inspires us to create a new task called character identification that aims to link personal mentions (e.g, she, mom) to their global entities across multiple dialogues, where the entities indicate the specific characters referred by those mentions (e.g., Judy).
Due to the nature of multiparty dialogue where several speakers take turns to complete a context, character identification is a crucial step for adapting higher-end NLP tasks (e.g., summarization, question answering, machine translation) to this genre. It can also bring another level of sophistication to intelligent personal assistants or tutoring systems. This task is challenging because it needs to process through colloquialism that includes slangs, grammar mistakes, and/or rhetorical questions, as well as to handle cross-document resolution for the identification of entities that are mentioned but may not be actively participating during the conversation. Nonetheless, we believe that models produced by this task will remarkably enhance inference on dialogue contexts (e.g., business meetings, doctorpatient conversations) by providing finer-grained information about individual characters. Section 2 illustrates the task of character identification and explains the key differences between it and other types of entity linking tasks. Section 3 describes the corpus, based on TV show transcripts, used for this task with annotation details. Section 4 gives brief overviews of the systems participated in this shared task. Section 5 explains the evaluation metrics and the results produced by those systems. Finally, Section 6 gives thorough analysis and comparative studies between these systems. This task was originally conducted at CodaLab. 2 The latest dataset and the system outputs can be found from our open source project, Emory NLP. 3 Ross I told mom and dad last night, they seemed to take it pretty well.

Monica
Oh really, so that hysterical phone call I got from a woman at sobbing 3:00 A.M., "I'll never have grandchildren, I'll never have grandchildren

Task Description
Let a mention be a nominal that refers to a singular or a collective entity (e.g., she, mom, Judy), and an entity be the actual person that the mention refers to. Given a dialogue transcribed in text where all mentions are detected, the objective is to find the entity for each mention, who can be either active or passive in the dialogue. In Figure 1, entities such as Ross, Monica, and Joey are the active speakers of the dialogue, whereas Jack and Judy are not although they are passively mentioned as mom and dad in this context. Linking such mentions to their global entities demands inferred knowledge about the kinship from other dialogues, challenging crossdocument resolution. Thus, character identification can be viewed as an entity linking task that aims for holistic understanding in multiparty dialogue. Most of previous works on entity linking have focused on Wikification, which links named entity mentions to their relevant Wikipedia articles (Mihalcea and Csomai, 2007;Ratinov et al., 2011;Guo et al., 2013). Unlike Wikification where most entities come with structured information from knowledge bases (e.g., Infobox, Freebase, DBPedia), entities in character identification have no such precom-piled information, which makes this task even more challenging. It is similar to coreference resolution in a sense that it groups mentions into entities, but distinguished because this task requires to identify each mention group as a known person. In Figure 1, coreference resolution would give a cluster of the four mentions, {mom, woman, I, I}; however, it would not identify that cluster to be the entity Judy, which in this case is not possible to identify without getting contexts from other dialogues.

Corpus
The character identification corpus was first created by collecting transcripts from the popular TV show, Friends (Chen and Choi, 2016). These transcripts were voluntarily provided by fans who made them publicly available. 4 Dialogues in this corpus mimic daily conversations that are more natural and various in topics than other dialogue corpora (Janin et al., 2003;Danescu-Niculescu-Mizil and Lee, 2011;Hu et al., 2013;Kim et al., 2015;Lowe et al., 2015). Although they are scripted, the interpretation of these dialogues is no easier than unscripted  dialogues; they not only involve as much disfluency and context switching as real dialogues do, but also include more humor, sarcasm, or metaphor. Thus, models evaluated on this corpus should give a general sense about the state of character identification on multiparty dialogue. The original transcripts collected from the fan site were formatted in HTML; we converted them into JSON so that they could be easily processed. This structured data were then manually checked for potential errors. Table 1 shows the distributions from the subset of the character identification corpus used for this shared task. The provided dataset is divided into two seasons, each season is divided into episodes, each episode is divided into scenes, each scene contains utterances, where each utterance indicates a turn of speech.

Mention Annotation
For mention annotation, a heuristic-based mention detector was developed, which utilized dependency relations (Choi and McCallum, 2013), named entity tags (Choi, 2016), and personal noun gazetteers, then automatically detected mentions for the entire corpus. In this heuristic, a noun phrase was considered a personal mention if it was either: 1. A PERSON named entity, or 2. A pronoun or a possessive pronoun excluding the pronouns it and they, or 3. One of the personal noun gazetteers that are 603 common and singular personal nouns selected from Freebase and DBPedia.
Specific mentions such as it and they were excluded because they often referred to the ambiguous entity types, collective, general, and other (Section 3.2). For the quality assurance, about 10% of this pseudo annotation were randomly sampled and manually evaluated, showing a precision, a recall, and the F1score of 97.58%, 94.34%, and 95.93%, respectively. Finally, the annotation was manually checked again while it was systematically corrected for routinely produced errors. Although mention detection was the foundational step, including it as a part of this shared task could over-complicate the evaluation. Thus, gold mentions were provided for this year's shared task such that participants could purely concentrate on the task of entity linking.

Entity Annotation
All mentions were double-annotated with their referent entities, and adjudicated upon disagreements. Annotation and adjudication tasks were conducted on Amazon Mechanical Turk. Each mention was annotated with either a primary character, that are Ross, Chandler, Joey, Rachel, Monica, and Pheobe, a secondary character (other frequently recurring characters across the show), or one of the following ambiguous types suggested by Chen et al. (2017): • Generic: indicates actual characters in the show whose identities are unknown (e.g., That waitress is really cute, I am going to ask her out). Generic entities are annotated with their group names and optional numberings (e.g., Man 1, Woman 1).
• Collective: indicates the plural use of the pronoun you, which cannot be deterministically distinguished from the singular use.
• General: indicates mentions used in reference to a general case rather than an specific entity (e.g., The ideal guy you look for doesn't exist).
• Other: indicates all the other kinds of entities.
For this year's shared task, mentions annotated with the last three ambiguous types, collective, general, and other, were excluded from the dataset to reduce the high complexity of this task (     Table 3 shows examples of these ambiguous types. About 83% were assigned to the primary and secondary characters, 1.4% were assigned to generic, and the rest were assigned to the other ambiguous types, collective, general, and other. To evaluate the annotation quality, the annotation agreement scores as well as Cohen's kappa scores were measured, showing 82.83% and 79.96%, respectively.

Data Split
The corpus was split into training and evaluation sets for this shared task (Table 4). No dedicated development set was provided; participants were encouraged to use sub-parts of the training set to create their own development sets or perform crossvalidation for the optimization of statistical models. Two types of datasets are provided for both training and evaluation sets, one treating each episode as an individual dialogue and the other treating each scene as an independent dialogue. 5 Processing a larger dialogue makes coreference resolution harder because it needs to link referential mentions that are farther apart; on the other hand, each cluster comprises a greater number of mentions which can help identifying the global entity of that cluster. The numbers of clusters grouped in each dataset are shown as Clusters E and Clusters S , implying episode-level and scene-level clusters, respectively. Our corpus includes singleton mentions, which take about 22% of all mentions.

Data Format
To help participants adapting their existing coreference resolution systems to this task, the original dataset in JSON was converted into the CoNLL'12 5 Each episode consists of about 10 scenes on average. shared task format (Pradhan et al., 2012), where each column is delimited by white spaces and represents the following: 1. Season and episode ID.
The part-of-speech tags, lemmas, and named entity tags were automatically generated by NLP4J, 6 and the phrase structure tags were produced by the Stanford parser. 7 Table 5 shows the example of the first utterance in Figure 1 in the CoNLL'12 format.

AMORE-UPF System
The AMORE-UPF system approaches this task as a multi-class classification. It uses a bidirectional Long Short-Term Memory (LSTM) that processes the input dialogue and resolves mentions, by means of a comparison between the LSTM's hidden state, for each mention, to vectors in an entity library. In this model, learned representations of each entity are stored in the entity library, that is a matrix where each row represents an entity and whose values are learned during training ( Figure 2  This is a matrix where each row vector represents an entity, and whose values are updated (only) during training. For tokens t i that are tagged as mentions, we map the hidden state to a representation that has the same dimensionality as the vectors in the entity library. 3 Its similarity to each entity representation is computed using cosine. Softmax is then applied to the resulting similarity profile to obtain a probability distribution o i over entities ('class scores' in Figure 1):

KNU-CI System
The KNU-CI system tackles this task as a sequencelabeling problem. It uses an attention-based recurrent neural network (RNN) encoder-decoder model. The input dialogue of character identification consists of several conversations, resulting a long sequence of text. The RNN encoder-decoder model suffers from poor performance when the length of the input sequence is long. To overcome this issue, this system applies an attention, position encoding, and the self-matching network to the original RNN encoder-decoder model. As a result, the best performance is achieved by the attention-based RNN depicted in Figure 3. the decoder and the hidden state of the encoder when performing decoding.

Model 1: Attention-based Enc-Dec model
The first model proposed in this paper is a general attention mechanism-based Enc-Dec model, as shown in Figure 1. The input of the encoder is one document that contains sentences (multiparty dialogue). Each sentence consists of words, and the input sequence is = { 1 , 2 , … , } . The input to the decoder is = { 0 , 1 , … , } consisting of the positions of the words given in the gold mentions, and the output sequence accordingly becomes = { 0 , 1 , … , } consisting of the character number, which is corresponded with the decoder's input mentions.
We use word embedding and adopt the Kdimensional word embedding , ∈ [1, ] for all input words, where is the word index in the input sequence. We perform feature embedding for three featuresspeaker, named entity recognition (NER) tags, and capitalizationand concatenate them to make ̃. The uppercase feature is a binary feature (1 or 0) that verifies whether the uppercase is included in the word. 10dimensional speaker embedding for a total of 205 different types of speakers included by "unknown". 19-dimensional NER embedding for a total of 19 different types of NER tags.
We use bidirectional gated recurrent unit (BiGRU) (Cho et al., 2014) for the encoder. The hidden state of the encoder for the input (word) sequence is defined as ℎ .
The inpu generated b position of quence. The receives the sponding to coder and th (ℎ , ℎ At the att attention w score for th and the enc tion layer ac gold mentio ing the atte vector . W attention which is ba attention w high score f tion.

= argm
After calc input of the we calculate tor , deco state ℎ are layer. Next, late the alig character in sponding to using the ar

Evaluation
Following Chen et al. (2017), the labeling accuracy (Acc) and the macro-average F1 score (F1) are used for the evaluation (C: the total number of characters, F 1 i : the F1-score for the i'th character): F 1 i Table 6 shows the overall scores from all submitted systems. Two types of evaluation are performed for this task. The first one is based on seven characters where six of them compose the primary characters (Section 3.2) and every other character is grouped as one entity called Others (Main + Others). The other is based on 78 characters comprising all characters appeared in the dataset, except for the ones appear either in the training or the evaluation set but not both, which is grouped to the Others (ALL).

Analysis
Based on the evaluation results, several interesting observations can be made for how different system architectures affect model performance on this task. The analysis in this section primarily focuses on the top-2 scoring systems, AMORE-UPF an KNU-CI, as their results vastly outperform the other two and the authors of those systems provide more detailed descriptions to the organizers.

Overall Performance
It is worth pointing out the significance of the two evaluation metrics proposed in Section 5 in terms of the model performance. The labeling accuracy indicates the raw predicative power of the model. This metric is biased towards more frequently appearing characters such as the primary characters, a total of which compose 70+% of the evaluation set. Thus, it is possible to achieve a relatively high labeling accuracy score without handling referents for the secondary characters well. On the contrary, the macro-average F1 score neutralizes the imbalance between frequently and not so frequently appearing characters. It reveals the model performance on a per-entity basis, which tends to favor transient and extra characters more because every character is treated equally in this metric.
For the overall performance, KNU-CI outperforms for Main + Others with the labeling accuracy of 85.10% and the macro-average F1 score of 86.00%, whereas AMORE-UPF outperforms for ALL with the labeling accuracy of 74.72% and the macroaverage F1 of 41.05% (Table 6). All systems produce better results for Main + Others than ALL, which is expected due to the fewer number of entities to classify (7 vs 78). It is possible that KNU-CI's attention model is highly optimized for the identification of the primary characters, whereas AMORE-UPF's LSTM model distributes weights for the secondary characters more evenly, but more detailed analysis needs to be made to see the comparative strengths between these two systems. Table 7 depicts the strength of the KNU-CI system for the primary characters in comparisons to the others, which is attributed to its unique sequence labeling architecture and the attention mechanism. Their encoder-decoder architecture helps consolidating sequential information of the input dialogue along with the mentions. The hidden units in RNNs enable the network to aggregate character-related information and to disambiguate timeline shifts across utterances. The encoder takes the input dialogue and provides the decoder with context-rich features. Coupled with the attention mechanism, this model focuses on the primary characters; thus, it results better performance on Main + Others. However, this architecture is not as well-adaptive as the number of characters increases for the identification, which can be observed from the system's low macro-average F1 score for All.

Conclusion
In this shared task, we propose a novel entity linking task called character identification that aims to find the global entities for all personal mentions, representing individual characters in the contexts of multiparty dialogue. Among 90+ participants signed up for this task at CodaLab, only four submitted their system outputs, which is unfortunate. However, the top-2 scoring systems depict unique strengths, allowing us to make a good analysis for this task. It would be interesting to see if the sequence labeling architecture from KNU-CI coupled with the entity library from AMORE-UPF could produce an even higher performing model for both the Main + Other and All evaluation.
To facilitate the momentum, we create an opensource project that will continuously support this task. 8 It is worth mentioning that Character Identification is a part of a bigger project called Character Mining that strives for machine comprehension on dialog. 9 Currently, this project provides more and cleaner annotation for character identification than the corpus described in Section 3, hoping to engage more researchers to this task.