Dialogue-Based Relation Extraction

We present the first human-annotated dialogue-based relation extraction (RE) dataset DialogRE, aiming to support the prediction of relation(s) between two arguments that appear in a dialogue. We further offer DialogRE as a platform for studying cross-sentence RE as most facts span multiple sentences. We argue that speaker-related information plays a critical role in the proposed task, based on an analysis of similarities and differences between dialogue-based and traditional RE tasks. Considering the timeliness of communication in a dialogue, we design a new metric to evaluate the performance of RE methods in a conversational setting and investigate the performance of several representative RE methods on DialogRE. Experimental results demonstrate that a speaker-aware extension on the best-performing model leads to gains in both the standard and conversational evaluation settings. DialogRE is available at https://dataset.org/dialogre/.


Introduction
Cross-sentence relation extraction, which aims to identify relations between two arguments that are not mentioned in the same sentence or relations that cannot be supported by any single sentence, is an essential step in building knowledge bases from large-scale corpora automatically (Ji et al., 2010;Swampillai and Stevenson, 2010;Surdeanu, 2013). It has yet to receive extensive study in natural language processing, however. In particular, although dialogues readily exhibit cross-sentence relations, most existing relation extraction tasks focus on texts from formal genres such as professionally written and edited news reports or well-edited websites (Elsahar et al., 2018;Yao et al., 2019; † Equal contribution. S1: Hey Pheebs. S2: Hey! S1: Any sign of your brother? S2: No, but he's always late. S1: I thought you only met him once? S2: Yeah, I did. I think it sounds y'know big sistery, y'know, 'Frank's always late.' S1: Well relax, he'll be here. Argument pair Trigger Relation type R1 (Frank, S2) brother per:siblings R2 (S2, Frank) brother per:siblings R3 (S2, Pheebs) none per:alternate names R4 (S1, Pheebs) none unanswerable Table 1: A dialogue and its associated instances in Di-alogRE. S1, S2: anoymized speaker of each utterance. Mesquita et al., 2019;Grishman, 2019), while dialogues have been under-studied.
In this paper, we take an initial step towards studying relation extraction in dialogues by constructing the first human-annotated dialogue-based relation extraction dataset, DialogRE. Specifically, we annotate all occurrences of 36 possible relation types that exist between pairs of arguments in the 1,788 dialogues originating from the complete transcripts of Friends, a corpus that has been widely employed in dialogue research in recent years (Catizone et al., 2010;Chen and Choi, 2016;Zhou and Choi, 2018;Rashid and Blanco, 2018;Yang and Choi, 2019). Altogether, we annotate 10,168 relational triples. For each (subject, relation type, object) triple, we also annotate the minimal contiguous text span that most clearly expresses the relation; this may enable researchers to explore relation extraction methods that provide fine-grained explanations along with evidence sentences. For example, the bolded text span "brother" in Table 1 indicates the PER:SIBLINGS relation (R1 and R2) between speaker 2 (S2) and "Frank".
Our analysis of DialogRE indicates that the supporting text for most (approximately 96.0%) an-notated relational triples includes content from multiple sentences, making the dataset ideal for studying cross-sentence relation extraction. This is perhaps because of the higher person pronoun frequency (Biber, 1991) and lower information density (Wang and Liu, 2011) in conversational texts than those in formal written texts. In addition, 65.9% of relational triples involve arguments that never appear in the same turn, suggesting that multi-turn information may play an important role in dialogue-based relation extraction. For example, to justify that "Pheebs" is an alternate name of S2 in Table 1, the response of S2 in the second turn is required as well as the first turn.
We next conduct a thorough investigation of the similarities and differences between dialoguebased and traditional relation extraction tasks by comparing DialogRE and the Slot Filling dataset (McNamee and Dang, 2009;Ji et al., 2010Ji et al., , 2011Surdeanu, 2013;Surdeanu and Ji, 2014), and we argue that a relation extraction system should be aware of speakers in dialogues. In particular, most relational triples in DialogRE (89.9%) signify either an attribute of a speaker or a relation between two speakers. The same phenomenon occurs in an existing knowledge base constructed by encyclopedia collaborators, relevant to the same dialogue corpus we use for annotation (Section 3.2). Unfortunately, most previous work directly applies existing relation extraction systems to dialogues without explicitly considering the speakers involved (Yoshino et al., 2011;Wang and Cardie, 2012).
Moreover, traditional relation extraction methods typically output a set of relations only after they have read the entire document and are free to rely on the existence of multiple mentions of a relation throughout the text to confirm its existence. However, these methods may be insufficient for powering a number of practical real-time dialoguebased applications such as chatbots, which would likely require recognition of a relation at its first mention in an interactive conversation. To encourage automated methods to identify the relationship between two arguments in a dialogue as early as possible, we further design a new performance evaluation metric for the conversational setting, which can be used as a supplement to the standard F1 measure (Section 4.1).
In addition to dataset creation and metric design, we adapt a number of strong, representative learning-based relation extraction methods (Zeng et al., 2014;Cai et al., 2016;Yao et al., 2019;Devlin et al., 2019) and evaluate them on Dialo-gRE to establish baseline results on the dataset going forward. We also extend the best-performing method (Devlin et al., 2019) among them by letting the model be aware of the existence of arguments that are dialogue participants (Section 4.2). Experiments on DialogRE demonstrate that this simple extension nevertheless yields substantial gains on both standard and conversational RE evaluation metrics, supporting our assumption regarding the critical role of tracking speakers in dialogue-based relation extraction (Section 5).
The primary contributions of this work are as follows: (i) we construct the first human-annotated dialogue-based relation extraction dataset and thoroughly investigate the similarities and differences between dialogue-based and traditional relation extraction tasks, (ii) we design a new conversational evaluation metric that features the timeliness aspect of interactive communications in dialogue, and (iii) we establish a set of baseline relation extraction results on DialogRE using standard learning-based techniques and further demonstrate the importance of explicit recognition of speaker arguments in dialogue-based relation extraction.

Data Construction
We use the transcripts of all ten seasons (263 episodes in total) of an American television situation comedy Friends, covering a range of topics. We remove all content (usually in parentheses or square brackets) that describes non-verbal information such as behaviors and scene information.

Relation Schema
We follow the slot descriptions 1 of the Slot Filling (SF) task in the Text Analysis Conference Knowledge Base Population (TAC-KBP) (McNamee and Dang, 2009;Ji et al., 2010Ji et al., , 2011Surdeanu, 2013;Surdeanu and Ji, 2014), which primarily focuses on biographical attributes of person (PER) entities and important attributes of organization (ORG) entities. As the range of topics in Friends is relatively restricted compared to large-scale news corpora such as Gigaword (Parker et al., 2011), some relation types (e.g., PER:CHARGES, and ORG:SUBSIDIARIES) seldom appear in the texts. Additionally, we consider new relation types such as PER:GIRL/BOYFRIEND and PER:NEIGHBOR that  frequently appear in Friends. We list all 36 relation types that have at least one relational instance in the transcripts in Table 2 and provide definitions and examples of new relation types in Appendix A.1.

Annotation
We focus on the annotation of relational triples (i.e., (subject, relation type, object)) in which at least one of the arguments is a named entity. We regard an uninterrupted stream of speech from one speaker and the name of this speaker as a turn.
As we follow the TAC-KBP guideline to annotate relation types and design new types, we use internal annotators (two authors of this paper) who are familiar with this task. For a pilot annotation, annotator A annotates relational triples in each scene in all transcripts and form a dialogue by extracting the shortest snippet of contiguous turns that covers all annotated relational triples and sufficient supportive contexts in this scene. The guidelines are adjusted during the annotation. 2 We prefer to use speaker name (i.e., the first word or phrase of a turn, followed by a colon) as one argument of a speaker-related triple if the corresponding full names or alternate names of the speaker name also appear in the same dialogue, except for relation PER:ALTERNATE NAMES in which both mentions should be regarded as arguments. For an argument pair (i.e., (subject, object)), there may exist multiple relations between them, and we annotate all instances of all of them. For each triple, we also annotate its trigger: the smallest extent (i.e., span) of contiguous text in the dialogue that most clearly indicates the existence of the relation between two arguments. If there exist multiple spans that can serve as triggers, we only keep one for each triple. For relation types such as PER:TITLE and PER:ALTERNATE NAMES, it is difficult to identify such supportive contexts, and therefore we leave their triggers empty. For each relational triple, we annotate its inverse triple if its corresponding inverse relation type exists in the schema (e.g., PER:CHILDREN and PER:PARENTS) while the trigger remains unchanged.
In the second process, annotator B annotates the possible relations between candidate pairs annotated by annotator A (previous relation labels are hidden). Cohen's kappa among annotators is around 0.87. We remove the cases when annotators cannot reach a consensus. On average, each dialogue in DialogRE contains 4.5 relational triples and 12.9 turns, as shown in Table 3. See Table 1

Negative Instance Generation, Data Split, and Speaker Name Anonymization
After our first round of annotation, we use any two annotated arguments associated with each dialogue to generate candidate relational triples, in which the relation between two arguments is unanswerable based on the given dialogue or beyond our relation schema. We manually filter out candidate triples for which there is "obviously" no relation between an argument pair in consideration of aspects such as argument type constraints (e.g., relation PER:SCHOOLS ATTENDED can only exist between a PER name and an ORG name). After filtering, we keep 2,100 triples in total, whose two arguments are in "no relation", and we finally have 10,168 triples for 1,788 dialogues. We randomly split them at the dialogue level, with 60% for training, 20% for development, and 20% for testing. The focus of the proposed task is to identify relations between argument pairs based on a dialogue, rather than exploiting information in Di-alogRE beyond the given dialogue or leveraging external knowledge to predict the relations between arguments (e.g., characters) specific to a particular television show. Therefore, we anonymize all speaker names (Section 2.2) in each dialogue and annotated triples and rename them in chronological order within the given dialogue. For example, S1 and S2 in Table 1 represent the original speaker names "Rachel" and "Phoebe", respectively.

Comparison Between DialogRE and SF
As a pilot study, we examine the similarities and differences between dialogue-based and traditional relation extraction datasets that are manually annotated. We compare DialogRE with the official SF (2013-2014) dataset (Surdeanu, 2013;Surdeanu and Ji, 2014) as 47.2% of relation types in DialogRE originate from the SF relation types (Section 2.1), and 92.2% of the source documents in it that contain ground truth relational triples are formally written newswire reports (72.8%) or well-edited web documents (19.4%) compared to the remaining documents from discussion fora.
We   In particular, the subjects of 77.3% of relational triples are speaker names, and more than 90.0% of relational triples contain at least one speaker argument. The high percentage of "speaker-centric" relational triples and the low percentage of ORG and GPE arguments in DialogRE is perhaps because the transcripts for annotation are from a single situation comedy that involves a small group of characters in a very limited number of scenes (see more discussions in Section 5.3).
Distance Between Argument Pairs: It has been shown that there is a longer distance between two arguments in the SF dataset (Surdeanu, 2013;Huang et al., 2017) compared to that in many widely used human-annotated relation extraction datasets such as ACE (Doddington et al., 2004) and SemEval (Hendrickx et al., 2010). However, it is not trivial to compute an accurate distance between two arguments in a dialogue, especially for cases containing arguments that are speaker names. We instead consider different types of distances (e.g., average and minimum) between two argument mentions in a dialogue. We argue that DialogRE exhibits a similar level of difficulty as SF from the perspective of the distance between two arguments. 41.3% of arguments are separated by at least seven words even considering the minimum distance, and the percentage can reach as high as 96.5% considering the average distance, contrast with 46.0% in SF (Huang et al., 2017) and 59.8% in a recently released cross-sentence relation extraction dataset DocRED, in which Wikipedia articles serve as documents (Yao et al., 2019). Note that the provenance/evidence sentences in SF and DocRED are provided by automated systems or annotators. Also, 95.6% of relational triples from an annotated subset of DialogRE (Section 5.2) require reasoning over multiple sentences in a dialogue, compared with 40.7% in DocRED (Table 7). See Figure 3 in Appendix A.3 for more details.

Comparison Between DialogRE and Existing Relational Triples
We also collect 2,341 relational triples related to Friends, which are summarized by a community of contributors, from a collaborative encyclopedia. 3 We remove triples of content-independent relation types such as DIRECTED BY, GUEST STARS, and NUMBER OF EPISODES.
3 https://friends.fandom.com/wiki/Friends. We find that 93.8% of all 224 relation types in these triples can be mapped to one of the 36 relation types in our relation schema (e.g., HUS-BAND, EX-HUSBAND, and WIFE can be mapped to PER:SPOUSE) except for the remaining relatively rare or implicit relation types such as PROM DATE and GENDER, and KISSED, demonstrating the relation schema we use for annotation is capable of covering most of the important relation types labeled by the encyclopedia community of contributors.
On the other hand, the relatively small number of the existing triples and the moderate size of our annotated triples in DialogRE may suggest the low information density (Wang and Liu, 2011) in conversational speech in terms of relation extraction. For example, the average annotated triple per sentence in DialogRE is merely 0.21, compared to other exhaustively annotated datasets ACE (0.73) and KnowledgeNet (Mesquita et al., 2019) (1.44), in which corpora are formal written news reports and Wikipedia articles, respectively.

Discussions on Triggers
As annotated triggers are rarely available in existing relation extraction datasets (Aguilar et al., 2014), the connections between different relation types and trigger existence are under-investigated.
Relation Type: In DialogRE, 49.6% of all relational triples are annotated with triggers. We find that argument pairs are frequently accompanied by triggers when (1) arguments have the same type such as PER:FRIENDS, (2) strong emotions are involved (e.g., PER:POSITIVE(NEGATIVE) IMPRESSION), or (3) the relation type is related to death or birth (e.g., GPE:BIRTHS IN PLACE). In comparison, a relation between two arguments of different types (e.g., PER:ORIGIN and PER:AGE) is more likely to be implicitly expressed instead of relying on triggers. This is perhaps because there exist fewer possible relations between such an argument pair compared to arguments of the same type, and a relatively short distance between such an argument pair might be sufficient to help the listeners understand the message correctly. For each relation type, we report the percentage of relational triples with triggers in Table 2.
Argument Distance: We assume the existence of triggers may allow a longer distance between argument pairs in a text as they help to decrease ambiguity. This assumption may be empirically validated by the longer average distance (68.3 tokens) between argument pairs with triggers in a dialogue, compared to the distance (61.2 tokens) between argument pairs without any triggers.

Dialogue-Based Relation Extraction
Given a dialogue D = s 1 : t 1 , s 2 : t 2 , . . . , s m : t m and an argument pair (a 1 , a 2 ), where s i and t i denote the speaker ID and text of the i th turn, respectively, and m is the total number of turns, we evaluate the performance of approaches in extracting relations between a 1 and a 2 that appear in D in the following two settings.
Standard Setting: As the standard setting of relation extraction tasks, we regard dialogue D as document d. The input is a 1 , a 2 , and d, and the expected output is the relation type(s) between a 1 and a 2 based on d. We adopt F1, which is the harmonic mean of precision (P) and recall (R), for evaluation.
Conversational Setting: Instead of only considering the entire dialogue, here we can regard the first i ≤ m turns of the dialogue as d. Accordingly, we propose a new metric F1 c , the harmonic mean of conversational precision (P c ) and recall (R c ), as a supplement to the standard F1. We start by introducing some notation that will be used in the definition of F1 c . Let O i denote the set of predicted relation types when the input is a 1 , a 2 , and the first i turns (i.e., d = s 1 : t 1 , s 2 : t 2 , . . . , s i : t i ). For an argument pair (a 1 , a 2 ), let L denote its corresponding set of relation types that are manually annotated based on the full dialogue. R represents the set of 36 relation types. By definition, O i , L ⊆ R. We define that auxiliary function (x) returns m if x does not appear in D. Otherwise, it returns the index of the turn where x first appears.
We define auxiliary function ı(r) as: (i) For each relation type r ∈ L, if there exists an annotated trigger for r, ı(r) = (λ r ) where λ r denotes the trigger. Otherwise, ı(r) = m. (ii) For each r ∈ R\L, ı(r) = 1. We define the set of relation types that are evaluable based on the first i turns by E i : The interpretation of Equation 1 is that given d containing the first i turns in a dialogue, relation type r associated with a 1 and a 2 is evaluable if a 1 , a 2 , and the trigger for r have all been mentioned in d. The definition is based on our assumption that we can roughly estimate how many turns we require to predict the relations between two arguments based on the positions of the arguments and triggers, which most clearly express relations. See Section 5.2 for more discussions. The conversational precision and recall for an input instance D, a 1 , and a 2 are defined as: P c (D, a 1 , a 2 We average the conversational precision/recall scores of all instances to obtain the final conversational precision/recall. and F1 c = 2 · P c · R c /(P c + R c ).

Baselines
Majority: If a given argument pair does not appear in the training set, output the majority relation type in the training set as the prediction. Otherwise, output the most frequent relation type associated with the two arguments in the training set.
BERT: We follow the framework of fine-tuning a pre-trained language model on a downstream task (Radford et al., 2018)   BERT S : We propose a modification to the input sequence of the above BERT baseline with two motivations: (1) help a model locate the start positions of relevant turns based on the arguments that are speaker names, and (2) prevent a model from overfitting to the training data. Formally, given an argument pair (a 1 , a 2 ) and its associated document d = s 1 : t 1 , s 2 : t 2 , . . . , s n : t n , we construct d =ŝ 1 : t 1 ,ŝ 2 : t 2 , . . . ,ŝ n : t n , whereŝ i is: (6) where  (Pennington et al., 2014), mention embeddings, and type embeddings. We assign the same mention embedding to mentions of the same argument and obtain the type embeddings based on named entity types of the two arguments. We use spaCy 4 for entity typing.
Language Model Fine-Tuning: We use the uncased base model of BERT released by Devlin et al. (2019). We truncate a document when the input sequence length exceeds 512 and fine-tune BERT using a batch size of 24 and a learning rate of 3×10 −5 4 https://spacy.io/.

Results and Discussions
We report the performance of all baselines in both the standard and conversational settings in Table 5. We run each experiment five times and report the average F1 and F1 c along with standard deviation (σ). The fine-tuned BERT method already outperform other baselines (e.g., BiLSTM that achieves 51.1% in F1 on DocRED (Yao et al., 2019)), and our speaker-aware extension to the BERT baseline further leads to 2.7% and 2.2% improvements in F1 and F1 c , respectively, on the test set of DialogRE, demonstrating the importance of tracking speakers in dialogue-based relation extraction.
Conversational Metric: We randomly select 269 and 256 instances, which are associated with 50 dialogues from each of the dev and test sets, respectively. For each of relational instances (188 in total) that are previously labeled with triggers in the subsets, annotator A labels the smallest turn i * such that the first i * turns contain sufficient information to justify a relation. The average distance between i * and our estimation max{(a 1 ), (a 2 ), ı(r)} in Equation (1) (Section 4.1) is only 0.9 turn, supporting our hypothesis that the positions of arguments and triggers may be good indicators for estimating the minimum turns for humans to make predictions. For convenience, we use BERT for the following discussions and comparisons.
Ground Truth Argument Types: Methods in Table 5 are not provided with ground truth argument types considering the unavailability of this kind of annotation in practical use. To study the impacts of argument types on DialogRE, we report the performance of four methods, each of which additionally takes as input the ground truth argument types as previous work (Zhang et al., 2017;Yao et al., 2019). We adopt the same baseline for a direct comparison except that the input sequence is changed.
In Method 1, we simply extend the original input sequence of BERT (Section 4.2) with newly-introduced special tokens that represent argument types. The input sequence is where τ i is a special token representing the argument type of a i (i ∈ {1, 2}). For example, given a 1 of type PER and a 2 of type STRING, τ 1 is [PER] and τ 2 is [STRING]. In Method 2, we extend the input sequence of BERT S with τ i defined in Method 1 (i.e., We also follow the input sequence of previous single-sentence relation extraction methods (Shi and Lin, 2019;Joshi et al., 2020) and refer them as Method 3 and 4, respectively. We provide the implementation details in Appendix A.5. As shown in Table 6, the best performance achieved by Method 2 is not superior to that of BERT S , which does not leverage ground truth argument types. Therefore, we guess that ground truth argument types may only provide a limited, if at all positive, contribution to the performance on DialogRE.  Table 6: Performance (F1 (σ)) comparison of methods with considering the ground truth argument types.

Ground Truth Triggers:
We investigate what performance would be ideally attainable if the model could identify all triggers correctly. We append the ground truth triggers to the input sequence on the baseline, and the F1 of this model is 74.9%, a 16.4% absolute improvement compared to the BERT baseline. In particular, through the introduction of triggers, we observe a 22.9% absolute improvement in F1 on relation types whose inverse relation types are themselves (e.g., PER:ROOMMATE and PER:SPOUSE). These experimental results show the critical role of triggers in dialogue-based relation extraction. However, trigger identification is perhaps as difficult as relation extraction, and it is labor-intensive to annotate large-scale datasets with triggers. Future research may explore how to identify triggers based on a small amount of human-annotated triggers as seeds (Bronstein et al., 2015;Yu and Ji, 2016).

Error Analysis and Limitations
We analyze the outputs on the dev set and find that BERT tends to make more mistakes when there exists an asymmetric inverse relation of the relation to be predicted compared to those that have symmetric inverse relations. For example, the baseline mistakenly predicts S2 as the subordinate of S1 based on the following dialogue: ". . . S2: Oh. Well, I wish I could say no, but you can't stay my assistant forever. Neither can you Sophie, but for different reasons. S1: God, I am so glad you don't have a problem with this, because if you did, I wouldn't even consider applying. . . ". Introducing triggers into the input sequence leads to a relatively small gain (11.0% in F1 on all types with an asymmetric inverse relation) perhaps because inverse relation types share the same triggers (e.g., "my assistant" serves as the trigger for both PER:BOSS and PER:SUBORDINATE). One possible solution may be the use of directed syntactic graphs constructed from the given dialogue, though the performance of coreference resolution and dependency parsing in dialogues may be relatively unsatisfying. A major limitation in DialogRE is that all transcripts for annotation are from Friends, which may limit the diversity of scenarios and generality of the relation distributions. It may be useful to leverage existing triples in knowledge bases (e.g., Fandom) for thousands of movies or TV shows using distant supervision (Mintz et al., 2009), considering the time-consuming manual annotation process. In addition, dialogues in Friends presents less variation based on linguistic features (Biber, 1991) than natural conversations; nonetheless, compared to other registers such as personal letters and prepared speeches, there are noticeable linguistic similarities between natural conversations and television dialogues in Friends (Quaglio, 2009).
In this paper, we do not consider relations that take relations or events as arguments and are also likely to span multiple sentences (Pustejovsky and Verhagen, 2009;Do et al., 2012;Moschitti et al., 2013).

Relation Extraction Approaches
Over the past few years, neural models have achieved remarkable success in RE (Nguyen and Grishman, 2015b,a; Adel et al., 2016;Yin et al., 2017;Levy et al., 2017;Su et al., 2018;Song et al., 2018;Luo et al., 2019), in which the input representation usually comes from shallow neural networks over pre-trained word and character embeddings (Xu et al., 2015;Zeng et al., 2015;Lin et al., 2016). Deep contextualized word representations such as the ELMo (Peters et al., 2018) are also applied as additional input features to boost the performance (Luan et al., 2018). A recent thread is to fine-tune pre-trained deep language models on downstream tasks (Radford et al., 2018;Devlin et al., 2019), leading to further performance gains on many RE tasks (Alt et al., 2019;Shi and Lin, 2019;Baldini Soares et al., 2019;Peters et al., 2019;Wadden et al., 2019). We propose an improved method that explicitly considers speaker arguments, which are seldom investigated in previous RE methods.
Dialogue-Based Natural Language Understanding To advance progress in spoken language understanding, researchers have studied dialoguebased tasks such as argument extraction (Swanson et al., 2015), named entity recognition (Chen and Choi, 2016;Choi and Chen, 2018;Bowden et al., 2018), coreference resolution Zhou and Choi, 2018), emotion detection (Zahiri and Choi, 2018), and machine reading comprehen-sion (Ma et al., 2018;Yang and Choi, 2019). Besides, some pioneer studies focus on participating in dialogues (Yoshino et al., 2011;Hixon et al., 2015) by asking users relation-related questions or using outputs of existing RE methods as inputs of other tasks (Klüwer et al., 2010;Wang and Cardie, 2012). In comparison, we focus on extracting relation triples from human-human dialogues, which is still under investigation.

Conclusions
We present the first human-annotated dialoguebased RE dataset DialogRE. We also design a new metric to evaluate the performance of RE methods in a conversational setting and argue that tracking speakers play a critical role in this task. We investigate the performance of several RE methods, and experimental results demonstrate that a speaker-aware extension on the best-performing model leads to substantial gains in both the standard and conversational settings.
In the future, we are interested in investigating the generality of our defined schema for other comedies and different conversational registers, identifying the temporal intervals when relations are valid (Surdeanu, 2013) in a dialogue, and joint dialogue-based information extraction as well as its potential combinations with multimodal signals from images, speech, and videos.
• per:girl/boyfriend: A relatively long-standing relationship compared to PER:POSITIVE IMPRESSION and PER:DATES, including but not limited to ex-relationships, partners, and engagement. The fact that two people dated for one or several times alone cannot guarantee that there exists a PER:GIRL/BOYFRIEND relation between them; we label PER:DATES for such an argument pair, instead.
• per:neighbor: A neighbor could be a person who lives in your apartment building whether they are next door to you, or not. A neighbor could also be in the broader sense of a person who lives in your neighborhood.
• per:roommate: We regard that two persons are roommates if they share a living facility (e.g., an apartment or dormitory), and they are not family or romantically involved (e.g., per:spouse and per:girl/boyfriend).
• per:visited place: A person visits a place in a relatively short term of period (vs. PER:PLACE OF RESIDENCE). For example, we annotate ("Mike", per:visited place, "Barbados") in dialogue D2 and its corresponding trigger "coming to".

D2
Phoebe: Okay, not a fan of the tough love. Precious: I just can't believe that Mike didn't give me any warning. Phoebe: But he didn't really know, you know. He wasn't planning on coming to Barbados and proposing to me... Precious: He proposed to you? This is the worst birthday ever.
• per:works: The argument can be a piece of art, a song, a movie, a book, or a TV series.
• per:place of work: A location in the form of a string or a general noun phrase, where a person works such as "shop".
• per:pet: We prefer to use named entities as arguments. If there is no name associated with a pet, we keep its species (e.g., dog) mentioned in a dialogue.

A.4 Other Input Sequences
We also experiment with the following three alternative input sequences on the BERT baseline: