What You See is What You Get: Visual Pronoun Coreference Resolution in Dialogues

Grounding a pronoun to a visual object it refers to requires complex reasoning from various information sources, especially in conversational scenarios. For example, when people in a conversation talk about something all speakers can see, they often directly use pronouns (e.g., it) to refer to it without previous introduction. This fact brings a huge challenge for modern natural language understanding systems, particularly conventional context-based pronoun coreference models. To tackle this challenge, in this paper, we formally define the task of visual-aware pronoun coreference resolution (PCR) and introduce VisPro, a large-scale dialogue PCR dataset, to investigate whether and how the visual information can help resolve pronouns in dialogues. We then propose a novel visual-aware PCR model, VisCoref, for this task and conduct comprehensive experiments and case studies on our dataset. Results demonstrate the importance of the visual information in this PCR case and show the effectiveness of the proposed model.


Introduction
The question of how human beings resolve pronouns has long been an attractive research topic in both linguistics and natural language processing (NLP) communities, for the reason that pronoun itself has weak semantic meaning (Ehrlich, 1981) and the correct resolution of pronouns requires complex reasoning over various information. As a core task of natural language understanding, pronoun coreference resolution (PCR) (Hobbs, 1978) is the task of identifying the noun (phrase) that pronouns refer to. Compared with the general coreference resolution task, the stringmatching methods are no longer effective for pronouns (Stoyanov et al., 2009), which makes PCR * Equal contribution.
A: What are they doing? B: Probably celebrating some holiday with such a big cake. A: Can you read the writing on it? B: I can't tell.
A: What is it behind the cake?
the women the cake the statue Figure 1: An example of a visual-related dialogue. Two people are discussing the view they can both see. Pronouns and noun phrases referring to the same entity are marked in same color. The first "it" in the dialogue labeled with blue color refers to the object "the big cake" and the second "it" labeled with green color refers to the statue in the image. more challenging than the general coreference resolution task.
Recently, great efforts have been devoted into the coreference resolution task (Raghunathan et al., 2010;Manning, 2015, 2016; and good performance has been achieved on formal written text such as newspapers (Pradhan et al., 2012;Zhang et al., 2019a) and diagnose records (Zhang et al., 2019b). However, when it comes to dialogues, where more abundant information is needed, the performance of existing models becomes less satisfying. The reason behind is that, different from formal written language, correct understanding of spoken language often requires the support of other information sources. For example, when people chat with each other, if they intend to refer to some object that all speakers can see, they may directly use pronouns such as "it" instead of describing or mentioning it in the first place. Sometimes, the object (name or text description) that pronouns refer to may not even appear in a conversation, and thus one needs to ground the pronouns into something outside the text, which is extremely challenging for conventional approaches purely based on humandesigned rules (Raghunathan et al., 2010) or contextual features .
A visual-related dialogue is shown in Figure 1. Both A and B are talking about a picture, in which several people are celebrating something. In the dialogue, the first "it" refers to the "the big cake," which is relatively easy for conventional models, because the correct mention just appears before the targeting pronoun. However, the second "it" refers to the statue in the image, which does not appear in the dialogue at all. Without the support of the visual information, it is almost impossible to identify the coreference relation between "it" and "the statue." In this work, we focus on investigating how visual information can help better resolve pronouns in dialogues. To achieve this goal, we first create VisPro, a large-scale visual-supported PCR dataset. Different from existing datasets such as ACE (NIST, 2003) and CoNLL-shared task (Pradhan et al., 2012), VisPro is annotated based on dialogues discussing given images. In total, Vis-Pro contains annotations for 29,722 pronouns extracted from 5,000 dialogues. Once the dataset is created, we formally define a new task, visual pronoun coreference resolution (Visual PCR), and design a novel visual-aware PCR model VisCoref, which can effectively extract information from images and leverage them to help better resolve pronouns in dialogues. Particularly, we align mentions in the dialogue with objects in the image and then jointly use the contextual and visual information for the final prediction.
The contribution of this paper is three-folded: (1) we formally define the task of visual PCR; (2) we present VisPro, the first dataset that focuses on PCR in visual-supported dialogues; (3) we propose VisCoref, a visualaware PCR model.
Comprehensive experiments and case studies are conducted to demonstrate the quality of VisPro and the effectiveness of VisCoref. The dataset, code, and models are available at: https://github.com/ HKUST-KnowComp/Visual_PCR.

The VisPro Dataset
To generate a high-quality and large-scale visualaware PCR dataset, we select VisDial (Das et al., 2017) as the base dataset and invite annotators to annotate. In VisDial, each image is accompanied by a dialogue record discussing that image. One example is shown in Figure 1. In addition, Vis-Dial also provides a caption for each image, which brings more information for us to create VisPro 1 . In this section, we introduce the details about the dataset creation in terms of pre-processing, survey design, annotation, and post-processing.

Pre-processing
To make the annotation task clear to annotators and help them provide accurate annotation, we first extract all the noun phrases and pronouns with Stanford Parser (Klein and Manning, 2003) and then provide the extracted noun phrases as candidate mentions to annotate on. To avoid the overlap of candidate noun phrases, we choose noun phrases with a height of two in parse trees. One example is shown in Figure 2. In the syntactic tree for the sentence "A man with a dog is walking on the grass," we choose "A man," "a dog" and "the grass" as candidates. If the height of noun phrases is not limited, then the noun phrase "A man with a dog" will cover "A man" and "a dog," leading to confusion in the options.
Following (Strube and Müller, 2003;Ng, 2005), we only select third-person personal (it, he, she, they, him, her, them) and possessive pronouns (its, his, her, their) as the targeting pronouns. In total, the VisPro dataset contains 29,722 pronouns of 5,000 dialogues selected from 133,351 dialogues in VisDial v1.0 (Das et al., 2017). We choose dialogues in which the number of pronouns ranges from four to ten for the following reasons. For one thing, dialogues with few pronouns are of lit- tle use to the task. For another, dialogues with too many pronouns often contain repeating pronouns referring to the same object, which makes the task too easy. The dialogues selected contain 5.94 pronouns on average. Figure 3 presents the distribution of different pronouns. From the figure we can see that "it" and "they" are used most frequently in the dialogues.

Survey Design
We divide 29,722 pronouns from 5,000 dialogues into 3,304 surveys. In each survey, besides the normal questions, we also include one checkpoint question to control the annotation quality 2 . In total, each survey contains ten questions (nine normal questions and one checkpoint question). Each question is about one pronoun.
The survey consists of three parts. We begin by explaining the task to the annotators, including how to deal with particular cases such as multiword expressions. Then, we present examples to help the annotators better understand the task. Finally, we invite them to provide annotations for the pronouns in the dialogues.
One example of the annotation interface is shown in Figure 4. The text and the image of the dialogue are displayed on the left and right panel, respectively. For each of the targeting pronoun, the workers are asked to select all the mentions that it refers to. If any of the noun phrases is selected, the reference type of the pronoun on the right panel will be set to "noun phrases in text" automatically, and vice versa. If the concept that the pronoun refers to is not available in the options, or the pronoun is not referring to anything in par- ticular, workers are asked to choose "some concepts not present in text" or "the pronoun is nonreferential" on the right panel accordingly. The nine normal questions in each survey are consecutive so that pronouns in the same dialogue are displayed sequentially.

Annotation
We employ the Amazon Mechanical Turk platform (MTurk) for our annotations.
We require that our annotators have more than 1,000 approved tasks and a task approval rate more than 90%. They are also asked to pass a simple test of pronoun resolution to prove that they understand the instruction of the task. Based on these criteria, we identify 186 valid annotators. For each dialogue, we invite at least four different workers to annotate. In total, we collect 122,443 annotations at a total cost of USD 3,270.80. We support the multiple participation of annotators by ensuring that subsequent surveys are generated with their previously-unlabelled dialogues.

Post-processing
Before processing the annotation result, we first exclude the annotation of workers who fail to answer the checkpoint questions correctly. As a result, 116,300 annotations are kept, which is 95% of the overall annotation results. We then decide the gold mentions that pronouns refer to using the following procedure: • Step one: We look into the annotations of each worker to find out the coreference clusters he annotates for each dialogue. To achieve this goal, we merge the intersecting sets of noun phrase antecedents for pronouns in the same dialogue. We observe that annotators often make the right decision for noun phrases near the anaphor pronoun, but neglect antecedents far away. It also happens in the annotation of other coreference datasets (Chen et al., 2018). Therefore, we generate clusters from different pronouns in the same dialogue rather than merging antecedents for each pronoun separately. If an entity is mentioned and referred to by pronouns for multiple times in the dialogue, combining the antecedents for all pronouns could create a more accurate coreference cluster for the entity.

•
Step two: We adjudicate the coreference clusters for the dialogue by majority voting within all workers. • Step three: We then decide the anaphoric type of all pronouns by voting. If a pronoun is considered to refer to somef noun phrases in the text, we find out the coreference cluster it belongs to and choose the noun phrases in the cluster that precede it as its antecedents. • Step four: We randomly split the collected data into train/val/test sets of 4,000/500/500 dialogues, respectively.
After collecting the data, we found out that 73.43% of pronouns act as an anaphor to some noun phrases, 5.67% of pronouns do not have a suitable antecedent, and the rest 20.90% are not referential. Among all the pronouns that have noun phrases as antecedents, 13.45% of them do not have an antecedent in the dialogue context. 3 For anaphoric pronouns, each has 2.06 antecedents on average. In the end, we calculate the inner-annotator agreement (IAA) to demonstrate the quality of the resulting dataset. Following conventional approaches (Pradhan et al., 2012), we use the average MUC score between individual workers and the adjudication result as the evaluation metric. The final IAA score is 72.4, which indicates that the workers can clearly understand our task and provide reliable annotation. 3 The antecedent labeled by the worker is provided by the caption. Text embedding e is acquired via LSTM and inner-span attention mechanism. Object labels are detected from the image and the possibility b of each text span referring to them is calculated. The contextual and visual score F c (n, p) and F v (n, p) are calculated upon the features, and the final score F (n, p) is their weighted sum.

The Task
In this work, we focus on jointly leveraging the contextual and visual information to resolve pronouns. Thus, we formally define the visual-aware pronoun coreference resolution as follows: Given an image I, a dialogue record D which discusses the content in I, and an external mention set M, for any pronoun p that appears in D, the goal is to optimize the following objective: where F (·) is the overall scoring function of p refers to c given I and D. c and s denote the correct mention and the candidate mention, and C and S denote the correct mention set and the candidate mention set, respectively. Note that in the case where no golden mentions are annotated, the union of all possible spans in D and M are used to form S.

The Model
The overall framework of the proposed model Vis-Coref is presented in Figure 5. In VisCoref, we want to leverage both the contextual and visual information to resolve pronouns. Thus we split the scoring function into two parts as follows: where F c (·) and F v (·) are the scoring functions based on contextual and visual information respectively. λ vis is the hyper-parameter to control the importance of visual information in the model. The details of the two scoring functions are described in the following subsections.

Contextual Scoring
Before computing F c , we first need to encode the contextual information into all the candidates and targeting pronouns through a mention representation module, which is shown as the dotted box in Figure 5. Following , a standard bidirectional LSTM (BiLSTM) (Hochreiter and Schmidhuber, 1997) is used to encode each span with attentions (Bahdanau et al., 2015). Assume initial embeddings of words in a span s i are denoted as x 1 , ..., x T , and their encoded representation after BiLSTM as x * 1 , ..., x * T , the weighted embeddings of each spanx i are obtained bŷ where a t is the inner-span attention computed by in which α t is obtained by a standard feed-forward neural network 4 α t = N N α (x * t ). After that, we concatenate the embeddings of the starting word (x * start ) and the ending word (x * end ) of each span, as well as its weighted embedding (x i ) and the length feature (φ(i)) to form its final representation e: On top of the extracted mention representation, we then compute the contextual score as follows: where [ ] represents the concatenation, e p and e n are the mention representation of the targeting pronoun and current candidate mention, and ⊙ indicates the element-wise multiplication. 4 We use N N to represent feed-forward neural networks.

Visual Scoring
In order to align mentions in the text with objects in the image, the first step of leveraging the visual information is to recognize the objects from the picture. We use a object detection module to identify object labels from each image I, such as "person," "car," or "dog." After that, we convert the identified labels into vector representations following the same encoding process in 4.1. For each image, we add a label "null," indicating that the pronoun is referring to none of the detected objects in I. We denote the resulting embeddings as e c 1 , e c 2 , ..., e c K , in which c i denotes the detected labels, and K is the total number of unique labels in the corresponding image. After extracting objects from the image, we would like to know whether the mentions are referring to them. To achieve this goal, we calculate the possibility of a mention n i corresponding to each detected object c k : Then we take the softmax of β n i ,c k as the final possibility of n i aligned with the object label c k : If n i corresponds to a certain object in I, the score of that label should be large. Otherwise, the possibility of "null" should be the largest. Therefore, we use the maximum of possibility scores among all K classes except "null" as the probability of n i related to some object in I. Similarly, given two mentions n i and n j , if they refer to the same detected object c k , then both b n i ,c k and b n j ,c k should be large. Thus, we can use the maximum of their product among all K classes except "null" as the probability of n i and n j related to the same object in I.
In the end, we define the visual scoring function as follows: F v (n, p) = N N v ([m p , m n , m p · m n , m p,n ]). 5128

The Experiment
In this section, we introduce the implementation details, experiment setting, and baseline models.

Experiment Setting
As introduced in Section 2.4, we randomly split the data into training, validation, and test sets of 4,000, 500, and 500 dialogues, respectively. For each dialogue, a mention pool of size 30 is provided for models to detect plausible mentions outside the dialogue. The pool contains both mentions extracted from the corresponding caption and randomly selected negative mention samples from other captions. All models are evaluated based on the precision (P), recall (R), and F1 score. Last but not least, we split the test dataset by whether the correct antecedents of the pronoun appear in the dialogue or not. We denote these two groups as "Discussed" and "Not Discussed."

Implementation Details
Following previous work , we use the concatenation of the 300d GloVe embedding (Pennington et al., 2014) and the ELMo (Peters et al., 2018) embedding as the initial word representations. Out-of-vocabulary words are initialized with zero vectors. We adopt the "ssd resnet 50 fpn coco" model from Tensorflow detection model zoo 5 as the object detection module. The size of hidden states in the LSTM module is set to 200, and the size of the projected embedding for computing similarity between text spans and object labels is 512. The feed-forward networks for contextual scoring and visual scoring have two 150-dimension hidden layers and one 100-dimension hidden layer, respectively. For model training, we use cross-entropy as the loss function and Adam (Kingma and Ba, 2015) as the optimizer. All the parameters are initialized randomly. Each mention selects the text span of the highest overall score among all previous text spans in the dialogue or the mention pool as its antecedent, so that all mentions in one dialogue are clustered into several coreference chains. The noun phrases in the same coreference cluster as a pronoun are deemed as the predicted antecedents of that pronoun. The models are trained with up to 50,000 steps, and the best one is selected based on its performance on the validation set. 5 https://github.com/tensorflow/models/ tree/master/research/object_detection

Baseline Methods
Since we are the first to proposed a visual-aware model for pronoun coreference resolution, we compare our results with existing models of general coreference resolution.
• Deterministic model (Raghunathan et al., 2010) is a rule-based system that aggregates multiple functions for determining whether two mentions are coreferent based on hand-craft features.
• Statistical model (Clark and Manning, 2015) learns upon human-designed entity-level features between clusters of mentions to produce accurate coreference chains.
• Deep-RL model (Clark and Manning, 2016) applies reinforcement learning to mention-ranking models to form coreference clusters.
• End-to-end model  is the stateof-the-art method of coreference resolution. It predicts coreference clusters via an end-to-end neural network that leverages pretrained word embeddings and contextual information.
Last but not least, to demonstrate the effectiveness of the proposed model, we also present a variation of the End-to-end model, which can also use the visual information, as an extra baseline: • End-to-end+Visual first extracts features from images with ResNet-152 (He et al., 2016). Then it concatenates the image feature with the contextual feature in the original End-to-end model together to make the final prediction.
As the Deterministic, Statistical, and Deep-RL model are included in the Stanford CoreNLP toolkit 6 , we use their released model as baselines. For the End-to-end model, we also use their released code 7 .

The Result
The experimental results are shown in Table 1. Our proposed model VisCoref outperforms all the baseline models significantly, which indicates that the visual information is crucial for resolving pronouns in dialogues. Besides that, we also have the following interesting findings: 1. For all the conventional models, the "Not Discussed" pronouns whose antecedents are absent in dialogues are more challenging than "Discussed" pronouns whose antecedents appear in the dialogue context. The reason behind is that if the correct mentions appear in the near context of pronouns, the information about the correct mention can be aggregated to the targeting pronoun through either human-designed rules or deep neural networks (Bi-LSTM). However, when the correct mention is not available in the near context, it is quite challenging for conventional models to understand the dialogue and correctly ground the pronoun to the object both speakers can see, as they do not have the support of visual information.
2. As is shown in the result of the "End-to-end+Visual" model, simply concatenating the visual feature to the contextual feature can help resolve "Discussed" pronouns but may hurt the performance of the model on "Not Discussed" pronouns. Different from them, the proposed Viscoref can improve the resolution of both the "Discussed" and "Not Discussed" pronouns. There are mainly two reasons behind: (1) The visual information in our model is first converted into textual labels and then transformed into vector representation in the same way as the dialogue context. Thus the vector space of contextual and visual information is perfectly aligned.
(2) We introduce a hyper-parameter λ vis to balance the influence of different knowledge resources.
3. Even though our model outperforms all the baseline methods, we still can observe a huge gap between our model and human being. It indicates that current models still cannot fully understand the dialogue even with the support of visual information and further proves the value and necessity of proposing VisPro.

Hyper-parameter Analysis
We traverse different weights of visual and contextual information from 0 to 1, and the result is shown in Figure 6. Along with the increase of λ vis , our model puts more weight on the visual information. As a result, our model can perform better. However, when our model focuses too much on the visual information (when λ vis equals to 0.9 or 1), the model overfits to the visual information and thus performs poorly on the task. To achieve the balance between the visual and contextual information, we set λ vis to be 0.4.

Case Study
To further investigate how visual information can help solve PCR, we randomly select two examples and show the prediction results of VisCoref and End-to-end model in Figure 7.
In the first example in Figure 7(a), given the pronoun "it," the End-to-end model picks "any writing" from the dialogue, while the VisCoref model chooses "a blue, white and red train" from the candidate mention sets. Without looking at the picture, we cannot distinguish between these two candidates. However, when the picture is taken into consideration, we observe that there is a train in the image and thus "a blue, white and red train" is a more suitable choice, which proves the importance of visual information. A similar situation happens in Figure 7(b), where the End-to-end model connects "they" to "the people" but there is no human being in the image at all. On the contrary, as VisPro can effectively leverage the visual information and make the decision that "they" should refer to "2 zebras."

Related Work
In this section, we introduce the related work about pronoun coreference resolution and visualaware natural language processing problems.

Pronoun Coreference Resolution
As one core task of natural language understanding, pronoun coreference resolution, the task of identifying mentions in text that the targeting pro-noun refers to, plays a vital role in many downstream applications in natural language processing, such as machine translation (Guillou, 2012), summarization (Steinberger et al., 2007) and information extraction (Edens et al., 2003). Traditional studies focus on resolving pronouns in expert-annotated formal textual dataset such as ACE (NIST, 2003) or OntoNotes (Pradhan et al., 2012). However, models that perform well on these datasets might not perform as well in other scenarios such as dialogues due to the informal language and the lack of essential information (e.g., the shared view of two speakers). In this work, we thus focus on the PCR in dialogues and show that the information contained in the shared view can be crucial for understanding the dialogues and correctly resolving the pronouns.

Visual-aware NLP
As the intersection of computer vision (CV) and natural language processing (NLP), visual-aware NLP research topics have been popular in both communities. For instance, image captioning (Xu et al., 2015) focuses on generating captions for images, visual question answering (VQA) (Antol et al., 2015) on answering questions about a image, and visual dialogue (Das et al., 2017) on generating a suitable response based on images. As one vital step of all the aforementioned visualaware natural language processing tasks (Kottur et al., 2018), the visual-aware PCR is still unexplored. To fill this gap, in this paper, we create VisPro, which is a large-scale visual-aware PCR dataset, and introduce VisCoref to demonstrate how to leverage information hidden in the shared view to resolve pronouns in dialogues better and thus understand the dialogues better.
Another related work is the comprehension of referring expressions (Mao et al., 2016), which is inferring the object in an image that an expression describes. However, the task is formulated on isolated noun phrases specially designed for discriminative descriptions without putting them into a meaningful context. Instead, our task focuses on resolving pronouns in dialogues based on images as the shared view, which enhances the understanding of dialogues based on the comprehension of expressions and images.

Conclusion
In this work, we formally define the task of visual pronoun coreference resolution (PCR) and present VisPro, the first large-scale visual-supported pronoun coreference resolution dataset. Different from conventional pronoun datasets, VisPro focuses on resolving pronouns in dialogues which discusses a view that both speakers can see.
Moreover, we also propose VisCoref, the first visual-aware PCR model that aligns contextual information with visual information and jointly uses them to find the correct objects that the targeting pronouns refer to. Extensive experiments demonstrate the effectiveness of the proposed model. Further case studies also demonstrate that jointly using visual information and contextual information is an essential path for fully understanding human language, especially dialogues.