Adapting Coreference Resolution to Twitter Conversations

The performance of standard coreference resolution is known to drop significantly on Twitter texts. We improve the performance of the (Lee et al., 2018) system, which is originally trained on OntoNotes, by retraining on manually-annotated Twitter conversation data. Further experiments by combining different portions of OntoNotes with Twitter data show that selecting text genres for the training data can beat the mere maximization of training data amount. In addition, we inspect several phenomena such as the role of deictic pronouns in conversational data, and present additional results for variant settings. Our best configuration improves the performance of the”out of the box” system by 21.6%.


Introduction and Related Work
Twitter messages present a discourse genre that includes noisy informal language with abbreviations and purposeful typos, use of nonstandard symbols such as # and @ signs, unintended misspellings, etc., which makes them challenging for NLP applications. We are here interested in the task of automated coreference resolution for nominal mentions in Twitter conversations, i.e., threads of messages that specifically reply to one another. In addition to non-standard words, Twitter conversations also show peculiar phenomena of referring, such as exophoric pointers to non-linguistic content in attached visual media, and mixed pronominal references to the same entity due to the nature of multi-user conversations (Aktaş et al., 2018).
Thus, tweets are a complicated genre for coreference resolution, but at the same time highly relevant for many applications that seek to extract information or opinions from users' messages. In this paper, we use a state-of-the-art resolution system built with the OntoNotes corpus (Pradhan * * indicates equal contribution. et al., 2007) and experiment with adding annotated Twitter conversations to the training data. Next, we consider the different -spoken and writtengenres included in the OntoNotes corpus. We thus conduct experiments with training on different portions, and we show that carefully selecting genre subsets beats the straightforward "taking as much as possible". Overall, our best configuration improves the "out of the box" performance of the system by  on Twitter data by 21.6%.
To our knowledge, there is no work specifically on adapting coreference resolution to Twitter, other than the aforementioned study of Aktaş et al. (2018), which showed a significant drop in performance when a system with OntoNotes models is applied to Twitter. More generally, one of the few studies on domain adaptation for coreference resolution is (Do et al., 2015), which adapts the Berkeley system (Durrett and Klein, 2013) to narrative stories. Do et al. do not retrain the system but add linguistic features of narratives as soft constraints to the resolver. -At the same time, Twitter-adaptation has been investigated for other NLP tasks, such as NER. As an example, in (Ritter et al., 2011), performance is measured using tools trained with Twitter-related and out-of-domain data.
Regarding OntoNotes genre differences, Uryupina and Poesio (2012) and Pradhan et al. (2013) report varying performance in coreference resolution for distinct corpus sections; this work inspired our experiments reported in the following. Section 2 describes our data sets, and Section 3 the experiments. Section 4 provides various additional analyses that shed light on the domain adaptation problem, and Section 5 concludes.

Data
For our experiments, 1 we use the English portion of the OntoNotes benchmark used as training set in the CoNLL-2012 shared task (Pradhan et al., 2012). It has texts from spoken and written registers, and contains gold annotations at different layers, including coreference chains, i.e., sets of mentions referring to the same entity. Spoken data includes telephone conversations (tc), broadcast conversations (bc), and broadcast news (bn); written data contains magazine (mz), newswire (nw), pivot text (pt) and web blogs (wb). As shown in Table 1  Our second dataset is the Twitter Conversation corpus (TW) presented in (Aktaş et al., 2018). They are tree structures where each tweet has a parent (i.e. the tweet it is replied-to) except for the initial tweet starting the conversation. A tree can be shallow, with many replies on just one level, or it can be deep when participants interact with each other across several turns. The corpus holds 1756 tweets in 185 threads, defined as a path from the root to a leaf node of a conversation tree. 2 69% of the coreference chains in this dataset contain coreferential relations across tweets. Hence, considering conversation context is important. We illustrate a thread structure with one example of coreference chain annotation in Figure 1.
The original TW corpus was annotated with a scheme slightly different from that of ONT. For systematic comparison, we modified the TW annotations so that they are conceptually parallel to

Experiments
For our experiments, we chose 'e2e-coref' , an update of the end-to-end neural coreference resolver presented at EMNLP 2017. It introduced a refined approach based on differentiable approximation to higher-order inference, and ELMo embeddings (Peters et al., 2018) for span scoring, which significantly improved performance on English ONT. The approach achieved 73.0 F1, representing the 2018 state-of-art. Due to its cost efficiency, speed and flexibility, it was later used as basis for several recent state-of-art models, including SpanBERT (Joshi et al., 2020).  Our main goal is to see how different training set configurations affect the coreference resolution performance on Twitter data. In order to achieve informative results, as the data is not linearly distributed and highly variable, we selected a representative test set not via random sampling, but through statistical analysis of three features: number of tokens, chains and mentions per document. To faithfully represent threads of all lengths, we determined the documents where these variables are situated either on the median, or in the first and fourth quartiles of the respective distribution, while omitting obvious outliers (see Figure 2). Because of the linear correlation of the three parameters shown on Figure 3, we could make sure to only select the documents where all three are in the same range of their distributions. Among the pre-screened files, we checked each document, marking features of the annotated mentions (person, number, gender) as well as Twitter phenomena (hash-tags, user names, pronouns with typos, etc.). With this information, we excluded threads without enough coverage and variability of the phenomena in focus. As the threads are not evenly distributed in their total length, we compared the average, median and sum for each of the three characteristics in the whole corpus with those of the determined test set, confirming that all values lie under the 15% threshold of the total number. The final distribution is shown in Table 2.

Baseline Experiments
For evaluation, we use the official CoNLL-2012 scripts, measuring the average of precision, recall and F1 for muc, b3 and ceafe metrics. After we successfully reproduced the published e2e-coref results, we measured how a model trained on ONT   Table 4) is almost 28% lower than the result reported on the official ONT test set. A second baseline results from using only the TW' twitter corpus as training data, which lead to 60.8 F1 (Test B). Although this model is based on a rather small training set, it already improves significantly on baseline A and points to the difference between in-domain and out-domain training.

Effects of selecting training (sub-)sets
Noting that the presence of Twitter data in the training set is beneficial, for Test C we merged ONT and TW', with the latter forming 3.35% of the total size (see Table 3). The results show not only a performance increase of 17% in comparison to Test A, but also a 2% gain over Test B, demonstrating that combinations of both ONT and TW' can be crucial for the learning effects. To study this in more detail, we measured how performance on the test set reacts to training on different subsets of ONT. We roughly distinguished spoken, spontaneous language from written or edited texts.
Hence, in Test D, the training set consists of Twitter and only ONT's spoken genres, viz. broadcasts conversations and telephone conversations. As a consequence, the proportion of Twitter data in the training set rises from 3.35% to 16.6%. We found an increase in overall performance by 4.3%, indicating that the written genres may rather add confusion instead of benefit to this task. However, it is not entirely clear whether the improvement results from excluding the written genres or from increasing the proportion of Twitter data.
To answer this question, we proceeded to Test E, which combines the proportion of Twitter data present in Test D with documents from the written genres; we chose newswires (nw) and magazines (mz). Test E scores F1 61.25, which is 5.5% lower than Test D. This result may partly be due to the sparsity of the written data, with a smaller amount of chains and mentions present in the written genre

Additional Analyses
To gain further insight into the adaptation of coreference resolution to Twitter, we quantitatively and qualitatively compare the results of the bestperforming test (D) to the baselines (see Table 5).
Mention length For all tests, the average token length of mentions additionally predicted by the system (spurious predictions) is significantly longer (p ≤ 0.05) than that of the correct predictions. The higher the proportion of ONT training data (whose mentions are on avg. 0.72 tokens longer than in TW'), the longer those predictions are. At the same time they are significantly shorter (p ≤ 0.05) than the missed gold predictions. Hence there is a tendency to select longer spans (especially when training on ONT), but these are also more error-prone.
Twitter-specific tokens Hashtags and usernames caused many errors in Test A. In tweets that are replies, user addresses are inserted at the beginning, so the majority of such tweet-initial usernames are not part of the syntax and have not been annotated. Table 5 shows that many of those names are incorrectly detected as mentions, while hashtags are completely ignored. With Twitter training data in Test B, identification of Twitter-specific tokens works better. Tweet-initial usernames are ignored as mentions and some username and hashtags are now correctly predicted. Test D shows further improvements for syntactically-integrated hashtags, but usernames or non-integrated hashtags still remain unresolved.
Pronouns Although they are relatively evenly distributed in the gold annotations, more 3rd person pronouns are resolved than 1st and 2nd ps. pronouns in Test A, resulting in an overall F1 of 0.769. In Test B with Twitter training data, which is rich in pronouns, pronoun performance improves for 1st and especially 2nd ps., and remains the same for 3rd ps., improving the F1 to 0.917. In Test D, pronoun performance is slightly worse (0.905).
As the entire training data in B and D is conversational, which by nature has many 1st and 2nd ps. pronouns, we repeated all test with removing those chains containing only 1st and 2nd ps. pronouns. This is to make sure that improvement is not exclusively caused by easy detection of the pronouns. The results are in column F1 1 in Table 4. While deictic pronouns have a major impact on F1, we still see improvements over the baseline for all tests but C, meaning that generally, detection of other anaphoric expressions improves as well.
Verb annotations Verb mentions are possible in ONT if they co-refer with a nominal mention (Pradhan et al., 2007), but they are not annotated in TW'. Thus four predicted verb mentions in Test A, of which two are correctly linked with the demonstrative pronoun that, are counted as erroneous predictions. After adding training data from TW' in Test D however, no verbal mentions are predicted. To check the influence of this annotation difference, we also ran all tests with the verbal annotations removed from ONT, which reduced mentions by 2.4% and chains by 3.6%. Column F1 2 in Table  4 shows the results. While training with only spoken genres outperformed more written dominant training data in previous experiments, we now see the opposite with Test D giving the worst results. These variations motivate looking further into the specific effects of different training data combinations and how verb annotations (both generally and depending on text genres) influence an otherwise purely nominal coreference resolution task.
Chain Linking The last section of Table 5 shows that Test B improves the number of correctly predicted chains compared to Test A, and it further increases in Test D, almost doubling from Test A. Partially correct chains also increase over the tests, and the number of missed entities (cases where not a single mention of an entity is predicted) is reduced by 51.3%. Notably, chains consisting only of identical strings profited the most from the combined training set in D.

Conclusion
We showed that the performance of a state-of-theart "standard" coreference resolution system run on Twitter conversations can improve by 21.6% by adding in-domain training data. In fact, even small amounts of added in-domain data can have an impact. Further, interestingly, for the out-domain 3 All gold mentions found, but also spurious mentions.  training data (ONT), the choice of genre can make a bigger difference than the bare amount of data. Our additional analyses considered two more variants of the main experiment design: While all results given in Table 4 indicate that adding Twitter data to the training set improves the performance significantly, the best combination of in-domain and out-domain data can depend on specific factors as discussed in section 4. Also, we showed that improvements from Twitter training data do not result just from the large proportion of 1st and 2nd ps. pronouns (as one might have wondered). Finally, we tested the effect of removing verb mentions from ONT, which exhibits different patterns than other setups regarding the best combination of training data. The result encourages deeper exploration of training data arrangements in terms of these features. In future work we plan to focus more on the specific kinds of training data portions and examine the influence of spoken versus written register, and on that of formal versus informal language (which need not necessarily coincide).
• Generic "you" instances are annotated in TW but not in OntoNotes. We removed generic "you" annotations from TW.
• In TW, "reflexives" are annotated as separate mentions even if they are used for focus (e.g. [The president] [himself ] said this). However, the focus reflexives are both annotated as a separate markable and also a part of the span of the preceding co-referring noun phrase in OntoNotes (e.g. [The president [himself ]] said this). Therefore, the focus reflexives in TW are added to the span of the preceding co-referring noun phrase.
If the removal of a mention made the remaining chain a singleton (i.e. only 1 mention left in the chain), the whole chain is removed from the annotations, as no singleton chains are allowed in the OntoNotes scheme.

B Appendix: Preprocessing the Data
In TW dataset: • We normalized parentheses, namely left and right bracket tokens into '-LRB-' and '-RRB-', respectively.
• We converted all smiley and emoji tokens into the strings of "%smiley" and "%emoji", respectively.
• We did not apply any preprocessing to hashtags and @-usernames.

C Appendix: Experimental Setup
The experiments are conducted on two servers with GPU, GeForce GTX 1080.