Does the Objective Matter? Comparing Training Objectives for Pronoun Resolution

Hard cases of pronoun resolution have been used as a long-standing benchmark for commonsense reasoning. In the recent literature, pre-trained language models have been used to obtain state-of-the-art results on pronoun resolution. Overall, four categories of training and evaluation objectives have been introduced. The variety of training datasets and pre-trained language models used in these works makes it unclear whether the choice of training objective is critical. In this work, we make a fair comparison of the performance and seed-wise stability of four models that represent the four categories of objectives. Our experiments show that the objective of sequence ranking performs the best in-domain, while the objective of semantic similarity between candidates and pronoun performs the best out-of-domain. We also observe a seed-wise instability of the model using sequence ranking, which is not the case when the other objectives are used.


Introduction
Hard cases of pronoun resolution have been a longstanding problem in natural language processing, which has served as a performance benchmark for the research community (Levesque et al., 2012;Wang et al., 2018Wang et al., , 2019a. For example, the Wino-Grande dataset (Sakaguchi et al., 2019) consists of pronoun resolution schemas that are constructed so that resolving them requires background knowledge and commonsense reasoning. In WinoGrande, the pronoun is obscured by " " to remove gender and number cues. The task is to find the correct candidate for " " out of two given candidates. For example: John moved the couch from the garage to the backyard to create space. The is small. Candidates: garage, backyard.
Recently, supervised learning on top of pretrained language models has been established as the main approach for pronoun resolution (Kocijan et al., 2019b,a;Sakaguchi et al., 2019). Under this type of approach, we identify four categories of objectives commonly used for pronoun resolution: 1. comparing the language model probabilities for each candidate (Kocijan et al., 2019b,a;He et al., 2019), 2. using semantic similarity between the pronoun and the candidates (Wang et al., 2019b;He et al., 2019), 3. using sequence ranking among the possible substituted sentences (Opitz and Frank, 2018;Sakaguchi et al., 2019), and 4. selecting a candidate based on the attentions of the pronoun in a transformer model (Klein and Nabi, 2019).
We list one representative model from each category. For 1, Kocijan et al. (2019b) use the BERT masked language model (Devlin et al., 2018) to produce the probabilities of the pronoun to be replaced with each of the two candidates. For 2, the Unsupervised Deep Structured Semantic Model (UDSSM-I) (Wang et al., 2019b) uses contextualized word embeddings produced by a bidirectional recurrent neural network (BiRNN), and then compares the word embedding of each candidate with the word embedding of the pronoun. For 3, RoBERTa-WinoGrande (Sakaguchi et al., 2019) encodes a pair of sentences (one for each candidate substituted in the input) by using RoBERTa  to determine which substitution is the correct one. Finally, the zero-shot Maximum Attention Score (MAS) model (Klein and Nabi, 2019) selects a candidate based on how much the pronoun attends to each candidate internally in BERT.
The problem with all these objectives is that they have not been introduced under the same circumstances. They use different language models and word embeddings (e.g., BERT, RoBERTa, or BiRNN), and have been trained on different data (e.g., DPR (Rahman and Ng, 2012), WinoGrande, or no additional data). Therefore, it is unclear whether the choice of the objective function is essential for pronoun resolution tasks. Moreover, the seed-wise stability and the expected performance of these models have usually not been reported. However, seed-wise instability and performance variation are well-known problems when fine-tuning transformer-based models (Liu et al., 2020;Dodge et al., 2020).
In this work, we compare the performance and seed-wise stability of the four categories of training objectives for pronoun resolution on equal grounds. To do this, for category 4, we adapt to training the zero-shot MAS model. For category 2, we also introduce Coreference Semantic Similarity (CSS), which is a simplification and modification of UDSSM-I for transformer encoders. We select WinoGrande as our training and development dataset due to its large size (40,938 examples) and generalizability to other pronoun resolution tasks (Sakaguchi et al., 2019). We also use for testing the following well-established datasets: the Winograd Schema Challenge dataset (WSC) (Levesque et al., 2012) and the Definite Pronoun Resolution dataset (DPR) (Rahman and Ng, 2012). We choose as language model RoBERTa , as it significantly outperforms BERT on WinoGrande, WSC, and DPR (Sakaguchi et al., 2019).
Finally, our evaluations are done under an unprecedentedly large number of seeds (20).

Models
This section presents the four training objectives and the models 1 that represent each of them.
All four models share the RoBERTa 2 contextualized word embeddings. RoBERTa has an identical transformer architecture to BERT (Devlin et al., 2018), with the only difference being the training procedure. Hence, RoBERTa is a masked language model that outputs the probability distribution for filling a gap in the text (denoted by a "<mask>" token). Additionally, RoBERTa is a text encoder, with one output for each token of the input sentence. Three of the models (2.1, 2.3, and 2.4) use a multi-layer perceptron (MLP) classification "head", which takes some part of the encoder as input.
All four models use binary cross-entropy loss with a pair of probabilities as input, and the following target labels: sentence correctness for 2.1 and candidate correctness for 2.2, 2.3, and 2.4.

WinoGrande Sequence Ranking
We refer to the RoBERTa-WinoGrande model introduced by Sakaguchi et al. (2019) as WG-SR, since it has a sequence ranking objective. This model predicts which sentence of a pair of substituted sentences is more plausible. Each of the pair of sentences in the input of WG-SR is split in two before the substituted candidate. For example, <s> The city councilmen refused the demonstrators a permit because </s> </s> feared violence. </s>, where " " is filled with each of the two candidates: "the city councilmen" or "the demonstrators". The WG-SR code 3 is based on the RobertaFor-MultipleChoice model (Wolf et al., 2019), restricted to binary choice. This model consists of the pre-trained RoBERTa encoder and an MLP head based on the <s> (first) token of RoBERTa's output. The MLP has one hidden layer with tanh activation, hidden size matching that of the encoder, and one-dimensional output. The pair of input sentences (S 1 , S 2 ) thus produces a pair of values, which are then passed through a softmax to obtain the two sentence probabilities P (S 1 ) and P (S 2 ).

Binary Word Prediction
We denote by Binary Word Prediction (BWP) the model suggested by  in their code repository 4 as a modification of the model from Kocijan et al. (2019b). Instead of using margin loss, BWP uses binary cross-entropy loss. We select this modified version, because it is claimed to be more robust by its authors, and it also has two fewer hyperparameters.
For a given (unsubstituted) input sentence, the BWP model estimates which of the two candidates is more likely to fill the gap " ". The input format is like in the following example, where " " is replaced by the "<mask>" token, to serve for the masked language model: <s> The city councilmen refused the demonstrators a permit because <mask> feared violence. </s> With such an input, the RoBERTa masked language model returns the log-probability predictions at the "<mask>" token over the vocabulary. Of those predictions, only the ones corresponding to the two word candidates c 1 and c 2 are selected by BWP: logP vocab (c 1 ) and logP vocab (c 2 ).

Coreference Semantic Similarity
We propose Coreference Semantic Similarity (CSS), a modification of the training objective of the Unsupervised Deep Structured Semantic Model (UDSSM-I) (Wang et al., 2019b). Like UDSSM-I, the CSS objective works by comparison in the word embedding space, such that the candidate that is more similar to the embedding of the pronoun is selected. Unlike UDSSM-I, the CSS objective is simpler, with no attention weights on the tokens of the candidates. It also uses a transformer encoder instead of a recurrent neural network, which enables it to take advantage of state-of-the-art pretrained language models.
The input format for this model is the same as for BWP (2.2). This input is used by RoBERTa to produce contextualized word embeddings. For each candidate c, we define its contextualized word embedding emb(c) by averaging the contextualized word embeddings of its tokens.
For classification, we compare the similarity scores of the embeddings of the <mask> token with each of the two candidates c 1 and c 2 , i.e., we compare sim(emb(c 1 ), emb(<mask>)) and sim(emb(c 2 ), emb(<mask>)) and select the candidate with greater similarity.
For the similarity score function, we use additive alignment (Bahdanau et al., 2014), i.e., sim(x, y) := v tanh(W x + U y), with the trainable parameters: vector v, and matrices W and U , with hidden size equal to that of RoBERTa and output size of one.

Maximum Attention Score
The Maximum Attention Score (MAS) model was originally developed for zero-shot evaluation of transformer models on pronoun disambiguation (Klein and Nabi, 2019). It uses the attentions of all layers of a transformer model to produce a maximum attention score for each candidate that summarizes how much the pronoun attends to a candidate. The candidate that is most attended is selected. We adapt this objective to be trainable by replacing the summary of attentions with an MLP over the concatenated masked attention tensors, followed by a binary classifier.
The input of MAS is the same as for BWP (2.2). Then, similarly to Klein and Nabi (2019), we extract the two attention tensors A c 1 and A c 2 given by the multi-layer RoBERTa attentions of the "<mask>" token to each of the two candidates c 1 and c 2 , respectively. For each candidate c, the attention tensor A c is defined as the average of the attention tensors of all tokens that form c. The two corresponding max-masking tensors M c 1 and M c 2 are then derived as follows: for i = 1, 2 and for each multi-index j of the tensor , and M c i (j) = 0, otherwise. We obtain the two corresponding max-masked tensors by the element-wise products: Unlike Klein and Nabi (2019), we introduce an MLP on top of the concatenated tensor B = [B c 1 , B c 2 ] for binary classification. The MLP has two hidden layers, tanh activation, hidden size the same as its input, and two-dimensional output. It is followed by a binary softmax function to produce the two candidate probabilities P (c 1 ) and P (c 2 ).

Experiments
For all four models, we select the best hyperparameters via grid search using 3 seeds, and then train the models with the best hyperparameters on 20 additional seeds.  report all 80 trained models on WG-test, and instead we report them on WG-dev. For additional verification, we include results over the hyperparameter space, where WG-dev is a true test set. We also report all models on the out-of-domain pronoun resolution datasets WSC (273 examples) and DPR (564 examples). The candidates provided in WSC were treated differently for the CSS and MAS models, as these models require precise candidate localization (see Appendix B). For all four models, we do a grid search over the learning rate {5e − 6, 1e − 5, 3e − 5, 5e − 5}, the number of training epochs {3, 4, 5, 8}, and the batch-size {8, 16}, and we run each model with three different random seeds. This hyperparameter space is selected based on the union of the grid search by the original WG-SR work (Sakaguchi et al., 2019) and our observations on the other three models. The best hyperparameters (in Appendix A) are selected based on the maximum WG-dev accuracy across the three seeds.
For all experiments, we use linear learning rate decay with warm-up over 10% of the training data, and the AdamW optimizer (Wolf et al., 2019), for which we only alter the learning rate. Table 1 shows the final seed-wise results for all four objectives. We see that the semantic similarity objective (CSS) outperforms the other three objec-tives on out-of-domain testing, with 90.2% average accuracy on WSC and 92.7% average accuracy on DPR. On the other hand, the sentence ranking objective used by WG-SR clearly outperforms the other three objectives on in-domain testing, with 78.2% average accuracy on WG-dev. This is confirmed by the contents of Table 2, where we see that WG-SR has a better mean and max accuracy on WG-dev over the entire hyperparameter space compared to the other three models. For these cases, WG-dev is a true test set, since early stopping was not used, and all tested setups are reported; hence, WG-dev has not influenced the models reported in Table 2.

Results
In order to verify the statistical significance of our main results, we used the t-test for similar variances and different sample sizes to compare the distributions of accuracy on the converging seeds. Comparing the accuracies of CSS and WG-SR on WG-dev, WSC, and DPR, respectively, we get the following two-tailed p-values: 0.008249, 0.003026, and 0.017441. All results are significant with p < 0.05.
We also observe that, even with the best hyperparameter combination, WG-SR exhibits seed-wise instability, as it fails to converge on 2 out of 20 seeds. This does not happen to the other three models. After considering 10 additional seeds, we obtained that WG-SR fails to converge on 10% of the seeds (3 out of 30).
Moreover, during the hyperparameter search, we observed that all models were prone to not converge for certain combinations of hyperparameters. The convergence threshold that we used was selected as having ≤ 60% accuracy on WG-dev, and its value was selected based on the performance distribution of all models. We observed that all models either perform around 50% accuracy or 70% accuracy or more on WG-dev. 60% in this context is a good middle ground threshold. Table 2 shows that MAS converged most often; however, it also had the highest performance variation with a standard deviation of 2.5. Out of the four models, WG-SR converged least often, for only 49 out of all 96 hyperparameter combinations.
WG-SR likely performs better in-domain than CSS, MAS, and BWP, since those three use existing properties of RoBERTa (such as the possibility to compare contextualized embeddings, the attention structure of the model, and its pre-trained LM prediction head, respectively) for a task that they were not originally designed for (pronoun resolution). WG-SR, on the other hand, only uses the output of RoBERTa at the 0-th token, which is not pre-trained.
We identify two possible reasons why WG-SR performs worse than CSS on out-of-domain examples. The first reason is the one mentioned above, namely, not explicitly exploiting the listed properties of the pre-trained model would lead to a better fit on a specific dataset, but worse "general knowledge". This reason is not completely warranted, since WG-SR has similar out-of-domain performance to BWP and MAS. The second possible reason is that CSS uses an explicit candidate localization and candidate-pronoun matching (by comparing the embedding of the candidate and the pronoun), whereas in WG-SR these are achieved implicitly by feeding a pair of sentences to the model, one with the correct and one with the incorrect substitution. Again, this reason is not completely warranted, since MAS also uses explicit candidate localization and candidate-pronoun matching, but has a similar out-of-domain performance to WG-SR. Further investigation on the reasons why CSS outperforms WG-SR on the out-of-domain examples is left for future work.

Summary and Outlook
In this work, we categorized four existing objectives for pronoun resolution, and compared their performance and seed-wise stability on equal grounds. Our experiments showed that, on indomain testing, the objective of sequence ranking based on the first token in RoBERTa outperforms the other three objectives, but can exhibit convergence problems. On out-of-domain testing, the objective of semantic similarity between the pronoun and each candidate outperforms the other three objectives.
Future work may investigate whether these results translate to other language models besides RoBERTa as well as other training datasets besides WinoGrande. Also, one could analyze the strengths and weaknesses of each objective, and evaluate other variations of these objectives.