ContraCAT: Contrastive Coreference Analytical Templates for Machine Translation

Recent high scores on pronoun translation using context-aware neural machine translation have suggested that current approaches work well. ContraPro is a notable example of a contrastive challenge set for English→German pronoun translation. The high scores achieved by transformer models may suggest that they are able to effectively model the complicated set of inferences required to carry out pronoun translation. This entails the ability to determine which entities could be referred to, identify which entity a source-language pronoun refers to (if any), and access the target-language grammatical gender for that entity. We first show through a series of targeted adversarial attacks that in fact current approaches are not able to model all of this information well. Inserting small amounts of distracting information is enough to strongly reduce scores, which should not be the case. We then create a new template test set ContraCAT, designed to individually assess the ability to handle the specific steps necessary for successful pronoun translation. Our analyses show that current approaches to context-aware NMT rely on a set of surface heuristics, which break down when translations require real reasoning. We also propose an approach for augmenting the training data, with some improvements.


Introduction
Machine translation is a complex task which requires diverse linguistic knowledge. The seemingly straightforward translation of the English pronoun it into German requires knowledge at the syntactic, discourse and world knowledge levels for proper pronoun coreference resolution (cr). The German third person pronoun can have three genders, determined by its antecedent: masculine (er), feminine (sie) and neuter (es). Previous work (Hardmeier and Federico, 2010;Miculicich Werlen and Popescu-Belis, 2017;Müller et al., 2018) proposed evaluation methods for pronoun translation. This has been of special interest in context-aware nmt models that are capable of using discourse-level information. Despite promising results (Bawden et al., 2018;Müller et al., 2018;Lopes et al., 2020), the question remains: Are transformers (Vaswani et al., 2017) truly learning this task, or are they exploiting simple heuristics to make a coreference prediction?
To empirically answer this question, we extend ContraPro (Müller et al., 2018)-a contrastive challenge set for automatic English→German pronoun translation evaluation-by making small adversarial changes in the contextual sentences. Our adversarial attacks on ContraPro show that context-aware transformer nmt models can easily be misled by simple and unimportant changes to the input. However, interpreting the results obtained from adversarial attacks can be difficult. The results indicate that nmt uses brittle heuristics to solve cr, but it is not clear what those heuristics are. In general, it is challenging to design attacks based on modifying ContraPro that can test specific phenomena that may be of interest. For this reason, we propose an independent set of templates for coreferential pronoun translation evaluation to systematically investigate which heuristics are being used. Inspired by previous work on cr (Raghunathan et al., 2010;Lee et al., 2011), we create a number of templates tailored to evaluating the specific steps of an idealized cr pipeline. We call this collection Contracat ( ), Contrastive Coreference Analytical Templates. The templates are constructed in a completely controlled manner, enabling us to easily create large number of coherent test examples and provide strong conclusions about the cr capabilities of nmt. The procedure we used in creating the templates can be adapted to many language pairs with little effort. Our results suggest that transformer models do not learn each step of a hypothetical cr pipeline.
We also present a simple data augmentation approach specifically tailored to pronoun translation. The experimental results show that this approach improves scores and robustness on some of our metrics, but it does not fundamentally change the way cr is being handled by nmt.
We publicly release Contracat and the adversarial modifications to ContraPro 1 . a 1:1 correspondence between pronouns in different languages. As a result, a system translation may be correct despite not containing the exact pronoun in the reference, and incorrect even if containing the pronoun in the reference, because of differences in the translation of the referent. Moreover, introducing a separate model which needs to be trained before evaluation adds an extra layer of complexity in the evaluation setup and makes interpretability more difficult. In contrast, templates can easily be used to pinpoint specific issues of an nmt model. Our templates follow previous work (Ribeiro et al., 2018;McCoy et al., 2019;Ribeiro et al., 2020) where similar tests are proposed for diagnosing nlp models.

Do Androids Dream of Coreference Translation Pipelines?
Imagine a hypothetical coreference pipeline that generates a pronoun in a target language, as illustrated in Table 1. First, markables (entities that can be referred to by pronouns) are tagged in the source sentence (we restrict ourselves to concrete entities as we wish to detect gender). Then, the subset of animate entities are detected, and human entities are separated from other animate ones (since it cannot refer to a human entity). Second, coreferences are resolved in the source language. This entails handling phenomena such as world knowledge, pleonastic it, and event references. Third, the pronoun is translated into the target language. This requires selecting the correct gender given the referent (if there is one), and selecting the correct grammatical case for the target context (e.g., accusative, if the pronoun is the grammatical object in the target language sentence). This idealized pipeline would produce the correct pronoun in the target language. The coreference steps resemble the rule-based approach implemented in Stanford Corenlp's Coref-Annotator (Raghunathan et al., 2010;Lee et al., 2011). However, nmt models are currently unable to decouple the individual steps of this pipeline. We propose to isolate each of these steps through targeted examples.

Model
We use a transformer model for all experiments and train a sentence-level model as a baseline. The context-aware model in our experimental setup is a concatenation model (Tiedemann and Scherrer, 2017) (concat) which is trained on a concatenation of consecutive sentences. concat is a standard transformer model and it differs from the sentence-level model only in the way that the training data is supplied to it. The training examples for this model are modified by prepending the previous source and target sentence to the main source and target sentence, respectively. The previous sentence is separated from the main sentence with a special token <SEP>, on both the source and target side. This also applies to how we prepare the ContraPro and Contracat data. We train the concatenation model on OpenSubtitles2018 data prepared in this way. We remove documents overlapping with ContraPro. Preprocessing details and model hyper-parameters are presented in the Appendix.

About ContraPro
ContraPro is a contrastive challenge set for English→German pronoun translation evaluation. The set consists of English sentences containing an anaphoric pronoun "it" and the corresponding German translations. It contains three contrastive translations, differing based on the gender of the translation of it: er, sie, or es. The challenge set artificially balances the amount of sentences where it is translated to each of these three German pronouns. The appropriate antecedent may be in the main sentence or in a previous sentence. For evaluation, a model needs to produce scores for all three possible translations, which are compared against ContraPro's gold labels.
We create automatic adversarial attacks on ContraPro that modify theoretically inconsequential parts of the context sentence before the occurrence of it. Contrary to expectations, we find that accuracy degrades in all adversarial attacks. Results are presented in Figure 1.

Adversarial Attack Generation
Our three modifications are: 1. Phrase Addition: Appending and prepending phrases containing implausible antecedents: The Church is merciful but that's not the point. It always welcomes the misguided lamb.

Phrase Addition
This attack modifies the previous sentence by appending phrases such as "…but he wasn't sure" and also prepending phrases such as "it is true:...". A range of other simple phrases can be used, which we leave out for simplicity. In general, all phrases we tried provided lower scores. These attacks introduce a human entity, a pleonastic or an event reference it (e.g. "it is true") which are all not plausible antecedents for the anaphoric it. We present results for appending "it is true" in Figure 1. Results with using different phrases are presented in the Appendix. In all cases, we prepend or append the same phrase to all ContraPro examples.

Possessive Extension
This attack introduces a new human entity by extending the original antecedent A with a possessive noun phrase e.g., "the woman's A". Only two-thirds of the 12,000 ContraPro sentences are linked to an antecedent phrase. Grammar and misannotated antecedents exclude half of the remaining phrases. We put pos-tag constraints on the antecedent phrases before extending them. This reduces our subset to 3,838 modified examples. Our possessive extensions can be humans (the woman's), organisations (the company's) and names (Maria's).

Synonym Replacement
This attack modifies the original German antecedent by replacing it with a German synonym of a different gender. For this we first identify the English antecedent and its most frequent synset in WordNet (Miller, 1995). We obtain a German synonym by mapping this WordNet synsets to GermaNet (Hamp and Feldweg, 1997) synsets. Finally, we modify the correct German pronoun translation to correspond to the gender of the antecedent synonym. Approximately one quarter of the nouns in our ContraPro examples are found in GermaNet. In 1,531 cases, a synonym of different gender could be identified. Scoring well on the Synonym Replacement attack cannot be done without understanding the pronoun/noun relationship. This attack gets to the core of whether nmt uses cr heuristics instead.
We evaluate a random sample of 100 auto-modified examples as a quality control metric. We note 11 issues with semantically-inappropriate synonyms. Overall, in 14 out of 100 cases, the model switches from correct to incorrect predictions because of synonym-replacement. Only 4 out of these 14 cases come from the questionable synonyms, showing that the drop in ContraPro scores is meaningful.

Adversarial Attack Results
Our model scores 75.4% on the original ContraPro. This is a very strong result compared to previous work (Müller et al., 2018), largely owing to our model being trained on OpenSubtitles, The straightforward adversarial modifications we make drop the ContraPro scores by over 10%, as shown by Figure 1. We analyze examples that are scored incorrectly. Some of the attacks introduce an entity that can in principle be referenced by it, like extending the antecedent with "the company's". In these cases, the new entity's influence on the model is expected, although ideally, the prediction should not change. More surprisingly, attacks that introduce a human entity drop the scores as well. The two largest examples are appending "…but he wasn't sure" and extending the original antecedent with Maria's. Our synonym replacement leads to a 6% drop in scores.
Intuitively, the adversarial attacks should not contribute to large drops in scores which is contrary to the empirical evidence. Nevertheless, no attack reduces the model's scores close to the original sentence-level baseline. Thus, we conclude that the concatenation model handles cr, but likely with brittle heuristics. Although the results expose potential issues with the model, it is still difficult to pinpoint the specific problems. This reveals a larger issue with pronoun translation evaluation that cannot be addressed with simple adversarial attacks on existing general-purpose challenge sets. We propose , a more systematic approach that targets each of the previously outlined cr pipeline steps with data synthetically generated from corresponding templates.

Templates
Automatic adversarial attacks offer less freedom than templates as many systematic modifications cannot be applied to the average sentence. Thus, our templates are based on the hypothetical coreference pipeline in Section 3 that target each of the three steps: i) Markable Detection, ii) Coreference Resolution and iii) Language Translation. Our minimalistic templates draw entities from sets of 25 animals, 20 human professions (McCoy et al., 2019), 15 foods, and 5 drinks, along with associated verbs and attributes. We use these sets to fill slots in our templates. Animals and foods are natural choices for subject and object slots referenced by it. Restricting our sets to interrelated concepts with generically applicable verbs-all animals eat and drink-ensures semantic plausibility. Other object sets, such as buildings, had more semantic implausibility issues and were not included in the final corpus.

Grammatical Role
The cat ate the egg. It ( / ) was big. Order I stood in front of the cat and the dog. It ( / ) was big. Verb Wow! She unlocked it.

Markable Detection Filter Humans
The cat and the actress were happy. However it ( ) was happier.

Coreference Resolution Lexical Overlap
The cat ate the apple and the owl drank the water. It ( ) ate the apple quickly.

World Knowledge
The cat ate the cookie. It ( ) was hungry.

Pleonastic it
The cat ate the sausage. It was raining. Event Reference The cat ate the carrot. It came as a surprise.

Language Translation
Antecedent Gender I saw a cat. It( ) was big. → Ich habe eine Katze gesehen. Sie ( ) war groß.

Template Generation
Our templates consist of a previous sentence that introduces at least one entity and a main sentence containing the pronoun it. We use contrastive evaluation to judge anaphoric pronoun translation accuracy for each template; we create three translated versions for each German gender corresponding to an English sentence, e.g. "The cat ate the egg. It rained." and the corresponding "Die Katze hat das Ei gegessen. Er/Sie/Es regnete". To fill a template, we only draw pairs of entities with two different genders, i.e. for animal a and food f : gender(a) = gender(f ). This way we can determine whether the model has picked the right antecedent. We refer to "the model picking an antecedent" as the model scoring the target sentence containing the German third person pronoun with the antecedent's gender higher than the provided alternatives. First, we create templates that analyze priors of the model for choosing a pronoun when no correct translation is obvious. Then, we create templates with correct translations, guided by the three broad coreference steps. Table 2 provides examples for our templates and the results are shown in Figure 2. Template details-entity sets, statistics, etc.-are provided in the Appendix.

Priors
Prior templates do not have a correct answer, but help to understand the model's biases. We expose three priors with our templates: i) grammatical roles prior (e.g. subject) ii) position prior (e.g. first antecedent) and iii) a general prior if no antecedent and only a verb is present.
For i), we create a Grammatical Role template where both subject and object are valid antecedents. We find that in 72.3% of the template instances, the model chooses the object as the antecedent.
For ii), we create a Position template where two objects are enumerated (see Table 2). We create an additional example where the entities order is reversed and test if there are priors for specific nouns or alternatively positions in the sentence.
The model shows a strong prior for neuter by predicting es in most cases, even if the two entities are masculine and feminine.
For iii), we create a Verb template, expecting that certain transitive verbs trigger certain object gender choices. We use 100 frequent transitive verbs and create sentences such as the example in Table 2. As expected, it is translated to the neuter es most of the time, with notable exceptions where the verb is strongly associated with a single noun, e.g. "Sie hat sie entriegelt" is scored higher for "She unlocked it". We presume that the reason for this is that to unlock a door is very common and door (Tür) is feminine in German.

Markable Detection with a Humanness Filter
Before doing the actual cr, the model needs to identify all possible entities that it can refer to. We construct a template that contains a human and animal which are in principle plausible antecedents, if not for the condition that it does not refer to people. For instance, the model should always choose cat in "The actress and the cat were hungry. However it was hungrier.". We find that the model instead falls back to translating it to the neuter es in all cases.

Coreference Resolution
Having determined all possible antecedents, the model has to choose the correct one, relying on semantics, syntax, and discourse. The pronoun it can in principle be used as an anaphoric (referring to entities), event reference or pleonastic pronoun (Loáiciga et al., 2017). For the anaphoric it, we identify two major ways of identifying the antecedent: lexical overlap and world knowledge. Our templates for these categories are meant to be simple and solvable.
Overlap: Broadly speaking the subject, verb, or object can overlap from the previous sentence to the main sentence, as well as combinations of them. This gives us five templates: i) subjectoverlap ii) verb-overlap iii) object-overlap iv) subject-verb-overlap and v) object-verb-overlap. We always use the same template for the context sentence. e.g. "The cat ate the apple and the owl drank the water.". For the object-verb-overlap we would then create the main sentence "It ate the apple quickly." and expect the model to choose cat as antecedent. To keep our overlap templates order-agnostic, we vary the order in the previous sentence by also creating "The owl drank the water and the cat ate the apple." However our results in 6.2 show that the model's predictions are almost completely random and are influenced by position priors, e.g., the first mentioned subject, or a prior for the neuter es when it needs to decide between the two subjects.
World Knowledge: cr has been traditionally seen as challenging as it requires world knowledge. Our templates test simple forms of world knowledge by using attributes that either apply to animal or food entities, such as cooked for food or hungry for animals. We then evaluate whether the model chooses e.g. cat in "The cat ate the cookie. It was hungry." As discussed later, the model occasionally predicts answers that require world knowledge, but most predictions are guided by a prior for choosing the neuter es or a prior for the subject.
Pleonastic and Event Templates: For the other two ways of using it, event reference and pleonastic-it, we again create a default previous sentence ("The cat ate the apple."). For the main sentence, we used four typical pleonastic and event reference phrases such as "It is a shame" and "It came as a surprise". We expect the model to correctly choose the neuter es as a translation every time and the strong prior for the neuter gender causes the model to do so nearly perfectly.

Translation to German
After cr, the decoder has to translate from English to German. In our contrastive scoring approach the translation of the English antecedent to German is already given. However the decoder is still required to know the gender of the German noun to select between er, sie or, es. We test this with a list of concrete nouns selected from Brysbaert et al. (2014), which we filter for nouns that occur more than 30 times in the training data. We are left with 2051 nouns which are plugged into the "I saw a N . It was {big, small}." template.

Results
We find that the model performs poorly when actual cr is required. It frequently falls back to choosing the neuter es or preferring a position (e.g. first of two entities) for determining the gender. For Markable Detection the model always predicts the neuter es regardless of the actual genders of the entities.
In the Overlap template, we find that the model fails to recognize the overlap and instead, has a general preference for one of the two clauses. We also see weak performance for world knowledge. An accuracy of 55.7% is slightly above the heuristic of randomly choosing an entity (= 50.0%). With a strong bias for the neuter es, the model has a high accuracy of 96.2% for event reference and pleonastic templates, where es is always the correct answer. Based on the strong performance on the Gender template in 6.1.4, we conclude the model consistently memorized the gender of concrete nouns. Hence, cr mistakes stem from Step 1 or Step 2, suggesting that the model failed to learn proper cr.

Augmentation
We present an approach for augmenting the training data. While challenging for nlp, we focus on a narrow problem which lends itself to easier data manipulation. Our previous analyses show that our model is capable of modeling the gender of nouns. However, they also show a strong prior to translate it to es and very little cr capability. Our goal with the augmentation is to break off the strong prior and test if this can give rise to better cr in the model.
We attempt to do this by augmenting our training data and call it Antecedent-free augmentation (afa). We identify candidates for augmentation as sentences where a coreferential it refers to an antecedent not present in the current or previous sentence (e.g., I told you before. <SEP> It is red. → Ich habe dir schonmal gesagt. <SEP> Es ist rot.). We create augmentations by adding two new training examples where the gender of the German translation of "it" is modified (e.g., the two new targets are "Ich habe dir schonmal gesagt. <SEP> Er ist rot." and "Ich habe dir schonmal gesagt. <SEP> Sie ist rot."). The source side remains the same. An additional example is shown in Table 3. Antecedents and coreferential pronouns are identified using a cr tool (Clark and Manning, 2016a;Clark and Manning, 2016b). We fine-tune our already trained concatenation model on a dataset consisting of the candidates and the augmented samples. As a baseline, we fine-tune on the candidates only so as to confidently say that any potential improvements come from the augmentations.

Adversarial Attacks
afa provides large improvements, scoring 85.3% on ContraPro. Results are shown in Figure  3. The afa baseline (fine-tuning on the augmentation candidates only) improves by 1.94%,  presumably because many candidates consist of coreference chains of "it" and the model learns they are important for coreferential pronouns. However, the improvement is small compared to afa.
Results on ContraPro for each gender (see Appendix) show that performance on er and sie is substantially increased, suggesting that the augmentation successfully removes the strong bias towards es. Templates provide further evidence about this. Although, the adversarial attacks lower afa scores, in contrast to concat, the model is more robust and the performance degradation is substantially lower (except on the synonym attack). We experimented with different learning rates during fine-tuning and present results with the lr that obtained the best baseline ContraPro score. Detailed scores in the Appendix show how lr can balance the scores across the three different genders. Furthermore, concat and afa obtain 31.5 and 32.2 BLEU on ContraPro, respectively, showing that this fine-tuning procedure, which is tailored to pronoun translation, does not lead to any degradation in general translation quality.

Templates
From the prior templates, we observe that the prior over gender pronouns is more evenly spread and not concentrated on es. This also provides for a more even distribution on the Position and Role Prior template. The results on the prior templates are presented in the Appendix. The augmented model is also substantially better on markable detection, improving by 27.6%.
Results for templates are presented in Figure 4. No improvements are observed on the World Knowledge template. Pleonastic cases are still reasonably handled, although not perfectly as with concat. The Event template identifies a systematic issue with our augmentation. We presume this is as a result of the cr tool marking cases where it refers to events. We do not apply any filtering and augment these cases as well, thus create wrong examples (an event reference it cannot be translated to er or sie). As a result, the scores are significantly lower compared to concat. We note that this issue with our model is not visible on ContraPro and the adversarial attacks results. In contrast, the Event template afa performs on par with the unaugmented baseline on the Gender template. However, despite increasing by 3.8%, results on Overlap are still underwhelming. Our analysis shows that augmentation helps in changing the prior. We believe this provides for improved cr heuristics which in turn provide for an improvement in coreferential pronoun translation. Nevertheless, the Overlap template shows that augmented models still do not solve cr in a fundamental way.

Conclusion
In this work, we study how and to what extent cr is handled in context-aware nmt. We show that standard challenge sets can easily be manipulated with adversarial attacks that cause dramatic drops in performance, suggesting that nmt uses a set of heuristics to solve the complex task of cr. Attempting to diagnose the underlying reasons for these results, we propose targeted templates which systematically test the different aspects necessary for cr. This analysis shows that while some type of cr such as pleonastic and event cr are handled well, nmt does not solve the task in an abstract sense. We also propose a data augmentation approach which substantially improves performance on some metrics, but it does not change the general conclusions we infer from the templates. Future work should be evaluated on our adversarial attacks and Contracat, which we publicly release, to realistically estimate the ability of nmt to robustly do cr.

A Preprocessing Details and Model Hyper-Parameters
We use OpenSubtitles2018 2 (Lison and Tiedemann, 2016) as training data. We tokenize the dataset using the Moses scripts 3 (Koehn et al., 2006). We BPE-split the data by jointly computing them on English and German using 32K merge operations. We remove all samples where the main sentence exceeds 100 tokens on the BPE-level or the concatenated sample contains more than 200 tokens. ContraPro is built using OpenSubtitles and contains samples from it. As a result, we remove the entire documents from which the ContraPro samples originate from in order to remove any exact duplicates from ContraPro or similar contexts which may lead to unfair advantages for the models. This still leaves some exact duplicates between our training data and ContraPro which we also remove. The model is finally trained on ≈16.7M samples. We train the transformer models with a batch size of 4096. We use an initial learning rate of 10 −4 and we lower it by a factor of 0.7 if there are no improvements on the validation perplexity for 8 checkpoints. We save a checkpoint every 4000 updates.
The transformer models we use are a 6 layer encoder/decoder with 8 attention heads. The model size is 512 and the size of the feed-forward layers is 2048. We tie the source, target and output embeddings. We use label smoothing with 0.1 and dropout in the transformer of 0.1. Models are trained on 2 GTX 1080 GPUs with 8GB RAM. The final model is an average of the 8 best checkpoints based on validation perplexity. The models we train are implemented in Sockeye (Hieber et al., 2017).

B.1 Adding Phrases
As a reference to the scores shown in Table 4, our model has a score of 0.754 on unmodified ContraPro. C is the original context sentence from ContraPro.

B.2 Possessive Extension
These were applied to 3,838 ContraPro examples. As a reference to scores shown in Table 5, our model has a score of 72.9% on the unmodified subset of ContraPro. A refers to the original antecedent noun phrase.

Modification
ContraPro Score he/she said: "C " 66.5 / 66.3 it is true:"C / it is true: C 55.2 / 63.5 C and it is true. / C. it is true. / C and that is true.
69.0 / 65.8 C but that's not the point. / C. but that's not the point. 70.7 / 67.5 C but there is a catch. /C. but there is a catch.

B.3 Synonym replacement
These were applied to 1,531 ContraPro examples. Our model has a score of 69.8% on the unmodified subset of ContraPro. When replacing with different-gender synonyms we drop to a score of 64.1%.

C.1 Vocabulary
Our templates draw from the sets of entities shown in Table 6 and Table 7. The translations in German are shown in brackets. We note that all entities appear in the training dataset we use to train our models. The least frequent entity ("kangaroo") appears 134 times. We also use four event-and pleonastic-it phrases which are used as the main sentence in the templates and referred to later. Event: It came as a surprise (Es kam überraschend), It actually happened (Es ist tatsächlich passiert), It resulted in chaos (Es führte zu Chaos), It was a funny situation (Es war eine lustige Situation) Pleonastic: It was raining (Es regnete), It is a shame (Es ist eine Schande), It seemed this was unnecessary (Es schien, dass dies unnötig war), It is hard to believe this is true (Es ist schwer zu glauben , dass das wahr ist)

C.2 Template Statistics
For each template, we report the number of lines it contains in Table 8.

C.3 Template Definitions
The template definitions are shown in Table 9. We refer to animals with A, professions as P, food as F, drinks as D. When creating a concrete animal, food or drink X i , we use the definite article "the" ( "der/die/das" in German Ich sah ein/eine/einen N concrete . Er/Sie/Es war {groß, klein}. Table 9: Template definitions. * We switch the position (first or second) of the two involved entities E i and E j .

C.4 Prior Results
For the templates that do have a correct answer, we show results in the main paper. In Table  10, Table 11 and Table 12 we show the results on the grammatical, position and verb prior templates.

D.1 Details
For all augmentations we use Spacy's dependency parser 4 in order to determine the case of the pronoun. This is necessary because the feminine ("sie") and neuter ("es") pronoun are the same in nominative and accusative, but the masculine is not ("er" and "ihn"). We fine-tuned on 207K for the antecedent-free augmentations.

D.2 Fine-Tuning Learning Rate Analysis
We conducted 3 different fine-tuning experiments where we varied the learning rate. We used a learning rate of 2 * 10 −6 , 2 * 10 −7 and 2 * 10 −8 . The initial concatenation model was trained with an initial LR of 2 * 10 −4 and when it converged, the learning rate was 7.82 * 10 −8 . Results are presented in Table 13. As before, we average 8 checkpoints before evaluating our models.