An Empirical Study of Contextual Data Augmentation for Japanese Zero Anaphora Resolution

One critical issue of zero anaphora resolution (ZAR) is the scarcity of labeled data. This study explores how effectively this problem can be alleviated by data augmentation. We adopt a state-of-the-art data augmentation method, called the contextual data augmentation (CDA), that generates labeled training instances using a pretrained language model. The CDA has been reported to work well for several other natural language processing tasks, including text classification and machine translation. This study addresses two underexplored issues on CDA, that is, how to reduce the computational cost of data augmentation and how to ensure the quality of the generated data. We also propose two methods to adapt CDA to ZAR: [MASK]-based augmentation and linguistically-controlled masking. Consequently, the experimental results on Japanese ZAR show that our methods contribute to both the accuracy gainand the computation cost reduction. Our closer analysis reveals that the proposed method can improve the quality of the augmented training data when compared to the conventional CDA.


Introduction
In pro-drop languages, such as Japanese and Chinese, the arguments of a predicate are frequently omitted in sentences.Automatic recognition of such omitted arguments plays an important role in understanding the predicate-argument structure of a sentence.This task is referred to as zero anaphora resolution (ZAR).Figure 1a depicts an example of the task.In the sentence, the nominative argument of the predicate was acquitted is omitted.Such an omitted argument is referred to as a zero pronoun often represented as φ.In Figure 1a, φ refers to the man.
One critical issue of ZAR is the scarcity of labeled data.To compensate for the scarcity, several studies have explored the direction of exploiting unlabeled data (Sasano et al., 2008;Sasano and Kurohashi, 2011;Chen and Ng, 2014;Chen and Ng, 2015;Liu et al., 2017;Yamashiro et al., 2018;Kurita et al., 2018).Another promising direction is to automatically augment labeled data (data augmentation).One state-of-the-art data augmentation method is contextual data augmentation (CDA), which augments labeled data by replacing an arbitrary token(s) with another token(s) predicted to be plausible in the surrounding context by a pretrained language model (LM).CDA works well in several natural language processing (NLP) tasks, such as text classification (Kobayashi, 2018;Wu et al., 2019) and machine translation (Gao et al., 2019).However, unlike the direction of exploiting unlabeled data, the potential of data augmentation in the context of ZAR has largely remained underexplored.This study investigates how the idea of CDA can be effectively employed in ZAR, taking the task of Japanese ZAR as a case study.
Figure 1 illustrates a straightforward approach of applying the idea of CDA to ZAR.For each given training instance (i.e., input sentence (a) paired with its gold labels (L)), a standard CDA method would use a pretrained LM to generate a variant(s) of the original sentence, say sentence (b).The new sentence (b) is then paired with the original gold labels (L) and is added to the training set (detailed in Section 4.1).This straightforward approach raises two critical issues.One is that the computational cost of CDA can be non-trivial because it runs a huge pretrained LM for each input sentence in the training set.Note that even with a relatively small training set, this cost can be enormous when one wants to repeatedly conduct experiments with a large variety of settings.The second issue is that CDA may produce improper instances.Figure 1b shows that the original noun man is properly replaced with another noun criminal.By saying that a word is properly replaced with another, we mean that the replacement does not affect the anaphoric relation labels of the original sentence.This requirement is critical because we want to use produced instances with the original ZAR signals to train our model.However, if not appropriately controlled, CDA may disrupt the original anaphoric relations, as in Figure 1c, where replacing the verb sued with struck affects the anaphoric relation between φ and boy.These issues can be crucial when applying CDA to ZAR, but they have not been addressed in previous works with CDA for other NLP tasks.
To address these two issues, we newly design a CDA variant with two directions of extension: (i) [MASK]-based augmentation and (ii) linguistically-controlled masking.Unlike a standard CDA method, which replaces tokens with other tokens, the [MASK]-based augmentation replaces tokens with the [MASK] token (Section 4.2).This simple extension halves the cost of running a pretrained LM.Our experiments show that it also improves the performance of ZAR.The second extension, which is linguistically-controlled masking, provides a means of controlling the types of tokens to be replaced to inhibit improper token replacements (Section 4.3).The token replacement quality must be considered even in the [MASK]-based augmentation scheme because replacing a token with the [MASK] token may also affect the anaphoric relations.Our experiments demonstrate that controlling token replacement by part-of-speech (POS) tags effectively inhibits improper token replacements and improves the performance of ZAR (see Section 6 for details).
Through extensive experiments, we reveal the relationship between performance gain and replacement of each token type with various masking probabilities.We deepen our understanding of the change of antecedents through a detailed analysis on actual augmented data.In summary, our main contributions are as follows: • Two technical modifications to a standard CDA method; • Extensive empirical results indicating which type of tokens should (or should not) be the target for replacement; and • In-depth analysis on the change of antecedents of zero pronouns, suggesting a direction for future improvements.
2 Related Work adopted supervised learning methods,1 the amount of labeled data is not sufficient to teach a model about knowledge for accurately resolving anaphoric relations.Some studies tried extracting and exploiting such knowledge from large-scale unlabeled data (e.g., case frame construction (Sasano et al., 2008;Sasano and Kurohashi, 2011;Yamashiro et al., 2018), pseudo training data generation (Liu et al., 2017), and semi-supervised adversarial training (Kurita et al., 2018)).Although findings on using unlabeled data for ZAR have been accumulated, the automatic augmentation of labeled data has been underinvestigated.One reason for this is the difficulty in producing high-quality pseudo data from its original labeled data.Some existing methods of text data augmentation take advantage of hand-crafted resources (Zhang et al., 2015;Wang and Yang, 2015) and rules (Fürstenau and Lapata, 2009;Kafle et al., 2017).With the recent advances of LMs (Peters et al., 2018;Devlin et al., 2019), CDA, which is a new method using LMs, has been proposed and reported to effectively increase labeled data in several NLP tasks (Kobayashi, 2018;Wu et al., 2019;Gao et al., 2019).This method offers a variety of good substitutes of original tokens without any hand-crafted resources.In ZAR, a key to performance improvement is training a model on anaphoric relations in various contexts.However, the original labeled data over a limited portion of contextual diversity.Here, CDA is a promising tool for producing good contextual variations of each original sentence.Considering this advantage, we introduce CDA to ZAR and provide extensive empirical results to encourage the future exploration of this direction.
3 Japanese Zero Anaphora Resolution

Problem Formulation and Notation
Japanese ZAR is often formulated as a part of the predicate-argument structure (PAS) analysis, which includes the task of detecting syntactically depending arguments (DEPs), in addition to ZAR.Given a sentence X and a list of predicates p, the Japanese PAS analysis task aims to identify the head words of nominative (NOM), accusative (ACC), and dative (DAT) arguments for each predicate.Our task definition strictly follows that of previous studies (Iida et al., 2015;Ouchi et al., 2017;Matsubayashi and Inui, 2017;Matsubayashi and Inui, 2018), making our results comparable with those published previously.
In the following sections we consider that the input sentence X = (x 1 , . . ., x I ) consists of a sequence of one-hot vectors x i ∈ {0, 1} |V| , each of which represents a word in the sentence.V denotes a vocabulary set.Each sentence comprises J predicates p = (p 1 , . . ., p J ), where p j ∈ N is a natural number indicating a predicate position.

Baseline Model
The baseline model employed in our experiments (hereinafter referred to as MP-masked language model (MP-MLM)) is based on the multi-predicate (MP) model proposed by Matsubayashi and Inui (2018).Figure 2a depicts an overview of the MP-MLM.The only difference is that our MP-MLM uses a sequence of the final hidden states of a pretrained MLM, such as BERT (Devlin et al., 2019), as the input for the MP model, whereas the original MP model uses conventional non-contextual word embeddings as the input layer.The MP-MLM is a sequential labeling model.Given a sentence X, predicate positions p and a target predicate position p j ∈ p formally.The MP-MLM models a conditional probability P (y i,j |X, p, i, j).Here, y i,j ∈ {NOM, ACC, DAT, NONE} is an argument label for a pair of ith word x i and jth predicate.

Predicates
. +, ( : Pretrained MLM : Eq. ( 1) #: Pretrained MLM : Eq. ( 13) $ !: Adj Noun Verb MASK : Eq. ( 14) Pretrained MLM : Eq. ( 1) Pretrained MLM : Eq. ( 12) $ MASK: Eq. ( 8) , where b target i represents whether or not the word is the target predicate, and b others i represents whether or not the word is the predicate.This concatenation operation is presented as follows: where ⊕ denotes the concatenation operation, and h 0 i,j ∈ R D+2 denotes the output of this concatenation operation.The obtained vectors are then fed into the k-layer bi-directional RNN (BiRNN) with residual connections (He et al., 2016) and alternating directions (Zhou and Xu, 2015) as follows: where h k i,j ∈ R M denotes the output of the kth RNN layer, and RNN k denotes the function representing the kth RNN layer.Gated recurrent units (Cho et al., 2014) are used for RNN cells.A four-dimensional vector representing a probability distribution P (y i,j |X, p, i, j) ∈ R 4 is then obtained as follows by applying a softmax layer to the output h K i,j ∈ R M : where W ∈ R 4×M is a classification layer.For each argument slot l ∈ {NOM, ACC, DAT} of each predicate, we eventually select a word x i with the maximum probability of P (y i,j = l|X, p, i, j) as an argument if the probability exceeds a threshold θ l ∈ R; otherwise, the slot remains empty.

Method
In this section, we first review an existing standard CDA method (Section 4.1), which is our starting point of data augmentation for ZAR.We then discuss the quantitative and qualitative aspects of CDA: (i) computational inefficiency of symbolic replacement and (ii) change of antecedents.Considering these aspects, we improve a standard CDA method with two methods, namely [MASK]-based augmentation (Section 4.2) and linguistically-controlled masking (Section 4.3).

Contextual Data Augmentation (CDA)
The idea underlying CDA aims to replace token(s) in a given sentence with other token(s) that seem probable in a given context.Sentences that contain some replaced tokens are regarded as new training instances.The alternative tokens for the replacement are selected on the basis of the probability distribution output from a pretrained LM.For example, in existing studies, bi-directional RNN-based LMs (Kobayashi, 2018) and MLM (Wu et al., 2019) have been used to compute the distribution.Our CDA formulation closely follows that proposed by Wu et al. (2019).We first replace token(s) in X with [MASK] and obtain X as follows: where α is the probability of the replacement with [MASK].We then feed X to MLM to obtain a sequence of the final hidden states (e 1 , . . ., e I ), where e i ∈ R D .The MLM then computes the token for replacement xi as where W MLM ∈ R |V |×D denotes a classification layer of the pretrained MLM; P i ∈ R |V | denotes the probability distribution over the vocabulary; and F denotes a function for selecting one word from P i .
[MASK]s are then replaced with the predicted tokens to obtain a new sequence X as follows: Finally, X is paired with the original gold labels of X and regarded as a new training instance.
As we mentioned in Section 1, we improved the above CDA method.First, we improved the computational efficiency.While good substitutes of the original tokens can be produced by using a gigantic MLM, the computational cost for training the model increases, and this cost cannot be ignored.Although the amount of training data of the Japanese ZAR is relatively small, the computational cost can be enormous when one wants to conduct multiple experiments with various settings.In Section 5, we indeed conduct extensive experiments to explore the use of CDA for ZAR.We design a technique that halves the computational cost of incorporating the abovementioned CDA method into the baseline model.This technique improves the performance of ZAR more effectively than a standard CDA on the same amount of augmented training data.Second, we improved the controllability.Our preliminary experiments suggested that replacing certain types of tokens is likely to change the antecedent of each zero pronoun.If another token (a token different from the original one) can be interpreted as the antecedent after the replacement, the original label will make no sense, which leads to a model being confused.We design herein a method that enables controlling the types of tokens to be replaced (Section 4.3).

[MASK]-based Augmentation
We present the [MASK]-based augmentation and simplify the integration of the original CDA into the baseline model (Section 3.2) for computational efficiency.Figure 2c shows an overview of the method.Instead of using the sequence X with token-replacement, we used the masked sequence X (Equation 8) as the new training data.That is, while an MLM has to be run over each input sequence twice (for predicting substitutes and computing their feature representations) when using the original CDA, the MLM has to be run only once in our simplified one.This one-pass running of the MLM halves the computational cost of the original CDA.In particular, the masked sequence X is fed into the MLM, and a sequence of the final hidden states of MLM E is obtained: The remaining computations are the same as those for the MP-MLM described in Section 3.2.Note that the MLM is pretrained to fill the [MASK] with a token that is probable in a given context.We can   is an abstract and a mixture of contextually possible word representations for a given context.This is not the case for the original CDA approach (Figure 2b), where the tokens are symbolically replaced with other tokens.

Linguistically-controlled Masking
We introduce and add a new function, called linguistically-controlled masking, to the abovementioned framework [MASK]-based augmentation.This function enables us to freely choose which types of tokens will be the target for replacement.As explained in Figure 1 in Section 1, replacing certain types of tokens is likely to change the antecedent of each zero pronoun.Here, instead of replacing arbitrary tokens, we utilize POS tags to control the types of tokens to be replaced.We particularly add the POS constraint to Equation 8: where S denotes a set of target POS tags to be replaced, and ψ denotes a function for obtaining the POS tag s i of the ith word x i .In other words, x i is replaced with [MASK] only if its POS tag belongs to the set of target POS tags to be replaced.The remaining operations are the same as those for the MP-MLM described in Section 3.2.

Experiments
In this section, we first investigate the effectiveness of [MASK]-based augmentation by comparing it with the baseline CDA (Section 4.1) on the validation set (Section 5.2).We then investigate the relationships between performance gain and each POS category to be replaced with some masking probability to find the best configuration of linguistically-controlled masking over the validation set (Section 5.3).Finally, we compare the best configured model with the other existing models over a test set (Section 5.4).

Experimental Configuration
The NAIST Text Corpus (NTC) 1.5 (Iida et al., 2010;Iida et al., 2017) with the standard dataset splits proposed by Taira et al. (2008)  Figure 3: The effect of changing masking probability α on ZAR F 1 adopted by previous studies (Ouchi et al., 2017;Matsubayashi and Inui, 2018;Omori and Komachi, 2019).Table 1 presents the number of instances in NTC 1.5, and Table 2 presents the number of tokens whose POS tag appears in the NTC 1.5 training set.
We used both the modified sentence X and the original sentence X in a 1:1 ratio for training.Each model was trained using 10 different random seeds.The POS tags from the NTC gold standard tags were used in the experiments.The arg max function was employed as F in Equation ( 11), which denotes the function for selecting the most probable word from the probability distribution.The average F 1 -scores were reported for both ZAR and DEP. 3 The BERT (Devlin et al., 2019) pretrained using the Japanese Wikipedia (Shibata et al., 2019) was employed as the MLM.The BERT parameters were kept fixed throughout the experiments.The target predicate was never masked. 4Table 3 shows the set of employed hyper-parameters.

Effectiveness of [MASK]-based Augmentation
We investigated the effectiveness of the [MASK]-based augmentation, where all tokens were considered as targets for replacement.The best masking probability α was identified from {0.1, 0.3, 0.5, 0.7, 0.9, 1.0}.The masking positions were fixed for each epoch to reduce the computational cost.Hereafter, we refer to the proposed method as MASKING, which was compared to the base CDA.In the CDA, all [MASK] tokens were filled simultaneously. 5We also report herein on the performance of the BASELINE (i.e., MP-MLM) with double amount of training data, namely, BASELINE (2X).BASELINE (2X) aims to conduct a fair comparison using the same amount of labeled data as those used by CDA and MASKING.
Table 4 shows the results.MASKING achieved the best F 1 scores for all the categories.Note that BASELINE (2X) did not outperform BASELINE, indicates that the improvement was achieved due to MASKING rather than the increased amount of the same training instances.MASKING also outperformed the base CDA, illustrating that our simplification applied to the base CDA contributed to the prediction performance and the computational efficiency.

Effectiveness of Linguistically-controlled Masking
We varied a target POS tagset for the masking and the masking probability to investigate the relationships between performance gain and each POS category.We left each one out of all the POS categories and treated the rest as targets for replacement.Let S all\verb denote that all POS categories, except for the verb, are targets for replacement (i.e., "all-but-verb masking,").We will refer to the others in the same manner.The models were trained with every possible combination of the following settings: (i) POS category {S all\noun , S all\verb , S all\particle , S all\symbol , S all }6 and (ii) masking probability α: {0.1, 0.3, 0.5, 0.7, 0.9, 1.0}.Figure 3 shows the relationships between ZAR F 1 and the number of masked tokens using each POScontrolled masking.The horizontal axis represents the proportion of masked tokens to all tokens in the training data.Here, all-but-verb masking S all\verb tended to achieve the best results, and all-but-symbol masking S all\symbol was on par or slightly better than S all .While these three outperformed BASELINE in most cases, the others (i.e., S all\noun and S all\particle ) did not.These performance differences were also observed under the condition of the same number of masked tokens.Controlling masked tokens by their POS category affected the performance.Moreover, a relatively high masking probability of approximately 0.5 tends to allow the realization of a better performance.
Table 5 shows the result of the model achieving the highest ZAR F 1 for each POS-controlled masking.Here, we also trained masking models with a single target POS category (e.g., S verb ).The results show that all-but-verb masking S all\verb achieved a better ZAR F 1 than BASELINE.By contrast, only-verb masking S verb did not improve the performance compared with BASELINE.These results imply that verbs should not be the targets for replacement.In this way, the POS category-based analysis and observation are facilitated by our linguistically-controlled masking.A more detailed analysis is provided in Section 6.

Comparison with the Existing Models
We compared the model built by all-but-verb masking S all\verb , which achieved the best ZAR F 1 on the validation set (Table 5), with state-of-the-art models (Matsubayashi and Inui, 2018;Omori and  2019) on the test set.Table 6 shows the results.We refer to our method of combining [MASK]-based augmentation and linguistically-controlled masking in the optimal setting as MASKING (S all\verb ).Our BASELINE already outperformed the state-of-the-art results by a large margin.When new techniques are only evaluated on very basic models, determining how much (if any) improvement will carry over to stronger models can be difficult (Denkowski and Neubig, 2017;Suzuki et al., 2018).In contrast to this concern, MASKING (S all\verb ) consistently achieved the statistically significant improvements over the strong BASELINE in both single and ensemble-model settings.This solid finding is likely to be generalizable to other strong models.On the contrary, the ZAR F 1 of the CDA was on par with that of BASELINE, which shows that the symbolic replacement of tokens is not effective.

Analysis of Augmented Data
In this section, we analyze the actual instances augmented by linguistically-controlled masking.The model built by all-but-verb masking S all\verb achieved the best ZAR F 1 ; hence, we further investigated the negative effect of verb replacement.We did this by observing the actual instances produced by S all\verb and verb-only masking S verb .Figure 4 depicts the three types of sentences of these two models, that is, the original ones X, the masked ones X , and those with some symbolically replaced tokens X .Note that although our method did not use any symbolic substitutions X for predicting the of zero anaphora (Section 4.2), they were produced only for the analysis. 8Looking at the symbolic token generated from each mask provides us an intuition on the feature vector of each mask.For example, if the token exist is generated from a mask, we can interpret that its corresponding vector approximately represents the meaning of exist.
In the original sentence X, the nominative argument of the target predicate noticeable was realized as a zero pronoun φ, and its antecedent was The scratches.In the sentence generated by S verb , the symbolic token exist filled the [MASK] (X 1 ), and the semantic structure of the sentence, especially the original relation between the two predicates (verbs) be fixed and noticeable, was changed.The antecedent can be interpreted as null, which was changed from the original one The scratches.In contrast, in the sentence generated by S all\verb (X 2 ), the antecedent of the zero pronoun was not changed, even though half of the tokens in the sentence has been replaced.The antecedent was that, which aligned with the original token (position), scratches.The original relation between the two predicates, be fixed and noticeable, still remained.In addition, the pretrained MLM generated the tokens surrounding these predicates, such that the context drastically changed from the original one the scratches can't be repaired, but . . . to if that gets fixed, . . . .Such a contextual variation plays an important role for improving the robustness and the coverage of a ZAR model, even if the meaning is slightly unnatural.This can be a reason for the results in Section 4.3, indicating that the setting of a high masking probability (approximately 0.5) and all-but-verb masking S all\verb achieves a better performance.5) (b): (a):

A Word Replacement Approaches
This section compares alternative approaches of the CDA method described in Section 4.1.We compare several different procedures for filling [MASK] tokens using the pretrained MLM.Specifically, we consider two aspects: (i) how many [MASK] tokens to fill simultaneously and (ii) how to determine a word symbol in each time step.
For (i) how many [MASK] tokens to fill simultaneously, we compare the following two methods: • SINGLE: The overview of this method is presented in Figure 5a.Suppose that a masked sentence X contains N [MASK] tokens, we first create (X 1 , . . ., X N ) from X .Here, each X n contains a single [MASK] token and the rest of tokens are from the original sentence X.Note that the position of the [MASK] token is unique for each X n .We then feed (X 1 , . . ., X N ) to MLM independently and fill each [MASK] token with another token to obtain (X 1 , . . ., X N ).Finally, we merge the outputs (X 1 , . . ., X N ) to construct a symbolically modified sentence X .
• MULTI: The overview of this method is presented in Figure 5b.Given a masked sentence X , we feed X to MLM and fill all [MASK] tokens simultaneously.The output of MLM is a symbolically modified sentence X .
For (ii) how to determine a word symbol in each time step, we compare the following two methods: • SAMPLE: Following Wang and Cho (2019), we determine the output symbol by sampling a word from the probability distribution over vocabulary that is computed by pretrained MLM.
• ARGMAX: From the probability distribution over the vocabulary, we determine the output symbol by choosing the token with the highest probability.
Table 7 shows the result of the combinations of these two aspects.In this experiment, the conditions of the POS category are identical across all models with the setting of S all\(verb∪symbol) , which achieves the best overall F 1 in Table 5.The Table 7 presents that the combination of MULTI and ARGMAX, namely, MULTI+ARGMAX achieves the best performance.Thus, we used the MULTI+ARGMAX setting for CDA in the experiment of Section 5.4.
2 Here, e i ∈ R D , and D is the hidden state size.Each state e i is then concatenated with two binary values b target i

Figure 2 :
Figure 2: Overview of our models.(a): the baseline model (MP-MLM) for Japanese ZAR.(b): the integration of a standard CDA method into the baseline.(c): proposed data augmentation method.

Table 1 :
Statistics on NAIST Text Corpus 1.5

Table 2 :
Number of tokens whose POS tag belongs to a set of POS tags: each number is counted on the training set of NTC 1.5.

Table 3 :
List of hyper-parameters expect that the final hidden state of the [MASK]

Table 4 :
F 1 scores on the NTC 1.5 validation set.Bold values indicate best results in the same column.
was used for the experiments.NTC 1.5 is a benchmark dataset commonly

Table 5 :
F 1 scores for each POS-controlled masking on the NTC 1.5 validation set.The method names denote a set of target POS for masking.Bold values indicate best results in the same column.

Table 6 :
F 1 scores on the NTC 1.5 test set.Bold value indicate the best results in the same column group.
†: result from our experiment.* : ensemble models.The improvement of MASKING over the BASELINE is statistically significant in both overall F1 score and ZAR score (p < 0.05) with a permutation test.
キズ は 直り ませ ん が 、%-NOM ほとんど 目立た なく なって 、…The scratches can't be repaired, but ! are becoming barely noticeable … scratches-TOP be-fixed not but , %-NOM almost noticeable not become,…The scratches can't be repaired, but ! are becoming barely noticeable …キズ は [MA] ませ ん が 、%-NOM ほとんど 目立た なく なって 、…There are no scratches, but !doesn't stand out anymore … TOP exist not but , %-NOM almost noticeable not become,… Figure4: Examples of pseudo data created by data augmentation.X indicates pseudo data in MASKING.X indicates the pseudo data in CDA.X 1 and X 1 are output of S verb .X 2 and X 2 are output of S all\verb .