Adversarial Training for Commonsense Inference

We apply small perturbations to word embeddings and minimize the resultant adversarial risk to regularize the model. We exploit a novel combination of two different approaches to estimate these perturbations: 1) using the true label and 2) using the model prediction. Without relying on any human-crafted features, knowledge bases, or additional datasets other than the target datasets, our model boosts the fine-tuning performance of RoBERTa, achieving competitive results on multiple reading comprehension datasets that require commonsense inference.


Introduction
Commonsense knowledge is often necessary for natural language understanding. As shown in Table  1, we can understand that the writer needs help to get dressed and seems upset with this situation, indicating that he or she is probably not a child. Thus, we can infer that a possible reason that the writer needs to be dressed by other people is that he or she may have a physical disability (Huang et al., 2019). Although a simple task for humans, it is still challenging for computers to understand and reason about commonsense.
Commonsense inference in natural language processing (NLP) is generally evaluated via machine reading comprehension task, in the format of selecting plausible responses with respect to natural language queries. Recent approaches are based on the use of pre-trained Transformer-based language models such as BERT (Devlin et al., 2019). Some approaches rely solely on these models by adopting either a single or multi-stage fine-tuning approach (by fine-tuning using additional datasets in a stepwise manner) (Li and Xie, 2019; Sharma and Roychowdhury, 2019;Liu and Yu, 2019; Paragraph: It's a very humbling experience when you need someone to dress you every morning, tie your shoes, and put your hair up. Every menial task takes an unprecedented amount of effort. It made me appreciate Dan even more. But anyway I shan't dwell on this (I'm not dying after all) and not let it detract from my lovely 5 days with my friends visiting from Jersey.
Question: What's a possible reason the writer needed someone to dress him every morning?
Option1: The writer doesn't like putting effort into these tasks. Option2: The writer has a physical disability. Option3: The writer is bad at doing his own hair. Option4: None of the above choices.  (Huang et al., 2019). The task is to identify the correct answer option. The correct answer is in bold. Zhou et al., 2019), while others further enhance their word representations with knowledge bases such as ConceptNet (Jain and Singh, 2019;Da, 2019;Wang et al., 2020). However, due to the often limited data from the downstream tasks and the extremely high complexity of the pre-trained model, aggressive fine-tuning can easily make the adapted model overfit the data of the target task, making it unable to generalize well on unseen data (Jiang et al., 2019). Moreover, some researchers have shown that such pre-trained models are vulnerable to adversarial attacks (Jin et al., 2020). Inspired by the recent success of adversarial training in NLP Jiang et al., 2019), our AdversariaL training algorithm for commonsense InferenCE (ALICE) focuses on improving the generalization of pre-trained language mod-els on downstream tasks by enhancing their robustness in the embedding space. More specifically, during the fine-tuning stage of Transformer-based models, e.g. RoBERTa (Liu et al., 2019b), random perturbations are added to the embedding layer to regularize the model by updating the parameters on these adversarial embeddings. ALICE exploits a novel way of combining two different approaches to estimate these perturbations: 1) using the true label and 2) using the model prediction. Experiments show that we were able to boost the performance of RoBERTa on multiple reading comprehension datasets that require commonsense inference, achieving competitive results with stateof-the-art approaches.

ALICE
Given a dataset D of N training examples, D = {(x 1 , y 1 ), (x 2 , y 2 ), ..., (x N , y N )}, the objective of supervised learning is to learn a function f (x; θ) that minimizes the empirical risk, which is defined by min θ E (x,y)∼D [l(f (x; θ), y)]. Here, the function f (x; θ) maps input sentences x to an output space y, and θ are learnable parameters. While this objective is effective to train a neural network, it usually suffers from overfitting and poor generalization to unseen cases (Goodfellow et al., 2015;Madry et al., 2018). To alleviate these issues, one can use adversarial training, which has been primarily explored in computer vision (Goodfellow et al., 2015;Madry et al., 2018). The idea is to perturb the data distribution in the embedding space by performing adversarial attacks. Specifically, its objective is defined by: where δ is the perturbation added to the embeddings. One challenge of adversarial training is how to estimate this perturbation δ, which is to solve the inner maximization, max δ l(f (x + δ; θ), y). A feasible solution is to approximate it by a fixed number of steps of a gradient-based optimization approach (Madry et al., 2018).
Based on recent successful cases that applied adversarial training to NLP (Jiang et al., 2019;Miyato et al., 2018), the approaches to estimate δ can be divided into two categories: adversarial training that uses the label y  and adversarial training that uses the model prediction f (x; θ), i.e. a "virtual" label (Miyato et al., 2018;Jiang et al., 2019). We hypothesize that these two categories complement each other: the first one is to improve the robustness of our target label, by avoiding an increase in the error of the unperturbed inputs, while the second term enforces the smoothness of the model, encouraging the output of the model not to change much, when injecting a small perturbation to the input. Thus, ALICE proposes a novel algorithm by combining these two approaches, which is defined by: where δ 1 and δ 2 are two perturbations, bounded by a general l p norm ball, estimated by a fixed K steps of the gradient-based optimization approach. In our experiments, we set p = ∞. It has been shown that a larger K can lead to a better estimation of δ (Qin et al., 2019;Madry et al., 2018). However, this can be expensive, especially in large models, e.g. BERT and RoBERTa. Thus, K is set to 1 for a better trade-off between speed and performance. Note that α is a hyperparameter balancing these two loss terms. In our experiments, we set α to 1.

Datasets and Evaluation Metrics
We evaluate ALICE on three reading comprehension benchmarks that require commonsense inference: CosmosQA (  typical time (when an event occurs), (4) frequency (how often an event occurs), and (5) stationarity (whether a state is maintained for a very long time or indefinitely). It contains 13k tuples, each consisting of a sentence, a question, and a candidate answer, that should be judged as plausible or not. The sentences are taken from different sources such as news, Wikipedia and textbooks.
The summary of the datasets is in Table 2. For the MCTACO dataset, no training set is available. Following (Zhou et al., 2019), we use the dev set for fine-tuning the model. We perform 5-fold crossvalidation for fine-tuning the parameters.
We evaluate CosmosQA and MCScript2.0 in terms of accuracy. Following (Ostermann et al., 2019a), we also report for the MCScript2.0 accuracy on the commonsense based questions and accuracy on the questions that are not commonsense based. For the MCTACO, we report the exact match (EM) and F1 scores, following (Zhou et al., 2019). EM measures how many questions a system correctly labeled all candidate answers, while F1 measures the average overlap between one's predictions and the ground truth. Our implementation for pairwise text classification and relevance ranking tasks are based on the MT-DNN framework 1 (Liu et al., 2019a.

Implementation Details
The RoBERTa LARGE model (Liu et al., 2019b) was used as the text encoder. We used ADAM (Kingma and Ba, 2015) as our optimizer with a learning rate in the range ∈ {1 × 10 −5 , 2 × 10 −5 , 3 × 10 −5 , 5 × 10 −5 , 5 × 10 −5 } and a batch size ∈ {16, 32, 64}. The maximum number of epochs was set to 10. A linear learning rate decay schedule with warm-up over 0.1 was used, unless stated otherwise. We also set the dropout rate of all the task specific layers as 0.1, except 0.3 for MCTACO. To avoid gradient exploding, we clipped the gradient norm within 1. All the texts were tokenized using wordpieces and were chopped to spans no longer than 512 tokens.

Baselines
We compare ALICE to a list of state-of-the-art models, as shown in Table 3. BERT + unit normalization (Zhou et al., 2019) is the BERT base model. The authors further add unit normalization to temporal expressions in candidate answers and finetune on the MC-TACO dataset. RoBERTa LARGE is our re-implementation of the large RoBERTa model by (Liu et al., 2019b). PSH-SJTU (Li and Xie, 2019) is based on multi-stage fine-tuning XL-NET (Yang et al., 2019) on RACE (Lai et al., 2017), SWAG (Zellers et al., 2018) and MC-Script2.0 datasets. K-ADAPTER (Wang et al., 2020) further enhances RoBERTa word representations with multiple knowledge sources, such as factual knowledge obtained through Wikipedia and Wikidata and linguistic knowledge obtained through dependency parsing web texts. SMART (Jiang et al., 2019) is an adversarial training model for fine-tuning pretrained language models through regularization. SMART uses the model prediction, f (x; θ), for estimating the perturbation δ. This model recently obtained state-of-the-art results on a bunch of NLP tasks on the GLUE benchmark (Wang et al., 2018). We also compare ALICE with a baseline that uses only the label y for estimating the perturbation δ (called model ADV hereafter) (Madry et al., 2018).

Results
The results are summarized in Table 3. Overall, we observed that adversarial methods, i.e. ADV, SMART and ALICE, were able to achieve competitive results over the baselines, without using any additional knowledge source, and without using any additional dataset other than the target task datasets. These results suggest that adversarial training lead to a more robust model and help generalize better on unseen data. ALICE consistently oupterformed SMART (which overall outperformed ADV) across all three datasets on both dev and test sets, indicating that adversarial training that uses the label y and adversarial training that uses the model prediction

Conclusion
We proposed ALICE, a simple and efficient adversarial training algorithm for fine-tuning large scale pre-trained language models. Our experiments demonstrated that it achieves competitive results on multiple machine reading comprehension datasets, without relying on any additional resource other than the target task dataset. Although in this paper we focused on the machine reading comprehension task, ALICE can be generalized to solve other downstream tasks as well, and we will explore this direction as future work.