Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset Augmentation Using Graph Theory

Hannah Cyberey; Yangfeng Ji; David K. Evans

doi:10.18653/v1/2020.findings-emnlp.426

Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset Augmentation Using Graph Theory

Abstract

Most NLP datasets are manually labeled, so suffer from inconsistent labeling or limited size. We propose methods for automatically improving datasets by viewing them as graphs with expected semantic properties. We construct a paraphrase graph from the provided sentence pair labels, and create an augmented dataset by directly inferring labels from the original sentence pairs using a transitivity property. We use structural balance theory to identify likely mislabelings in the graph, and flip their labels. We evaluate our methods on paraphrase models trained using these datasets starting from a pretrained BERT model, and find that the automatically-enhanced training sets result in more accurate models.

Anthology ID:: 2020.findings-emnlp.426
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2020
Month:: November
Year:: 2020
Address:: Online
Editors:: Trevor Cohn, Yulan He, Yang Liu
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4741–4751
Language:
URL:: https://aclanthology.org/2020.findings-emnlp.426/
DOI:: 10.18653/v1/2020.findings-emnlp.426
Bibkey:
Cite (ACL):: Hannah Chen, Yangfeng Ji, and David Evans. 2020. Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset Augmentation Using Graph Theory. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4741–4751, Online. Association for Computational Linguistics.
Cite (Informal):: Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset Augmentation Using Graph Theory (Chen et al., Findings 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.findings-emnlp.426.pdf
Code: hannahxchen/automatic-paraphrase-dataset-augmentation
Data: GLUE

PDF Cite Search Code Fix data