Guiding Attention for Self-Supervised Learning with Transformers

Ameet Deshpande, Karthik Narasimhan


Abstract
In this paper, we propose a simple and effective technique to allow for efficient self-supervised learning with bi-directional Transformers. Our approach is motivated by recent studies demonstrating that self-attention patterns in trained models contain a majority of non-linguistic regularities. We propose a computationally efficient auxiliary loss function to guide attention heads to conform to such patterns. Our method is agnostic to the actual pre-training objective and results in faster convergence of models as well as better performance on downstream tasks compared to the baselines, achieving state of the art results in low-resource settings. Surprisingly, we also find that linguistic properties of attention heads are not necessarily correlated with language modeling performance.
Anthology ID:
2020.findings-emnlp.419
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Editors:
Trevor Cohn, Yulan He, Yang Liu
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4676–4686
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.419
DOI:
10.18653/v1/2020.findings-emnlp.419
Bibkey:
Cite (ACL):
Ameet Deshpande and Karthik Narasimhan. 2020. Guiding Attention for Self-Supervised Learning with Transformers. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4676–4686, Online. Association for Computational Linguistics.
Cite (Informal):
Guiding Attention for Self-Supervised Learning with Transformers (Deshpande & Narasimhan, Findings 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.findings-emnlp.419.pdf
Optional supplementary material:
 2020.findings-emnlp.419.OptionalSupplementaryMaterial.zip
Video:
 https://slideslive.com/38940124
Video:
 https://slideslive.com/38939446
Code
 ameet-1997/AttentionGuidance
Data
GLUEMultiNLIQNLI