基于强负采样的词嵌入优化算法(Word Embedding Optimization Based on Hard Negative Sampling)

Yuchen Wang (王雨晨), Miaozhe Lin (林淼哲), Jiefan Zhan (詹杰凡)


Abstract
word2vec是自然语言处理领域重要的词嵌入算法之一,为了解决随机负采样作为优化目标可能出现的样本贡献消失问题,提出了可以应用在CBOW和Skip-gram框架上的以余弦距离为度量的强负采样方法:HNS-CBOW和HNS-SG。将原随机负采样过程拆解为两个步骤,首先,计算随机负样本与目标词的余弦距离,然后,再使用距离较近的强负样本更新参数。以英文维基百科数据作为实验语料,在公开的语义-语法数据集上对优化算法的效果进行了定量分析,实验表明,优化后的词嵌入质量显著优于原方法。同时,与GloVe等公开发布的预训练词向量相比,可以在更小的语料库上获得更高的准确性。
Anthology ID:
2020.ccl-1.20
Volume:
Proceedings of the 19th Chinese National Conference on Computational Linguistics
Month:
October
Year:
2020
Address:
Haikou, China
Editors:
Maosong Sun (孙茂松), Sujian Li (李素建), Yue Zhang (张岳), Yang Liu (刘洋)
Venue:
CCL
SIG:
Publisher:
Chinese Information Processing Society of China
Note:
Pages:
207–214
Language:
Chinese
URL:
https://aclanthology.org/2020.ccl-1.20
DOI:
Bibkey:
Cite (ACL):
Yuchen Wang, Miaozhe Lin, and Jiefan Zhan. 2020. 基于强负采样的词嵌入优化算法(Word Embedding Optimization Based on Hard Negative Sampling). In Proceedings of the 19th Chinese National Conference on Computational Linguistics, pages 207–214, Haikou, China. Chinese Information Processing Society of China.
Cite (Informal):
基于强负采样的词嵌入优化算法(Word Embedding Optimization Based on Hard Negative Sampling) (Wang et al., CCL 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.ccl-1.20.pdf