Text Similarity Estimation Based on Word Embeddings and Matrix Norms for Targeted Marketing

Tim vor der Brück, Marc Pouly


Abstract
The prevalent way to estimate the similarity of two documents based on word embeddings is to apply the cosine similarity measure to the two centroids obtained from the embedding vectors associated with the words in each document. Motivated by an industrial application from the domain of youth marketing, where this approach produced only mediocre results, we propose an alternative way of combining the word vectors using matrix norms. The evaluation shows superior results for most of the investigated matrix norms in comparison to both the classical cosine measure and several other document similarity estimates.
Anthology ID:
N19-1181
Volume:
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
Month:
June
Year:
2019
Address:
Minneapolis, Minnesota
Editors:
Jill Burstein, Christy Doran, Thamar Solorio
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1827–1836
Language:
URL:
https://aclanthology.org/N19-1181
DOI:
10.18653/v1/N19-1181
Bibkey:
Cite (ACL):
Tim vor der Brück and Marc Pouly. 2019. Text Similarity Estimation Based on Word Embeddings and Matrix Norms for Targeted Marketing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1827–1836, Minneapolis, Minnesota. Association for Computational Linguistics.
Cite (Informal):
Text Similarity Estimation Based on Word Embeddings and Matrix Norms for Targeted Marketing (vor der Brück & Pouly, NAACL 2019)
Copy Citation:
PDF:
https://aclanthology.org/N19-1181.pdf