Derivation of Document Vectors from Adaptation of LSTM Language Model

Wei Li, Brian Mak


Abstract
In many natural language processing (NLP) tasks, a document is commonly modeled as a bag of words using the term frequency-inverse document frequency (TF-IDF) vector. One major shortcoming of the frequency-based TF-IDF feature vector is that it ignores word orders that carry syntactic and semantic relationships among the words in a document. This paper proposes a novel distributed vector representation of a document, which will be labeled as DV-LSTM, and is derived from the result of adapting a long short-term memory recurrent neural network language model by the document. DV-LSTM is expected to capture some high-level sequential information in the document, which other current document representations fail to do. It was evaluated in document genre classification in the Brown Corpus and the BNC Baby Corpus. The results show that DV-LSTM significantly outperforms TF-IDF vector and paragraph vector (PV-DM) in most cases, and their combinations may further improve the classification performance.
Anthology ID:
E17-2073
Volume:
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers
Month:
April
Year:
2017
Address:
Valencia, Spain
Editors:
Mirella Lapata, Phil Blunsom, Alexander Koller
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
456–461
Language:
URL:
https://aclanthology.org/E17-2073
DOI:
Bibkey:
Cite (ACL):
Wei Li and Brian Mak. 2017. Derivation of Document Vectors from Adaptation of LSTM Language Model. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 456–461, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
Derivation of Document Vectors from Adaptation of LSTM Language Model (Li & Mak, EACL 2017)
Copy Citation:
PDF:
https://aclanthology.org/E17-2073.pdf