Coding Textual Inputs Boosts the Accuracy of Neural Networks

Abdul Rafae Khan, Jia Xu, Weiwei Sun


Abstract
Natural Language Processing (NLP) tasks are usually performed word by word on textual inputs. We can use arbitrary symbols to represent the linguistic meaning of a word and use these symbols as inputs. As “alternatives” to a text representation, we introduce Soundex, MetaPhone, NYSIIS, logogram to NLP, and develop fixed-output-length coding and its extension using Huffman coding. Each of those codings combines different character/digital sequences and constructs a new vocabulary based on codewords. We find that the integration of those codewords with text provides more reliable inputs to Neural-Network-based NLP systems through redundancy than text-alone inputs. Experiments demonstrate that our approach outperforms the state-of-the-art models on the application of machine translation, language modeling, and part-of-speech tagging. The source code is available at https://github.com/abdulrafae/coding_nmt.
Anthology ID:
2020.emnlp-main.104
Volume:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:
November
Year:
2020
Address:
Online
Editors:
Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1350–1360
Language:
URL:
https://aclanthology.org/2020.emnlp-main.104
DOI:
10.18653/v1/2020.emnlp-main.104
Bibkey:
Cite (ACL):
Abdul Rafae Khan, Jia Xu, and Weiwei Sun. 2020. Coding Textual Inputs Boosts the Accuracy of Neural Networks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1350–1360, Online. Association for Computational Linguistics.
Cite (Informal):
Coding Textual Inputs Boosts the Accuracy of Neural Networks (Khan et al., EMNLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.emnlp-main.104.pdf
Video:
 https://slideslive.com/38939325
Code
 abdulrafae/coding_nmt