Unsupervised Cross-Lingual Part-of-Speech Tagging for Truly Low-Resource Scenarios

Ramy Eskander, Smaranda Muresan, Michael Collins


Abstract
We describe a fully unsupervised cross-lingual transfer approach for part-of-speech (POS) tagging under a truly low resource scenario. We assume access to parallel translations between the target language and one or more source languages for which POS taggers are available. We use the Bible as parallel data in our experiments: small size, out-of-domain and covering many diverse languages. Our approach innovates in three ways: 1) a robust approach of selecting training instances via cross-lingual annotation projection that exploits best practices of unsupervised type and token constraints, word-alignment confidence and density of projected POS, 2) a Bi-LSTM architecture that uses contextualized word embeddings, affix embeddings and hierarchical Brown clusters, and 3) an evaluation on 12 diverse languages in terms of language family and morphological typology. In spite of the use of limited and out-of-domain parallel data, our experiments demonstrate significant improvements in accuracy over previous work. In addition, we show that using multi-source information, either via projection or output combination, improves the performance for most target languages.
Anthology ID:
2020.emnlp-main.391
Volume:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:
November
Year:
2020
Address:
Online
Editors:
Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4820–4831
Language:
URL:
https://aclanthology.org/2020.emnlp-main.391
DOI:
10.18653/v1/2020.emnlp-main.391
Bibkey:
Cite (ACL):
Ramy Eskander, Smaranda Muresan, and Michael Collins. 2020. Unsupervised Cross-Lingual Part-of-Speech Tagging for Truly Low-Resource Scenarios. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4820–4831, Online. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Cross-Lingual Part-of-Speech Tagging for Truly Low-Resource Scenarios (Eskander et al., EMNLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.emnlp-main.391.pdf
Video:
 https://slideslive.com/38939256