A Framework for Shared Agreement of Language Tags beyond ISO 639

Frances Gillis-Webber, Sabine Tittel


Abstract
The identification and annotation of languages in an unambiguous and standardized way is essential for the description of linguistic data. It is the prerequisite for machine-based interpretation, aggregation, and re-use of the data with respect to different languages. This makes it a key aspect especially for Linked Data and the multilingual Semantic Web. The standard for language tags is defined by IETF’s BCP 47 and ISO 639 provides the language codes that are the tags’ main constituents. However, for the identification of lesser-known languages, endangered languages, regional varieties or historical stages of a language, the ISO 639 codes are insufficient. Also, the optional language sub-tags compliant with BCP 47 do not offer a possibility fine-grained enough to represent linguistic variation. We propose a versatile pattern that extends the BCP 47 sub-tag ‘privateuse’ and is, thus, able to overcome the limits of BCP 47 and ISO 639. Sufficient coverage of the pattern is demonstrated with the use case of linguistic Linked Data of the endangered Gascon language. We show how to use a URI shortcode for the extended sub-tag, making the length compliant with BCP 47. We achieve this with a web application and API developed to encode and decode the language tag.
Anthology ID:
2020.lrec-1.408
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3333–3339
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.408
DOI:
Bibkey:
Cite (ACL):
Frances Gillis-Webber and Sabine Tittel. 2020. A Framework for Shared Agreement of Language Tags beyond ISO 639. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3333–3339, Marseille, France. European Language Resources Association.
Cite (Informal):
A Framework for Shared Agreement of Language Tags beyond ISO 639 (Gillis-Webber & Tittel, LREC 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.lrec-1.408.pdf