Wiktionary Normalization of Translations and Morphological Information

We extend the Yawipa Wiktionary Parser (Wu and Yarowsky, 2020) to extract and normalize translations from etymology glosses, and morphological form-of relations, resulting in 300K unique translations and over 4 million instances of 168 annotated morphological relations. We propose a method to identify typos in translation annotations. Using the extracted morphological data, we develop multilingual neural models for predicting three types of word formation—clipping, contraction, and eye dialect—and improve upon a standard attention baseline by using copy attention.


Introduction
Wiktionary is a large, free multilingual dictionary with a wealth of information. Yawipa (Wu and Yarowsky, 2020), henceforth W&Y, is a recent Wiktionary parser billed as "comprehensive and extensible." It has the ability to extract numerous types information from Wiktionary, including pronunciations, part of speech, translations, etymology, and a wide range of word relations, and normalize it into an easy to process tabular format. In particular, one of Yawipa's innovations over existing parsers was extracting translations from the definition section of a dictionary definition. Confirming its easy extensibility and improving upon its comprehensiveness, we extend Yawipa's extraction and normalization of Wiktionary in two directions: we extract translations from an unusual source, etymology glosses, and we extract morphological relations as annotated by form-of relations. This results in an addition of 282,092 new unique translations and 4,027,201 extracted morphological relations (from the 2020-04 English Wiktionary XML dump). We present an analysis that enables us to find typos in translation annotations. Using the extracted morphological data, we experiment with several new low-resource (1.5K instances) multilingual prediction tasks on clipping, contraction, and eye dialect. Our experiments with neural sequence-tosequence models show that using copy attention can improve performance by up to 52% over a model with a standard attention mechanism.

Related Work
Though Wiktionary has existed since 2002, only until very recently has there been a surge of interest in using Wiktionary. Navarro et al. (2009) was one of the first to examine Wiktionary as a resource for NLP. This paper builds upon Yawipa (Wu and Yarowsky, 2020), an open-source, extensible Wiktionary parsing framework written in Julia with support for parsing a wide variety of data from multiple language editions of Wiktionary into a structured machine-readable format. Yawipa's goal is to be comprehensive and extensible. To that end, Yawipa goes beyond existing parsers in extracting and normalizing information, such as etymology and translations, that exist outside of structured Wiktionary markup (we further this goal in this paper), and it facilitates the creation of new parsers for other Wiktionary editions. In the literature, there are similar Wiktionary parsing efforts (e.g. knoWitiary (Nastase and Strapparava, 2015), DBnary (Sérasset, 2015), and ENGLAWI (Sajous et al., 2020)), but with different goals and coverage.
Most studies on translation extraction have utilized the translation section of an entry:Ács (2014) using a triangulation approach, Kirov et al. (2016) for morphophological analysis, and Wu and Yarowsky (2020) as part of a comprehensive Wiktionary parsing effort. DBnary (Sérasset, 2015) is a similar effort at parsing certain lexical data, including translations, from Wiktionary into a structured format.
Related to the word formation mechanisms we examine, Kulkarni and Wang (2018) examine word formation in slang, specifically blends, clippings, and reduplication, and Brooke et al. (2011) predict clipping using a LSA-based approach. Contractions are not typically studied in a predictive context; Volk and Sennrich (2011) disambiguates contractions as a preprocessing step in machine translation. Researchers have recently examined eye dialect in the context of spelling correction (Eryani et al., 2020; Himoro and Pareja-Lora, 2020), but to our knowledge, this paper is the first study on eye dialect generation.

Extracting Translations from Etymology Glosses
Wiktionary contains translations in a specialized Translation section. W&Y extract these translations, as well as "translations" from the definition section of non-English word entries. Since non-English words have English definitions (in the English Wiktionary), short definitions can be regarded as viable translations. One unusual but particularly fruitful source of translations that has not been previously considered is glosses in the Etymology section of an entry. For example, in Wiktionary the etymology of the German word Marienkäfer 'ladybug' is: From Maria (given name) + Käfer ("beetle").
Glosses of each component of the compound word are given in parentheses; these are the translations that we extract. The provided glosses can help disambiguate the word in cases where a word may have multiple senses (e.g. Käfer can refer to a beetle, a wench, or the Volkswagen car).
The decomposition of Marienkäfer in the above etymology entry is encoded in MediaWiki markup as {{compound|de|Maria|pos1=given name|Käfer|t2=beetle}}. This is a Wiktionary template with arguments separated by pipes, indicating (1) the word is a compound, (2) it is a German word, (3) the 1 st component is Maria, (4) the part of speech of the 1 st component is "given name", (5) the 2 nd component is Käfer, and (6) the translation of the 2 nd component is "beetle". From this example, we would extract and normalize the second component's translation to augment the translations already extracted by Yawipa from other sources.
Analysis Table 1  The top 5 languages we extract translations from are Latin, Greek, and Proto Indo-European (common ancestor languages) and Finnish and German (highly compositional languages). We also examine specifically where in the etymology template the gloss occurs (Table 2), whether as a named argument (e.g. t2=beetle) or as a positional (non-named) argument (e.g. {{m|la|ab||from, away from}}, 1 and denoted as (none) in Table 2).   (none) 235,123 t3  4,450 t7  20 gloss6  2  t1  74,792 gloss3  738 t8  11 gloss11 1  t2  56,452 t4  476 gloss5  9 t22  1  t  55,376 t5  117 t9  3  gloss1  23,213 t6  53 t11  3  gloss2 14,084 gloss4 28 t10 3 We find that the large majority of etymology glosses are annotated through positional arguments, indicating that the word is not a compound word. Following this, we see a large number of t1 and t2 arguments, which occur in compositional words such as compounds and affixal words (e.g. {{compound|de|Zeit|t1=time|Geist|t2=spirit}}. Note that glosses are by no means required and are often left out for compound words (e.g. {{compound|en|light|house}}). We observe some inconsistency in whether to use t or gloss; gloss seems to be the older standard, while t is the accepted convention. The larger argument numbers in this histogram also give an indication of the number of compound words and phrases and their components contained in Wiktionary.
Typos This analysis also allows us to automatically identify potential annotation typos (Table 3). For example, the template argument t11 in Table 2 indicates a translation of the 11 th component in a compound word or phrase. The three entries with a t11 are the Dutch stokhaver, Latin aequabilis, and Hungarian amit nyer a réven, elveszti a vámon. By examining unlikely template arguments, and then verifying the presence of previous arguments (t1 through t10) we can automatically identify typos by annotators (who probably accidentally pressed the 1 key twice, since 11-part compound words are highly unlikely). Typos are then recommended to the user, who can manually correct the upstream source.

Extracting Morphological Information
Wiktionary is also a rich source of morphological information. Here we focus on one type of information, which we call "form-of relations" because they are annotated in Wiktionary using Form-Of templates. 2 We extract 4,027,201 relations across 168 relation types, a full histogram of which is in Appendix A. While different relations have different requirements as to where they can appear in an entry (e.g. some relations can only appear in the etymology section), form-of relations are relatively straightforward to extract and normalize due to the consistency of their templates. Many inflectional relations for both nouns and verbs, including relations such as inflection-of, genetivesingular-of, or past-participle-of, are already packaged in UniMorph and have been used in tasks such as morphological inflection analysis and prediction (McCarthy et al., 2019;Kann et al., 2020). Other relations, such as plural-of and feminine-form-of can augment training data for morphological analysis systems such as that of Nicolai and Yarowsky (2019). However, much of the rest of this form-of data has not been thoroughly explored. Below, we present preliminary experiments on clipping, contraction, and eye dialect, three understudied types of data whose further research is enabled through our extraction and normalization.

Experiments
We experiment with predicting three form-of relations. Clipping is a process of word formation in which a part of the word gets "clipped" or truncated to form a new word that retains both original word's meaning and part of speech. Common examples in English include math from mathematics or phone from telephone. Contraction occurs when sounds or letters are dropped to form a new, shorter word or word group. In English, examples include I'm from I am and the bound morpheme -n't from not. Eye dialect is the use of nonstandard spelling to highlight a word's pronunciation. It is often used in literary works to draw attention to a character's particular dialect or accent. Some examples in English include aftuh for after and jokin' for joking. In Wiktionary, several eye dialect annotations include the specific dialect represented, such as African American Vernacular English (AAVE) or Southern US.
For these linguistic phenomena, Wiktionary contains annotations across a wide range of languages. The amount of annotations is also quite small: the total amount of data is only around 1-2K instances per task (Table 4). While there has not been much published computational literature on these tasks, we envision interesting potential downstream applications for systems successful at generating clippings, contradictions, and eye dialectical variations. For example, changing the language style of chatbots has been shown to increase user satisfaction (Elsholz et al., 2019).

Models
We use a character neural machine translation setup. Using OpenNMT-py (Klein et al., 2017), we employ a 2-layer LSTM encoder-decoder 3 with 256-dimension hidden and embedding size, batch size 64, Adam optimizer with learning rate 0.001, and patience of 5. We train two model variants, a baseline with Luong attention (Luong et al., 2015) (the default in OpenNMT), and a second with copy attention (Gu et al., 2016). For eye dialect, we only use English data, as the overwhelming majority of annotations are English. For clipping and contraction, we employ the entire range of languages annotated, thus making our models multi-source, multi-target systems. We use a randomly shuffled 80-10-10 train-dev-test split. The input and output format of each experiment, as well as results are presented in Table 5.

Task
Top 5    Results We compute exact match accuracy and average character edit distance to the gold for each setting. Though 1-best and 5-best accuracies across all three tasks seem low, actually on average the results are only 1-2 characters off from the gold; we see the model consistently making plausible predictions with similar sounds. In addition, the models with copy attention consistently outperform the models with a standard Luong attention. Due to space constraints, sample predictions are presented in Appendix B, and improvements of the copy attention model over the Luong attention model are in Appendix C.
Analysis Clippings tend to keep the beginning part of the word (speculation → spec), which the model learned (Spotlight → Spot), albeit sometimes incorrectly (Alfredino → Alfe, gold is Dino). A large percentage of clippings are in Japanese; if the input is written in katakana, the model can sometimes make a correct prediction, but if written in kanji, the model gets it completely wrong, due to the rarity of the characters. These errors are corrected by the copy attention model, which learns to copy over characters that would otherwise be unlikely to be generated. Contraction is perhaps an easier form of clipping; the model learns to keep characters at the beginning and end of a word. For eye dialect, the models successfully learned the -ing → -in' mapping. We observe that many incorrect predictions are often quite acceptable to a human depending on one's dialect of English (old → ole, gold is owld; yourself → yoself, gold is youself). Thus character-based metrics may be more informative measures of performance than accuracy. Overall, the copy attention model substantially outperforms a regular attention baseline, due to the fact that the output contains many characters from the input (for clipping and contraction, the task is akin to selecting characters to keep and or discard).

Conclusion
We extend a Yawipa, a comprehensive Wiktionary parser, to extract and normalize translations from etymology glosses and morphological form-of relations, resulting in substantial increases in extracted data. Our multilingual neural sequence models trained on very low amounts of data show quite low character edit distance when predicting words formed through clipping, contraction and eye dialect. We show that copy attention works well for tasks where the output is a mutation of the input. We envision our newly extracted data to be extremely valuable to researchers working with multilingual text data. Data and code are available at github.com/wswu/yawipa.