Igbo Diacritic Restoration using Embedding Models

Igbo is a low-resource language spoken by approximately 30 million people worldwide. It is the native language of the Igbo people of south-eastern Nigeria. In Igbo language, diacritics - orthographic and tonal - play a huge role in the distinguishing the meaning and pronunciation of words. Omitting diacritics in texts often leads to lexical ambiguity. Diacritic restoration is a pre-processing task that replaces missing diacritics on words from which they have been removed. In this work, we applied embedding models to the diacritic restoration task and compared their performances to those of n-gram models. Although word embedding models have been successfully applied to various NLP tasks, it has not been used, to our knowledge, for diacritic restoration. Two classes of word embeddings models were used: those projected from the English embedding space; and those trained with Igbo bible corpus (≈ 1m). Our best result, 82.49%, is an improvement on the baseline n-gram models.


Introduction
Lexical disambiguation is at the heart of a variety of NLP tasks and systems, ranging from grammar and spelling checkers to machine translation systems. In Igbo language, diacritics -orthographic and tonal -play a huge role in the distinction of the meaning and pronunciation of words (Ezeani et al., 2017(Ezeani et al., , 2016. Therefore, effective restoration of diacritics not only improves the quality of corpora for training NLP systems but often improves the performance of existing ones (De Pauw et al., 2007;Mihalcea, 2002).

Diacritic Ambiguities in Igbo
There is a wide range of ambiguity classes in Igbo (Thecla-Obiora, 2012). In this paper, we focus on diacritic ambiguities. Besides orthographic diacritics (i.e. dots below and above), tone marks also impose the actual pronunciation and meaning on different words with the same latinized spelling. Table 1 shows Igbo diacritic complexity which impacts on word meanings and pronunciations 1 .

Proposed Approach
As shown in section 2, previous approaches to diacritic restoration techniques depend mostly on existing human annotated resources (e.g. POStagged corpora, lexicon, morphological information). In this work, embedding models were used to restore diacritics in Igbo. For our experiments, models are created by training or projection. The evaluation method is a simple accuracy measure i.e. the average percentage of correctly restored instances over all instance keys. An accuracy of 82.49% is achieved with the IgboBible model using Tweak3 confirming our hypothesis that the semantic relationships captured in embedding models could be exploited in the restoration of diacritics.

Related Works
Some of the key studies in diacritic restoration involve word-, grapheme-, and tag-based techniques (Francom and Hulden, 2013). Some examples of word-based approaches are the works of Yarowsky (Yarowsky, 1994(Yarowsky, ., 1999 which combined decision list with morphological and collocational information. Grapheme-based models tend to support low resource languages better by using character collocations. Mihalcea et al (2002) proposed an approach that used character based instances with classification algorithms for Romanian. This later inspired the works of Wagacha et al (2006), De Pauw et al (2011) and Scannell (2011) on a variety of relatively low resourced languages. However, it is a common position that the word-based approach is superior to character-based approach for well resourced languages.
POS-tags and language models have also been applied by Simard (1998) to well resourced languages (French and Spanish) which generally involved pre-processing, candidate generation and disambiguation. Hybrid techniques are common with this task e.g. Yarowsky (1999) used decision list, Bayesian classification and Viterbi decoding while Crandall (2005) applied Bayesianand HMM-based methods. Tufiş and Chiţu (1999) used a hybrid approach that backs off to character-based method when dealing with "unknown words".
Electronic dictionaries, where available, often augment the substitution schemes used (Šantić et al., 2009). On Maori, Cocks and Keegan (2011) used naïve Bayes algorithms with word n-grams to improve on the character based approach by Scannell (2011).
For Igbo, however, one major challenge to applying most of the techniques mentioned above that depend on annotated datasets is the lack of these datasets for Igbo e.g tagged corpora, morphologically segmented corpora or dictionaries. This work aims at using a resource-light approach that is based on a more generalisable state-of-theart representation model like word-embeddings which could be tested on other tasks.

Igbo Diacritic Restoration
Igbo was among the languages in a previous work (Scannell, 2011) with 89.5% accuracy on webcrawled Igbo data (31k tokens with a vocabulary size of 4.3k). Their lexicon lookup methods, LL and LL2 used the most frequent word and a bigram model to determine the right replacement. However, their training corpus was too little to be representative and there was no language speaker in their team to validate their results.
Ezeani et al (2016) implemented a more complex set of n-gram models with similar techniques on a larger corpus and reported better results but their evaluation method assumed a closed-world by training and testing on the same dataset. Better results were achieved with the approach reported in (Ezeani et al., 2017) but it used a nonstandard data representation model which assigns a sequence of real values to the words in the vocabulary. This method is not only inefficient but does not capture any relationship that may exist between words in the vocabulary.
Also, for Igbo, diacritic restoration does not always eliminate the need for sense disambiguation. For example, the restored wordàkwà could be referring to either bed or bridge. Ezeani et al (2017) had earlier shown that with proper diacritics on ambiguous wordkeys 2 (e.g. akwa), a translation system like Google Translate may perform better at translating Igbo sentences to other languages. This strategy, therefore, could be more easily extended to sense disambiguation in future.

Embedding Projection
Embedding models are very generalisable and therefore will be a good resource for Igbo which has limited resources. We intend to use both trained and projected embeddings for the task. The intuition for embedding projection, illustrated in Figure 1, is hinged on the concept of the universality of meaning and representation. We adopt an alignment-based projection method similar to the one described in (Guo et al., 2015). It uses an Igbo-English alignment dictionary A I|E with a function f (w I i ) that maps each Igbo word w I i to all its co-aligned English words w E i,j and their counts c i,j as defined in Equation 1. |V I | is the vocabulary size of Igbo and n is the number of co-aligned English words.
The projection is formalised as assigning the weighted average of the embeddings of the coaligned English words w E i,j to the Igbo word embeddings vec(w I i ) (Guo et al., 2015):   Table 3 shows that both the total corpus words and its word types constitute over 50% diacritic words i.e. words with at least one diacritic character. Over 97% of the ambiguous wordkeys have 2 or 3 variants.

Experimental Datasets
We chose 29 wordkeys which have several variants occurring in our corpus, the wordkey itself occurring too 5 . For each wordkey, we keep a list of sentences (excluding punctuations and numbers), each with a blank (see Table 5) to be filled with the correct variant of the wordkey.

Experimental Procedure
The experimental pipeline, as illustrated in Figure  2, follows three fundamental stages:

Creating embedding model
Four embedding models, two trained and two projected, were created for Igbo in the first stage of the pipeline: Trained: The first model, IgboBible, is produced from the data described in Table  3 using the Gensim word2vec Python libraries (Řehůřek and Sojka, 2010). Default configurations were used apart from optimizing dimension(def ault = 100) and window size(def ault = 5) parameters to 140 and 2 respectively on the Basic restoration method described in section 4.3.3.
Projected Using the projection method defined above, we created the IgboGNews model from the pre-trained Google News 7 word2vec model while the IgboEnBbl is projected from a model we trained on the English bible.  IgboGNews has a lot of holes i.e. 1101 out of 4057, (24.92%) entries in the alignment dictionary words were not represented in the Google News embedding model. A quick look at the list revealed that they are mostly bible names that do not exist in the Google News model and so have no vectors for their Igbo equivalents e.g. ko . ri . nt, nimshai . , manase, peletai . t, go . g, pileg, abi . shag, aro . na, franki . nsens.
The projection process removes 8 these words thereby stripping the model of a quarter of its vocabulary with any linguistic information from them.

Deriving diacritic embedding models
In both training and projection of the embedding model, vectors are assigned to each word in the dictionary, and that includes each diacritic variant of a wordkey. The Basic restoration process (section 4.3.3) uses this initial embedding model asis. The models are then refined by "tweaking" the variant vectors to get new ones that correlate more with context embeddings.
For example, let mcw v contain the top n of the most co-occurring words of a certain variant, v and their counts, c. The following three tweaking methods are applied: • Tweak1: adds to each diacritic variant vector the weighted average of the vectors of its most co-occurring words (see Equation (3)). At restoration time, all the words in the sentence are used to build the context vector.
• Tweak2: updates each variant vector as in Tweak1 but its restoration process uses only the vectors of co-occurring words with each of the contesting variants excluding common words.
• Tweak3: is similar to the previous methods but replaces (not updates) each of the variant
where w c is the 'weight' of w i.e. the probability distribution of the count of w in mcw v .

Diacritic restoration process
Algorithm 1 sketches the steps followed to apply the diacritic embedding vectors to the diacritic restoration task. This algorithm is based on the assumption that combining the vectors of words in context is likely to yield a vector that is more similar to the correct diacritic variant. In this process, a set of candidate vectors, D wk = {d 1 , ..., d n } for each wordkey, wk, are extracted from the embedding model. C is defined as the list of the context words of a sentence containing a placeholder (examples are shown in Table 5) to be filled and vec C is the context vector of C (Equation (5)).

Evaluation Strategies
A major subtask of this project is building the dataset for training the embedding and other lan-guage models. For all of the 29 wordkeys 9 used in the project, we extracted 38,911 instances each with the correct variant and no diacritics on all words in context. The dataset was used to optimise the parameters in the training of the Basic embedding model. Simple unigram and bigram methods were were used as the baseline for the restoration task. 10-fold cross-validation was applied in the evaluation of each of the models.

Results and Discussion
Our results (Table 6) indicate that with respect to the n-gram models, the embedding based diacritic restoration techniques perform comparatively well. Though the projected models (IgboGNews and IgboEnBbl) appear to have struggled a bit compared to the IgboBible, one can infer that having been trained originally with the same dataset and language of the task may have given the latter some advantage. It also captures all the necessary linguistic information for Igbo better than the projected models.
Again, IgboEnBbl did better than IgboGNews possibly because it was trained on a corpus that directly aligns with the Igbo data used in the experiment. The pre-trained IgboWiki model was abysmally poor possibly because, out of the 3111 entries in its vocabulary, 1,930 (62.04%) were English words while only 345 (11.09%) were found in our Igbo dictionary 10 used. It is not clear yet why all the results are the same across the methods. The best restoration technique across the models is the Tweak3 which suggests that very frequent common words may have introduced some noise in the training process.

Conclusion and Future Research Direction
This work contributes to the IgboNLP 11 (Onyenwe et al., 2018) project with the ultimately goal to build a framework that can adapt, in an effective and efficient way, existing NLP tools to support the development of Igbo. This paper addresses the issue of building and projecting embedding models for Igbo as well as applying the models to diacritic restoration. We have shown that word embeddings can be used to restore diacritics. However, there is still room for further exploration of the techniques presented here. For instance, we can investigate how generalizable the models produced are with regards to other tasks e.g. sense disambiguation, word similarity and analogy tasks. On the restoration task, the design here appear to be more simplistic than in real life as one may want to restore an entire sentence, and by extension a document, and not just fill a blank. Also, with Igbo being a morphologically rich language, the impact of character and sub-word embeddings as compared to word embeddings could be investigated.