Amharic-English Speech Translation in Tourism Domain

This paper describes speech translation from Amharic-to-English, particularly Automatic Speech Recognition (ASR) with post-editing feature and Amharic-English Statistical Machine Translation (SMT). ASR experiment is conducted using morpheme language model (LM) and phoneme acoustic model(AM). Likewise,SMT conducted using word and morpheme as unit. Morpheme based translation shows a 6.29 BLEU score at a 76.4% of recognition accuracy while word based translation shows a 12.83 BLEU score using 77.4% word recognition accuracy. Further, after post-edit on Amharic ASR using corpus based n-gram, the word recognition accuracy increased by 1.42%. Since post-edit approach reduces error propagation, the word based translation accuracy improved by 0.25 (1.95%) BLEU score. We are now working towards further improving propagated errors through different algorithms at each unit of speech translation cascading component.


Introduction
Speech is one of the most natural form of communication for humankind (Honda, 2003). Computer with the ability to understand natural language promoted the development of man-machine interface. This can be extended through different digital platforms such as radio, mobile, TV, CD and others. Through these, speech translation facilitates communication between the people who speak different languages.
Speech translation is the process by which spoken source phrases are translated to a target lan-guage using a computer (Gao et al., 2006). Speech translation research for major and technological supported languages like English, European languages (like French and Spanish) and Asian languages (like Japanese and Chinese) has been conducted since the 1983s by NEC Corporation (Kurematsu, 1996). The advancement of speech translation captivates the communication between people who do not share the same language.
The state-of-the-art of speech translation system can be seen as the integration of three major cascading components (Gao et al., 2006;Jurafsky and Martin, 2008); Automatic Speech Recognition (ASR), Machine Translation (MT) and Text-To-Speech (TTS) synthesis.
ASR is the process by which a machine infers spoken words, by means of talking to computer, and having it correctly understand a recorded audio signal. Beside ASR, MT is the process by which a machine is used to translate a text from one source language to another target language. Finally, TTS creates a spoken version from the text of electronic document such as text file and web document.
As one major component of speech translation, Amharic ASR started in 2001 . A number of attempts have been made for Amharic ASR using different methods and techniques towards designing speaker independent, large vocabulary, contineous speech and spontaneous speech recognition.
In addition to ASR, a preliminary English-Amharic machine translation experiments was conducted using phonemic transcription on the Amharic corpus (Teshome et al., 2015). The result obtained from the experiment shows that, it is possible to design English-Amharic machine translation using statistical method.
As the last component of speech translation, a number of TTS research have been attempted using different techniques and methods as discussed by (Anberbir and Takara, 2009). Among these, concatenative, cepstral, formant and a syllable based speech synthesizers were the main methods and techniques applied.
All the above research works were conducted using different methods and techniques beside data difference and integration as a cascading component. Moreover, dataset and tools used in the above research are not accessible which makes difficult to evaluate the advancement of research in speech technology for local languages.
However, there is no attempt to integrate ASR, SMT and TTS to come up with speech translation system for Amharic language. Thus, the main aim of this study is to investigate the possibility to design Amharic-English speech translation system that controls recognition errors propagating through cascading components.

Amharic Language
Amharic is a Semitic language derived from Ge'ez with the second largest speaker in the world next to Arabic (Simons and Fennig, 2017). The name Amharic (€≈r{) comes from the district of Amhara (€≈•) in northern Ethiopia, which is thought to be the historic, classical and ecclesiastical language of Ethiopia. Moreover, the language Amharic has five dialectical variations spoken named as: Addis Ababa, Gojam, Gonder, Wollo and Menz.
Amharic is the official working language of government of Ethiopia among the 89 languages registered in the country with up to 200 different spoken dialects (Simons and Fennig, 2017;Thompson, 2016). Beside these, Amharic language is being used in governmental administration, public media and national commerce of some regional states of the country. This includes; Addis Ababa, Amhara, Diredawa and Southern Nations, Nationalities and People (SNNP).
Amharic language is spoken by more than 25 million with up to 22 million native speakers. The majority of Amharic speakers found in Ethiopia even though there are also speakers in a number of other countries, particularly Italy, Canada, the USA and Sweden.
Unlike other Semitic languages, such as Arabic and Hebrew, modern Amharic script has inherited its writing system from Ge'ez (gez) (Yimam, 2000). Amharic language uses a grapheme based writing system called fidel (âÔl) written and read from left to right. Amharic graphemes are represented as a sequence of consonant vowel (CV) pairs, the basic shape determined by the consonant, which is modified for the vowel. The Amharic writing system is composed of four distinct categories consisting of 276 different symbols; 33 core characters with 7 orders (€, ∫, ‚, ƒ, ", … and †), 4 labiovelars with 5 orders symbol (q, u, k and g), 18 labialized consonants with 1 order (wƒ) and 1 labiodental characters consisting 7 orders (€, ∫, ‚, ƒ, ", … and †).
In Amharic writing system, all the 276 distinct orthographic representation are indispensable due to their distinct orthographic representation.
However, as part of speech translation, speech recognition mainly deals with distinct sound. Among those, some of the graphemes generate same sound like (h, M, u and Ω) pronounced as h/h/.
On the other hand, Machine translation emphasizes on orthographic representation which result the same meaning in different graphemes. As a result, normalization is required to minimize the graphemes variation which leads to better translation while minimizing the ASR model. Table 1 presents the Amharic character set before and after normalization.  Table 1: Distribution of Amharic character set adopted and modified from  As a result, graphemes that generate the same sound are normalized in to the seven order of core character. The normalization is based on the usage of most characters frequency in Amharic text document. This includes, normalization from (h, M, u and Ω) to h, (…, e) to …, (U, s) to s and (Õ, Ý) to Õ along with order.

Tourism in Ethiopia
Tourism is the activity of traveling to and staying in places outside their usual environment for not more than one year to create a direct contact between people and cultures (UNWTO, 2016). Ethiopia has much to offer for international tourists 1 ranging from the peaks of the rugged Semien mountains to the lowest points on earth called Danakil Depression which is more than 400 feet below sea level.
In addition, tourism become a pleasing sustainable economic development that serves as an alternative source of foreign exchange for the counties like Ethiopia.
Moreover, The 2015 United Nations World Tourism report (UNWTO, 2016) and the World Bank 2 report indicate that, in 2015 a total of 864,000 non-resident tourists come to Ethiopia to visit different tourist attraction. These include; ancient, medieval cities and world heritages registered by UNESCO as tourist attraction. Since the year 2010 until 2015, the average number of tourist flow increase by 13.05% per year.
According to Walta Information Center 3 , citing Ethiopia Ministry of Culture and Tourism, Ethiopia has secured 872 million dollars in first quarter of its 2016/17 fiscal year from 223,032 international tourists. The revenue was mostly through conference tourism, research business and other activities. Majority of the tourists were from USA, England, Germany, France and Italy speaking foreign languages. Beside this, tourists express their ideas using different languages, the majority of the tourists can speak and communicate in English to exchange information about tourist attractions.
Due to this, language barriers are a major problem for today's global communication (Nakamura, 2009). As a result, they look for an alternate option that lets them communicate with the surrounding.
Thus, speech translation system is one of the best technologies used to fill the communication gap between the people who speak different languages (Nakamura, 2009). This is especially true in overcoming language barriers of today's global communication besides supporting underresourced language.
However, under-resourced languages such as Amharic, suffer from having a digital text and speech corpus to support speech translation. So, after collecting text and speech corpora, moving one step further helps in solving language barriers problem.
Therefore, this study attempts to come up with an Amharic-English speech translation system taking tourism as a domain.

Data Preparation
Nowadays, Amharic language suffers from a lack of speech and text corpora for ASR and SMT. Beside these, collecting standardized and annotated corpora is one of the most challenging and expensive tasks when working with under resourced languages (Besacier et al., 2006;. For Amharic speech recognition training and development, 20 hours of read speech corpus prepared by Abate et. al (2005) were used. However, due to unavailability of standardized corpora for speech translation in tourism domain, a text corpus is acquired from resourced and technologically supported languages particularly English.
Accordingly, a parallel English-Arabic text data was acquired from the Basic Traveller Expression Corpus (BTEC) 2009 which is made available through International Workshop on Spoken Language Translation (IWSLT) (Kessler, 2010). A parallel Amharic-English corpus has been prepared by translating the English BTEC data using a bilingual speaker. This data is used for the development of speech translation cascading component such as, ASR and SMT.
The corpus has a total of 28,084 Amharic-English parallel sentences. To keep the dataset consistent, the text corpus has been further preprocessed, such as typing errors are corrected, abbreviations have been expanded, numbers have been textually transcribed and concatenated words have been separated.
Amharic speech recognition is conducted using words and morphemes as a language model with a phoneme-based acoustic model. Similarly word and morpheme have been used as a translation unit for Amharic in Amharic-English machine translation. Morpheme-based segmentation of training, development, testing obtained by segmenting word into sub-word unit using corpus-based, language independent and unsupervised segmentation for using morfessor 2.0 (Smit et al., 2014).
Then, the 8112 (28.38%) test set sentences are recorded under a normal office environment from eight (4 Male and 4 Female) native Amharic speakers using LIG-Aikuma, a smartphone based application tool (Blachon et al., 2016).
Accordingly, a total of 7.43 hours read speech corpus ranging from 1,020 ms to 14,633 ms with an average speech time of 3,297 ms has been collected from the tourism domain.
Moreover, as suggested by , morphologically rich and under-resourced language like Amharic provides a better recognition accuracy using morpheme based language model with phoneme based acoustic model. Similarly, language model data for Amharic speech recognition has been collected from different sources. A text corpus collected for Google project (Tachbelie and Abate, 2015) have been used in addition to BTEC SMT training data excluding the test data.  Like speech recognition, a total of 42,134 sentences (374,153 token of 8,678 type) English language model data have been used for Amharic-English machine translation. The data is collected from the same BTEC corpus excluding test data.
Consequently, corpus based and language independent segmentation have been applied on a training, development and test set of Amharic SMT data. Morfessor is used to segment words to a sub word units. Table 3 presents summary of the corpus used for Amharic-English machine translation using word and morpheme as a unit.
Likewise, the post-edit is conducted using a corpus based n-gram approach. Accordingly, a corpus containing 681,910 sentences (11,514,557 tokens) of 582,150 type data crawled from web including news and magazine.
Then, the data is further cleaned, preprocessed and normalized. From this data, a total of 5,057,112 bigram, 8,341,966 trigram, 9,276,600 quadrigram and 9,242,670 pentagram word se-  quences have been extracted after expanding numbers and abbreviation.

System Architecture
As discussed in Section 1, the state-of-the-art of speech translation suggest to apply through the integration of cascading components to translate speech from source language (Amharic) to a target language (English). As part of the cascading components, the output of a speech recognizer contains more and presents a variety of errors. These errors further propagates to the succeeding component of speech translation which results in low performance.
Hence, in this study we propose an Amharic ASR post-editing module that can detect an error, identify possible suggestion and finally correct based on the proposal. The correction is made using n-gram data store using minimum edit disatnce and perplexity before the error heads to statistical machine translation. Figure 1 presents Amharic-English speech-tospeech translation (S2ST) architecture with and without considering ASR post-edit.
The post-edit process mainly consists of three different phases; error detection, correction proposal and finally suggest correction as depicted in Figure 2.
The first phase of post editing is to detect the error from ASR recognition output. Basically, to detect an error, recognized morpheme units are concatenated to form a word and its existence is checked in unigram Amharic dictionary.
If the error is detected during the first phase, then the correction proposal phase takes the sentence with error mark and creates (w-n+1) n-grams after adding start "<s>" and end "</s>" symbol, where w is number of token in sentence and n specifies n-grams. Otherwise, the sentence is considered as correct.
Once the candidate identified, the suggestion is made taking the minimum edit distance between Figure 2: Amharic ASR post-edit algorithm the error detected and suggestion selected. In this phase, the sum of maximum edit distance has been set experimentally to 16. The maximum edit distance 16 was selected to provide at least one suggestion per sentence and minimize the computation of perplexity. Table 4 depicts a sample of possible correction proposal for a sentence "Îs ¶³ …¡ -°È ¶Û°sã €ÔrÝ †∫".
Finally, the suggestion is made primarly using minimum edit distance then by calculating the perplexity. The minimal edit distance is computed between the word "-°È ¶Û" and the underlined ngram based possible suggestion from a sentence of Table 4.  Table 4: Sample n-gram based suggestion for a sentence "Îs ¶³ …¡ -°È ¶Û°sã €ÔrÝ †∫".
If the edit distance is the same as a different suggestion, then the decision is made by selecting the one that result lower perplexity.

Experimental results
Speech translation experiments are conducted through cascading components of speech translation as discussed in Section 1. In speech recognition experiments, Kaldi (Povey et al., 2011), SRILM (Stolcke et al., 2002) and Morfessor 2.0 (Smit et al., 2014) have been used for Amharic speech recognition, language modeling and unsupervised segmentation, respectively.
Morfessor based segmentation has been applied to segment training, testing and language model data for Amharic. In addition to this, Moses and MGIZA++ for implementing a phrase based statistical machine translation and Python is used for implementing the post-edit algorithm and to integrate ASR and SMT under the Linux platform.
The entire ASR experiment is conducted using a morpheme-based language model with phonemebased acoustic model. Accordingly, the experimental result is computed using NIST Scoring Toolkit (SCTK) 5 and presented in terms of word recognition accuracy (WRA 6 ) and morph recognition accuracy (MRA).
Thus, the Amharic speech recognition experiment shows a 76.4% for the morpheme-based. Then, after the concatination of morphemes to words, a 77.4% word-based recognition accuracy have been achieved.
Consequently, Amharic-English SMT experiment have been conducted with and without considering Amharic ASR result.
The first two experiments were conducted without considering ASR. Accordingly, a word-word system resulted in a BLEU score of 14.72 while morpheme-word brings about 11.24 BLEU. Combining Amharic ASR with Amharic-English SMT as cascading component resulted in a 6.29 BLEU score through 76.4% of recognition accuracy for Amharic morpheme and English word based translation.
Similarly, Amharic word with English word based translation shows a 12.83 BLEU score using 77.4% recognition accuracy without using ASR post-edit. The result achieved by ASR can further be improved by applying post-edit on Amharic speech recognition.  Table 6: Amharic-English Speech Translation result.
Accordingly, the morpheme based recognition followed by post-edit resulted in a BLEU score of 13.08 at 78.5% of word recognition accuracy speech translation.
The result obtained from the n-gram post-edit experiment shows an absolute advance by 1.42% from word recognition accuracy of 77.4% obtained by concatenating a 76.4% morpheme based recognition. Similarly, BLEU score evaluation advanced by 1.95% (from 12.83 to 13.08).

Conclussion and Future work
Speech translation research has been studied for more than a decade for resourced and technological supported languages like English, European and Asian. On the contrary, attempts for under resourced languages, not yet started, particularly for Amharic. This paper presents the first Amharic speech to English text translation using the cascading components of speech translation.
For ASR, a 20 hours of training and 7.43 hours of testing speech were used consuming a morpheme-based language model with a phonemic acoustic model. Whereas for SMT, 19,472 sentence for training and 8112 sentences for testing used. Similarly to apply ASR post-edit using n-gram approach, a corpus consisting 681,910 sentences were used.
Accordingly, speech translation through ASR post-editing resulted a 0.25 (1.95%) BLEU score enhancement from the word-based SMT. The enhancement seemed as a result of improving ASR by 1.42% using a corpus based n-gram post-edit.
The current study shows the possibility of enhancing the performance of speech translation by controlling speech recognition error propagation using post-editing algorithm.
Further works need to be done to apply postediting both at the recognition and the translation stages of speech translation.