Spoken Term Discovery for Language Documentation using Translations

Vast amounts of speech data collected for language documentation and research remain untranscribed and unsearchable, but often a small amount of speech may have text translations available. We present a method for partially labeling additional speech with translations in this scenario. We modify an unsupervised speech-to-translation alignment model and obtain prototype speech segments that match the translation words, which are in turn used to discover terms in the unlabelled data. We evaluate our method on a Spanish-English speech translation corpus and on two corpora of endangered languages, Arapaho and Ainu, demonstrating its appropriateness and applicability in an actual very-low-resource scenario.


Introduction
Language documentation efforts over the last 50-60 years have resulted in audio recordings of native speakers in a large number of languages, many of which are available online.However, due to the enormous effort required for transcription, much of the data remains unannotated and unsearchable. 1For example, out of the 137 unrestricted collections in the Archive of the Indigenous Languages of Latin America, about half (49%) contain no transcriptions at all, and only 7% are fully transcribed. 2As a result, some recent documentation efforts have begun to focus instead on annotating with translations, often with the help of bilingual native speakers themselves (Bird et al., 2014;Blachon et al., 2016;Adda et al., 2016).Nevertheless, even translation takes time and language knowledge, so there may still be little translated data relative to the amount of recorded audio.An important goal, then, is to bootstrap language technology from this small parallel corpus in order to provide tools to annotate more data or make the data more searchable.
We build on the approach of Anastasopoulos et al. (2016), who developed a system that performs joint inference to identify recurring segments of audio and cluster them while aligning them to words in a text translation.Here, we extend the method to be able to search for new instances of the latent clusters within the unlabeled audio, effectively providing keyword translations for some of the unlabeled speech.We evaluate our method on a Spanish-English corpus used in previous work, and on two datasets from endangered languages (narratives in Arapaho and Ainu).No previous computational methods have been tested on the latter data, to our knowledge.We show that in all cases, our system outperforms a recent baseline targeted at the same very low-resource setting (Bansal et al., 2017b), also showing robustness to audio quality and preprocessing decisions.

Related work
Our work joins a handful of other recent proposals aimed at low-resource speech-to-text alignment and translation.These include those of Duong et al. (2016) and Anastasopoulos et al. (2016), who performed alignment only; Bérard et al. (2016), who used synthetic rather than real speech; and Adams et al. (2016) and Godard et al. (2016), who worked from phone lattices and phone sequences, respectively; Stahlberg et al. (2013), who perform phone-to-translation alignment for pronunciation extraction.Weiss et al. (2017) presented a sequence-to-sequence neural model that learned a direct mapping from speech to translated text with impressive results, but was trained on roughly 140 hours of parallel data-far more than is available for most endangered languages.
The only previous system we know of to address the same very-low-resource scenario and provide translation terms for unlabeled audio is that of Bansal et al. (2017b) (henceforth UTD-align), who used an unsupervised term discovery system (Jansen et al., 2010) to cluster recurring audio segments into pseudowords.The pseudowords occurring in the parallel section of the corpus were then aligned to the translation text using IBM Model 1, and used to translate instances occurring in the test (audio-only) section.

Method
The main difference between our method and UTD-align is that UTD-align clusters the audio prior to aligning with the translations, whereas we start by performing joint alignment and clustering using an improved version of the method proposed by Anastasopoulos et al. (2016) (henceforth s2t).The resulting aligned clusters are represented by one or more prototype speech segments.We extend s2t to identify new instances of those prototypes in the unlabeled speech, using a modified version of ZRTools, the same UTD toolkit used by UTD-align.3(Jansen et al., 2010) Previous work has indicated that using translation text to inform acoustic clustering provides more accurate clusters than just using UTD (Bansal et al., 2017a), so we initially expected that this straightforward extension of s2t would work better than UTD-align.However, early experiments indicated that the text had too much influence on clustering, yielding clusters with highly diverse audio, and thus poor prototypes.Thus, we modified s2t4 in order to account for this issue, obtaining prototypes of higher quality ( §3.1), which we search for in the unlabeled audio ( §3.2).

Aligning speech to translation
The s2t model is an extension of IBM Model 2 for word alignment (Brown et al., 1993), combined with K-means clustering using Dynamic Time Warping (DTW) (Berndt and Clifford, 1994) as a distance measure.It uses expectationmaximization (EM) to align speech segments to words in the parallel text, while jointly clustering the segments.Each translation word is aligned to an acoustic segment, with overlapping alignments and unaligned speech spans being allowed.
In the original implementation, every translation word was represented by a fixed number (2) of acoustic sub-clusters, with a single prototype representing each. 5The prototypes are averages of the segments in the cluster, computed using DTW Barycenter Averaging (Petitjean et al., 2011).At the E-step, each segment was assigned to its closest sub-cluster, and at the M-step the sub-cluster's prototype was re-computed.However, the original choice of two subclusters was fairly arbitrary, and we found it doesn't sufficiently account for the wide acoustic variability due to gender or speaker.We thus modify s2t so that, before the M-step, each cluster's segments are grouped into sub-clusters using connected components clustering with a similarity threshold δ, following Park and Glass (2008).That way, the number of subclusters and prototypes for each translation word is determined automatically based on the acoustic similarity of the segments.
Our preliminary analysis showed that shorter alignments tend to introduce significantly more noise than longer ones.Therefore, in the final Mstep of s2t, we discard all segments shorter than a length threshold t before computing the prototypes.We use the default values for the rest of the s2t parameters.
Another pragmatic choice we made based on the performance of our method was to remove the stopwords from the translations, following Bansal et al. (2017b).The rationale is that translation stopwords would not be particularly useful for labelling speech in our envisioned use cases.

Keyword Search
In the second stage, we use the approximate DTWbased pattern matching method of ZRTools to search for the obtained prototypes in the test data.We require that each discovered term matches at least k% of a prototype's length and that its DTW similarity score is higher than a threshold s.By varying s we can control the number of discovered terms, trading off precision and recall.Also, we do not allow overlapping matches; in the case of an overlap, we output the match with the higher score.
The CALLHOME Spanish Speech dataset (LDC2014T23) with English translations (Post et al., 2013) has been used in almost all groundlaying previous work, treating Spanish as a low-resource language.As a collection of telephone conversations between relatives (about 20 total hours of audio), it doesn't match our language documentation scenario, but we use it in order to compare our method with previous work.
We shuffle the utterances and split them into training, dev, and test sets with 70%, 10%, and 20% of the data, respectively.We filter the active audio regions using energy-based voice activity detection (VAD).We obtain prototypes in the training set and tune the values of the length threshold t, the similarity threshold d, and the partial overlap threshold k on the development set using grid search.The best parameter combination is t = 300 ms, d = 90%, and k = 80%, while s = 0.90 returns the highest F-score.We evaluate our discovered translation terms on the test set using precision, recall, and F-score at the token level over the correct bag-of-words translations.
We also evaluate our method on two lowresource endangered languages, Arapaho and Ainu.For these experiments, we only have a training and test set, so we use the same preprocessing and hyperparameter settings as in CALLHOME.
Arapaho is an Algonquian language with about 1,000 native speakers, mostly in Wyoming.We use 8 narratives published at The Arapaho Language Project, 6 which provides the narratives' audio along with English translations, among other language learning resources.
Hokkaido Ainu is the sole surviving member of the Ainu language family and is generally considered a language isolate.As of 2007, only ten native speakers were alive.The Glossed Audio Corpus of Ainu Folklore provides 10 narratives with audio and translations in English. 7More information and statistics on the Arapaho and Ainu corpora is provided in Tables 4 and 5.

Results on CALLHOME
We first evaluate the effect of our modifications to the s2t method, by calculating alignment Fscore on links between speech frames and transla- tion words. 8The intermediate sub-clustering step between the E-and M-steps results in a more informed selection of the number of sub-clusters that increases the alignment F-score by 1.5%.Also, removing translation stopwords further leads to higher alignment precision by +4%.Alignment recall is lower since it's computed over the alignments of both content and stopwords.Although both improvements are small, the higher alignment precision leads to better prototypes.
In addition, Duong et al. (2016) created "silver" standard speech-to-translation alignments by combining the forced speech-to-transcription alignments and the transcription-to-translation word alignments.These are useful for evaluating how well the prototype creation and matching could work, given oracle speech-to-translation alignments.In Table 1, we report precision, recall, and F-score on the discovered translation terms (at the token level) using prototypes from both "silver" and noisy alignments.We also report the percentage of active audio that is labelled (coverage).In both cases we outperform UTD-align.9Even though there is room for improvement, using the translation information at the alignment stage certainly improves the clustering, as anticipated.Another advantage of our method over UTD-align is its significantly improved coverage of the active audio, as shown in the last column of Table 1.The precision-recall curve obtained by varying the output similarity threshold s is shown in Figure 1.

Results on Arapaho and Ainu
Out of the eight Arapaho narratives, we select the longest (18 minutes of audio, 233 English word types) for training, using the other seven (32 minutes total) for evaluation.The Ainu collection provides ten narratives, so we use the first two for  Treating each narrative as a bag of words, the precision and recall results at the token level are shown in Tables 2 and 3.The last columns of these Tables correspond to the highest possible recall that we could get if we discovered all the training terms that also appear in the test set.Precisionrecall curves can be seen in Figure 1.
On both corpora, UTD-align identifies hardly any translation terms, with recall scores below 1% and average F-scores of 0.8% and 0.2% for Arapaho and Ainu, respectively.Preprocessing with the same VAD script as for our method, UTD-align produced too many spurious matches (millions); we then used a more aggressive filtering which removed more parts of the audio, but it resulted in too few discovered matches (as shown here).In principle, it should be possible to tailor the preprocessing parameters for each corpus and improve results for UTD-align.
Our method, instead, outputs several terms per  2 and 3, we are generally able to identify meaningful terms.For most of the Arapaho stories we discover named entities such as Ghost and Strong Bear, content nouns like tipis and mountains, or verbs such as hunting.In Ainu we discover more terms, but the narratives are also longer.A larger domain shift between training and test (small overlap on named entities and other content words) leads to lower recall compared to Arapaho.Our method correctly identifies mostly common terms in the Ainu narratives, like village, food, as well as verbs used in narration such as said, went, or came.

Conclusion
We propose a method that modifies and extends a speech-to-translation alignment method and can be used for identifying translation terms in unlabeled audio, appropriate for extremely small datasets.On CALLHOME, we show small improvements over a recent baseline.We also demonstrate the applicability of our method on language documentation scenarios, by applying it on two endangered language datasets.Speaker differences are still an issue, but our method is more robust to differences in acoustic quality than the previous method.

Figure 1 :
Figure 1: Average precision and recall curve for our discovered matches in CALLHOME and the Arapaho and Ainu test narratives (varying the output threshold s between 0.90 and 0.94).

Table 4 :
Statistics on the Arapaho narratives.English type and token counts do not include stopwords.

Table 5 :
Statistics on the Ainu narratives.English type and token counts do not include stopwords.