Adaptor Grammars for the Linguist: Word Segmentation Experiments for Very Low-Resource Languages

Computational Language Documentation attempts to make the most recent research in speech and language technologies available to linguists working on language preservation and documentation. In this paper, we pursue two main goals along these lines. The first is to improve upon a strong baseline for the unsupervised word discovery task on two very low-resource Bantu languages, taking advantage of the expertise of linguists on these particular languages. The second consists in exploring the Adaptor Grammar framework as a decision and prediction tool for linguists studying a new language. We experiment 162 grammar configurations for each language and show that using Adaptor Grammars for word segmentation enables us to test hypotheses about a language. Specializing a generic grammar with language specific knowledge leads to great improvements for the word discovery task, ultimately achieving a leap of about 30% token F-score from the results of a strong baseline.


Introduction
A large number of the world's languages are expected to go extinct during this century -as much as half of them according to Crystal (2002) and Janson (2003). Such predictions have subsequently fostered a growing interest for a new field, Computational Language Documentation (CLD), as it is now clear that traditional field linguistics alone will not meet the challenge of preserving and documenting all of these languages.
CLD attempts to make the most recent research in speech and language technologies available to linguists working on language preservation and documentation (e.g. (Anastasopoulos and Chiang, 2017;Adams et al., 2017)). A remarkable effort in this direction has improved the data collection tools to be used on the field (Bird et al., 2014;, enabling to collect corpora for several endangered languages . In parallel, the language technology community is investing more efforts to design methodologies tailored for the new challenges posed by the analysis of such linguistic material: the extreme variability of the orthographic representation, the scarcity of annotated data (both written and oral), as well as the modeling of complex tonal systems.
This effort could greatly benefit from a tighter collaboration between the two main research communities involved in this endeavor, which often struggle to cooperate efficiently. Knowledge background differs between linguists and computer scientists; the definition of why a problem is interesting or not may not be the same for the two communities, theoretical and experimental platforms do not intersect much, etc. Consequently, for lack of investing enough energy working on the same problems with the same tools and towards the same goals, we might not achieve the efficiency that is needed, as time is running out for many languages. This view constitutes the underlying motivation of the work reported here.
We pursue two main goals in this spirit. The first one is to improve upon a strong baseline  for the unsupervised word discovery task 1 on two low-resource languages, by teaming up with linguist experts. A natural idea to achieve this goal is to engage them in formalizing their linguistic knowledge regarding the languages or language families under study, in the hope that it will compensate for the small amount of available data. In our case, this expertise corresponds to morphological and phonotactic constraints for two Bantu languages displaying very similar structures (see Section 3). For one language, we were also able to elicit a list of prefixes and some additional knowledge regarding the consonantal system. Such expert knowledge can readily be integrated in grammar rules using the framework of Adaptor Grammars (see Section 6). Another interesting property of this framework is its compatibility with two strategies that are usually thought as being mutually exclusive: rule-based learning, still in wide use inside the linguistics community, and statistical learning, prevalent in natural language processing circles.
Our second goal is to study ways to help linguists explore language data when little expert knowledge is available. Our proposal is to complement the grammatical description activity with task-oriented search procedures, that will speed up the exploration of competing hypotheses. The intuition is that better grammars should not only truthfully match the empirical data, but also improve the quality of automatic analysis processes. The word discovery task considered below should thus be viewed as an extrinsic validation procedure, rather than a goal in and of itself. This process might also yield new linguistic insights regarding the language(s) under focus.
To sum up, the main contribution of this paper is a methodology for systematically exploring (a subpart of) the space of possible grammars, refining grammar rules (from the most generic to the most language specific) at four levels of description (see Section 4). This results in a comparison of 162 alternative accounts of the grammar for two languages. Our results (analyzed in Section 5) show that enriching grammar rules with language specific knowledge has a consistent positive impact in performance for the segmentation task. They validate our hypotheses that (a) improved grammatical descriptions actually correlate with better automatic analysis; (b) Adaptor Grammars provide a framework around which linguists and computer scientists can effectively collaborate, with tangible results for both communities.

Adaptor Grammars
Formal grammars, and notably Context-Free Grammars (CFGs), are a cornerstone of linguistic description and provide a model for the structural description of linguistic objects. Our grammars capture simple aspects of the syntax and some less trivial aspects of the morphological and phonological structures. As discussed below, both levels of descriptions are useful for word discovery.
A CFG is a 4-tuple G = (N, W, R, S) where N and W are respectively the non-terminal and terminal symbols, R a finite set of rules of the form A → β, with A ∈ N and β ∈ (N ∪ W ) * , and S ∈ N the start symbol. Our grammars will be used to analyze the structure of complete utterances and the start symbol S will always correspond to the sentence top-level. Assuming that S, Words, and Word belong to N , the top level rules will typically look like: S → Words; Words → Word Words; Words → Word, the last two rules abbreviated as Words → Word +.
Probabilistic CFGs (PCFGs) (Johnson, 1998) extend this model by associating each rule with a scalar value θ A→β , such that for each A ∈ N , β θ A→β = 1. Under some technical conditions (Chi, 1999), PCFGs define probability distributions over the set of parse trees, where the probability of a tree is a product of the probability of the rules it contains. PCFGs can be learned in a supervised way from treebanks or in a unsupervised manner using, for instance, the EM algorithm (Lari and Young, 1990).
PCFGs make unrealistic independence assumptions between the different subparts of a tree, an observation that has yielded many subsequent variations and extensions. Adaptor grammars (AGs) (Johnson et al., 2007) define a powerful mechanism to manipulate PCFG distributions to better match the occurrences of trees and subtrees observed in actual corpora. Informally, an AG is a CFG where non-terminals have the possibility to be adapted: when non-terminal A is adapted, all subtrees rooted in A are "reified", meaning that they are no-longer viewed only as decomposable objects, but can also be manipulated and stored as a whole. In our grammars below, adapted non-terminals are underlined, and optional non-terminals appear between brackets. Following (Johnson et al., 2007), we only adapt non-recursive non-terminals. 2 AGs define a framework to implement Bayesian nonparametric learning of grammars, and are usually trained in an unsupervised manner using sampling techniques (Monte-Carlo Markov Chain, MCMC). A typical run will produce, for each sentence, a distribution of possible parses under the grammar, from which we can then retain the most frequent one as the "best" possible analysis. 3

Word Segmentation using AGs
In this work, we are interested in the word segmentation task: from an unsegmented stream of symbols, the system must output delimited sequences corresponding to actual words in the language. For this, we assume a linguistic grammar G, which parses sequences of letters (or phones) as being organized into Words, which themselves recursively decompose into smaller units such as Morphs, Syllables, etc. To induce word segmentation from parse trees, we will consider that each span covered by the non-terminal symbol Word defines a linguistic word, even though in a fully unsupervised setting, this non-terminal might actually correspond to larger or smaller linguistic units. Likewise, when examining the output of the training process, we are in a position to collect sets of word types (or morph types, syllable types, etc.) and will do so based only on the identity of the root symbol, i.e. without any certainty regarding the linguistic status of the collected sequences.
3 Linguistic material

Mboshi and Myene
We experiment with two Northwestern Bantu Languages: Mboshi (Bantu C25), a language spoken in Congo-Brazzaville, and Myene (B10, Gabon), a cluster of six mutually intelligible varieties (Adyumba, Enenga, Galwa, Mpongwe, Nkomi and Orungu) spoken at the coastal areas and around the town of Lambarene in Gabon. 4 Unlike southern Bantu relatives such as Swahili, Sotho or Zulu, Mboshi and Myene are scarcely studied, protected, and resourced. We briefly describe the main aspects related to phonetics, phonology, morphology, and tonology of these languages.
While both languages can be considered as rarely written, linguists have nonetheless defined a non-standard graphemic form for them, considered to be close to the language phonology. Affricates and prenasalized plosives are coded using multiple symbols (e.g. two symbols for dz, three for mbv). For Mboshi, long and short vowels are coded respectively as V and as VV. In Myene, the transcription of the corpus involves not only the phoneme set, but also the main variants (ñ, tS, dz) and some marginal sounds found in loanwords.
Both languages display a complex set of phonological rules. The deletion of a vowel before another vowel in particular, common in Bantu languages, occurs at 40% of word junctions in Mboshi (Rialland et al., 2015). This tends to obscure word segmentation and introduces an additional challenge for automatic processing.
Morphology. Words are composed of roots and affixes, and almost always include at least one prefix, while the presence of several prefixes and one suffix is also very common. The suffix structure mostly consists of a single vowel V (e.g. -a or -i) whereas the prefix structure may be both CV or V (or CVV in Mboshi). The most common syllable structures are V and CV in both languages. CVC also occurs in Myene, and CVV in Mboshi. 5 The noun class prefix system is another feature typical of Bantu languages. For both languages, the structure of the verbs, also common in Bantu languages, is as follows: Subject Marker -Tense/Mood Marker -Root-derivative Extensions -Final Vowel. A verb can be very short or quite long, depending on the markers involved.
Tonology. Prosodic systems for both Mboshi and Myene involve tones, but the transcribed data used for this work do not encode tone markers. Experiments to assess the usability of tonal information for word segmentation were conducted in (Godard et al., 2018b).

Corpora for Mboshi and Myene
Corpora for Mboshi and Myene were collected following a real language documentation scenario, using a mobile app dedicated to fieldwork language documentation . These corpora contain manual transcriptions in the form of a non-standard graphemic form close to the languages' phonology. The correct word segmentations for these transcripts were also annotated by linguists. Basic statistics are in Table 1.

Structuring Grammar Sets
Our starting point is the set of grammars used in ) and (Eskander et al., 2016) which we progressively specialize through an iterative refinement process involving both field linguists and computer scientists. As we wish to evaluate specific linguistic hypotheses, the initial space of interesting grammars has been generalized in a modular, systematic, and hierarchical way as follows. We distinguish four sections in each grammar: sentence, word, syllable, character. For each section, we test multiple hypotheses, gradually incorporating more linguistic structure. Every hypothesis inside a given section can be combined with every hypothesis of any other section, 7 thereby allowing us to explore a large quantity of grammars and to analyze the contribution of each particular hypothesis.

The Full Grammar Landscape
All the grammar sections (sentence, word, syllable, character) experimented in this paper are detailed in Figure 1. We describe below the way each section was designed.
7 Note that if a non-terminal is absent from a hypothesis (e.g. Syllable in a word level hypothesis), the corresponding non-terminal in the subsequent hypotheses (e.g. at the syllable level) will be ignored.
• sentence level: we model 3 different hierarchies of words. We introduce first the flat variety with two rules generating rightbranching parse trees. colloc adds a single level of word collocation, aimed to capture recurrent local word associations (such as frequent bigrams); colloc3 displays a deeper hierarchical structure with three levels of collocations. Exploring more realistic syntactic structures is left for future work.
• word level: here we propose 6 competing hypotheses. flat is similar to previous sentence variety but at the word level instead of the sentence level. generic corresponds to a more structured version of flat, as the specification of a sequence of 5 adapted morphemes allows, in principle, the Adaptor Grammar to learn some morphotactics. bantu defines a generic morphology for Bantu languages. basaa implements the morphology of a well-studied Bantu language, Basaa (A43 (Hamlaoui and Makasso, 2015)). mboshi/myene corresponds to a somewhat crude morphology of Mboshi, also applicable to Myene. Last mboshi/myene_NV refines mboshi/myene with a specification of the morphology of nouns and verbs. Additionally, for basaa, mboshi/myene and mboshi/myene_NV which introduce a notion of prefix, we also test a variant (called respectively basaa+, mboshi/myene+ and mboshi/myene_NV+) containing an explicit list of prefixes in Mboshi.
• syllable level: we contrast 3 hypotheses : flat is similar to previous sentence and word varieties but at the syllable level, defining the syllable as a mere sequence of characters. generic/basaa is a generic set of rules modeling phonotactics applicable to a wide scope of languages (including Basaa mentioned in the preceding level). bantu/mboshi/myene displays a set of rules more specific to Mboshi and Myene. 8 • character level: rules in the chars set simply rewrite the characters (terminals) ob- Syllable level (C)
For each language, we evaluate our 162 grammar configurations using Mark Johnson's code, 10 collecting parses after 2,000 sampling steps. 11 We adapt all non-recursive non-terminals and use a Dirichlet prior to estimate the rule probabilities. We place a uniform Beta prior on the discount parameter of the Pitman-Yor process, and a vague Gamma prior on the concentration parameter. Figure 3 presents token metrics (WP, WR, WF) and type metrics (LP, LR, LF), as well as sentence exact-match (X) for both corpora on all grammars.

Word Segmentation Results
Impact of sentence level variants We can see in Figure 3 that A2 and A3 hypotheses globally yield better results than A1 in both languages. For 9 The exact-match metric includes single-word utterances. 10 http://web.science.mq.edu.au/ mjohnson/Software.htm 11 The large number of experiments we are dealing with did not allow us to average over several runs. Stable results were obtained on a subset of grammars. Two particular configurations in Mboshi (A3-B6-C3-D1+ and A1-B6-C1-D1) did not reach 2,000 iterations within the maximum wall clock time allowed by the cluster used for these experiments (2 weeks), and are left out of the discussion.
Mboshi, the benefit of A3 vs. A2 appears especially on token metrics (WP, WR, WF), but this contrast is less clear on Myene. For both languages, however, our results confirm that modeling collocation-like word groups at the sentence level is important. These word dependencies seem indeed related to a universal linguistic property.
Impact of word level variants If we now focus solely on the A3 hypothesis for Myene in Figure 3, we observe a general trend upwards for all metrics. The benefit of gradually using more languagespecific grammars, from B1 to B6+, is clear. While this trend is also observed for Mboshi, the less specific B3 hypothesis yields the strongest results on token metrics (WP, WR, WF). Precision on types (LP) with B3 is also the strongest, but B6+ achieves better performance on type recall and F-measure (LR and LF). The contrast between B1 and B2 for all metrics on both languages (keeping a focus on A3, but this can also be seen for A1 and A2) highlights the benefit of modeling some morphotactics inside the word-level hypotheses, which seems to correspond to another universal linguistic property (the dependency between morphemes inside a word).

Impact of syllable level variants
It is difficult to see a clear trend for the impact of syllable-level variants in Figure 3. Importantly, the syllable level will only be effective when combined with word level variants B4, B5 and B6 (and their "+" versions) which model the concept of syllable: when combined with B1, B2 or B3, each C level hypothesis will default to its "Chars → Char +" rule. Figure 4 illustrates the impact of C1, C2, and C3 by averaging type and token F-measures (LF and WF) over all grammar sections with a syllable non-terminal (B4, B4+, B5, B5+, B6, and B6+). The benefit of C2/C3 vs. C1 appears more clearly, especially on type F-measures and on Myene. 12 Nevertheless, the impact of the syllable level, and the capacity to incorporate phonotactics in our models, seems of less significance for word segmentation than choices made at the word and sentence levels. Figure 3, it is also hard to see if there is any benefit in using D1+ over D1, i.e. adding digraphs or trigraphs  to the consonant inventory. Averaging over all hypotheses at the A, B, and C levels do not exhibit any clearer impact. It is likely that refined models at the syllable level (C) compensate for a less accurate consonant inventory through the adaptation of their non-terminals, and do learn some phonotactics. This would explain the weak contribution of D1+. To test this hypothesis, we set the sentence level to A3 (the best compromise for Mboshi and Myene) and the word level to B1, B2, or B3 (levels without a Syllable non-terminal, which cancels the effect of the syllable level C). The token and type F-measures averaged over the considered hypotheses are shown Figure 5. We do observe a benefit in using the D1+ character variant in Mboshi, but not in Myene. This is not surprising, as the digraph and trigraph rules added by the D1+ variant are specific to Mboshi and do not cover the inventory for Myene.

Impact of character level variants In
Stronger results in Myene Segmentation performance is globally superior in Myene. This can probably be explained by corpus statistics (see Table 1), as the average number of words per sentence is 3.94 in Myene, and 5.96 in Mboshi. Since sentence boundaries are also word boundaries, the proportion of already known word boundaries is higher in Myene, which makes word segmentation a harder task in Mboshi. Figure 3 also reveals an interesting contrast: token results are higher than type results in Myene, while the converse is true in Mboshi. The token/type ratio (5.75 tokens for one type in Mboshi, and 4.30 in Myene) indicates a higher lexical diversity in Myene, which might explain weaker results on types. Strong results on types for Mboshi, on the other hand, show the ca- pacity of AGs to generalize well on low-frequency events, a property of particular interest in the lowresource scenario.

How Can This Help a Linguist?
Our second goal is to understand more precisely how such experiments can be useful for linguists, beyond the benefit of having access to better automatic word segmentation tools for their data.
Phonological status of complex consonants In the analysis of the results (Section 5.1 above) we showed the benefit of integrating digraphs or trigraphs in the consonants inventory for Mboshi. This result is of special interest for linguists, since it is in line with the most recent phonological analyses of Mboshi (Embanga Aborobongui, 2013;Kouarata, 2014;Amboulou, 1998) which agree in recognizing complex consonants (represented by digraphs or trigraphs) in the phonological inventory of this language. The analysis of complex consonants, in particular prenasalized consonants, generated many debates in Bantu linguistics (Odden, 2015;Herbert, 1986;Downing, 2005). The present experiments provide more substance to support the integration of complex consonants in the phonological inventory of Mboshi.
Learning prefixes without supervision Since parses are produced to segment sentences into words, it is possible to extract the most frequent prefixes or suffixes (for B variants introducing such a concept). The precision on the 20 most frequently found prefixes for grammars without prefix-supervision (B3, B4, B5 and B6) 14 reaches 58.21% in Mboshi, and 61.21% in Myene. The capacity of AGs to learn true prefixes without supervision could thus help linguists in the process of documenting a new language. On the supervised variants (B4+, B5+, and B6+), the precision achieved in Mboshi is 61.11%, and 63.07% in Myene: the benefit of the supervision is limited; token measures for Mboshi with these variants (Figure 3) nevertheless indicate a benefit for word segmentation.

Related Work
AGs have been used to infer the structure of unsegmented sequences of symbols, offering a plausible modeling of language acquisition (Johnson, 2008b;); they have also been used for the unsupervised discovery of word structure, applied to the Sesotho language by Johnson (2008a). One notable outcome of this latter study was to demonstrate the effectiveness of having an explicit hierarchical model of word internal structure ; an observation that was one of our primary motivations for using AGs in our language documentation work. In this series of studies, AGs are shown to generalize models of unsupervised word segmentations such as the Bayesian nonparametric model of Goldwater (2006), delivering hierarchical (rather than flat) decompositions for words or sentences.
While AGs are essentially viewed as an unsupervised grammatical inference tool, several authors have also tried to better inform grammar inference with external knowledge sources. This is the case of Sirts and Goldwater (2013), who study a semi-supervised learning scheme combining annotated data (parse trees) with raw sentences. The linguistic knowledge considered in  aims to better model function words in a 14 We include B3 variant, interpreting its non-terminal Prefixes as a prefix. language acquisition setting: explicitly representing the occurrence of these short (typically monosyllabic) tokens in front of content-bearing words was shown to improve the resulting word segmentations. The work of Eskander et al. (2016) considers the use of additional dictionaries, storing partial lists of prefixes or suffixes collected either on the Internet, or discovered during a first round of training. We study similar complementary information, which are collected in close collaboration with linguistic experts.
Various other extensions or applications of AGs are worth mentioning, such as O' Donnell et al. (2009), which generalizes AGs so as to adapt fragments of subtrees (rather than entire subtrees). Botha and Blunsom (2013) consider the adaptation of grammars from a more general class than context-free grammars (mildly context-sensitive grammars), in order to model discontinuous fragments in non-concatenative morphology. Finally, Börschinger and Johnson (2014) propose to model the role of stress cues in language learning.

Conclusion
This paper had two main goals: (1) improve upon a strong baseline for the unsupervised discovery of words in two very low-resource Bantu languages; (2) explore the Adaptor Grammar framework as an analysis and prediction tool for linguists studying a new language.
Systematic experiments with 162 grammar configurations for each language have shown that using AGs for word segmentation is a way to test linguistic hypotheses during a language documentation process. Conversely, we have also shown that specializing a generic grammar with language specific knowledge greatly improves word segmentation performance. In addition, our paper reports word segmentation results that are way higher than a Bayesian baseline. These results invite us to further this collaboration, and to analyze more thoroughly the usability of output parses in speeding up the documentation process.