A Corpus for Large-Scale Phonetic Typology

A major hurdle in data-driven research on typology is having sufficient data in many languages to draw meaningful conclusions. We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology, with aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants. Access to such data can greatly facilitate investigation of phonetic typology at a large scale and across many languages. However, it is non-trivial and computationally intensive to obtain such alignments for hundreds of languages, many of which have few to no resources presently available. We describe the methodology to create our corpus, discuss caveats with current methods and their impact on the utility of this data, and illustrate possible research directions through a series of case studies on the 48 highest-quality readings. Our corpus and scripts are publicly available for non-commercial use at https://voxclamantisproject.github.io.


Introduction
Understanding the range and limits of crosslinguistic variation is fundamental to the scientific study of language. In speech and particularly phonetic typology, this involves exploring potentially universal tendencies that shape sound systems and govern phonetic structure. Such investigation requires access to large amounts of cross-linguistic data. Previous cross-linguistic phonetic studies have been limited to a small number of languages with available data (Disner, 1983;Cho and Ladefoged, 1999), or have relied on previously reported measures from many studies (Whalen and Levitt, 1995;Becker-Kristal, 2010;Gordon and Roettger, 2017;Chodroff et al., 2019). Existing multilingual speech corpora have similar restrictions, with data too limited for many tasks (Engstrand and Cunningham-Andersson, 1988;Ladefoged and Maddieson, 2007) or approximately 20 to 30 recorded languages (Ardila et al., 2020;Harper, 2011;Schultz, 2002).
The recently developed CMU Wilderness corpus (Black, 2019) constitutes an exception to this rule with over 600 languages. This makes it the largest and most typologically diverse speech corpus to date. In addition to its coverage, the CMU Wilderness corpus is unique in two additional aspects: cleanly recorded, read speech exists for all languages in the corpus, and the same content (modulo translation) exists across all languages.
However, this massively multilingual speech corpus is challenging to work with directly. Copyright, computational restrictions, and sheer size limit its accessibility. Due to copyright restrictions, the audio cannot be directly downloaded with the sentence and phoneme alignments. A researcher would need to download original audio MP3 and text through links to bible.is, then segment these with speech-to-text sentence alignments distributed in Black (2019). 1 For phonetic research, subsequently identifying examples of specific phonetic segments in the audio is also a near-essential step for extracting relevant acoustic-phonetic measurements. Carrying out this derivative step has allowed us to release a stable-access collection of token-level acoustic-phonetic measures to enable further research.
Obtaining such measurements requires several processing steps: estimating pronunciations, aligning them to the text, evaluating alignment quality, and finally, extracting phonetic measures. This work is further complicated by the fact that, for a sizable number of these languages, no linguistic resources currently exist (e.g., language-specific pronunciation lexicons). We adapt speech processing methods based on Black (2019) to accomplish these tasks, though not without noise: in §3.4, we identify three significant caveats when attempting to use our extended corpus for large-scale phonetic studies.
We release a comprehensive set of standoff markup of over 400 million labeled segments of continuous speech. 2 For each segment, we provide an estimated phoneme-level label from the X-SAMPA alphabet, the preceding and following labels, and the start position and duration in the audio. Vowels are supplemented with formant measurements, and sibilants with standard measures of spectral shape.
We present a series of targeted case studies illustrating the utility of our corpus for large-scale phonetic typology. These studies are motivated by potentially universal principles posited to govern phonetic variation: phonetic dispersion and phonetic uniformity. Our studies both replicate known results in the phonetics literature and also present novel findings. Importantly, these studies investigate current methodology as well as questions of interest to phonetic typology at a large scale.

Original Speech
The CMU Wilderness corpus (Black, 2019) consists of recorded readings of the New Testament of the Bible in many languages and dialects. Following the New Testament structure, these data are broken into 27 books, each with a variable number of chapters between 1 and 25. Bible chapters contain standardized verses (approximately sentence-level segments); however, the speech is originally split only by chapter. Each chapter has an average of 13 minutes of speech for a total of ≈20 hours of speech and text per language. These recordings are clean, read speech with a sampling rate of 16 kHz. In most languages, they are non-dramatic readings with a single speaker; in some, they are dramatic multi-speaker readings with additive music. 3 The release from Black (2019) includes several resources for processing the corpus: scripts to download the original source data from bible.is, 'lexicons' created using grapheme-to-phoneme (G2P) conversion, and scripts to apply their generated sentence alignments, which facilitates downstream language processing tasks, including phoneme alignment.
3 The VoxClamantis V1.0 Corpus Our VoxClamantis V1.0 corpus is derived from 690 audio readings of the New Testament of the Bible 4 in 635 languages. 5 We mark estimated speech seg-ments labeled with phonemic labels, and phonetic measures for the tokens that are vowels or sibilants. The extraction process is diagrammed in Figure 2. In the sections below, we detail our procedures for extracting labeled audio segments and their phonetic measures, in both high-and low-resource languages. We then outline important caveats to keep in mind when using this corpus.

Extracting Phoneme Alignments
We use a multi-pronged forced alignment strategy to balance broad language coverage ( §3.1.1) with utilization of existing high-quality resources ( §3.1.2). We assess the quality of our approaches in §3.1.3. We release the stand-off markup for our final alignments as both text files and Praat TextGrids (Boersma and Weenink, 2019). 6 Using scripts and estimated boundaries from Black (2019), we first download and convert the audio MP3s to waveforms, and cut the audio and text into 'sentences' (hereafter called 'utterances' as they are not necessarily sentences). This step creates shorter-length speech samples to facilitate forced alignment; utterance boundaries do not change through our processing.
To extract labeled segments, we first require pronunciations for each utterance. A pronunciation is predicted from the text alone using some graphemeto-phoneme (G2P) method. Each word's predicted pronunciation is a sequence of categorical labels, which are 'phoneme-level' in the sense that they are usually intended to distinguish the words of the language. We then align this predicted sequence of 'phonemes' to the corresponding audio.

All Languages
Most of our languages have neither existing pronunciation lexicons nor G2P resources. To provide coverage for all languages, we generate pronunciations using the simple 'universal' G2P system Unitran (Qian et al., 2010, as extended by Black, 2019, which deterministically expands each grapheme to a fixed sequence of phones in the Extended Speech Assessment Methods Phonetic Alphabet (X-SAMPA) (Wells, 1995(Wells, /2000. This naive process is error-prone for languages with opaque orthographies, as we show in §3.1.3 below and discuss further in §3.4 (Caveat B). Even so, it provides a starting point for exploring low-resource languages: after some manual inspection, a linguist may be 6 Corresponding audio will need to be downloaded from source and split by utterance using scripts from Black (2019). able to correct the labels in a given language by a combination of manual and automatic methods.
For each reading, to align the pronunciation strings to the audio, we fit a generative acoustic model designed for this purpose: specifically, eHMM (Prahallad et al., 2006) as implemented in Festvox (Anumanchipalli et al., 2011) to run full Baum-Welch from a flat start for 15 to 30 iterations until the mean mel cepstral distortion score (see §3.1.3) converges. Baum-Welch does not change the predicted phoneme labels, but obtains a language-specific, reading-specific, contextual (triphone) acoustic model for each phoneme type in the language. We then use Viterbi alignment to identify an audio segment for each phoneme token.

High-Resource Languages
A subset of the languages in our corpus are supported by existing pronunciation resources. Two such resources are Epitran (Mortensen et al., 2018), a G2P tool based on language-specific rules, available in both IPA and X-SAMPA, and WikiPron , a collection of crowd-sourced pronunciations scraped from Wiktionary. These are mapped from IPA to X-SAMPA for label consistency across our corpus. Epitran covers 29 of our languages (39 readings), while WikiPron's 'phonemic' annotations 7 provide partial coverage of 13 additional languages (18 readings). We use Epitran for languages with regular orthographies where it provides high-quality support, and WikiPron for other languages covered by WikiPron annotations. While Unitran and Epitran provide a single pronunciation for a word from the orthography, WikiPron may include multiple pronunciations. In such cases, Viterbi alignment (see below) chooses the pronunciation of each token that best fits the audio.
For most languages covered by WikiPron, most of our corpus words are out-of-vocabulary, as they do not yet have user-submitted pronunciations on Wiktionary. We train G2P models on WikiPron annotations to provide pronunciations for these words. Specifically, we use the WFST-based tool Phonetisaurus (Novak et al., 2016). Model hyperparameters are tuned on 3 WikiPron languages from SIGMORPHON 2020 (Gorman et al., 2020) (see Appendix C for details). In general, for languages that are not easily supported by Epitran-style G2P rules, training a G2P model on sufficiently many high-quality annotations may be more accurate.
We align the speech with the high-quality labels using a multilingual ASR model (see Wiesner et al., 2019). The model is trained in Kaldi (Povey et al., 2011) on 300 hours of data from the IARPA BABEL corpora (21 languages), a subset of Wall Street Journal (English), the Hub4 Spanish Broadcast news (Spanish), and a subset of the Voxforge corpus (Russian and French). These languages use a shared X-SAMPA phoneme label set which has high coverage of the labels of our corpus.
Our use of a pretrained multilingual model here contrasts with §3.1.1, where we had to train reading-specific acoustic models to deal with the fact that the same Unitran phoneme label may refer to quite different phonemes in different languages (see §3.4). We did not fine-tune our multilingual model to each language, as the cross-lingual ASR performance in previous work (Wiesner et al., 2019) suggests that this model is sufficient for producing phoneme-level alignments.

Quality Measures
Automatically generated phoneme-level labels and alignments inherently have some amount of noise, and this is particularly true for low-resource languages. The noise level is difficult to assess without gold-labeled corpora for either modeling or assessment. However, for the high-resource languages, we can evaluate Unitran against Epitran and WikiPron, pretending that the latter are ground truth. For example, Table 1 shows Unitran's phoneme error rates relative to Epitran. Appendix B gives several more detailed analyses with examples of individual phonemes.
Unitran pronunciations may have acceptable phoneme error rates for languages with transparent orthographies and one-to-one grapheme-tophoneme mappings. Alas, without these conditions they prove to be highly inaccurate.
That said, evaluating Unitran labels against Epitran or WikiPron may be unfair to Unitran, since some discrepancies are arguably not errors but mere differences in annotation granularity. For example, the 'phonemic' annotations in WikiPron are sometimes surprisingly fine-grained: WikiPron frequently uses /t "/ in Cebuano where Unitran only uses /t/, though these refer to the same phoneme. These tokens are scored as incorrect. Moreover, there can be simple systematic errors: Unitran always maps grapheme <a> to label /A/, but in Tagalog, all such tokens should be /a/. Such errors can often be fixed by remapping the Unitran labels, which in these cases would reduce PER from 30.1 to 6.8 (Cebuano) and from 34.4 to 7.8 (Tagalog). Such rules are not always this straightforward and should be created on a language-specific basis; we encourage rules created for languages outside of current Epitran support to be contributed back to the Epitran project.
For those languages where we train a G2P system on WikiPron, we compute the PER of the G2P system on held-out WikiPron entries treated as ground truth. The results (Appendix C) range from excellent to mediocre.
We care less about the pronunciations themselves than about the segments that we extract by aligning these pronunciations to the audio. For high-resource languages, we can again compare the segments extracted by Unitran to the higher-quality ones extracted with better pronunciations. For each Unitran token, we evaluate its label and temporal boundaries against the high-quality token that is closest in the audio, as measured by the temporal distance between their midpoints (Appendix B).
Finally, the segmentation of speech and text into corresponding utterances is not perfect. We use the utterance alignments generated by Black (2019), in which the text and audio versions of a putative utterance may have only partial overlap. Indeed, Black (2019) sometimes failed to align the Unitran pronunciation to the audio at all, and discarded these utterances. For each remaining utterance, he assessed the match quality using Mel Cepstral Distortion (MCD)-which is commonly used to evaluate synthesized spoken utterances (Kominek et al., 2008)-between the original audio and a resynthesized version of the audio based on the aligned pronunciation. Each segment's audio was resynthesized given the segment's phoneme label and the preceding and following phonemes, in a way that preserves its duration, using CLUSTER-GEN (Black, 2006) with the same reading-specific eHMM model that we used for alignment. We distribute Black's per-utterance MCD scores with our corpus, and show the average score for each language in Appendix E. In some readings, the MCD scores are consistently poor.

Phonetic measures
Using the phoneme-level alignments described in §3.1, we automatically extract several standard acoustic-phonetic measures of vowels and sibilant fricatives that correlate with aspects of their articulation and abstract representation.

Vowel measures
Standard phonetic measurements of vowels include the formant frequencies and duration information.
Formants are concentrations of acoustic energy at frequencies reflecting resonance points in the vocal tract during vowel production (Ladefoged and Johnson, 2014). The lowest two formants, F1 and F2, are considered diagnostic of vowel category identity and approximate tongue body height (F1) and backness (F2) during vowel production ( Figure 3). F3 correlates with finer-grained aspects of vowel production such as rhoticity (/r/-coloring), lip rounding, and nasality (House and Stevens, 1956;Lindblom and Sundberg, 1971;Ladefoged et al., 1978), and F4 with high front vowel distinctions and speaker voice quality (Eek and Meister, 1994). Vowel duration can also signal vowel quality, and denotes lexical differences in many languages.
We extracted formant and duration information from each vowel using Praat (Boersma and Weenink, 2019 Figure 3: Vowel Chart step of 6.25 ms, a maximum of five formants permitted, and a formant ceiling of 5000 Hz, which is the recommended value for a male vocal tract (Boersma and Weenink, 2019). Note that the speakers in this corpus are predominantly male.

Sibilant measures
Standard phonetic measurements of sibilant fricatives such as /s/, /z/, /S/, and /Z/ include measures of spectral shape, and also segment duration. Measures of spectral shape frequently distinguish sibilant place of articulation: higher concentrations of energy generally reflect more anterior constriction locations (e.g., /s z/ are produced closer to the teeth than /S Z/). Segment duration can also signal contrasts in voicing status (Jongman et al., 2000).
Our release contains the segment duration, spectral peak, the spectral moments of the frequency distribution (center of gravity: COG, variance, skewness, and kurtosis), as well as two measures of the mid-frequency peak determined by sibilant quality. These are the mid-frequency peak between 3000 and 7000 Hz for alveolar sibilants, and between 2000 and 6000 Hz for post-alveolar sibilants (Koenig et al., 2013;Shadle et al., 2016). The spectral information was obtained via multitaper spectral analysis (Rahim and Burr, 2017), with a time-bandwidth parameter (nw) of 4 and 8 tapers (k) over the middle 50% of the fricative (Blacklock, 2004). Measurements were made using the methods described in Forrest et al. (1988) for spectral moments and Koenig et al. (2013) for spectral peak varieties.

Computation times
Generating phoneme-level alignments and extracting subsequent phonetic measures takes significant time, computational resources, and domain knowledge. Our release enables the community to use this data directly without these prerequisites. Table 2 shows that the time to extract our resources,  once methods have been developed, was more than 6 CPU years, primarily for training eHMM models.

General caveats
We caution that our labeling and alignment of the corpus contains errors. In particular, it is difficult to responsibly draw firm linguistic conclusions from the Unitran-based segments ( §3.1.1). In §5 we suggest future work to address these issues.
A Quality of Utterance Pairs: For some utterances, the speech does not correspond completely to the text, due to incorrect cosegmentation. In our phonetic studies, we threshold using reading-level MCD as a heuristic for overall alignment quality, and further threshold remaining readings using utterance-level MCD.
We recommend others do so as well.

B Phoneme Label Consistency and Accuracy:
Phoneme-level labels are predicted from text without the aid of audio using G2P methods. This may lead to systematic errors. In particular, Unitran relies on a 'universal' table that maps grapheme <s> (for example) to phoneme /s/ in every context and every language. This is problematic for languages that use <s> in some or all contexts to refer to other phonemes such as /S/ or /ù/, or use digraphs that contain <s>, such as <sh> for /S/. Thus, the predicted label /s/ may not consistently refer to the same phoneme within a language, nor to phonetically similar phonemes across languages. Even WikiPron annotations are user-submitted and may not be internally consistent (e.g., some words use /d Z/ or /t/ while others use /Ã/ or /t "/), nor comparable across languages.
'Phoneme' inventories for Unitran and WikiPron have been implicitly chosen by whoever designed the language's orthography or its WikiPron pages; while this may reflect a reasonable folk phonology, it may not correspond to the inventory of underlying or surface phonemes that any linguist would be likely to posit.
C Label and Alignment Assessment: While alignment quality for languages with Epitran and WikiPron can be assessed and calibrated beyond this corpus, it cannot for those languages with only Unitran alignments; the error rate on languages without resources to evaluate PER is unknown to us. The Unitran alignments should be treated as a first-pass alignment which may still be useful for a researcher who is willing to perform quality control and correction of the alignments using automatic or manual procedures. Our automatically-generated alignment offers an initial label and placement of the boundaries that would hopefully facilitate downstream analysis.

Phonetic Case Studies
We present two case studies to illustrate the utility of our resource for exploration of crosslinguistic typology. Phoneticians have posited several typological principles that may structure phonetic systems. Though previous research has provided some indication as to the direction and magnitude of expected effects, many instances of the principles have not yet been explored at scale. Our case studies investigate how well they account for cross-linguistic variation and systematicity for our phonetic measures from vowels and sibilants. Below we present the data filtering methods for our case studies, followed by an introduction to and evaluation of phonetic dispersion and uniformity.

Data filtering
For quality, we use only the tokens extracted using high-resource pronunciations (Epitran and WikiPron) and only in languages with mean MCD lower than 8.0. 9 Furthermore, we only use those utterances with MCD lower than 6.0. The vowel analyses focus on F1 and F2 in ERB taken at the vowel midpoint (Zwicker and Terhardt, 1980;Glasberg and Moore, 1990). 10 The sibilant analyses focus on mid-frequency peak of /s/ and /z/, also in ERB. Vowel tokens with F1 or F2 measures beyond two standard deviations from the labeland reading-specific mean were excluded, as were tokens for which Praat failed to find a measurable F1 or F2, or whose duration exceeded 300 ms. Sibilant tokens with mid-frequency peak or duration measures beyond two standard deviations from the label-and reading-specific mean were also excluded. When comparing realizations of two labels such as /i/-/u/ or /s/-/z/, we excluded readings that did not contain at least 50 tokens of each label. We show data representation with different filtering methods in Appendix D.

Phonetic dispersion
Phonetic dispersion refers to the principle that contrasting speech sounds should be distinct from one another in phonetic space (Martinet, 1955;Jakobson, 1968;Flemming, 1995Flemming, , 2004. Most studies investigating this principle have focused on its va-9 In the high-MCD languages, even the low-MCD utterances seem to be untrustworthy. 10 The Equivalent Rectangular Bandwidth (ERB) scale is a psychoacoustic scale that better approximates human perception, which may serve as auditory feedback for the phonetic realization (Fletcher, 1923;Nearey, 1977;Zwicker and Terhardt, 1980;Glasberg and Moore, 1990). The precise equation comes from Glasberg and Moore (1990, Eq. 4).
lidity within vowel systems, as we do here. While languages tend to have seemingly well-dispersed vowel inventories such as {/i/, /a/, /u/} (Joos, 1948;Stevens and Keyser, 2010), the actual phonetic realization of each vowel can vary substantially (Lindau and Wood, 1977;Disner, 1983). One prediction of dispersion is that the number of vowel categories in a language should be inversely related to the degree of per-category acoustic variation (Lindblom, 1986). Subsequent findings have cast doubt on this (Livijn, 2000;Recasens and Espinosa, 2009;Vaux and Samuels, 2015), but these studies have been limited by the number and diversity of languages investigated.
To investigate this, we measured the correlation between the number of vowel categories in a language and the degree of per-category variation, as measured by the joint entropy of (F1, F2) conditioned on the vowel category. We model p(F1, F2 | V ) using a bivariate Gaussian for each vowel type v. We can then compute the joint conditional entropy under this model as Vowel inventory sizes per reading ranged from 4 to 20 vowels, with a median of 8. Both Spearman and Pearson correlations between entropy estimate and vowel inventory size across analyzed languages were small and not significant (Spearman ρ = 0.11, p = 0.44; Pearson r = 0.11, p = 0.46), corroborating previous accounts of the relationship described in Livijn (2000) and Vaux and Samuels (2015) with a larger number of languages-a larger vowel inventory does not necessarily imply more precision in vowel category production. 11

Phonetic uniformity
Previous work suggests that F1 is fairly uniform with respect to phonological height. Within a single language, the mean F1s of /e/ and /o/-which share a height-have been found to be correlated across speakers (Yorkshire English: Watt, 2000; French: Ménard et al., 2008;Brazilian Portuguese: Oushiro, 2019;Dutch, English, French, Japanese, Portuguese, Spanish: Schwartz and Ménard, 2019). Though it is physically possible for these vowels 11 Since differential entropy is sensitive to parameterization, we also measured this correlation using formants in hertz, instead of in ERB, as ERB is on a logarithmic scale. This change did not the influence the pattern of results (Spearman ρ = 0.12, p = 0.41; Pearson r = 0.13, p = 0.39). to differ in F1 realization, the correlations indicate a strong tendency for languages and individual speakers to yoke these two representations together.
Systematicity in the realization of sibilant place of articulation has also been observed across speakers of American English and Czech (Chodroff, 2017). Phonetic correlates of sibilant place strongly covary between /s/ and /z/, which share a [+anterior] place of articulation and are produced the alveolar ridge, and between /S/ and /Z/, which share a [-anterior] place of articulation and are produced behind the alveolar ridge.
A principle of uniformity may account for these above findings. Uniformity here refers to a principle in which a distinctive phonological feature should have a consistent phonetic realization, within a language or speaker, across different segments with that feature (Keating, 2003;Chodroff et al., 2019). Similar principles posited in the literature include Maximal Use of Available Controls, in which a control refers to an integrated perceptual and motor phonetic target (Ménard et al., 2008), as well as a principle of gestural economy (Maddieson, 1995). Phonetic realization refers to the mapping from the abstract distinctive feature to an abstract phonetic target. We approximate this phonetic target via an acoustic-phonetic measurement, but we emphasize that the acoustic measurement is not necessarily a direct reflection of an underlying phonetic target (which could be an articulatory gesture, auditory goal, or perceptuo-motor repre-sentation of the sound). We make the simplifying assumption that the acoustic-phonetic formants (F1, F2) directly correspond to phonetic targets linked to the vowel features of height and backness.
More precisely, uniformity of a phonetic measure with respect to a phonological feature means that any two segments sharing that feature will tend to have approximately equal measurements in a given language, even when that value varies across languages. We can observe whether this is true by plotting the measures of the two segments against each other by language (e.g., Figure 4). Figure 4 and Table 3, the strongest correlations in mean F1 frequently reflected uniformity of height (e.g., high vowels /i/-/u/: r = 0.79, p < 0.001, mid vowels /e/-/o/: r = 0.62, p < 0.01). 12 Nevertheless, some vowel pairs that differed in height were also moderately correlated in mean F1 (e.g., /o/-/a/: r = 0.66, p < 0.001). Correlations of mean F1 were overall moderate in strength, regardless of the vowels' phonological specifications.

Vowels. As shown in
Correlations of mean F2 were also strongest among vowels with a uniform backness specification (e.g., back vowels /u/-/o/: r = 0.69, p < 0.001; front vowels /i/-/E/: r = 0.69, p < 0.05; Table 4). The correlation between front tense vowels /i/ and /e/ was significant and in the ex-pected direction, but also slightly weaker than the homologous back vowel pair (r = 0.41, p < 0.05). Vowels differing in backness frequently had negative correlations, which could reflect influences of category crowding or language-/speaker-specific differences in peripheralization. We leave further exploration of those relationships to future study.
The moderate to strong F1 correlations among vowels with a shared height specification are consistent with expectations based on previous studies, and also with predictions of uniformity. Similarly, we find an expected correlation of F2 means for vowels with a shared height specification. The correlations of vowel pairs that were predicted to have significant correlations, but did not, tended to have small sample sizes (< 14 readings).
Nevertheless, the correlations are not perfect; nor are the patterns. For instance, the back vowel correlations of F2 are stronger than the front vowel correlations. While speculative, the apparent peripheralization of /i/ (as revealed in the negative F2 correlations) could have weakened the expected uniformity relation of /i/ with other front vowels. Future research should take into account additional influences of the vowel inventory composition, as well as articulatory or auditory factors for a more complete understanding of the structural forces in the phonetic realization of vowels.
Sibilants. The mean mid-frequency peak values for /s/ and /z/ each varied substantially across readings, and were also strongly correlated with one another (r = 0.87, p < 0.001; Figure 4). 13 This finding suggests a further influence of uniformity on the realization of place for /s/ and /z/, and the magnitude is comparable to previous correlations observed across American English and Czech speakers, in which r was ≈0.90 (Chodroff, 2017).

Directions for Future Work
We hope our corpus may serve as a touchstone for further improvements in phonetic typology research and methodology. Here we suggest potential steps forward for known areas ( §3.4) where this corpus could be improved: A Sentence alignments were generated using Unitran, and could be improved with higherquality G2P and verse-level text segmentation to standardize utterances across languages. 13 The magnitude of this correlation did not change when using hertz (r = 0.86, p < 0.001).
B Consistent and comparable phoneme labels are the ultimate goal. Concurrent work on universal phone recognition (Li et al., 2020) addresses this issue through a universal phone inventory constrained by language-specific PHOIBLE inventories (Moran and McCloy, 2019). However, free-decoding phones from speech alone is challenging. One exciting possibility is to use the orthography and audio jointly to guide semi-supervised learning of per-language pronunciation lexicons (Lu et al., 2013;Zhang et al., 2017).
C Reliable quality assessment for current methods remains an outstanding research question for many languages. For covered languages, using a universal label set to map additional high quality lexicons (e.g., hand-annotated lexicons) to the same label space as ours would enable direct label and alignment assessment through precision, recall, and PER.
D Curating additional resources beyond this corpus would improve coverage and balance, such as contributing additional Epitran modules. Additional readings exist for many languages on the original bible.is site and elsewhere. Annotations with speaker information are not available, but improved unsupervised speaker clustering may also support better analysis.

Conclusion
VoxClamantis V1.0 is the first large-scale corpus for phonetic typology, with extracted phonetic features for 635 typologically diverse languages. We present two case studies illustrating both the research potential and limitations of this corpus for investigation of phonetic typology at a large scale. We discuss several caveats for the use of this corpus and areas for substantial improvement. Nonetheless, we hope that directly releasing our alignments and token-level features enables greater research accessibility in this area. We hope this corpus will motivate and enable further developments in both phonetic typology and methodology for working with cross-linguistic speech corpora. A Pairwise Correlations between Vowel Formant Measures ( §4 Case Studies) Table 3 and Table 4 respectively show Pearson correlations of mean F1 and mean F2 in ERB between vowels that appear in at least 10 readings. As formalized in the present analysis, phonetic uniformity predicts strong correlations of mean F1 among vowels with a shared height specification, and strong correlations of mean F2 among vowels with a shared backness specification. The respective "Height" and "Backness" columns in Table 3 and Table 4 indicate whether the vowels in each pair match in their respective specifications. p-values are corrected for multiple comparisons using the Benjamini-Hochberg correction and a false discovery rate of 0.25 (Benjamini and Hochberg, 1995). Significance is assessed at α = 0.05 following the correction for multiple comparisons; rows that appear in gray have correlations that are not significant according to this threshold.   Here we evaluate the quality of the Unitran dataset in more detail. The goal is to explore the variation in the quality of the labeled Unitran segments across different languages and phoneme labels. This evaluation includes only readings in high-resource languages, where we have not only the aligned Unitran pronunciations but also aligned high-resource pronunciations (Epitran or WikiPron) against which to evaluate them. The per-token statistics used to calculate these plots are included in the corpus release to enable closer investigation of individual phonemes than is possible here.

B.1 Unitran Pronunciation Accuracy
First, in Figures 5 and 6, we consider whether Unitran's utterance pronunciations are accurate without looking at the audio. For each utterance, we compute the unweighted Levenshtein alignment between the Unitran pronunciation of the utterance and the high-resource pronunciation. For each reading, we then score the percentage of Unitran 'phoneme' tokens that were aligned to high-resource 'phoneme' tokens with exactly the same label. 14 We can see in Figure 6 that many labels are highly accurate in many readings while being highly inaccurate in many others. Some labels are noisy in some readings. 15 Figure 5: Unitran pronunciation accuracy per language, evaluated by Levenshtein alignment to WikiPron pronunciations (hatched bars) or Epitran pronunciations (plain bars). Where a language has multiple readings, error bars show the min and max across those readings.

0%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Unitran Label Precision : Unitran pronunciation accuracy per language, for selected phonemes. Accuracy is evaluated by Levenshtein alignment as in Figure 5. Each curve is a kernel density plot with integral 1. For the /z/ curve, the integral between 80% and 100% (for example) is the estimated probability that in a high-resource language drawn uniformly at random, the fraction of Unitran /z/ segments that align to high-resource /z/ segments falls in that range. The 'all' curve is the same, but now the uniform draw is from all pairs of (high-resource language, Unitran phoneme used in that language).
14 By contrast, PER in Table 1 aligns at the word level rather than the utterance level, uses the number of symmetric alignment errors (insertions + deletions + substitutions) rather than the number of correct Unitran phonemes, and normalizes by the length of the high-resource 'reference' pronunciation rather than by the length of the Unitran pronunciation. 15 Note that as §3.1.3 points out, it may be unfair to require exact match of labels, since annotation schemes vary.)

B.2 Unitran Segment Label Accuracy
In Figures 7 and 8, we ask the same question again, but making use of the audio data. The match for each Unitran segment is now found not by Levenshtein alignment, but more usefully by choosing the high-resource segment with the closest midpoint. For each reading, we again score the percentage of Unitran 'phoneme' tokens whose aligned high-resource 'phoneme' tokens have exactly the same label. Notice that phonemes that typically had high accuracy in Figure 6, such as /p/ and /b/, now have far more variable accuracy in Figure 8, suggesting difficulty in aligning the Unitran pronunciations to the correct parts of the audio.

B.3 Unitran Segment Boundary Accuracy
Finally, in Figures 9 and 10, we measure whether Unitran segments with the "correct" label also have the "correct" time boundaries, where "correctness" is evaluated against the corresponding segments obtained using Epitran or WikiPron+G2P. Each curve is a kernel density plot with integral 1. For the /z/ curve, the integral between 50ms and 100ms (for example) is the estimated probability that in a high-resource language drawn uniformly at random, the Unitran /z/ segments whose corresponding Epitran or WikiPron segments are also labeled with /z/ have mean boundary error in that range. Small bumps toward the right correspond to individual languages where the mean error of /z/ is unusually high. The 'all' curve is the same, but now the uniform draw is from all pairs of (high-resource language, Unitran phoneme used in that language). The boundary error of a segment is evaluated as in Figure 9.
C WikiPron Grapheme-to-Phoneme (G2P) Accuracy ( §3.1.3 Quality Measures) For each language where we used WikiPron, Table 5 shows the phoneme error rate (PER) of Phonetisaurus G2P models trained on WikiPron entries, as evaluated on held-out WikiPron entries. This is an estimate of how accurate our G2P-predicted pronunciations are on out-of-vocabulary words, insofar as those are distributed similarly to the in-vocabulary words. (It is possible, however, that out-of-vocabulary words such as Biblical names are systematically easier or harder for the G2P system to pronounce, depending on how they were transliterated.) The same G2P configuration was used for all languages, with the hyperparameter settings shown in Table 6. (seq1 max and seq2 max describe how many tokens in the grapheme and phoneme sequences can align to each other.). These settings were tuned on SIGMORPHON 2020 Task Table 7 shows what percentage of tokens would be retained after various methods are applied to filter out questionable tokens from the readings used in §4.1. In particular, the rightmost column shows the filtering that was actually used in §4.1. We compute statistics for each reading separately; in each column we report the minimum, median, mean, and maximum statistics over the readings. The top half of the table considers vowel tokens (for the vowels in Appendix A); the bottom half considers sibilant tokens (/s/ and /z/).
On the left side of the table, we consider three filtering techniques for Unitran alignments. Midpoint retains only the segments whose labels are "correct" according to the midpoint-matching methods of Appendix B. MCD retains only those utterances with MCD < 6. Outlier removes tokens that are outliers according to the criteria described in §4.1. Finally, AGG. is the aggregate retention rate retention rate after all three methods are applied in order.
On the right side of the table, we consider the same filtering techniques for the high-resource alignments that we actually use, with the exception of Midpoint, as here we have no higher-quality annotation to match against.

Unitran Alignments
High-Resource Alignments  Table 7: Summary of quality measure retention statistics for vowels and sibilants over unique readings with reading-level MCD < 8 for Unitran and high-resource alignments.
E All VoxClamantis V1.0 Languages All 635 languages from 690 readings are presented here with their language family, ISO 639-3 code, and mean utterance alignment quality in Mel Cepstral Distortion (MCD) from Black (2019). Languages for which we release Epitran and/or WikiPron alignments in addition to Unitran alignments are marked with e and w respectively. MCD ranges from purple (low), blue-green (mid), to yellow (high). Lower MCD typically corresponds to better audio-text utterance alignments and higher quality speech synthesis, but judgments regarding distinctions between languages may be subjective. ISO 639-3 is not intended to provide identifiers for dialects or other sub-language variations, which may be present here where there are multiple readings for one ISO 639-3 code. We report the most up-to-date language names from the ISO 639-3 schema (Eberhard and Fennig, 2020). Language names and codes in many schema could be pejorative and outdated, but where language codes cannot be easily updated, language names can and often are.