Phonetic Vector Representations for Sound Sequence Alignment

This study explores a number of data-driven vector representations of the IPA-encoded sound segments for the purpose of sound sequence alignment. We test the alternative representations based on the alignment accuracy in the context of computational historical linguistics. We show that the data-driven methods consistently do better than linguistically-motivated articulatory-acoustic features. The similarity scores obtained using the data-driven representations in a monolingual context, however, performs worse than the state-of-the-art distance (or similarity) scoring methods proposed in earlier studies of computational historical linguistics. We also show that adapting representations to the task at hand improves the results, yielding alignment accuracy comparable to the state of the art methods.


Introduction
Most studies in computational linguistics or natural language processing treat the phonetic segments as categorical units, which prevents analyzing or exploiting the similarities or differences between these units. Alignment of sound sequences, a crucial step in a number of different fields of inquiry, is one of the tasks that suffers if the segments are treated as distinct symbols with no notion of similarity. As a result, alignment algorithms commonly employed in practice (e.g., Needleman and Wunsch, 1970) use a scoring function based on similarity of the individual units.
The tasks that require or benefit from aligning sequences are prevalent in computational linguistics, as well as relatively unrelated fields such as bioinformatics. In this study, we focus on aligning phonetically transcribed parallel word lists in the context of computational historical linguistics, where alignment of sound sequences is interesting either on its own (demonstrating differences between language varieties) or as a necessary step in a larger application, for example, for inferring the cognacy of these words or finding synchronic or diachronic sound correspondences.
The use of similarities between the sound segments has been common in computational studies of historical linguistics (Covington, 1996(Covington, , 1998Kondrak, 2000;Kondrak and Hirst, 2002;Kondrak, 2003;List, 2012;Jäger, 2013;Jäger and Sofroniev, 2016). These studies rely on scoring functions most of which are based on the linguistic knowledge about the sound changes that typically occur across languages. Another trend shared by all of the earlier studies is the use of a reduced alphabet for representing the sound segments. Even though the standard way to encode sound sequences is the International Phonetic Alphabet (IPA), using a smaller set of symbols, such as ASJP (Brown et al., 2013;Wichmann et al., 2016), seem to help creating scoring functions that are more useful for historical linguistics.
In the present study, we explore a number of methods that learn vector representations for IPA tokens from multi-lingual word-lists, either using the words in a monolingual context or making use of the fact that words represent the same concept in different languages. We use a standard similarity metric over vectors (cosine similarity) for determining the similarities between the segments, and, in turn, use these similarities for aligning IPAtranscribed sequences.
Besides providing a more principled method for measuring distances, compared to only distance information, vector representations are more useful for further analysis, and may yield better results in other computational tasks relying on supervised or unsupervised machine learning techniques. Vector representations for phonetic, phonological or orthographic units have been used successfully in earlier research, e.g., for word segmentation (Ma et al., 2016), transfer learning of named entity recognition (Mortensen et al., 2016) and morphological inflection (Silfverberg et al., 2018).
We compare our methods to a one-hot-encoding baseline (which is equivalent to symbolic representations), linguistically-motivated vectors, and alignments produced using state-of-the-art scoring methods. We compare the alignment performance of these methods on a manually-annotated goldstandard corpus, using the same alignment algorithm and the same training data where applicable.

Methods
Our aim is to learn and use vector representations for the purposes of sound sequence alignment. Once we have vector representations, we align the two sequences with Needleman-Wunsch algorithm using the cosine similarity between the phonetic vectors as the similarity function.

Baseline Representations
One-hot encoding is a common method for representing categorical data. Under one-hot encoding, given a vocabulary of N distinct segments, each segment would be represented as a distinct binary vector of size N , such that exactly one of its dimensions has value 1 and all other dimensions have value 0. The method does not yield useful distance measures as each segment is equidistant from all the others. We use one-hot encoding as a proxy for a purely symbolic baseline. PHOIBLE Online is an ongoing project aiming to compile a comprehensive database of the world languages' phonological inventories (Moran et al., 2014). The project also maintains a table of phonological features, effectively mapping each segment encountered in the database to a unique ternary feature vector. Feature values are assigned based on Hayes (2009) and Moisik and Esling (2011), and indicate either the presence, absence, or nonapplicability of an articulatory-acoustic feature for each IPA symbol. PHOIBLE feature vectors serve as a linguistically-informed baseline.

Data-driven Vector Representations
Our proposed methods include three data-driven methods to learn vector representations for IPAencoded sound segments. phon2vec embeddings are the well-known word2vec method (Mikolov et al., 2013) applied to IPA-encoded phonetic segments. The method learns dense vector representations that maximize the similarity of segments that appear in similar contexts. As in original word2vec models, the context is treated as a bag of words, ignoring the relative position of each context element.
Position sensitive neural network embeddings (NN embeddings) are obtained using a simple feed-forward neural network architecture. Similar to word2vec skip-gram method, the neural network tries to predict the context of a word from the word itself. The hidden layer representations are, then, used as the representations for the word. Unlike word2vec, however, the context is not treated as a bag of phonetic segments. The position of the elements in the context is significant.
RNN embeddings are obtained using a sequence-to-sequence recurrent neural network (Cho et al., 2014). Given a pair of sequences, the network encodes the first sequence into a vector which is then decoded into an output sequence. The first layer of the network is an embeddings layer which converts the input categories to dense vector representations with a smaller number of dimensions. The network is trained to 'translate' words (as sequences of IPA tokens) between the languages in the training set, while, in the process, learning useful representations for IPA tokens. Once the network is trained, we are interested in the representations build for each IPA-token by the embedding layer.
Unlike the other data-driven methods described above, the RNN embeddings require, and make use of, multi-lingual nature of the data. However, crucially, the method does not require any explicit alignment of the sequences in advance.

State-of-the-art Scoring Functions
We compare the alignment performance of our methods to two state-of-the-art scoring functions. The first one, the sound-class-based phonetic alignment (SCA, List, 2012) employs a set of 28 sound classes. It operates on IPA sequences by converting the segments into their respective sound classes, aligning the sound class tokens, and then converting these back into IPA. The scoring function is hand-crafted to reflect the perceived probabilities of sound change transforming a segment of one class into a segment of another.
We also compare our results with the alignments obtained using the method proposed by Jäger (2013), which uses the ASJP database (Wichmann et al., 2016) to calculate the pairwise mutual information (PMI) scores for each pair of ASJP segments. The method starts with an initial alignment, and re-aligns the corpus iteratively for obtaining the final PMI-based scores. The method is datadriven, but heavily optimized for the task. Since it does not work with IPA-encoded sequences, we first convert the IPA sequences to ASJP alphabet, and convert them back to IPA after alignment. 1

Data
In order to evaluate the performance of the methods put forward in the previous section, we use the Benchmark Database for Phonetic Alignments (BDPA, List and Prokić, 2014). The database contains 7198 aligned pairs of IPA sequences collected from 12 source datasets, covering languages and dialects from 6 language families (detailed information about the data set is provided in the Appendix). The database also features the small set of 82 selected pairs used by Covington (1996) to evaluate his method, encoded in IPA.
Our training data is sourced from NorthEuraLex, a comprehensive lexicostatistical database that provides IPA-encoded lexical data for languages of, primarily but not exclusively, Northern Eurasia (Dellert and Jäger, 2017). At the time of writing the database covers 1016 concepts from 107 languages, resulting in 121 614 IPA transcriptions.

Experimental Setup
Obtaining vector representations with the phon2vec and neural network methods involves settings the models' hyperparameters and training on a data set of IPA sequences (or pairs thereof).
We tokenize the input sequences using an open source Python package developed during this study. 2 The phon2vec and NN embeddings are trained on the set of all tokenised transcriptions in the training set. For training the RNN, we need cognates, pairs of words in different languages that share a common root. As our training set does not include cognacy information, the RNN embeddings are trained on the set of tokenised transcriptions of the word pairs constituting probable cognates -pairs in which the words belong to different languages, are linked to the same concept, and have normalised Levenshtein distance lower than 0.5. We have also experimented with thresholds of 0.4 and 0.6, but setting the cutoff at 0.5 yields better-performing embeddings.
For each method, we run the respective model with the Cartesian product of common values for each hyperparameter, practically performing a random search of the hyperparameter space. The values we have experimented with, as well as the bestperforming combinations thereof, are summarized in the Appendix. Note that the models are optimized for the respective prediction task they perform, not for good alignment performance.
The implementation is realized in the Python programming language, and makes use of a number of libraries, including NumPy (Walt et al., 2011), SciPy (Jones et al., 2001), scikit-learn (Pedregosa et al., 2011), Gensim (Řehůřek and Sojka, 2010), and Keras (Chollet et al., 2015). The source code used for the experiments reported here is publicly available. 3

Evaluation
In order to quantify the methods' performance, we employ an intuitive evaluation scheme similar to the one used by Kondrak and Hirst (2002): if, for a given word pair, m is the number of alternative gold-standard alignments and n is the number of correctly predicted alignments, the score for that pair would be n m . In the common word pair case of a single gold-standard alignment and a single predicted alignment, the latter would yield 1 point if it is correct and 0 points otherwise; partially correct alignment do not yield points. The percentage scores are obtained by dividing the points by the total number of pairs.

Results and Discussion
The alignment performance of our baselines, proposed methods, as well as PMI and SCA on the BDPA data sets is summarized in Table 1.
The first point we would like to draw attention to is that the one-hot encoding scores are consistently lower than those in the other columns. This is expected because, unlike the other methods, one- hot encoding cannot represent the degree of phonetic similarity between IPA segments. Viewing the one-hot encoding scores as a baseline, we conclude that the other methods' distance measures do indeed contribute to sequence alignment. The PHOIBLE feature vectors are roughly on par with the phon2vec embeddings, yielding better results than the NN embeddings on two of the datasets (Japanese and Slavic), and are otherwise outperformed by the NN and the RNN embeddings, as well as PMI and SCA. Part of the low performance of the PHOIBLE's vectors can be due to the fact that PHOIBLE does not provide feature vectors for all IPA segments in the BDPA datasets. However, the similar performance between PHOIBLE vectors and phon2vec and, clearly better performance achieved by the NN embeddings indicates that we can learn (more) useful linguistic generalizations in a data-driven manner.
Of the data-driven methods, phon2vec yields the lowest scores, being outperformed by both neural network models in all datasets except Japanese. Given that both the phon2vec and the NN embeddings are trained on the same data, the consistent performance difference between phon2vec and NN embeddings points to usefulness of to the sequential order of IPA segments. The better performance of the RNN embeddings over other data driven methods is not surprising, as they capture useful information from the multi-lingual data set. Furthermore, the performance of RNN embeddings is similar to the PMI method, yielding better results in many data sets.
For all but the Slavic dataset, SCA yields higher scores than other methods compared in this study. The score differences exhibit considerable vari-ance -from less than 1 percent point for the Andean dataset up to 26 percent points for the Sinitic dataset. A possible explanation for this variance is the fact that not all IPA segments found in the benchmark datasets are found in the training data. For example, NorthEuraLex includes a single tonal language, Mandarin Chinese, and the models cannot produce meaningful embeddings for most of the tones encountered in the Sinitic and Bai datasets. Arguably, a larger training dataset featuring a richer set of IPA segments would produce better-performing embeddings.

Conclusion
In this study we have proposed, implemented, and evaluated three methods for obtaining vector representations of IPA segments for the purposes of pairwise IPA sequence alignment. Our method outperforms a linguistically-informed baseline, as well as a trivial one-hot representation, performs comparably to a state-of-the-art data driven method. However, the performances of data driven methods, including ours, seem to be behind a linguistically-informed system, SCA. Nevertheless, the results of the data-driven methods are not too far off the mark, and we believe that they could be significantly improved by using larger and more diverse training data, and better tuning of the datadriven methods. This constitutes one direction for future experiments; another possibility is to train and use embeddings specific to a particular language family or macro-area. Further investigation is also needed with respect to comparing and evaluating the methods, especially in the context of a larger application, such as cognacy identification or phylogenetic inference.