Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists

Most current approaches in phylogenetic linguistics require as input multilingual word lists partitioned into sets of etymologically related words (cognates). Cognate identification is so far done manually by experts, which is time consuming and as of yet only available for a small number of well-studied language families. Automatizing this step will greatly expand the empirical scope of phylogenetic methods in linguistics, as raw wordlists (in phonetic transcription) are much easier to obtain than wordlists in which cognate words have been fully identified and annotated, even for under-studied languages. A couple of different methods have been proposed in the past, but they are either disappointing regarding their performance or not applicable to larger datasets. Here we present a new approach that uses support vector machines to unify different state-of-the-art methods for phonetic alignment and cognate detection within a single framework. Training and evaluating these method on a typologically broad collection of gold-standard data shows it to be superior to the existing state of the art.


Introduction
Computational historical linguistics is a relatively young sub-discipline of computational linguistics which uses computational methods to uncover how the world's 7 000 human languages have developed into their current shape. The discipline has made great strides in recent years. Exciting progress has been made with regard to automated language classification (Bowern and Atkin-son, 2012;Jäger, 2015), inference regarding the time depth and geographic location of ancestral language stages (Bouckaert et al., 2012), or the identification of sound shifts and the reconstruction of ancestral word forms (Bouchard-Côté et al., 2013), to mention just a few. Most of the mentioned and related work relies on multilingual word lists manually annotated for cognacy. Unlike the classical NLP conception, cognate words are here understood as words in different languages which are etymologically related, that means, they have regularly developed from a common ancestral form, such as both English tooth and German Zahn 'tooth' that go back to an earlier Proto-Germanic word tanT-with the same meaning. Manual cognate classification is a slow and labor intensive task requiring expertise in historical linguistics and intimate knowledge of the language family under investigation. From a methodological perspective, it can further be problematic to build phylogenetic inference on expert judgments, as the expert annotators necessarily base their judgments on certain hypotheses regarding the internal structure of the language family in question. In this way, the human-annotated cognate sets bear the danger of circularity. Deploying automatically inferred cognate classes thus has two advantages: it avoids the bias inherent in manually collected expert judgments and it is applicable to both well-studied and under-studied language families.
In the typical scenario, the researcher has obtained a collection of multilingual word lists in phonetic transcription (e.g. from field research or from dictionaries) and wants to classify them according to cognacy. Such datasets usually cover many languages and/or dialects (from scores to hundreds or even thousands) but only a small number of concepts (often the 200-item or 100-item Swadesh list or subsets thereof). The machine learning task is to perform cross-linguistic clustering. There exists a growing body of gold standard data, i.e. multilingual word lists covering between 40 and 210 concepts which are manually annotated for cognacy (see Methods section for details). This suggests a supervised learning approach. The challenge here is quite different from most machine learning problems in NLP though since the goal is not to identify and deploy language-specific features based on a large amount of mono-or bi-lingual resources. Rather, the gold standard data have to be used to find cross-linguistically informative features that generalize across arbitrary language families. In the remainder of this paper we will propose such an approach, drawing on and expanding related work such as List (2014b) and Jäger and Sofroniev (2016).

Previous Work
Cognate detection is a partitioning task: a clustering task which does not necessarily assume a hierarchy. An early approach (Dolgopolsky, 1964) is based on the idea of sound classes: In order to reduce the phonetic space and to guarantee comparability across languages, sounds are clustered into classes which frequently occur in correspondence relation in genetically related languages. Dolgopolsky proposed a very rough sound class system, proposing to group all consonants into ten classes ignoring vowels. When converting all transcriptions in the data to their respective sound classes, one can use different criteria to assign words resembling each other in their sound classes to the same set of cognate words. Turchin et al. (2010) further formalized this approach and employed a modified sound class schema of 9 vowel classes to test the Altaic hypothesis. Their Consonant Class Matching (CCM) approach was reported to produce a low rate of false positives. Unfortunately, the rate of false negatives is also very high (List, 2014b). This is especially due to the lack of flexibility of the procedure, which hard-codes sounds to classes, ignoring that sound change is usually based on fine-grained transitions.
An alternative family of approaches to cognate detection circumvents this problem by first calculating distances or similarities between pairs of words in the data, and then feeding those scores to a flat clustering algorithm which partitions the words into cognate sets. This workflow is very common in evolutionary biology, where it is used to detect homologous genes and proteins (Bernardes et al., 2015). Two basic families of partitioning algorithms can be distinguished: hierarchical cluster algorithms and graph-based algorithms. Hierarchical cluster algorithms are based on classical agglomerative cluster algorithms (Sokal and Michener, 1958), but terminate when a user-defined threshold of average similarities among clusters is reached. In graph-based partitioning algorithms (Andreopoulos et al., 2009), words are represented as nodes in a network and links between nodes represent similarities. When clustering, links are added and removed until the nodes are partitioned into homogeneous groups (van Dongen, 2000).
More important than the clustering algorithm one uses is the computation of pairwise similarity scores between words. Here, different measures have been tested, ranging from simple string distance metrics (Bergsma and Kondrak, 2007), via enhanced sound-class-based alignment algorithms (SCA, List 2014a), to iterative frameworks in which segmental similarities between sounds are either iteratively inferred from the data (Steiner et al., 2011), or aggregated using machine learning techniques (Hauer and Kondrak, 2011). Frameworks may differ greatly regarding their underlying workflow. While the LexStat algorithm by List (2014b) uses a permutation method to compute individual segmental similarities between individual language pairs which are then fed to an alignment algorithm, the PMI similarity approach by Jäger (2013) infers general segmental similarities between sounds from an exhaustive parameter training procedure.

Materials
Benchmark data for training and testing was assembled from different previous studies and considerably enhanced by unifying semantic and phonetic representations and correcting numerous errors in the datasets. Our collection was taken from six major sources (Greenhill et al., 2008;Dunn, 2012;List, 2014b;List et al., 2016b;Mennecier et al., 2016)  covers datasets ranging between 100 and 210 concepts translated into 5 to 100 languages from 13 different language families. Modifications introduced in the process of preparing the datasets included (a) the correction of errata (e.g. orthographic forms in place of phonetic representations), (b) the replacement of non-IPA symbols with their IPA counterparts (e.g. t→ ú or ' → P), (c) the removal of non-IPA symbols used to convey meta-information (e.g. %), (d) removal of extraneous phonetic representation variants, and (e) the removal of morphological markers. In addition, all concept labels in the different datasets were linked to the Concepticon (http://concepticon.clld.org, List et al. 2016a), a resource which links concept labels guage since the computational effort would have been impractical otherwise. For all data sets, only entries containing both a phonetic transcription and the cognate classification were used. Table 2: Sample entries for woman in IELex. The cognate class identifier in the last column consists of a the concept label and an arbitrary letter combination. If two words share the same cognate class identifier, they are marked as cognate.
to standardized concept sets in order to ease the exchange and standardization of cross-linguistic datasets. A small sample of the entries extracted from the IELex data is shown in Table 2 for illustration.

Methods
Unlike many other supervised or semi-supervised clustering tasks, the set of cluster labels to be inferred is disjoint from the gold standard labels. Therefore we chose a two-step procedure: (1) A similarity score for each pair of synonymous words from the same dataset is inferred using supervised learning, and (2) these inferred similarities are used as input for unsupervised clustering.
As for subtask (1), the relevant gold standard information are the labels "cognate" and "not cognate" for pairs of synonymous words. The sub-goal is to predict a probability distribution over these labels for unseen pairs of synonymous words. This is achieved by training a Support Vector Machine (SVM), followed by Platt scaling (Platt, 1999). The SVM primarily operates on two string similarity measure from the literature, PMI similarity Jäger (2013) and LexStat similarity (List, 2014b), which are both known to generalize well across languages and language families. We also used some auxiliary features from (Jäger and Sofroniev, 2016), which are derived from string similarities. For the clustering subtask (2), we followed List et al. (2016b) and List et al. (2017) in using the Infomap algorithm (Rosvall and Bergstrom, 2008).
The gold standard data were split into a training set and a test set. Feature selection for subtask (1) and parameter training for subtask (2) were achieved via cross-validation over the train-ing data. For evaluation, we trained an SVM on all training data and used it to perform automatic clustering on the test data.
The remainder of this section spells out these steps in detail.

String Similarity Measures
Our strategy is to first calculate string similarities and distances between pairs of words denoting the same concept and then inferring a partition of the corresponding words from those similarities or distances via a partitioning algorithm. For word comparison we utilize two recently proposed string similarity measures.
The first string similarity measure is the one underlying the above-mentioned LexStat algorithm for automatic cognate detection (List, 2014b). The core features of the string similarity produced by the LexStat algorithm include (a) an enhanced sound-class model of 28 symbols, including tone symbols for the handling of South-East Asian tone languages, (b) a linguistically informed scoring function derived from frequently recurring directional sound change processes, and (c) a prosodic tier which automatically defines a prosodic context for each sound in a word and thus allows for a rough handling of context. The LexStat algorithm for determining string similarities can be roughly divided into four stages. In a first stage, words for the same concept in each language pair are aligned, using the SCA algorithm for phonetic alignment (List, 2014b), both globally and locally, and correspondences in the word pairs with a promising score are retained. At the same time, a randomized distribution of expected sound correspondences is calculated, using a permutation method (Kessler, 2001) in which the wordlist are shuffled, so that words denoting different concepts, which are much more likely to be not cognate, are aligned instead. In a second step, both distributions are compared, and log-odds scores (Durbin et al., 2002) for each segment pair s x,y are calculated (List, 2014b, 181). In a third step, the new scoring function is used to re-align the words, using a semi-global alignment algorithm which ignores prefixes or suffixes occurring in one of two strings (Durbin et al., 2002), and the similarity scores produced by classical alignment algorithms are normalized to similarity scores using the formula by Downey et al. (2008) where S AB is the similarity score of an alignment of two words A and B produced by the SCA method, and S A and S B are the similarity scores produced by the alignment of A and B with themselves. 2 In Jäger (2013) a data-driven method for determining string similarities is proposed which we will refer to as PMI similarity, as it is based on the notion of Pointwise Mutual Information between phonetic segments. It has successfully been used for phylogenetic inference in Jäger (2015). The method operates on phonetic strings in ASJP transcription  without diacritics, i.e., each segment is assigned one out of only 41 sound classes.
The PMI score of two sound classes a, b is defined as where s(a, b) is the probability of a and b being aligned to each other in a pair of cognate words, and q(a), q(b) are the probabilities of occurrence of a and b respectively. Sound pairs with positive PMI score provide evidence for cognacy, and vice versa.
To estimate the likelihood of sound class alignments, a corpus of probable cognate pairs was compiled from the ASJP data base 3 using two heuristics. First, a crude similarity measure between wordlists, based on Levenshtein distance, was defined and the 1% of all ASJP doculect 4 pairs with highest similarity were kept as probably related. Second, the normalized Levenshtein distance was computed for all translation pairs from probably related doculects. Those with a distance below a certain threshold were considered as probably cognate. These probable cognate pairs were used to estimate PMI scores. Subsequently, all translation pairs were aligned via the Needleman-Wunsch algorithm Needleman and Wunsch (1970) using the PMI scores from the previous step as weights. This resulted in a measure of string similarity, and all pairs above a certain similarity −12 threshold were treated as probable cognates in the next step. This procedure was repeated ten times. In the last step, app. 1.3 million probable cognate pairs were used to estimate the final PMI scores. The PMI scores thus obtained are visualized in Figure 1a (numerical values are available from the Supplementary Material of Jäger (2015)). The aggregate PMI score of a pair of aligned strings (where gaps may be inserted at any position) is defined as the sum of the PMI scores of the aligned symbol pairs. Matching a symbol with a gap incurs a penalty, with different penalties for initial and non-initial positions in a sequence of consecutive gaps. 5 The similarity s(w 1 , w 2 ) between two 5 The values of the gap penalties were taken from Jäger (2013), where the method of estimating them is described. strings w 1 , w 2 is then defined as minimal aggregate PMI score for all possible alignments. It can be computed efficiently via the Needleman-Wunsch algorithm.
There are major conceptual differences on how the two similarity measures are derived. LexStat similarity estimates separate scores between each pair of doculects, thus utilizing regular sound correspondences, while PMI similarity uses the same PMI scores regardless of the doculects compared. LexStat alignments further capture a prosodic tier which allows for a rough modeling of phonetic context and reflects theories on the importance of phonetic strength in sound change processes (Geisler, 1992), while the parameters used for computing PMI similarities are estimated in a purely data-driven way without using specifically linguistic insights beyond the classification of sounds into ASJP sound classes. The parameters of the PMI framework are statistically estimated using a large amount (more than 1 000 000 word pairs) of cross-linguistically diverse data. In contrast, LexStat's initial alignment algorithm is based on manually assigned parameters, and the final parameters are estimated empirically from the word pairs in the doculects being compared, and no external information is being applied. As a result, the algorithm needs a minimum of 100 concepts to yield reliable results and it yields notably better results with more than 200 words (List, 2014b;List, 2014a).
The joint distribution of LexStat and PMI string similarities for cognate and non-cognate pairs within our training set is visualized in Figure 1b.
Despite those differences, the two measures capture a similar signal; for the data from List (2014b) and List et al. (2016b), e.g., their correlation is as high as 0.727. Also, both variables are contain similar information about the binary cognate/not cognate variable. Figure 2 shows the Precision-Recall curves (cf. for instance Manning and Schütze, 1999) for LexStat and PMI similarity. While the curves are slightly different (Lex-Stat achieves a higer precision for low recalls and PMI for high recalls), the areas under the curve are almost identical (0.893 for LexStat and 0.880 for PMI).

Workflow
In this study, we utilized both string similarity measures discussed above, as well as a collec- tion of auxiliary predictors pertaining to the similarity of the doculects compared and the differential diachronic stability of lexical meanings, to infer cognate classifications. We chose a supervised learning approach using a Support Vector Machine (SVM) for this purpose. The overall workflow is shown in Figure 3. It consists of two major parts. During the first phase (the upper part in the figure shown in red), a SVM is trained on a set of training data and then used to predict the probability of cognacy between pairs of words from a set of test data. During the second phase (lower part in the figure, shown in green), those probabilities are used to cluster the words from the test set into inferred cognacy classes. The system is evaluated by comparing the inferred classification with the expert classification. We used the three largest data sets at our disposal (cf. the datasets colored in red in Table 1), ABVD, Central Asian, and IELex, for testing and all other datasets for training.

Support Vector Machine Training
Each data point during the first phase is a pair of words w 1 , w 2 (i.e., a pair of phonetic strings) from doculects L 1 , L 2 from data set S, both denoting the same concept c. It is mapped to a vector of values for the following features: 6 1. LexStat string similarity between w 1 and w 2 (computed with LingPy, List and Forkel,6 Features 2-5 are taken from (Jäger and Sofroniev, 2016  2016) , 2. PMI string similarity between w 1 and w 2 , 3. doculect similarity between L 1 and L 2 as defined in Jäger (2013), 7 4. mean word length (measured in number of segments) of words for concepts c within S. 5. correlation coefficient between PMI string similarity and doculect similarity across all word pairs denoting concept c within S. 8 The marginal distributions for cognate and noncognate pairs of those features (for the data from List (2014b) and List et al. (2016b)) is displayed in Figure 4. It can be discerned from these plots that word length is a negative predictor and the other four features are positive predictors for cognacy. The fact that word length is a negative predictor of cognacy arguably results from the interplay of two known regularities. (1) Pagel et al. (2007) present evidence that diachronic stability of concepts is positively correlated with their usage frequency in modern corpora. (2) According to Zipf's Law of Abbreviation (Zipf, 1935), there is an negative correlation between the corpus frequency of words and their lengths. Taken together, this means that concepts usually being expressed by short words tend to have a high usage frequency and therefore tend to be diachronically stable. Therefore we expect a higher proportion of cognate pairs among concepts expressed by short words than among those expressed by short words.
As the data points within the training set are mutually non-independent, we randomly chose one word pair per concept and data set for training the SVM. During the training phase, we used crossvalidation over the data sets within the training set (i.e., using one training data set for validation and the other training data sets for SVM training) to identify the optimal kernel and its optimal parameters. This was carried out by completing both phases of the work flow and optimizing the Adjusted Rand Index (see Subsection 4.5) of the resulting classification. Training and prediction was carried out using the svm module from the Python package sklearn (http://scikit-learn. org/stable/modules/svm.html), which is based on the LIBSVM library (Fan et al., 2005). Predicting class membership probabilities from a trained SVM was carried out using Platt scaling (Platt, 1999) as implemented in sklearn (http: //scikit-learn.org). This results in a predicted probability of cognacy p(w 1 , w 2 |c, S) for each data point. The best cross-validation performance was achieved with a linear kernel with a penalty value of C = 0.82. Polynomial and RBF-kernels performed slightly worse. Also, we found that leaving out any subset of the features decreases performance.

Cognate Set Partitioning
In order to cluster the words into sets of potentially cognate words, we follow recent approaches by List et al. (2016b) and List et al. (2017) in using Infomap (Rosvall and Bergstrom, 2008), an algorithm which was originally designed for the detection of communities in large social networks, to detect "communities" of related words. Infomap uses random walks in undirected networks to identify the best way to assign the nodes in the network, that is, in our case, the words, to distinct groups which form a homogeneous class.
For each data set D and each concept c covered in D, a network was constructed. The vertices are all words from D denoting c. Two vertices are connected if and only if the corresponding words are predicted to be cognate with a probability ≥ θ according to SVM prediction + Platt scaling. The optimal value for θ was determined as 0.66 via cross-validation over the training data. Infomap was then applied to this network, resulting in an assignment of class labels to vertices/words.

Evaluation
We used two evaluation measures to compare inferred with expert classifications on the test data. The Adjusted Rand Index (ARI, Hubert 1985) assesses how much the equivalence relations induced by two partitions coincide. It assumes real values ≤ 1, where 1 means "perfect agreement" and 0 means "degree of agreement expected by chance". Negative values may result when from an agreement smaller than expected by chance. B-Cubed scores (Bagga and Baldwin, 1998) measure precision and recall of a partition analysis compared against a gold standard by computing an individual accuracy score for the cluster decisions on each item in the data and then averaging the results. Hauer and Kondrak (2011) were the first to introduce this measure to test the accuracy of multilingual cognate detection algorithms. In contrast to pair scores such as ARI, B-Cubed scores have the advantage of being independent of the evaluation data itself. While pair-scores tend vary greatly depending on dataset size and cognate density, B-Cubed scores do not show this effect. They are reported as precision and recall. A low B-Cubed precision almost directly translates to the classical notion of a high amount of false positive cognate judgments made by an algorithm, while low B-Cubed recall points to a large amount of cognate sets which were missed by an algorithm.
We took the original LexStat algorithm as a baseline with which we compare our results.
LexStat provides a good baseline, since it was shown to outperform alternative approaches like the above-mentioned CCM approach (Turchin et al., 2010), or clustering based on alternative string similarity measures, like the normalized edit distance, or the normalized scores of the abovementioned SCA algorithm (List, 2014b). The LexStat implementation in LingPy offers different methods for cognate clustering. Since we employed Infomap for our SVM approach, and since Infomap clustering was shown to work well with LexStat similarities (List et al., 2017), we also used Infomap as the cluster algorithm for the Lex-Stat approach. Since Infomap requires a threshold, we trained the threshold on our training data, excluding short wordlists. Optimal results on the training data was obtained with θ * = 0.57.

Results and Outlook
The evaluation results are given in Table 3, and the differences to the baseline are visualized in Figure 5. On average, the SVM-based classification shows a superior performance when compared to the baseline (an improvement of 0.7% ARI and 0.5% B-cubed F-score). This is mostly due to a substantial improvement for the Austronesian data (4.3% ARI/2.1% B-cubed F-score). Our method slightly outperforms the baseline for Indo-European but is minimally inferior when applied to the Central Asian data. While this might seem a minor improvement only, it is worth exploring on what type of data our method makes progress.
The plot in Figure 6 shows the dependency of performance (ARI) on the number concepts per data base for the training data. While this result has to be taken with a grain of salt as it involves the data used for model fitting, the pattern is both plausible and striking. It shows that our method clearly outperforms LexStat if the number of concepts is smaller than 100. This finding is unsurprising since LexStat depends on regular sound correspondences. If those cannot be reliably inferred due to data sparseness, its performance drops. Our method is more robust here as it makes use of the PMI string similarity which does not rely on language-specific information. This may also explain the performance on the Austronesian data: although it covers 210 concepts across 100 languages, the languages contain many gaps, and many languages have only 100 words if not even less.
In order to get a clearer impression on where our algorithm failed, we compared false positives and negatives in the Indo-European data (Dunn, 2012), which has been investigated in deep detail during the last 200 years. While a quantitative comparison of part of speech and word length did not reveal any strong correlations with the accuracy of our approach, a qualitative analysis showed that false positives produced by our approach are usually due to language-specific factors. Among the factors triggering false negatives, there are specific morphological processes involving complex paradigms, such as Proto-Indo-European *séh 2 wel-'sun', which shows many suffixes in its descendant forms, and specific instances of sound change, involving words that were drastically changed (cf. English four vs. French quatre). False positives are not only due to chance similarities (compare English much with Spanish mucho), but also due to words which share morphological elements but are marked as non-cognate in our gold standard (cf. Dutch man vs. German Ehemann 'husband'), and errors in the gold standard (cf. Upper Sorbian powjaz vs. Lower Sorbian powrjoz 'rope', wrongly marked as non-cognate in the gold standard).
The classical methods for the identification of cognate words in genetically related languages are based on the general idea that relatedness can be rigorously proven. This requires that the languages under investigation have retained enough similarity to identify regular sound correspondences. The further we go back in time, however, the less similarities we find. The fact that an algorithm like LexStat, which closely mimics the classical comparative method in historical linguistics, needs at least 100 (if not more) concepts in order to yield a satisfying performance reflects this problem of data sparseness in historical linguistics. One could argue that a serious analysis in historical linguistics should never be carried out if data are too sparse. As an alternative to this agnostic attitude, however, one could also try to work on methods that go beyond the classical framework, adding a probabilistic component, where data are too sparse to yield undisputable proof. In this paper, we have tried to make a first step into this direction by testing the power of machine learning approaches with state-of-the-art measures for string similarity in quantitative historical linguistics. The fact that our approach outperforms existing automatic approaches shows that this direction could prove fruitful in future research. and the DFG research fellowship grant 261553824 Vertical and lateral aspects of Chinese dialect history (JML). We also thank all scholars who contributed to this study by sharing their data.