Similarity Dependent Chinese Restaurant Process for Cognate Identification in Multilingual Wordlists

We present and evaluate two similarity dependent Chinese Restaurant Process (sd-CRP) algorithms at the task of automated cognate detection. The sd-CRP clustering algorithms do not require any predefined threshold for detecting cognate sets in a multilingual word list. We evaluate the performance of the algorithms on six language families (more than 750 languages) and find that both the sd-CRP variants performs as well as InfoMap and better than UPGMA at the task of inferring cognate clusters. The algorithms presented in this paper are family agnostic and can be applied to any linguistically under-studied language family.


Introduction
Cognates are related words across languages that have descended from a common ancestral language. Identification of cognates is an important step in historical linguistics while establishing genetic relations between languages that are hypothesized to have descended from a single language that existed in the past. For instance, English hound and German Hund "dog" are cognates that go back to the Proto-Germanic stage. Cognate identification requires great amount of scholarly effort and is available for some language families such as Indo-European, Dravidian, Austronesian, and Uralic which have a long tradition of comparative linguistic research that involves decades (Dravidian family) to centuries (Indo-European family) of scholarly effort. Automatic detection of cognates with high accuracy is very much desired for reducing the effort required in analyzing understudied language families of the world.
Typically, expert annotated cognate sets are employed to infer phylogenetic trees showing language relationships that can be used to test hypotheses about temporal and spatial evolution of language families (Bouckaert et al., 2012;Chang et al., 2015), linguistic reconstruction of ancestral states on a tree , or lexical reconstruction (Bouchard-Côté et al., 2013). Rama et al. (2018) showed that cognates inferred from automated methods of cognate detection can be used to infer high quality phylogenetic trees. The authors noted that there is a need for more research towards developing highly accurate cognate identification methods that can be applied to the data of not so well-studied language families which will be of assistance to historical linguists to automate parts if not the whole of the comparative method.
Most of the above cognate identification methods involve a workflow consisting of computation of distances between all the word pairs that have the same meaning using a machine learning algorithm or a sequence alignment algorithm; and, then clustering the pairwise distance matrix using a clustering algorithm such as InfoMap (Rosvall and Bergstrom, 2008) or UPGMA (Unweighted Pair Group Method with Arithmetic Mean;Sokal and Michener, 1958).
Both InfoMap and UPGMA require a predefined threshold that is either set heuristically or through tuned to obtain to obtain optimal perfor-Language ALL AND . . . On the other hand, a non-parametric clustering method such as Chinese Restaurant Process (CRP; Gershman and Blei 2012) can form clusters directly from the data without the need for tuning the threshold. CRP has found application in different NLP tasks such as morphological segmentation (Goldwater et al., 2006), language modeling (Goldwater et al., 2011), machine translation (Ravi and Knight, 2011), part-of-speech induction (Blunsom and Cohn, 2011;Sirts et al., 2014), and language decipherment (Snyder et al., 2010).

English
In this paper, we present two clustering algorithms inspired from similarity dependent Chinese Restaurant Process for the purpose of inferring cognate clusters. Our CRP based clustering algorithms take a word pair similarity matrix as input and infer cognate clusters automatically without needing any threshold. The sd-CRP algorithms have a hyperparameter α that allows us to form new clusters. We compare the performance of the CRP algorithms on six different language families and find that the CRP algorithms better than UP-GMA and yields better or competing performance against InfoMap. We sample α so that the algorithms are robust to the initial value of α.
The paper is organized as follows. We describe related work in section 2. In section 3, we describe the word similarity features used to train the SVM model. We describe sd-CRP, UPGMA, and InfoMap algorithms in section 4. We describe the evaluation metrics and datasets in section 5. We present the results of our experiments in section 6. We discuss the results by analyzing the effect of features on SVM model, initial α values, and missing data on the performance of clustering in section 7. Finally, we conclude and present directions for future work in section 8.

Related work
Most of the automated cognate identification work mentioned in the previous section employed either UPGMA or InfoMap algorithms. Hauer and Kondrak (2011) were the first to apply UPGMA clustering algorithm to infer cognate sets from Swadesh lists. The authors trained a SVM classifier based on string similarity features to calculate word distances between all word pairs for a meaning. The pair-wise distance matrix is supplied to UPGMA with a predefined threshold for inferring word clusters. The UPGMA algorithm is simple and yields reasonable results across various language families (List, 2012a). However, UPGMA clustering algorithm is dependent on the threshold that needs to be tuned to obtain optimal performance (List et al., 2017b). The cognate identification work of Hall and Klein (2011) and Bouchard-Côté et al. (2013) requires the phylogenetic tree of the language family to be known beforehand which is an unrealistic assumption for large number of world's language families. In another work, List et al. (2016) employ a weighted variant of Levenshtein distance known as SCA (see section 3) for calculating similarity between two words. Then, they apply a community detection algorithm known as InfoMap for the purpose of discovering partial cognate sets in multiple groups of Sino-Tibetan language family. The authors find that the InfoMap algorithm works better than UPGMA when tuned for threshold. In this paper, we compare the CRP clustering algorithms against InfoMap and the similarity variant of UPGMA algorithm described in section 4.3.

Word similarity model
In this section, we present the word similarity features used to train our SVM model at the binary task of classifying if a word pair is cognate or noncognate.
String similarity features We use length normalized edit distance, number of common bigrams, common prefix length, individual word lengths, and absolute difference between the word lengths as features for training a SVM classifier (Hauer and Kondrak, 2011). We refer to this feature set as HK.
Point-wise Mutual Information (PMI) We include PMI weighted Needleman-Wunsch (Needleman and Wunsch, 1970) word similarity score (Jäger, 2013) as an additional feature for training the SVM classifier. The (unweighted or vanilla) Needleman-Wunsch algorithm is the similarity counterpart of the Levenshtein distance. The vanilla Needleman-Wunsch algorithm assigns equal negative weight to a common sound correspondence such as /s/ ∼ /h/ and a highly improbable sound correspondence such as /p/ ∼ /r/. The PMI weighted sound pair matrix inferred in Jäger (2013) assigns a positive weight to common sound correspondences and a negative weight to the latter ones. The PMI weight for two sounds i and j is defined as log p(i,j) q(i)·q(j) where, p(i, j) is the relative frequency of i, j occurring at the same position in the aligned word pairs and q(.) is the relative frequency of a sound in the whole word list. The similarity score for a word pair is computed using PMI-weighted Needleman-Wunsch algorithm. We transform the word similarity score using sigmoid function to yield a score between 0 and 1.0.
SCA We experimented with SCA (Sound Class Based Phonetic Alignment) word distance score (List et al., 2016) as an additional feature in our SVM model and found that inclusion of this feature improves the performance of cognate clustering systems. The SCA distance score is computed using the LingPy library (List et al., 2017a).
All the above features are widely used in cognate identification papers cited in sections 1 and 2. All the string similarity features are computed on words represented in ASJP code consisting of symbols on standard QWERTY keyboard. The ASJP code consists of 41 symbols that is used to represent common sounds of the world's languages. As such it collapses some distinctions between similar sounds such as using a single 'r' symbol for all the rhotic sounds. In this paper, we used LingPy library to convert IPA symbols to ASJP symbols. Our SVM model is implemented using scikit-learn (Buitinck et al., 2013). The trained SVM model is then used to predict the confidence scores for all the word pairs having the same meaning.

Clustering algorithms
In this section, we motivate and describe the two sd-CRP algorithms followed by InfoMap and UP-GMA clustering algorithms.

Motivation for CRP
In the traditional CRP, the probability that a new customer i sits at a table already filled with customers is proportional to the number of customers sitting at the table. The probability that the new customer sits at a new table is proportional to α. Blei and Frazier (2011) extended the traditional CRP model to a distance-dependent CRP model (dd-CRP) where customer i sits with a different customer j with a probability proportional to f (d ij ) where f is a decay function and d ij is the distance between customers i and j. The new customer can sit by itself with a probability proportional to α. The dd-CRP formulation forms clusters through connections between the customers. This property to form clusters depending on the data is directly relevant for inferring cognate clusters from a word pair distance matrix.
In a later paper, Socher et al. (2011) introduced a similarity dependent CRP (sd-CRP) algorithm that can handle arbitrary similarities between two customers. Socher et al. (2011) showed that their sd-CRP variant performs better than dd-CRP when clustering MNIST digits dataset and Newsgroup articles. A customer is a word in the context of cognate identification. We describe the two variants of sd-CRP -ns-CRP and sb-CRP -that work directly with a similarity matrix S in the next section.

sd-CRP algorithms
Given a word similarity matrix S ∈ R N ×N and α, the CRP algorithm clusters N elements into K clusters where 1 <= K <= N .

ns-CRP
The algorithm starts by placing each word into its own cluster. At each step, the algorithm assign a word w i to the cluster C that has the highest net similarity with w i which gives the name to the algorithm. We define net similarity as Algorithm 1 ns-CRP Input: S, α Ouput: Cluster assignments 1. Initialize each word into its own cluster and set α to 0.1.

Repeat until convergence:
• For each word wi -Remove wi from its cluster.
-Compute the net similarity s ik between wi to all words in a cluster k.
• Sample α using a Metropolis-Hastings step |C| j=1 S(w i , w j ). We call the algorithm ns-CRP after the net similarity criterion used to perform cluster assignments. w i is assigned to a new cluster if αS(w i , w i ) is greater than any of the similarities with the existing clusters. Any empty clusters remaining at the end of an iteration are removed. The cluster inference procedure is summarized in Algorithm 1.
Algorithm 2 sb-CRP Input: S, α Ouput: Cluster assignments 1. Initialize each word to its own cluster and set α to 0.1.

Repeat until convergence:
• For each word wi -Remove the outgoing link from wi.
-Compute the net similarity s ik between wi and the words in the set returned by SitBehind(w k ). that w i is in its own cluster. The probability of forming a directed link from w i and w j is proportional to the sum of the similarity between w i and all the words in the set returned by SitBehind(w j ). The weight for linking w i to itself is computed as αS(w i , w i ). The sb-CRP is summarized in Algorithm 2.
We present the result of application of sb-CRP algorithm to meaning fish in figure 1. The algorithm places the words correctly in their own clusters. The algorithm forms singleton clusters by forming self-loops. For instance, the algorithm links Ancient Greek ikhthis to itself thus, placing the word in its own cluster. When two words belonging to Bihari and Oriya are highly similar maTh ∼ maTho then, the algorithm links both the words to each other forming a cycle.

Underlying objective
Given K clusters out of which n are nonsingleton, algorithm 1 maximizes the following objective where k is the cluster index.
In the initial step, the objective in equation 1 is −α i S(w i , w i ) which increases until there is no change in the cluster reassignments. The objective for algorithm 2 is similar to equation 1 and only differs in the positive part due to SitBehind function. We use the above objective to sample α which is explained below. We observe that the objective function given in equation 1 is similar to the CRP extension to K-Means (DP-Means) proposed by Kulis and Jordan (2011) who show that the DP-means algorithm converges to a local optimum.

Sampling α
We sample α using a Metropolis-Hastings step. We will assume an exponential prior for α with rate parameter 10. We assume an exponential prior since α should be greater than zero and the support for the exponential distribution is R + . α is sampled through a Metropolis-Hastings step at the end of each iteration. We use an asymmetric multiplier proposal q(α * |α) = α · e ε(u−0.5) where u(∈ [0, 1]) is a uniform random number to propose a new α * . The Hastings ratio for a multiplier proposal is ε(u − 0.5) where ε (= 1) is the tuning parameter that controls the range of proposed α * (Lakner et al., 2008). Since we sample α on fixed cluster assignments, the likelihood ratio is equal to α * α . The prior ratio is equal to exp(α * ) exp(α) . In this paper, we run both the sd-CRP algorithms by setting the initial value of α to 0.1 and running the algorithms for 100 iterations. We found that the algorithm converges within the first ten iterations (see section 7.4). The algorithms take less than three hours to run for the Austronesian language family. We report the final iteration's B-cubed F-scores and ARI scores (see section 5.2) for each dataset.

Other Clustering algorithms
UPGMA The variant of St Arnaud et al. (2017) applied a ReLU transformation (max(0, s)) to the pairwise similarity matrix S such that the matrix consists only of positive similarity scores. In the initial step, each word is placed in its own cluster. The mutual score between two clusters is computed as the average of the similarity scores between all the word pairs. In each step, the algorithm merges two clusters with the highest pairwise score. The merging process is only stopped when no two clusters have positive average similarity score.
InfoMap is an information-theoretic based clustering algorithm that uses random walks to detect clusters in a network (Rosvall and Bergstrom, 2008). We transform the similarity matrix into a distance matrix by applying a sigmoid transformation then subtracting the matrix values from 1.0. Then, we apply a pre-defined threshold to form a disconnected graph. Finally, we supply the disconnected graph as input to the InfoMap algorithm to infer clusters. We also experimented with the threshold during cross-validation experiments on the training dataset and found that a threshold of 0.57 yielded slightly higher performance than a threshold of 0.5.

Materials and Evaluation
In this section, we describe the datasets and cluster evaluation metrics.

Datasets
Training dataset Wichmann and Holman (2013) and List (2014) compiled cognacy annotated multilingual word lists for subsets of families from various scholarly sources such as comparative handbooks and historical linguistics' articles. The detailed references to all the datasets are given in . Below, we provide the number of languages/number of meanings in each language group in parantheses.

Evaluation
We use B-cubed F-score (Amigó et al., 2009) and Adjusted Rand Index (Hubert and Arabie, 1985) to evaluate the quality of the inferred clusters.
B-cubed F-scores are defined for each individual item (word) as follows. The precision for an item is defined as the ratio between the number of cognates in its cluster to the total number of items in its cluster. The recall for an item is defined as the ratio between the number of cognates in its cluster to the total number of expert labeled cognates. Finally, the B-cubed F-score for a meaning is computed as the harmonic mean of the items' average precision and recall. The B-cubed F-score for the whole dataset is computed as the average of the B-cubed F-scores across all the meanings.
Adjusted Rand Index (ARI) is a chance corrected version of rand index (Hubert and Arabie, 1985). The ARI scores are in the range of −1 to +1. A score of 0 indicates that the obtained clusters are randomly labelled whereas a score +1 indicates perfect match between the two clusters. The ARI score is zero whenever the gold standard groups all the words belonging to the same meaning slot (e.g. words for meaning name are cognate across the daughter Indo-European languages) as one cluster, whereas the B-cubed F-score is not zero in such a case.

F-scores and ARI
We visualize the B-cubed F-scores and ARI scores in figure 2. The spread of the F-scores and ARI scores suggest that InfoMap and sd-CRP variants are better than UPGMA in the case of all the datasets except for the Central Asian dataset. The box plots for InfoMap are similar to the box plots of sd-CRP variants across all the language fami-lies. InfoMap and sd-CRP variants have shorter width boxes than those of UPGMA across all the families. All the algorithms show the lowest performance in terms of both F-scores and ARI scores on the Austro-Asiatic dataset. Based on mean Fscores and ARI scores across all the four language families, we determine the ns-CRP algorithm to be the winner.  Table 3: Pearson's R between number of predicted clusters and number of clusters in the gold standard data. The best correlation for each language family is shaded in light gray.

Size of inferred clusters
Apart from evaluating the cluster quality using B-cubed F-scores and ARI scores, we compare the number of inferred clusters by each algorithm against the number of clusters given in the gold standard data using Pearson's R. We present the results of Pearson's correlation in table 3. The Pearson's correlation between the number of predicted clusters and the number of gold clusters shows that the sd-CRP variants are successful at retrieving the right number of clusters when compared to UPGMA. InfoMap comes close to both sd-CRP variants' performance only in the case of the Central Asian languages dataset. The ns-CRP algorithm is the winner at being the best predictor of cluster sizes since it predicts clusters of sizes close to those given in the gold standard in the case of Austro-Asiatic and Austronesian datasets and shows same performance as sb-CRP in the case of the Central Asian dialects dataset.

Discussion
In this section, we discuss the effect of feature selection and initial value of α on the performance of sd-CRP algorithms. We verify the effect of missing data on all the clustering algorithms and present the results. Finally, we analyze the working of sd-CRP algorithms.

Feature ablation
To ascertain which word similarity features contribute the most to the performance of the ns-CRP algorithm, we trained three simpler SVM models and evaluated the quality of the inferred clusters using these models. The first model HK uses only orthographic features. The second model uses the PMI word similarity as an additional feature to the HK model. The third model uses SCA word similarity as an additional feature to the HK model. The results presented in previous section showed that ns-CRP performs the worst on Austronesian and Austro-Asiatic datasets.
Therefore, we present the cluster evaluation results only for these two datasets in table 4. The HK model yields high F-scores for both the datasets. Addition of PMI or SCA as an additional feature always improves both F-scores and ARI scores. In fact, including both PMI and SCA as features yields the best results even if the improvement is marginal in the case of the Austro-Asiatic dataset. We note that we observe similar trends for the rest of the datasets. We do not present the results for other datasets due to space constraints. Finally, the ablation experiments suggest that including both data-driven PMI and linguistically guided SCA as features gives the best results at cognate clustering.

Effect of lexical coverage
In this subsection, we investigate the effect of missing data on the clustering algorithms. In the case of the Austronesian dataset, less than 50% of the languages have word forms attested in 70% of  the meanings. The situation is slightly better in the case of Austro-Asiatic with more than 80% of the languages having meanings attested in 70% of the meanings.
In a separate paper, Rama et al. (2018) presented pruned datasets for five different language families -Pama-Nyungan and Sino-Tibetan in addition to Austronesian, Austro-Asiatic, and Indo-European -consisting of only those languages that show the highest mutual lexical coverage. For each dataset, the authors pruned any language which has less than 75% mutual attestations with the rest of the languages. We attempted to prune the Central Asian dataset but found that we could only exclude a single dialect which has less than 50% attestation. Therefore, we did not include the Central Asian dataset in our experiments. The statistics of the pruned datasets is given in  The results of this experiment are visualized in figure 3. The sd-CRP algorithms perform better than UPGMA and InfoMap in the case of Pama-Nyungan and Austro-Asiatic datasets. There seems to be no difference in the performance of all the algorithms in the case of the Sino-Tibetan dataset. There is no difference between sd-CRP and InfoMap algorithms in the case of the Austronesian dataset. Although the mean B-cubed Fscores indicate that there is no difference between the algorithms in the case of the Indo-European dataset, the spread of the box plots suggests that non-UPGMA algorithms perform better than UP-GMA. The B-cubed F-scores are not decisive in the case of the Indo-European dataset, whereas the ARI score clearly shows that non-UPGMA perform better than UPGMA. In conclusion, both the sd-CRP algorithms perform at least as good or better than InfoMap algorithm in the case of pruned datasets.  In this experiment, we test the sensitivity of ns-CRP algorithm to the initial α by initializing α to 0.001, 0.01, and 1.0. We hypothesize that our sampling step makes the algorithm robust to the initial value of α. We run the ns-CRP clustering algorithm for 100 iterations for different starting values of α on each of the pruned datasets. The results of the experiment are given in table 6 for α = 0.001. The B-cubed F-scores and ARI scores are quite similar for other initial values of α, and therefore we do not present those results to avoid repetition. These results suggest that the ns-CRP algorithm is not sensitive to the value of initial α. Here, we investigate the stability of the ns-CRP algorithm by plotting the B-cubed F-scores against the number of iterations for 30 random meanings from the Indo-European dataset in figure 4. The plot shows that the ns-CRP algorithm quickly moves from an initial configuration with low Fscore to a configuration that has high F-scores within the first 20 iterations. We observe similar behaviour of ns-CRP in the case of other language families. In conclusion, the plot shows that the quality of the clusters inferred by the ns-CRP algorithm achieves a high F-score. Moreover, the cluster quality does not change drastically after reaching a local optimum.

Analysis of sd-CRP algorithms
In this subsection, we analyze the difference in the behaviours of sd-CRP algorithms. If w i and w j are cognate and w j and w k are cognate, then all the three words are cognate with each other which follows from the definition of cognacy. The sb-CRP algorithm captures this cognacy relation through the SitBehind function. During cluster formation, w i only has to connect to a word that might have no other words other than itself sitting behind it. We hypothesize that the sb-CRP algorithm would be more efficient at identifying partial cognates where only part of the lexical material is cognate with another word. An example of a partial cognate is the meaning of meat in sweetmeat which is cognate with Swedish mat 'food' (Campbell, 2004). In contrast, the ns-CRP algorithm is stricter than sb-CRP algorithm in that a word is assigned to the cluster with which it has the highest net similarity. If a word has net similarity of zero with all the existing clusters, then, the word would form its own cluster since αS(w i , w i ) is always positive.

Conclusion
We presented and compared the performance of two similarity dependent Chinese Restaurant process algorithms at the task of automated cognate detection for six different language families. The sensitivity experiments suggested that the sd-CRP algorithms is not sensitive to initial α and missing data. The feature ablation experiments suggest that the inclusion of PMI and SCA features improve the performance of the sd-CRP algorithms. We conclude that the sd-CRP algorithms perform better than the existing clustering algorithms across multiple settings.
As future work, we plan to include language relatedness as features into SVM training and also train the SVM classifier in an unsupervised fashion using the sd-CRP algorithms.