Alignment Analysis of Sequential Segmentation of Lexicons to Improve Automatic Cognate Detection

Ranking functions in information retrieval are often used in search engines to recommend the relevant answers to the query. This paper makes use of this notion of information retrieval and applies onto the problem domain of cognate detection. The main contributions of this paper are: (1) positional segmentation, which incorporates the sequential notion; (2) graphical error modelling, which deduces the transformations. The current research work focuses on classification problem; which is distinguishing whether a pair of words are cognates. This paper focuses on a harder problem, whether we could predict a possible cognate from the given input. Our study shows that when language modelling smoothing methods are applied as the retrieval functions and used in conjunction with positional segmentation and error modelling gives better results than competing baselines, in both classification and prediction of cognates. Source code is at: https://github.com/pranav-ust/cognates


Introduction
Cognates are a collection of words in different languages deriving from the same origin. The study of cognates plays a crucial role in applying comparative approaches for historical linguistics, in particular, solving language relatedness and tracking the interaction and evolvement of multiple languages over time. A cognate instance in Indo-European languages is given as the word group: night (English), nuit (French), noche (Spanish) and nacht (German).
The existing studies on cognate detection involve experiments which distinguish between a pair of words whether they are cognates or non-cognates (Ciobanu and Dinu, 2014;List, 2012a). These studies do not approach the problem of predicting the possible cognate of the target language, if the cognate of the source language is given. For example, given the word nuit, could the algorithm predict the appropriate German cognate within the huge German wordlist? This paper tackles this problem by incorporating heuristics of the probabilistic ranking functions from information retrieval. Information retrieval addresses the problem of scoring a document with a given query, which is used in every search engine. One can view the above problem as the construction of a suitable search engine, through which we want to find the cognate counterpart of a word (query) in a lexicon of another language (documents). This paper deals with the intersection between the areas of information retrieval and approximate string similarity (like the cognate detection problem), which is largely under-explored in the literature. Retrieval methods also provide a variety of alternative heuristics which can be chosen for the desired application areas (Fang et al., 2011). Taking such advantage of the flexibility of these models, the combination of approximate string similarity operations with an information retrieval system could be beneficial in many cases. We demonstrate how the notion of information retrieval can be incorporated into the approximate string similarity problem by breaking a word into smaller units. Regarding this, Nguyen et al. (2016) has argued that segmented words are a more practical way to query large databases of sequences, in comparison with conventional query methods. This further encourages the heuristic attempt at imposing an information retrieval model on the cognate detection problem in this paper.
Our main contribution is to design an information retrieval based scoring function (see section 4) which can capture the complex morphological shifts between the cognates. We tackled this by proposing a shingling (chunking) scheme which incorporates positional information (see section 2) and a graph-based error modelling scheme to understand the transformations (see section 3). Our test harness focuses not only on distinguishing between a pair of cognates, but also the ability to predict the cognate for a target language (see section 5).

Positional Character-based Shingling
This section examines on converting a string into a shingle-set which includes the encodings of the positional information. In this paper, we notify, S as the shingle-set of cognate from the source language and T as the shingle-set of cognate for the target language. The similarity between these split-sets is denoted by S ∩ T . An example of cognate from the source language, S (Romanian) could be shingle set of the word rosmarin and T (Italian) could be romarin.
K-gram shingling: Usually, set based string similarity measures are based on comparing overlap between the shingles of two strings. Shingling is a way of viewing a string as a document by considering k characters at a time. For example, the shingle of the word rosmarin is created with k = 2 as: S = { s r, ro, os, sm, ma, ar, ri, in, n /s }. Here, s is the start sentinel token and /s is the stop sentinel token. For the sake of simplicity, we have ignored sentinel tokens; which transforms into: S = {r, ro, os, sm, ma, ar, ri, in, n}. This method splits the strings into smaller k-grams without any positional information.

Positional Shingling from 1 End
We argue that the unordered k-grams splitting could lead to an inefficient matching of strings since a shingle set is visualized as the bag-ofwords method. Given this, we propose a positional k-gram shingling technique, which introduces position number in the splits to incorporate the notion of the sequence of the tokens. For example, the word rosmarin could be position-wise split with k = 2 as: S = {1r, 2ro, 3os, 4sm, 5ma, 6ar, 7ri, 8in, 9n}.
Thus, the member 4sm means that it is the fourth member of the set. The motivation behind this modification is that it retains the positional information which is useful in probabilistic retrieval ranking functions.

Positional Shingling from 2 Ends
The main disadvantage of the positional shingling from single end is that any mismatch can completely disturb the order of the rest, leading to low similarity.
We attach a position number to the left if the numbering begins from the start, and to the right if the numbering begins from the end. Then the smallest position number is selected between the two position numbers. If the position numbers are equal, then we select the left position number as a convention. Figure 1 gives an exemplification of this algorithm illustrated with splits of romarin and rosmarin. On the left, algorithm segments the Romanian word romarin into the split-set {1r, 2ro, 3om, 4ma, ar4, ri3, in2, n1}. On the right, the algorithm segments rosmarin into {1r, 2ro, 3os, 4sm, 5ma, ar4, ri3, in2, n1}.
Once shingle sets are created, common overlap set measures like set intersection, Jaccard (Järvelin et al., 2007), XDice (Brew et al., 1996) or TF-IDF (Wu et al., 2008) could be used to measure similarities between two sets. However, these methods only focus on similarity of the two strings. For cognate detection, it is crucial to understand how substrings are transformed from source language to target language. This section discusses on how to view this "dissimilarity" by creating a graphical error model.
Algorithm 1 explicates the process of graphical error modelling. For illustration purposes, we visualize the procedure via a Romanian-Italian cognate pair (mesia, messia). If the source language is Romanian, then S = {1m, 2me, 3es, si3, ia2, a1}, which is the split-set of mesia. Let the target language by Italian. Then the split-set of the Italian word messia, denoted as T , will be {1m, 2me, 3es, 4ss, si3, ia2, a1}. Thus |S ∩ T | is the number of common terms. Thus the term matches are, S∩T = {1m, 2me, 3es, si3, ia2, a1}. We are interested in examining the "dissimilarity", which are the leftover terms in the sets. That means, we need to infer a certain pattern from leftover sets, which are S Intuition: The edges created as the result of this graph could be used for probabilistic calculations which are detailed more in section 4.2. Intuitively, φ → 4ss means that if the letter s is added at position 4 of the word of the source mesia, then one could get the target word messia.

Evaluation Function
The design of our evaluation function focuses on two main properties: set based similarity (see section 4.1) and probabilistic calculation through graphical model (see section 4.2)

Similarity Function
Usually, the computation of similarity between two sets is done by metrics like Jaccard, Dice and XDice (Brew et al., 1996). Dynamic programming based methods like edit distance and LCSR (Longest Common Subsequence Ratio, (Melamed, 1999)) are also often used to calculate similarity between two strings. Ranking functions incorporate more complex but necessary features which are needed to distinguish between the documents.
In this paper, we use BM25 and a Dirichlet smoothing based ranking function to compute the similarity. BM25 considers term-frequency, inverse document frequency and length normalization based penalization features for similarity calculations. Dirichlet smoothing function (Robertson and Zaragoza, 2009) makes use of language modelling features and tunable parameter which aids in Bayesian smoothing of unseen shingles in the split sets (Blei et al., 2003).

Error Modelling Function
The information of the common morphological transformations for cognates between two different languages helps in determining if a pair of words could be cognates. Based on the graphs of cognate pairs between Italian and Romanian (section 3), which models the morphological shifts between the cognates in the two languages, we define an error modelling function on any pair of words from the two languages. The split set from the source language is denoted by S and target language by T , then probabilistic function would be: where G(S, T ) is the constructed graph of S and T , the strength parameter is called q here with the range of (0, ∞), and P (e) is the probability of edge e to occur in between two cognates, which is estimated by its frequency of being observed in the graphs of cognate pairs in the training set. Figure 3 illustrates the aggregation of edges in the graph and figure 4 shows the final output of the error modelling function after normalizing.
π(S, T) is called the error modelling function defined for the word pair, which is an intuitive calculation of probabilty between a pair of cognates through estimating their transformations. q is a tunable parameter that controls the effect of the probabilistic frequencies P (e) observed in the training set, often useful in avoiding overfitting.

Combining Error Modelling and Similarity Function metrics
In this subsection, we merge the notion of similarity and dissimilarity together. We combine a set-based similarity function (discussed in section 4.1) and the error modelling function (discussed in section 4.2) into a score function by a weighted sum of them, which is, where λ ∈ [0, 1] is a weight based hyperparameter, sim(S, T ) is a set-based similarity between S and T , and π(S, T ) is the graphical error modelling function defined above. Table 1 summarizes the results of the experimental setup. The elements of test harness are mentioned as following:

Setup Description
Dataset: The experiments in this paper are performed on the dataset used by Ciobanu et al (2014). The dataset consists 400 pairs of cognates and non-cognates for Romanian-French (Ro -Fr), Romanian-Italian (Ro -It), Romanian-Spanish (Ro -Es) and Romanian-Portuguese (Ro -Pt). The dataset is divided into a 3:1 ratio for training and testing purposes. Using cross-validation, hyperparameters and thresholds for all the algorithms and baselines were tuned accordingly in a fair manner. Experiments: Two experiments are included in test harness. 1. We provide a pair of words and the algorithms would aim to detect whether they are cognates. Accuracy on the test set is used as a metric for evaluation. 2. We provide a source cognate as the input and the algorithm would return a ranked list as the output. The efficiency of the algorithm would depend on the rank of the desired target cognate. This is measured by MRR (Mean Reciprocal Rank), which is defined as, 1 rank i , where rank i is the rank of the true cognate in the ranked list returned to the i th query. This dataset is prepared by listing search candidates as the entire lexicon of the particular language. The target list is the whole lexicon for that particular language. Given the input cognate, the algorithm will output possible matches after evaluating the whole lexicon list. Thus, the collection of documents are lexicons (search space) and queries would be the cognates.

Baselines
String Similarity Baselines: It is intuitive to compare our methods with the prevalent string similarity baselines since the notions behind cognate detection and string similarity are almost similar. Edit Distance is often used as the baseline in the cognate detection papers (Melamed, 1999). This computes the number of operations required to transform from source to target cognate. We have also incorporated XDice (Brew et al., 1996), which is a set based similarity measure that operates between shingle set between two strings. Hidden alignment conditional random fields (CRF) are often used in transliteration which serves as the generative sequential model to compute the probabilities between the cognates, which is analogous to learnable edit distance (McCallum et al., 2012). Among these baselines, CRF performs the best in accuracy and MRR. Orthographic Cognate Detection: Papers related to this notion usually take alignment of substrings which in classifier like support vector machines Dinu, 2015, 2014) or hidden markov models (Bhargava and Kondrak, 2009). We included Alina et al as the baseline (2014), which employs the dynamic programming based methods for sequence alignment following which features were extracted from the mismatches in the word alignments. These features are plugged into the classifier like SVM which results in decent performance on accuracy with an average of 84%, but only 16% on MRR. This result is due to the fact that a large number of features leads to overfitting and scoring function is not able to distinguish the appropriate cognate. Phonetic Cognate Detection: Research in automatic cognate identification using phonetic aspects involve computation of similarity by decomposing phonetically transcribed words (Kondrak, 2000), acoustic models (Mielke, 2012), phonetic encodings (Rama, 2015), aligned segments of transcribed phonemes (List, 2012b). We implemented Rama's research (2015), which employs a Siamese convolutional neural network to learn the phonetic features jointly with language relatedness for cognate identification, which was achieved through phoneme encodings. Although it performs well on accuracy, it shows poor results with MRR, possibly the reason as same as SVM performance.

Ablation experiments
We experiment with the variables like length of substrings, ranking functions, shingling techniques, and graphical error model, which are detailed in the Table 1. Amongst the shingling techniques, we found that character bigrams with 2ended positioning give better results. Adding trigrams to the database does not give major effect