Unsupervised Deep Language and Dialect Identification for Short Texts

Automatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification (UDLDI) method, which can simultaneously learn sentence embeddings and cluster assignments from short texts. The UDLDI model understands the sentence constructions of languages by applying attention to character relations which helps to optimize the clustering of languages. We have performed our experiments on three short-text datasets for different language families, each consisting of closely related languages or dialects, with very minimal training sets. Our experimental evaluations on these datasets have shown significant improvement over state-of-the-art unsupervised methods and our model has outperformed state-of-the-art LI and DI systems in supervised settings.


Introduction
Automatic Language Identification (LI) and Dialect Identification (DI) has become a very crucial part of natural language processing (NLP) pipelines and is part of tasks such as language modelling, categorization and analysis of code-mixed datasets. Many researchers have carried out experiments in these domains both in speech and texts starting from House and Neuburg (1977). Unsupervised LI is an under-explored area, but is highly useful as it can exploit the large amount of unlabelled data and, more importantly, can be employed when the languages to be identified are not known in advance. However, unsupervised LI for short texts is a very difficult task, for which performance is still comparatively poor. Zhang et al. (2016) have explored unsupervised language identification but this does not work well with closely related short texts. Zaidan and Callison-Burch (2014) have worked on the identification of Arabic dialects. Ciobanu et al. (2018) studied German dialect identification. These works are mostly supervised works. The task is even harder when it comes to unsupervised DI.
These previous works have generated some questions: What is the best way to construct sentence embeddings and how to cluster them efficiently for short texts when there is no labelled training data? How to address the hard task of DI and LI for closely related languages in an unsupervised way without any manual intervention? In this paper, we address these problems by taking inspiration from iterative clustering (Xie et al., 2016) and self-attention-based sentence embeddings from Lin et al. (2017). Both of the papers have shown good results on unsupervised clustering and sentence embeddings for a single language dataset whereas our goal is to design a system which can work on different short texts of closely related languages and understand every distinct feature. Our system takes different n-gram character features of a sentence and computes attention weights based on their importance in sentence construction. We then fine-tune sentence embeddings with an iterative clustering process with the help of the Stochastic This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/. Gradient Descent (SGD) and backpropagation. We call the system Unsupervised Deep Language and Dialect Identification (UDLDI).
One of the main challenges of the model is to get sentence embeddings. As we do not have any labels here, we cannot use typical loss functions to update character-to-sentence embeddings. Further, as the datasets are short text small datasets, there is much less scope to learn features to construct efficient sentence embeddings. To overcome this challenge we propose a loss function which considers the probability distribution of assigning sentences in each cluster and tries to maximize the unique sentence assignments in each cluster. These sentence embeddings define the initial cluster centre assignments, which we then fine-tune with the iterative clustering learning process. Our method is tested against three different datasets consisting of under-resourced and closely related languages, the language families and/or varieties in question being South African, Dravidian and Swiss German. The South African dataset consists of languages from different language families, some of which are similar. We have achieved good results on each dataset and the results are better than other base results. We have also compared our sentence-embedding model to supervised LI and DI methods which outperformed other state-of-the-art models. The robustness of the model is, it does not require any manual intervention and it works well for short texts in any dataset consisting of closely related languages, for any language family.
Our contributions are: (1) the mentioned UDLDI model for language and dialect identification in short texts of closely related languages, (2) efficient sentence embedding construction for short texts, and (3) efficient clustering of closely related languages with a very small dataset.

Related Work
There are some LI and DI models which work efficiently for short texts and closely related languages on different datasets (Hammarström, 2007;Medvedeva et al., 2017). Jauhiainen et al. (2019) have described different LI and DI models. Language identification of closely related languages are explored in Malay-Indonesian languages (Ranaivo-Malançon, 2006), Indian languages (Murthy and Kumar, 2006), South Slavic languages (Tiedemann and Ljubešić, 2012;Ljubešic and Kranjcic, 2014;Ljubešić and Kranjčić, 2015). Zampieri and Gebre (2014) made a LI system which can identify 27 languages.
There is some research on LI for short texts. Vatanen et al. (2010) explored an LI method based on ngram character sequences for messages that are 5-21 characters long. Bergsma et al. (2012) have explored LI for Twitter data. King and Abney (2013) used Conditional Random Fields (CRF) (Lafferty et al., 2001) to build word-level LI method. Giwa and Davel (2013) have investigated LI in terms of code-switching. These methods are all supervised and this does not allow them to exploit the large amount of unlabelled data and situations where the languages are not known in advance. Unsupervised LI is a challenging task. Selamat and Ching (2008) used Fuzzy ART NNs for clustering of documents written in Arabic, Persian, and Urdu. In Fuzzy ART, they have shown how to update clusters dynamically. Amine et al. (2010) used a character n-gram representation for text. Later k-means algorithm followed by particle-swarm optimization is applied to cluster the languages which results in different small clusters. Shiells and Pham (2010) worked on unsupervised tweet LI with the Chinese Whispers algorithm of Biemann (2006) and Graclus clustering. Elfardy and Diab (2012) deal with code-switching identification between Modern Standard Arabic (MSA) and dialectal Arabic. Wan (2016) used word-level k-means clustering for LI and clustered languages based on co-occurring words. Poulston et al. (2017) used word2Vec word embeddings and k-means clustering to make their LI model. Rijhwani et al. (2017) researched short-text tweets for unsupervised language detection which mainly focuses on seven languages. These algorithms work well for some specific languages and long texts.
There is some research on dialect identification (DI). DI has been explored in Serbo-Croatian dialects (Zečević and Vujicic-Stankovic, 2013), English varieties (Simaki et al., 2017), Dutch dialects (Trieschnigg et al., 2012), German dialects (Hollenstein and Aepli, 2015), Mainland-Singaporean-Taiwanese Chinese (Huang and Lee, 2008), Arabic dialects in social media (Huang, 2015), Portuguese varieties (Zampieri and Gebre, 2012). One of the more famous dialect identification systems is an Arabic dialect system which was given much emphasis in 2015 (Zampieri et al., 2015). But most of these works deal with supervised dialect identification. Our model bridges the gap between DI and unsupervised learning. Unsupervised Deep Language and Dialect Identification (UDLDI) clusters the short texts of closely related languages in an efficient way using character-to-sentence embeddings with self-attention, based on an unsupervised loss function and an iterative clustering method (refer to Figure 1). The model uses an unsupervised loss function to build the character-to-sentence embeddings. The iterative clustering process fine-tunes the sentence embedding and enhances the cluster assignment.

Sentence Embedding
The sentence-embedding model has two parts. The first one is an n-gram character-level CNN and the second one is a self-attention mechanism, which gives the weighted summed features of sentences based on n-gram (n ∈ {2,3,4,5,6}) characters. The attention weights are then multiplied with the feature vectors of the CNN to get the sentence embedding. The different character-level features are achieved with the help of 1-dimensional CNN (Zhang et al., 2015). The sentence embeddings are then fed into a dense layer to get the language level probability distribution of the sentence.
The 1-dimensional CNN accepts the input sequence as a character sequence. If a sentence has m characters then the input sequence would be S = [w 1 , w 2 , ..., w m ], where w i is a character in the sentence (including spaces). The one-dimensional convolution implements 1-dimensional filters which slide over the sentences as a feature extractor. Let the filters have a shape of 1 × k where k is the filter size. Filter sizes are the same with different n-grams. Let x i ∈ R d denote the d-dimensional character embedding of the i-th character considering the character vocabulary size is n. For each position j in the sentence, we have a window vector w j with k consecutive character vectors, as denoted in Equation 1.
The 1-dimensional k-sized filters slide over the window vector w j to create the feature map s where s ∈ R m−k+1 and m is the input size. Each element s j of the feature map for window vector w j is produced according to Equation 2, where vector m is a filter for convolution operation, b j is the bias for the j-th position and a is the non-linear function. The new feature representation F after rearranging will be represented as F = [s 1 , s 2 , ..., s n ], where n is the number of convolution filters, s i is the feature map generated with i-th filter and F ∈ R (m−k+1)×n .
To understand the different n-gram feature representation of the sentences, multiple k-sized filters are used which is represented as F k where k ∈ {2,3,4,5,6}.
We aim to get sentence embeddings from different n-gram representations. We have achieved this by getting the weighted summation of T elements of new feature representation F. We perform this for each F k separately. To achieve this, we use a self-attention mechanism. The attention mechanism takes the feature representation as input and gives a as an output weight vector, as shown in Equation 3, where W h is the weight matrix and b h is the bias for the attention mechanism. The use of softmax is to ensure that the summation of attention weights (a 1 , a 2 , .., a T ) are non-negative and sum up to 1. Then we sum up the elements of feature representation F according to the weight vectors provided by a to get a vector representation r of the input sentence by Equation 4, where a i is the attention weights for each element of the feature representation, and ⋅ is the element-wise product between elements. Different n-gram representations carry different weighted sentence embeddings. To capture all the representations, the final representation is a concatenated vector r = [r 2 , r 3 , r 4 , r 5 , r 6 ] (where n-grams ∈ {2,3,4,5,6}), which is then passed to the fully connected layer to get the probability distribution of there being one language or dialect.
The sentence embeddings are used in two phases. The first phase is the pre-training phase which can be obtained as described in section 3.2; the second phase is the fine-tuning of sentence embeddings based on self-training with the clustering algorithm, described in section 3.3.

Dense Layer and Loss function
The sentence embeddings r ∈ R N ×M are passed to a fully connected (FC) layer which gives the output z ∈ R N ×K and then a softmax layer (p ∈ R N ×K ) returning the probability distribution of all classes, as per Equation 5, where K is the number of languages or dialects and p ij gives the probability that sample i belongs to class j (j ∈ {1,2,...,K}). The loss function maximizes the same feature sentence assignment to one language. It can be considered as maximizing cluster purity based on the probability distribution. The loss function can be computed by means of Equation 6, where N is the batch size. In contrast to existing loss functions, our loss function does not depend on entropy-based calculation as there is no label. Instead the loss function is trying to maximize the probability distribution function based on the feature assignments on each class (or cluster) and minimize the maximum summation of the squared probability distribution of all the languages assigned to one class. The maximization of the probability function ensures the best possible assignment of the features in each individual classes. But in this process, there is a chance of biased feature assignments (for the datasets where features of data points are very near in distance) to a single class. To overcome this trivial situation, the model is penalised with the maximum summation of the feature assignments over all samples. This stabilizes the system by assigning features to different classes in the best possible way.
After training the character embeddings and sentence embedding we cluster the dataset. Xie et al. (2016) have shown in the Deep Embedding for Clustering (DEC) architecture how to use clustering as self-training (Nigam and Ghani, 2000) and how to fine-tune the mapping and do the clustering. The DEC framework uses the idea of parameterized non-linear mapping from the data space X to a lowerdimensional feature space Z to optimize a clustering objective. It uses stochastic gradient descent (SGD) via backpropagation on the clustering objective to learn the mapping, which is parameterized by a deep neural network. In our case, we have made use of this iterative clustering process to fine-tune the sentence embeddings, which improve the clustering.

Self Training and Clustering
The initialization of this phase starts by obtaining initial cluster centroids u j k j=1 by applying the cluster algorithm to the sentence embeddings from the pre-training phase, where k is the set of cluster centres. The self-training phase has the following two steps: (i) the soft assignment between sentence embedding and initial cluster centroids, and (ii) fine-tuning sentence embeddings and cluster centroids using the auxiliary target distribution.
As a first step, we calculate the soft assignment between the cluster centroid u j and sentence embedding z i based on Student's t-distribution (Maaten and Hinton, 2008) with a single degree of freedom (α = 1), as per Equation 7, where z i ∈ {Z} and q ij can be considered as the probability of assigning sample i to cluster j. The second step involves refining the cluster learning from their high confidence assignments with the help of an auxiliary target distribution (Xie et al., 2016). The auxiliary target distribution (p ij ) helps to improve cluster purity and puts more emphasis on data points assigned with high confidence. The p ij can be calculated by means of Equation 8, where ∑ i q ij is the frequency of the soft clusters. The training loss is then designed based on KL divergence loss between the soft assignments q i and the auxiliary distribution p i , as shown in Equation 9.
4 Experimental Setup

Data
We have taken three different datasets for the training and testing of our model, for the following languages/dialects. South African languages. This dataset was used by Duvenhage et al. (2017) and consists of short texts for 11 South African languages, many of which are related; these are the Nguni languages (zul, xho, nbl, ssw), Afrikaans (afr) and English (en), the Sotho languages (nso, sot, tsn), tshiVenda (ven) and Xitsonga (tso). The texts are on average 15-20 characters long. We have taken 15,000 samples for training and 5,000 for testing.
Dravidian languages are under-resourced languages spoken mainly in the southern part of India. This dataset contains the languages Tamil (ISO 639- Swiss German. This dataset is based on the ArchiMob corpus (Samardžić et al., 2016) and is used for dialect identification. It contains transcriptions of 34 interviews with native speakers of various German dialects spoken in Switzerland. The subset used for German Dialect Identification contains 18 interviews (14 for training and 4 for testing) from four Swiss German dialect areas, i.e., Basel, Bern, Lucerne, and Zurich, with each interview being transcribed using the 'Schwyzertutschi Dialäktschrift' writing system (Dieth, 1986). The training set contains a total of around 14,000 instances and the test set contains a total of 3,638 instances.

Evaluation Metric
We have used the standard cluster performance evaluation process. The number of clusters is set to the same number as the original number of classes in the data and evaluated with unsupervised cluster accuracy, as shown in Equation 10, where l i is the ground truth label, c i is the cluster assignment produced by the algorithm and m ranges over all possible one-to-one mappings between clusters and labels. The number of clusters is also an evaluation process in case of unsupervised clustering. The accuracy is determined with the same number of clusters as the ground truth data. But in general, the number of clusters also needs to be optimised and evaluated. It is evaluated by Normalized Mutual Information (NMI). The NMI is useful for comparing the clustering results with different cluster numbers. It is shown in Equation 11, where l is the ground truth label and c is the predicted cluster. I is the mutual information and H is entropy. When data is partitioned perfectly, the NMI score is 1, and when l and c are independent, it becomes 0.

Training Details
We implemented our model using pytorch 1 and used the SGD (Stochastic Gradient Descent) optimizer. The starting learning rate for the South African, Dravidian and Swiss German datasets was hand-tuned to 1e-4, 5e-3 and 2e-3, respectively, based on training data convergence during training time. We have used LambdaLR 2 as the learning rate scheduler which varies based on the filter numbers of convolution network. The convolution networks have 128 filters each and it has 1 stride. The number of neurons in the feed-forward network is 1024. We have used k-means clustering (MacQueen, 1967). For initial cluster centroid detection on batches, we have used minibatch k-means clustering 3 . The weight decay is 0.01 and the momentum is 0.9. For supervised learning, we have used Cross-Entropy loss 4 .

Results and Discussions
We have compared our results with various baselines, both for supervised and unsupervised setup, which are provided in Table 1 and Table 2.
Unsupervised setup. The model is compared with one of the most well known unsupervised clustering methods Auto-encoder based k-means clustering. The sentences are projected in a latent space Z from the feature space X and then clustered it applying k-means. Our second baseline model is fasttext 5 word embeddings and k-means. The sentence embedding is the average of word embeddings. We have also compared our experimental results with the LSTM sentence-embedding model (Lin et al., 2017) along with the DEC framework (Table 1). Our model performed substantially better than other state-of-the-art models, achieving 37% accuracy (16.82% improvement upon the LSTM model) for the South African  (Duvenhage et al., 2017) 96.00% 97.34% --CLUZH (Clematide and Makarov, 2017) --67.00% MAZA (Malmasi and Zampieri, 2017) --68.00% tearsofjoy  --75.00% LSTM (Lin et al., 2017) 96.67% 98.51% 68.67% CharCNN (Zhang et al., 2015) 98  dataset. It has also achieved 82% accuracy (.6% improvement upon the LSTM model) in the Dravidian dataset. For the Swiss German dataset, 30% accuracy (15.83% improvement upon the LSTM model) was achieved; the latter DI task can be considered one of the toughest tasks to perform, as the dialects are very similar. It means that the model is well capable of identifying important character n-gram features in sentence construction. Details about the attention weights distribution are discussed in the section 5.1.
Supervised setup. Table 2 shows different baseline models including previously submitted systems in the Swiss German DI shared tasks. Our model has outperformed all the state-of-the-art methods achieving 99%, 99.34% and 81% accuracy for the South African dataset, Dravidian dataset and Swiss German dataset, respectively. Though it has outperformed both the LI and DI method, significant efficiency can be observed in DI which has outperformed the highest performing system  by 8%.

Discussion and Analysis
To analyze sentence embeddings we will take reference of the Swiss German DI which can be considered one of the hardest tasks. The dialects are very closely related from four different parts -Basel (BS), Bern (BE), Lucerne (LU), and Zurich (ZH). Some of the sentence constructions are very similar to each other. For example we can take two sentences of Bern and Lucerne which are "aber dääìsch emaal dä" and "aber dääìsch seer schträng ggsìì mitöis" respectively. In the sentence constructions, we can see the first three words are exactly the same consisting of the same characters even though they represents two different dialects. Without extensive morphological and linguistic knowledge, it is very difficult to separate sentences spoken by people from these regions. Character n-grams represents distinct features of sentences of a particular region. Our model is also able to identify dialectal (pronunciation) variants for an inflected form of a verb. We can see this in Figure 2. There are two sub-columns under the "Text" column which consist of different sentences formed with the same verb in different contexts. The red colour is the importance degree of the attention weight and light blue is a less important part. The English word "have" has different forms in different dialects. For example, in case of BE, it is written as "hei" whereas in LU it is written as "hend". The verb form is used in different contexts in different sentences. In the first example in the Text1 column, a sentence is shown with words "mir hei" which means "we have". But the "mir" word is the same in different dialects. Our model has given less attention weight to this word whereas the highest attention weight is given to "hei". The character n-gram considered space as well. So Figure 2: Attention visualization of dialect-specific words pointed out by the model in the Text2 column, though the sentence does not contain the word "mir", depending on the "hei" word with space, the sentence is classified as BE. Here the word "söimer" is also given the highest attention weight as the word only appeared in BE. The same can be observed in other sentences as well for other dialects. Conjunctions like "und" ("and") are very common in sentences and are often written the same. As shown in Figure 2, this word occurs both in BS and ZH, and as such has very low significance in terms of dialect identification. Consequently, our model has assigned it a low attention weight.
As  have pointed out, the ArchiMob corpus shows variation on different levels. Not only are different lexical items pronounced and hence written differently across dialects, the data additionally incorporates intra-speaker variability (i.e., slight variation in the pronunciation of the same word by the same speaker) and even shows transcriber-related variation. These factors may overestimate the distinctive features as identified by the model. Moreover, the attention weights in our model may point to yet another type of variability, namely, idiosyncratic features of certain speakers such as the repetition of words (idiolect). In Figure 3a dialects are compared with each other in pairs. The word "ùùnd" (standard German "und") in the second example is common in both LU and BS. But in the case of LU, the word followed by "ùnd" whereas in case of BS it followed by other words, so the character attention distribution gives the highest attention weights to the n-gram "ùndùn". Though the word "ùnd" (like "und") has very low significance in dialect identification as an individual word, in the Swiss German dataset this n-gram is a distinctive feature, although probably being more indicative of idiolect than dialect. The n-gram "im va" carries very little distinctiveness in case of LU, so it has the lowest attention score. There are possibilities of very small differences in sentence construction for DI or LI. Our model efficiently captures these differences. In Figure 3a the character n-gram "u nä" exists in BE. Even in word "nächer", the first two letters are "nä", which leads to the n-gram "u nä". In case of BS, the construction is different and n-gram "u nä" does not exist. Thus in the case of BE, the highest attention weight is assigned to "u nä" which makes the sentence embedding more robust and efficient.

Error analysis in the model
The UDLDI works efficiently in comparison to other state-of-the-art methods. However, there are some challenges. As can be seen in Figure 3b, the sentence construction "a de gränze" can be found both in Zurich and Bern region. Though the sentences consist of other words, our methods are unable to differentiate between sentence-level features as the character n-grams are also common in different dialects; for example, the word "gsii" also exists in the Lucerne region. This leads to an insufficient attention weights distribution for common character n-grams. The same is true for the word sequence "aber es isch". These lead to biased clustering. This observation is supported by the NMI score of the clustering.  Figure 4 shows that the NMI is highest when the number of clusters is 3 (blue line). We suspect that in the case of overlapping word sequences with a non-distinctive right context, the model is wrongly classifying sentences as belonging to the same cluster (dialect). We can compare this with Dravidian dataset. The graph shows that the NMI score is highest for 4 clusters (red line), which is equal to the number of languages present in the dataset. Though the Dravidian languages are very similar, they have different orthographies with a unique written script for each language. Thus the model can easily identify the script, directly captured by linguistic features on the character level. The somewhat unstable characterisation of the number of clusters as shown in the NMI graph for Swiss German dataset may to some extent be caused not only by linguistic features transcending one dialect area, but also, as mentioned in 5.1, by intra-speaker variation, transcriber-related variation and idiolect.

Conclusion
In this paper, we propose a novel character attention based unsupervised deep language and dialect identification (UDLDI) model for short texts of closely related languages. We have performed our experiments on three different language families and outperformed other state-of-the-art models both as a supervised and an unsupervised model. We have also achieved promising results for dialect identification which is considered to be one of the hardest tasks in NLP. Our experiments show that UDLDI is capable of doing language or even dialect identification irrespective of language families. The results in Section 5.1 show that the model is able to correctly identify some of the verbs which are distinguishable in different dialects. We have also shown that, based on NMI, our unsupervised model is correctly predicting the number of languages that are present in a dataset, opening up interesting applications to large-scale multilingual dataset processing.
Our future work will be the improvement of clustering efficiency for closely related languages and dialects. Though the model has outperformed other state-of-the-art models, there is a lot of scope to improve the unsupervised learning model. In this model, for example, co-occurrence of words in different contexts has not yet been considered. In the future, we will aim to capture word contexts along with n-grams in sentence embeddings.