Tübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing

This paper describes our systems and results on VarDial 2017 shared tasks. Besides three language/dialect discrimination tasks, we also participated in the cross-lingual dependency parsing (CLP) task using a simple methodology which we also briefly describe in this paper. For all the discrimination tasks, we used linear SVMs with character and word features. The system achieves competitive results among other systems in the shared task. We also report additional experiments with neural network models. The performance of neural network models was close but always below the corresponding SVM classifiers in the discrimination tasks. For the cross-lingual parsing task, we experimented with an approach based on automatically translating the source treebank to the target language, and training a parser on the translated treebank. We used off-the-shelf tools for both translation and parsing. Despite achieving better-than-baseline results, our scores in CLP tasks were substantially lower than the scores of the other participants.


Introduction
In this paper, we describe our efforts in two rather different tasks during our participation in VarDial 2017 shared tasks (Zampieri et al., 2017). The first task, which we collectively call language identification task, aims to identify closely related languages or dialects. VarDial 2017 hosted three related language identification tasks: Discriminating between similar languages (DSL) shared task which includes closely related languages in six groups, Arabic dialect identification (ADI), and German dialect identification (GDI). The second task, cross-lingual parsing (CLP), aims to exploit resources available for a related source language for parsing a target language for which no syntactically annotated corpora (treebank) is available. This paper focuses on the language identification, while providing a brief summary of our methods and results for the CLP task as well.
Although language identification is a mostly solved problem, closely related languages and dialects still pose a challenge for the language identification systems (Tiedemann and Ljubešić, 2012;Zampieri et al., 2014;Zampieri et al., 2015;Zampieri et al., 2017). For this task, we experimented with two different families of models: linear support vector machines (SVM), and (deep) neural network models. For both models we used combination of character and word (n-gram) features. Similar to our earlier experiments in Var-Dial 2016 shared task (Çöltekin and Rama, 2016), the linear models performed better than the neural network models in all language identification tasks. We describe both families of models, and compare the results obtained. In the VarDial 2017 shared task campaign, the DSL and ADI shared tasks had both open and closed track submissions, while GDI had only closed tracks. For all the tasks, we only participate in the closed track.
While discriminating closely related languages is a challenge for the language identification task, the similarities can be useful in other tasks. By using information or resources available for a related (source) language one can build or improve natural language tools for a (target) language. This is particularly useful for low-resource languages, and tasks that require difficult-to-build languagespecific tools or resources. Parsing fits into this category well, since treebanks, the primary resources used for parsing, require considerable time and effort to create. Hence, transferring knowledge from one or more (not necessarily related) languages is studied extensively in some recent work and found to be useful (Yarowsky et al., 2001;Hwa et al., 2005;Zeman and Resnik, 2008;McDonald et al., 2011;Tiedemann et al., 2014a, just to name a few). Particularly, it has been shown that these approaches tend to perform better than purely unsupervised methods, which can be another natural choice for parsing a language without a treebank.
There are two common approaches for transfer parsing. The first one is often called model transfer, which typically involves training a delexicalized parser on the source language treebank, and using it on the target language, with further adaptation or lexicalization with the help of additional monolingual or parallel corpora (McDonald et al., 2011;Naseem et al., 2012). The second method is annotation transfer, which utilizes parallel resources to map the existing annotations for the source language to the target language (Yarowsky et al., 2001;Hwa et al., 2005;Tiedemann, 2014). In this work, we use a straightforward annotationtransfer method using freely available tools. Similar to the language identification, we only participated in the closed track of the CLP task.
The remainder of the paper is organized as follows. The next section provides brief descriptions of the tasks and the data sets. Section 3 describes the methods and the systems we used for both tasks, Section 4 presents our results and we conclude in Section 5 after a brief discussion.

Task description
In this section, we provide a brief description of the tasks, and the data sets. Detailed description of the task and data can be found in Zampieri et al. (2017).

Language identification
VarDial 2017 shared task included three language identification challenges.
-Basel (bs) -Bern (be) -Lucerne (lu) -Zurich (zh) The organizers provided separate training and development sets for the DSL task. The training set consists of 18 000 documents and the development set consists of 2 000 documents for each   language variety. Although the data is balanced with respect to the number of documents, there is a slight variation with respect to the number of characters and tokens among different language varieties as presented in Table 1. These differences may explain some of the biases towards certain varieties within groups. Further details about the task and the data can be found in Goutte et al. (2016). The ADI data includes transcriptions of speech from five different Arabic varieties. Besides the transcribed words, the ADI data also includes ivectors, fixed-length vectors representing some acoustic properties of whole utterances. The ADI data shows slightly more class imbalance than the DSL data, as shown in Table 2. The lengths of the documents in the ADI data is also more varied. More information on the data and the task can be found in Malmasi et al. (2015).
The GDI task includes data from four Swiss German dialects. This data set includes much shorter documents compared to the DSL and ADI data sets. The GDI data statistics are also presented in Table 3.

Cross-lingual parsing
The cross lingual parsing tasks involved using one or more source language treebanks along with parallel texts to parse the target languages. The source-target language pairs for this task are, • Target language: Croatian, Source language: Slovenian • Target language: Slovak, Source language: Czech • Target language: Norwegian, Source languages: Danish and Swedish The source language treebanks are part of the Universal Dependencies (UD) version 1.4 (Nivre et al., 2016). The parallel texts are subtitles from the OPUS corpora collection (Tiedemann, 2012).

System descriptions 3.1 Language identification with SVMs
Similar to our past year's participation, we submitted results using a multi-class (one-vs-one) support vector machine (SVM) model. Unlike our last year's submissions (Çöltekin and Rama, 2016) where we used only character n-grams as features, we used a combination of both character and word n-grams. Both character and word n-gram features are weighted using sub-linear tf-idf scaling (Jurafsky and Martin, 2009, p.805). We did not apply any filtering (e.g., case normalization), except for removing features that occur in only a single document.
The ADI data set also included fixed-length numeric features, i-vectors, for each document. We concatenated these vectors with the tf-idf features in our best performing model for the ADI task. In all SVM models we combine the features in a flat manner and predict the varieties directly without using a two-stage or hierarchical approach. We also tuned the number of character and word ngrams, as well as the SVM margin parameter 'C' for each task separately. The SVMs were not very sensitive to the changes in these parameters. Table 4 lists the configurations of the SVM models in our main submission. We present further results on the effects of these parameters in Section 4. In all of our experiments, we combined the development and training sets for the DSL and ADI tasks and used 10-fold cross validation for tuning. We also used 10-fold cross validation for tuning the parameters of the system for the GDI task for which no designated development data was provided. Task word char C DSL 3 7 1.8 ADI 3 10 0.5 GDI 2 7 0.7 Table 4: Maximum word and character n-grams, and the SVM margin parameter, C, used for each language identification task, for our main submission. We use all n-grams starting unigrams up to the indicated maximum n-gram value.
We also experimented with logistic regression, using both one-vs-rest and one-vs-one multi-class strategies. Like the previous year, the SVM models always performed slightly better than logistic regression models. In this paper, we only describe the SVM models and discuss the results obtained using them.

Language identification with neural networks
The general architecture used for our hierarchical network model is presented in Figure 1. This is virtually identical to the general architecture described in Çöltekin and Rama (2016).
In this study, we use both task-specific character and word embeddings to train our model. They are trained during learning to discriminate the languages varieties. As opposed to general-purpose embeddings, they are expected to capture the input features (characters words) that are indicative of a particular language variety rather than words that are semantically similar.
The presented architecture is an instance of multi-label classification. During training, model parameters are optimized to guess both the group and the specific language variety correctly. Furthermore, we feed the model's prediction of the group to the classifier predicting the specific language variety. For instance, we would use the information that fr-fr and fr-ca labels belong to the French group. The intuition behind this model is that it will use the highly accurate group prediction during test time to tune into features that are useful within a particular language group for predicting individual varieties. For ADI, and GDI tasks, we do not use the group prediction since these data set contain only as single language group.
In principle, the boxes 'Group classifier' and 'Language / variety classifier' in Figure 1 may include multiple layers for allowing the classifier to generalize based on non-linear combinations in its input features. However, in the experiments reported in this paper, we did not use multiple layers in both the classifiers, since, it did not improve the results.
The dashed boxes in Figure 1 turn the sequence of word and character embeddings into fixed-size feature vectors. Any network layer/model that extracts useful features from a sequence of embeddings are useful here. The convolutional and recurrent neural networks are typical choices for this step. We have experimented with both methods, as well as simple averaging of embeddings.
In the experiments reported below, the documents are padded or truncated to 512 characters for the character embedding input, and they are padded or truncated to 128 tokens for the word embeddings input. For both embedding layers, we used dropout with rate 0.40. Both classifiers in the figure were single layer networks (with softmax activation function), predicting one-hot representations of groups and varieties. The network was trained using categorical cross-entropy loss function for both outputs using Adam optimization algorithm. To prevent overfitting, the training was stopped when validation set accuracy stopped improving after two iterations. All neural network experiments are realized using Keras (Chollet, 2015) with Tensorflow backend (Abadi et al., 2015).

Cross-lingual parsing
We adopted the word-based MT approach of Tiedemann et al. (2014b) for translating the source language dependency treebank(s) to target languages. In the first step, we used the efmaral system (Östling and Tiedemann, 2016) to word-align the OPUS parallel corpus of a source-target language pair. We word-aligned the parallel corpus from both source to target and target to source; and, then proceeded to symmetrize the alignments using grow-diag-final-and method. Then, we supplied the symmetric alignments to Moses (Koehn et al., 2007) and constrained the Moses system to train using phrase translations of length 1. Finally, we used the Moses decoder with the default settings to translate the source language treebank to target language. The intuition behind this approach is that word based translations do not require heuristics to correct the trees that result from the default phrase-based translation settings of Moses. We used this approach to create treebanks for Norwegian, Croatian, and Slovak languages.

Language identification
In the language identification subtasks, our best performing models were SVM models with the parameters listed in Table 4. We have participated in the shared task using only these models. For the ADI task, we submitted two runs, the first one using both the transcriptions and the i-vectors, and the second one using only the transcriptions. The scores of our systems in each task on the test set is presented in Table 5. According to rankings based on absolute F1 scores, our results indicate that the systems are in mid-range in all tasks. More precisely, we get 4th, 3th, 6th, positions in DSL, ADI, and GDI tasks, respectively. However, for the DSL task, the difference from the best score is rather small. Our accuracy scores are behind the top scores in each task by 0.25 %, 6.57 % and 2.78 % for DSL, ADI, and GDI respectively. We also present the confusion matrices for each task. For the DSL task, as shown in Table 6, almost all confusions occur within the groups. Within the groups, there seems to be a slight tendency for the members of the group with shorter documents on average to be confused more. Looking at inter-language group confusions on the development set more closely reveals that all such confusions are difficult to classify correctly without further context. Table 9 lists a few of the documents that were assigned a label from another language group by the classifier. The confused documents mainly consist of named entities, addresses, numbers or other symbols.
The confusion tables for ADI and GDI tasks are presented in Table 7 and Table 8 respectively. Since these represent a single group of varieties, the confusions are common in both tables. We do not observe any clear patterns in the mistakes made by the classifier in ADI task. Similarly, the confusion matrix of the GDI task does not indicate very clear patterns, except the Lucerne vari-ety seems to be very difficult to identify for our system. The documents from the Lucerne area are more often recognized as from Basel or Zurich than Lucerne itself.
In our last year's participation, we only used character n-grams as features. Intuitively, the character n-grams are useful since they can capture parts of the morphology of languages. This helps generalizing over suffixes or prefixes that were possibly not observed in the training data. Larger character n-grams also include words, and also fragments from word sequences. However, very large character n-grams do not provide much help since they suffer from data sparsity. In our experiments, we often found improvements in language discrimination up to 7-grams. This may not be able to capture most variety-specific word bigrams or trigrams. As a result, we expect word ngrams to be also useful, despite the fact the information from (large) character n-grams and word n-grams will overlap considerably. To investigate the relative merits of combining character and word ngrams, we present the best average accuracies scores obtained with 10-fold cross validation experiments on the DSL training and development set combination in Table 10. Increasing the maximum length of the character n-grams helps in for all cases up to character n-gram length of 7. Increasing maximum word n-grams length also has a positive effect in all cases, although, the effect diminishes after bigrams.
As in the previous year, the accuracy of the neural network model was close to the SVM model, but despite additional efforts of tuning, the neural models did not perform better than the SVM model in any of the tasks. We performed a random search involving the type of feature extractors for characters and words, the length of embeddings for characters and words, the width of the convolutional filter (in case one of the feature extractors were convolutional networks), length of the embedding representations (number of convolutions, or length of RNN representations), and the amount of dropout used in various parts of the network.
In the case of the DSL development set, the best accuracy score obtained by the neural network was 90.72 as opposed 92.58 from our best performing SVM model in the same setting. In general, the performance of the model was relatively stable across 200 different random configurations of hyperparameters listed above, all lying within the range 0.88-0.91. Convolutional networks performed well over characters, but they yielded bad scores over the words, likely due to large number of filters over words that would be needed in the multilingual corpus processing. Recurrent neural network flavors (GRUs and LSTMs) were among the better options for obtaining better document representations from the word embeddings. However, simple averaging of the embedding vectors performed similarly. On character features, recurrent networks were impractical in our computing environment due to longer input sequence (512 characters).

Cross-lingual parsing
We used UDpipe (Straka et al., 2016) to train our parsers on the translated treebanks. We report both the Labeled Attachment Scores (LAS) and the Unlabeled Attachment Scores (UAS) in Table 11. In the case of Norwegian, we trained our system on both individual and combined treebanks from Swedish and Danish. In the case of Norwegian, we obtained the best results (9 points more than the baseline) when we trained the dependency parser on Norwegian treebank which is translated from Swedish. We obtained slightly better results than the baseline in the case of Croatian. In the case of Slovak, we obtained an improvement of 10 points over the baseline. In all the cases, our results are behind the other two participants by a margin of 5 points in Croatian and Norwegian; and, 14 points in the case of Slovak.

Discussion and conclusions
In this paper we described our systems participating in the VarDial 2017 shared tasks. We participated in all the four tasks offered during this shared task campaign. Although our main focus has been language identification tasks, we have also participated in the cross-lingual parsing shared task with a simple approach, and reported results in this paper.
Our participation in the language discrimination tasks, namely Discriminating between similar languages (DSL), Arabic dialect identification (ADI), and German dialect identification (GDI), is similar to to our previous year's participation (Çöltekin and Rama, 2016). We experimented with both SVMs and (deep) neural network models. Similar to our last year's experience, SVMs performed better than neural networks. This is inline with       Table 11: Labeled (LAS) and unlabeled (UAS) attachment scores obtained by the translation model in comparison to the baseline provided by the organizers.
Unlike last year, where we only used character n-grams, this year we used a combination of character and word n-grams as features, and tuned the maximum number of n-grams included for each task. We obtained scores competitive with the scores of the other participating teams. In gen-eral, all scores are slightly higher for the DSL task compared to the last year. Besides the results on the shared task, we presented some results from the additional experiments that we performed in Section 4. The combination of character and word n-grams seem to have made a small but consistent difference in the experiments performed on the development data.
For the cross-lingual parsing task, we followed a simple method by automatically translating the source treebank and training an off-theshelf parser on the translated treebank. We did not perform any further adaptation or pre-trained word representations which may have been helpful in this task. Although we obtained results that are consistently better than the baseline, our results have been substantially lower than the scores of the other two participating systems.