KMI-Panlingua-IITKGP @SIGTYP2020: Exploring rules and hybrid systems for automatic prediction of typological features

This paper enumerates SigTyP 2020 Shared Task on the prediction of typological features as performed by the KMI-Panlingua-IITKGP team. The task entailed the prediction of missing values in a particular language, provided, the name of the language family, its genus, location (in terms of latitude and longitude coordinates and name of the country where it is spoken) and a set of feature-value pair are available. As part of fulfillment of the aforementioned task, the team submitted 3 kinds of system - 2 rule-based and one hybrid system. Of these 3, one rule-based system generated the best performance on the test set. All the systems were ‘constrained’ in the sense that no additional dataset or information, other than those provided by the organisers, was used for developing the systems.


Introduction
This paper is a detailed documentation of the KMI-Panlingua-IITKGP team's system submission at the SigTyP 2020 Shared Task on the prediction of typological features. The objective behind this task is to develop a computational model that predicts (missing) linguistic features of a language, given its location, language family, genus, and a set of feature-value pair. The shared task organisers provided the dataset used for this purpose, which has been extracted from Worlds Atlas of Language structures (WALS) (see section 2 for details). Since the provided dataset was not large and comprised of unevenly distributed features, we prepared and experimented with three different systems and compare them with each other to provide the best model. Of these three systems, 2 are rule-based, in which, one is frequency-based system (see subsection 3.1) and the other is statistical system (see subsection 3.2). The third one is a hybrid (see subsection 3.3). The statistical system provides the best results among the three (see section 4). This paper promises two major contributions. First, it provides an automatic system that enables extraction of feature-value pair of a given language -a tedious job if done manually. Second, it compares three different systems and provides evidence that a statistical model gives better results for the given data set.

Dataset
The dataset 1 used for this experiment was extracted from World Atlas of Language Structures (WALS) (Dryer and Haspelmath, 2013). It covered the typological features of close to 2,000 languages (Bjerva et al., 2020). These typological features were organised in 8 columns (including Language ID, Language name, Latitude, Longitude, Genus, Family, Country Codes, and feature-value. The feature values were separated by ' ' for each language). The task was divided into two sub-tasks: (a) Constrained and (b)

Experiments
Our task consisted of 3 experiments and this section enumerates and discusses each one of the systems in detail. All the 3 systems are based on the notion of shared typological properties of languages belonging to the same language family (or sub-family, represented as genus in the dataset) and shared areal properties of languages belonging to different language families but being in regular contact by virtue of being in close contact, mainly because of speakers residing in close geographical proximity.

Baseline System
Our baseline system is a frequency-based system that makes use of the language family and genus-based typological properties to predict the grammatical features of a given lan-guage. In the training phase, for each feature, frequency of each of its value in each of the language family and genus is calculated and stored. During the prediction, for a specific feature in a given language (under the given language family and genus), the feature value with the maximum frequency within that genus (or language family, if the genus does not occur in the training data) is predicted as the value for the concerned feature. If neither the language family nor the genus occurs during the training phase then a default value of the feature is predicted by the system.

Statistical System
The statistical system is an extension of our baseline system where the absence of both the language family and the genus in the training data is handled in a more principled way. In such cases, we made use of the 'distance' between two languages to decide on the feature values. The training phase for this system is exactly the same as that of our baseline system. However, during prediction, the following steps are taken -1.
Step 1: As in the case of the baseline system, if the genus of the language whose feature is to be predicted is seen during the training phase and if the feature that is to be predicted was seen in that genus during the training phase then the value that was most frequent for that genus-feature combination is predicted as the value for the current case. In case the genus of the language or the concerned feature was not seen in the genus then the same step is carried out with the language family. If neither genus-feature nor family-feature combination is seen in the training phase then we move to Step 2.

2.
Step 2: In the second step, based on the latitude and longitude position of the given language, the Haversine distance between the language for which the feature value is to be predicted, and all the other languages in the training set, is calculated. Then we take the language families and genus of the four closest languages. We look at the frequency of each feature value across these four closest language families and the value with the maximum frequency is predicted as the correct feature value. The choice of four closest language families is established experimentally, by looking at numbers from 1 -10 and deciding on the basis of best performance with the train set. If the feature is not found in these four closest language families then we move to step 3.

3.
Step 3: In the third step, we look at the closest language family and genus which has the feature that we are trying to predict. The system takes the feature value with the maximum frequency and predict that, as the value for the feature. In this and the previous step, it is to be noted that each feature may have multiple values in a specific language family and genus -as such the value which occurs in the maximum number of languages of that family and genus is the one that is considered most frequent and, hence, predicted.

Hybrid System
For the hybrid system, we trained 180 different classifiers for the 180 features, which were present in the training set. Since it was not a huge dataset, with quite uneven distribution of each features and the features for training were also limited, we used SVM (Pedregosa et al., 2011), (Buitinck et al., 2013) for training each of the classifier. We experimented with different c-values from 0.001 -10. For each feature, there were 1,100 training data points (each data point for each language in the dataset). We used the normalised Haversine distance (calculated using the coordinates), language family, genus, country and the other 179 linguistic features as features for training the classifiers. All the features not listed for a specific language was considered absent in that language; otherwise its assigned value was used for training. As mentioned earlier, since the dataset was imbalanced and some features were adequately represented while others occurred only a few times in the dataset, the performance of the classifiers accordingly varied from approx 0.30 -0.98 (F-score). Clearly, it would not have been possible to use the classifiers that performed too low. As such we decided to use only those classifiers which had an F-score of 0.6 or above; for the other features, the statistical method (outlined in the previous section) was employed. This F-score was again experimentally deduced by looking at the best performance for multiple systems ranging from an all classifier-based system to using only those classifiers with 0.9 or above F1 score. The performance was measured by predicting features in the train set for different languages i.e. the train accuracy was taken as the benchmark for deciding this value.

Results
Among the three systems, our statistical system performed the best on the test set with a microaverage F-score (see Table 2) of slightly under 0.61 (a 10-point gain over the knn-imputation baseline, 9 point gain over the frequency-based baseline). The hybrid system performed the worst among the three systems with a score of slightly above 0.56. While we were expecting the hybrid system to work better than the statistical system (since we assumed that it combined the best of both worlds), in the final results, it is the statistical systems (even the most naive one) that perform better than the hybrid systems. This

Systems
Score kmi-panlingua-iitkgp constrained rule 0.607 kmi-panlingua-iitkgp constrained hybrid 0.562 kmi-panlingua-iitkgp constrained 0.574 frequency-baseline constrained 0.513 knn-imputation-baseline constrained 0.507 • Typological and Areal Features: The typological and areal features are derived via systematic study of multiple languages and prior linguistic studies have shown that language families as well as geographically closer languages share certain linguistic features. The statistical system makes use of these generalisations about human language and manages to capture, at least partially, these properties of human languages. This could be one of the reasons why the statistical system performs better than the hybrid system, where sufficient information was not available to the classifier to make this kind of generalisation. This also provides some kind of explicit validity to the arguments related to the use of typological and areal features for augmenting the NLP systems, especially in low-resource situations. In this case even with minimal data and a rather naive approach our statistical system has outperformed a SVMbased hybrid system -this itself attests the fact that typological and areal features capture a significant generalisation about human languages and they could prove to be very valuable, if used judiciously, for low-resource NLP.

Conclusion
In this paper, we presented two rule-based systems and one hybrid system to predict typological features of a given language. We demonstrated that the statistical, a rule-based system, gave the best performance on the test set. Only the data set provided by the organisers was used for developing the systems.