NUIG: Multitasking Self-attention based approach to SigTyp 2020 Shared Task

The paper describes the Multitasking Self-attention based approach to constrained sub-task within Sigtyp 2020 Shared task. Our model is simple neural network based architecture inspired by Transformers (CITATION) model. The model uses Multitasking to compute values of all WALS features for a given input language simultaneously. Results show that our approach performs at par with the baseline approaches, even though our proposed approach requires only phylogenetic and geographical attributes namely Longitude, Latitude, Genus-index, Family-index and Country-index and do not use any of the known WALS features of the respective input language, to compute its missing WALS features.


Introduction
In this paper we describe our Multitasking Selfattention based approach to Sigtyp 2020 Shared task (Constrained Sub-task) (Bjerva et al., 2020) which involves prediction of values of features from WALS Typology database for various lowresourced languages. Linguistic typology is the classification of human languages according to their syntactic, phonological and semantic features. WALS (Haspelmath, 2009) is the most popular and comprehensive database which provides list of typological features and their possible values, as well as the respective feature-values for most of the world's languages. However all the popular typological databases (Haspelmath, 2009;Collins and Kayne, 2009;Maddieson et al., 2013;Hartmann and Bradley Taylor, 2013;Bickel et al., 2017;Michaelis and Magnus Huber, 2013) (including WALS) suffer from a major shortcoming of limited coverage. In fact, values of many important typological features for most languages (specially less documented ones) are missing in these databases. This sparked a line of research on automatic acquisition of such missing typology knowledge (Malaviya et al., 2017;Coke et al., 2016;Daumé III and Campbell, 2009;Littell et al., 2017;Bjerva et al., 2019). Our proposed model is a neural network architecture which takes in as input, the phylogenetic and geographical attributes of a language. The model subsequently predicts values of all its typology features simultaneously using Multitask learning setup (Ruder, 2017). Figure 1 depicts the architecture of our proposed model that computes values of all WALS typology features for a given language simultaneously. As evident in Figure 1, our proposed model architecture comprises of three components namely Input Network Component, Self-attention Network Component and Multitasking Output Networks Component described as section 2.1, 2.2 and 2.3 respectively.

Input Network Component
The input component is a simple two layered feedforward neural network. The input of the network is a 5-dimensional vector x comprising of values of five key attributes of any language, namely Longitude, Latitude, Genus-index, Family-index and Country-index as these are the attributes provided by train and test datasets (for all languages within the datasets) for Sigtyp 2020 Shared Task. We computed Genus-index, Family-index and Countryindex from genus, family and countryCode attributes provided within dataset using respective name-index dictionaries. This two layered feed forward network computes output vector o ∈ R T * d where T is the total number of WALS typology features to be predicted by Here A 1 ∈ R d * 5 , A 2 ∈ R T * 1 are weights and a 1 ∈ R d and a 2 ∈ R T * d are biases.

Self-attention Network Component
The architecture of this component is inspired by Transformers (Vaswani et al., 2017) model. The model architecture comprises a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple fully connected feed-forward network. Hence input to layer i is always the output from layer i−1. Input to the first layer is the output of the previous Input Network Component. For i th layer within architecture, its Feed-forward and self-attention sub-layers are given by equations 3 and 4.
Here h i ∈ R d and k i ∈ R d are outputs of feed-forward and self-attention layers respectively. We used same attention mechanism as used by (Vaswani et al., 2017). Final output of i th layer Input to the first layer y 0 is the output from previous Input Network Component. Output of Selfattention Network Component is the output of final layer y N . It is been observed that there is a correlation between various WALS typology features. Thus, to predict the missing value of a particular typology feature for a specific languages, knowledge about other typology features for that languages would be useful. Such knowledge is ensured by the selfattention layers.

Multitasking Output Networks Component
Multitasking Output Networks Component comprises of T independent feed-forward neuralnetwork classifiers. The component splits the output of previous Self-attention Network Component i.e y 6 ∈ R T * d into T d-dimensional vectors e 1 , e 2 , . . . .e T . each corresponds to one of the T typology features to be predicted. Value of the j th typology feature is computed by applying equation 6.
Here 1 <= j <= T , P r j provides probability of each of the possible values for j th typology feature being the true-value. Dimensions of weights and

Training
The parameters of model described in section 2 are trained by optimizing the loss function given by equation 7.
Here OH t is the one-hot encoding of true-value for t th typology feature. CE is the Cross-entropy loss. Table 1 lists the hyper-parameters used during training. These are computed by minimizing the loss over Validation set.

Conclusion
In this paper we described our Multitasking Selfattention based approach to Sigtyp 2020 Shared task, Constrained Sub-task. Our model is simple neural network based approach which computes values of all WALS features for a given input language simultaneously in Multitasking settings. The architecture of our network is inspired by Transformers (Vaswani et al., 2017). Results show that our approach performs at par with the baseline approaches, even though our approach uses only five linguistic and geographical attributes namely Longitude, Latitude, Genus-index, Family-index and Country-index and do not use any of the known WALS features of the respective input language, to compute its missing WALS features.
A Appendix 1