Leveraging Pre-Trained Embeddings for Welsh Taggers

While the application of word embedding models to downstream Natural Language Processing (NLP) tasks has been shown to be successful, the benefits for low-resource languages is somewhat limited due to lack of adequate data for training the models. However, NLP research efforts for low-resource languages have focused on constantly seeking ways to harness pre-trained models to improve the performance of NLP systems built to process these languages without the need to re-invent the wheel. One such language is Welsh and therefore, in this paper, we present the results of our experiments on learning a simple multi-task neural network model for part-of-speech and semantic tagging for Welsh using a pre-trained embedding model from FastText. Our model’s performance was compared with those of the existing rule-based stand-alone taggers for part-of-speech and semantic taggers. Despite its simplicity and capacity to perform both tasks simultaneously, our tagger compared very well with the existing taggers.


Introduction
The Welsh language can easily be classified as low resourced in the context of natural language processing because the lack of the commonly used resources in language research such as large annotated corpora as well as the standard computational tools and techniques for processing these resources.
There is still a long way to go for Welsh, but the situation is improving. For instance, Welsh is fortunate to have a fund that supports an on-going inter-disciplinary and multi-institutional project, the National Corpus of Contemporary Welsh (Corpws Cenedlaethol Cymraeg Cyfoes -CorCenCC) 1 , which has been building a large-1 http://www.corcencc.org/ scale open-source language resource for contemporary Welsh language.
Existing Welsh part-of-speech (sections 2.1) and semantic (section 2.2) taggers produce good results, but their heavy dependence on handcrafted rules and hard-coded resources may pose a maintenance challenge in future. Also, considering the speed with which languages evolve, especially on the internet, and the huge amount of unannotated corpora that can be collected from the web, we urgently need a system that is capable of learning from unstructured text in order to guarantee the generalisability and scalability of tagging tools.
Given the potential challenges with the existing approaches and considering the similarities between the tasks of part-of-speech (POS) and semantic (SEM) annotation, we propose to train a single neural network model that can jointly learn both of the tasks. We aim at requiring as little human annotation effort as possible and leveraging the linguistic patterns acquired from unsupervised language models like word embeddings. The main contributions of this research includes: (1) The first application of multi-task learning to POS and semantic tagging for any language that we know of, (2) The ability to improve OOV coverage for the Welsh language using pre-trained embeddings for semantic category extension, (3) Public release of two sets of manually checked goldstandard corpora for POS and semantic tagging of Welsh, (4) Inter-annotator agreement scores for Welsh semantic tagging, (5) Public release of the first Welsh semantic tagger (CySemTagger) (6) The first demonstration of multi-task learning to improve NLP task accuracy for Welsh, and (7) A demonstration of the usefulness of multi-task learning in a mono-lingual setting for a low re-source language. 2 2 Background POS tagging is a well studied NLP task. Much recent work on this task has moved away from English and European languages to other major languages such as Arabic (Aldarmaki and Diab, 2015), Chinese (Sun and Wan, 2016), dialects thereof (Darwish et al., 2018), and text types containing more noise such as historical (Yang and Eisenstein, 2016;Janssen et al., 2017), learner language (Nagata et al., 2018), code switching (Vyas et al., 2014) and social media varieties (Horsmann and Zesch, 2016;van der Goot et al., 2017). More recently, joint and multi-task learning approaches have been applied to link POS tagging and other tasks such as segmentation or tokenisation (Al-Gahtani and McNaught, 2015;Shao et al., 2017), dependency parsing (Nguyen andVerspoor, 2018) and lemmatisation (Arakelyan et al., 2018).
Besides being applied to other NLP applications and levels, multi-task learning has been applied with promising results to the semantic level in various scenarios, including cross-lingual sentiment analysis (Wang et al., 2018), opinion and semantic role labelling (Marasović and Frank, 2018), semantic parsing (Bordes et al., 2012), emotion prediction (Buechel and Hahn, 2018), irony detection (Wu et al., 2018) and rumour verification (Kochkina et al., 2018). However, there is very little research that applies multi-task learning to link Word Sense Disambiguation (WSD) or semantic tagging with another task. Here, we refer to the semantic tagging as coarse-grained word sense disambiguation based on an existing taxonomy of categories, e.g. in USAS (Rayson et al., 2004). Previously, semantic tagging in multiple languages has been shown to greatly benefit from POS tagging in the NLP pipeline, since it can help to filter out inapplicable semantic fields from the set of possible candidates (Piao et al., 2015).
Over the past few years, researchers started to port NLP tools and methods into low resource languages using a various approaches, such as porting lexicons from one language to another using bilingual dictionaries and parallel corpora (Piao et al., 2016) and cross-lingual word embeddings (Adams et al., 2017;Sharoff, 2018). Multi-task learning has also been proved useful in transferring the learning across languages in a multilingual setting where one of the languages has only sparse resources available (Junczys-Dowmunt et al., 2018;Lin et al., 2018;Choi et al., 2018), although less successful in named entity recognition settings (Enghoff et al., 2018). In our experiments, we focus on a low-resource mono-lingual setting with a small manually corrected corpus, and combine the Welsh POS and SEM annotation for the first time.

CyTag
The rule-based POS tagger under consideration in our work, CyTag (Neale et al., 2018), was built based on Constraint Grammar (CG) (Karlsson, 1990), in particular built around the latest version of the software, VISL CG-3 3 . The CyTag tagset 4 contains 145 fine-grained POS tags that can collapse into 13 EAGLES 5 -conformant broader categories.
CyTag utilises three steps to assign POS tags to tokens: • A list of candidate POS tags is produced for each token.
• The list of candidate tags for each token is pruned to as few as possible (ideally one) using CG-formatted rules.
• The optimal tag for each token is selected, helped by some small additional processing steps for any cases that were still ambiguous after post-CG.
In the second step listed above, CyTag makes use of a CG-formatted 'grammar' file -currently containing 243 hand-crafted and hard-coded rules -to 'prune' the list of candidate tags to one for ambiguous tokens. The rules are formatted as follows: action (reading) if (neighbour (features)) whereby action refers to the 'operation' to be performed on the reading e.g. ('selecting or 'removing'); neighbour is a nearby token of interest to the target token on whose features the action depends. CyTag was evaluated using a goldstandard annotated corpus containing 611 sentences (14,876 tokens), as will be described in subsection 3.1.
Another recently-developed POS tagger for Welsh is the WNLT-Tagger, which forms part of the Welsh Natural Language Toolkit (WNLT) 6 . WNLT-Tagger is one of the four main modules in WNLT, which is itself built on the GATE (General Architecture for Text Engineering) framework (Cunningham, 2002).

CySemTagger: The Welsh Semantic tagger
CyTag is a precursor to CySemTagger (Piao et al., 2018) which is an automatic semantic annotation tool that depends on the POS tagged output to assign semantic tags to tokens in Welsh texts.
CySemTagger employs the semantic tagset of Lancaster University's UCREL Semantic Analysis System, USAS 7 . The semantic tagset, which was originally derived from Tom McArthur's Longman Lexicon of Contemporary English (McArthur and McArthur, 1981), has 21 major discourse fields and 232 tags.
The CySemTagger is a knowledge-based and rule-based system with the following key components: • lexicon look-up (both for single words and MWEs) • part-of-speech tagging (CyTag and WNLT-Tagger) • semantic category disambiguation • output formatting and display The CySemTagger tagger is designed to work with any POS-tagger but its performance was assessed so far only on the coverage of the Welsh text presented to it, i.e. the fraction of the tokens it is able to assign at least one of the valid semantic tags. The experiment presented in (Piao et al., 2018) indicates that, on the text coverage evaluation, the CySemTagger works better with CyTag than with WNLT-Tagger, as shown by the respective text coverage scores of 91.78% and 72.92% with both POS taggers.

Experiments
The CyTag and the CySemTagger are separate tools that use rule-based methods to achieve their results. The semantic tagger relies heavily on a part-of-speech tagger to function. The key aim of this paper is to implement a tagging system that: • learns from unstructured data, • leverages available embedding models, • performs both tasks, POS and semantic tagging, simultaneously using a multi-task learning set up.

Experimental data
As mentioned earlier in section 2.1, the instances for training the POS and semantic taggers were extracted from the manually annotated gold standard evaluation corpus that has been constructed in the CorCenCC project, i.e. the data used for the Cy-Tag and CySemTagger development. This training data comprises 611 tagged sentences (14,876 tokens) stored in eight input files that contain excerpts from a variety of existing Welsh corpora, including Kynulliad314 (Welsh Assembly proceedings), Meddalwedd15 (translations of software instructions), Kwici16 (Welsh Wikipedia articles), LERBIML17 (multi-domain spoken corpora) and some short abstracts of three additional Welsh Wikipedia articles. The fully manually checked version of the gold standard data, i.e. with the POS and SEM tags, will be released along with the multi-task model for parts-of-speech and semantic tagging. The dataset used for training the multi-task model was built with the data instances extracted from the fully tagged version of the gold standard data. These data instances do not contain unambiguous tokens (e.g. punctuation and numbers) and those categorised as unknown are removed from the training data. The basic statistics from the data used in our experiment are shown in Table 1.
Although the data used in this experiment is comparatively smaller than what is often used by typical neural network projects, we assume it is sufficient for an exploratory research that aims to build a prototypical framework to support further developments for the Welsh language tools.

Embedding model
A key contribution of this work to Welsh NLP research is the application of pre-trained embeddings to build the model. Although most deeplearning frameworks provide an embedding layer that allows one to create embeddings from the training data, it is more beneficial to leverage existing models trained with much larger Welsh text data than to only rely on what is currently available. To that effect, we used the Welsh pre-trained embedding models built by the FastText Project 8 (Grave et al., 2018).

Design of experiment
The key input data to our pipeline consists of the 611 sentences that are jointly annotated with the POS and semantic tags. The combination of the annotation tags on the gold standard data makes it possible to extract the data in the different formats, as shown in Table 3. However, the format used for this experiment is the last one, 3-BOTH, in which each token is tagged with a concatenation of the POS and semantic tags. The extraction of the instance features for each token is carried out in two stage process which involves the chunking of the target word along with its three previous tokens (i.e. 4 words in total), as well as the vectorisation of the features. The chunking process proceeds with a sliding window along the sentence, with the target word being the rightmost in the chunk. The vectorisation then replaces each word in the chunk with its vector representation from a word-embedding model, forming a matrix of values that represent each training instance. The label for each instance is the tag-ID i.e. a unique integer number assigned to each of the tags.

Model architecture and training setup
The model we used is a simple neural network with only one hidden layer. Each instance is a con-   catenation of the embedding vectors of the target word and the three previous words. So the size of the input layer is the same as the length of the concatenated vectors. The key parameters required for the model training and evaluation are vector size, mini-batch size and dropout rate, and different values of each parameter are tested over runs of 50 epochs for each as shown in Table 2.
The output layer is the size of the tagset extracted from the training data. From the annotation format used, each token's tag is a combination of the POS and semantic tags and, as shown in Table  1, the total tagset size is 392. This is comparatively large but it will help facilitate the multi-task learning, which this work aims to achieve.
The model architecture is shallow, as only one hidden layer is used. Ideally, the size of the hidden layer should be somewhere between the sizes of both the input and the output layers (Reed and Marks, 1999). However, in order to reduce the number of parameters in this model, the size of 256 was arbitrarily chosen. For the hidden layer, the Adam optimiser (Kingma and Ba, 2014) was used with the rectified linear unit (ReLu) activation function (Nair and Hinton, 2010) as implemented in the integrated TensorFlow-Keras (Abadi et al., 2016), (Chollet et al., 2015) framework.

Vector size
Given the small size of the training data, and in order not to have too many parameters that can cause over-fitting, we tested the model with different vector sizes, (i.e. 10, 50, 100, 200, 300), averaged across a range of other parameters values for the mini-batch and dropout. The training and evaluation for parameter optimisation was performed over 50 epochs.
With regards to the evaluation accuracy, as shown in Figure 1, apart from nvecs = 10, all other vector sizes could converge within the first 30 to 40 epochs. However, the evaluation loss begins to rise within the first 10 epochs, with most nvecs hitting nearly above 4.5 before reaching the 50th epoch. To balance this, a vector size of 100 was used, i.e. only the first 100 values were taken from each embedding vector to build the input layer, as suggested in (Brownlee, 2017). This produced an input layer size of 400.

Mini-batch size
The training set was chunked into mini-batches as described in (Ruder, 2016), with 8 instances per batch. The mini-batch values 8, 16, 32 and 64 were tested across other parameter values (see Figure 2). Their average performances indicate that, while there is only a small change in evaluation accuracies across the values, there is a slightly lower loss value with a mini-batch of 8 than the others.

Dropout rate
Given the small quantity of the training data, the architecture also implemented dropout regularisation (Srivastava et al., 2014) on the hidden layer to reduce the expected likelihood of over-fitting. Different dropout rates (10%, 20%, 30%, 40% and 50%) were tested as shown in Figure 3, and dropout rate of 30% was chosen to jointly mitigate the impact of on both the evaluation accuracy and the loss.

Batch Normalisation
Batch normalisation addresses the problem of internal covariate shift (Ioffe and Szegedy, 2015) by normalising the inputs to the model layers, thereby increasing the training speed. In some cases, it acts as a regulariser. Therefore, a version of the model architecture described above implements batch normalisation. This is because, during training, improvement rates in the model's evaluation accuracy slow down after the first 50 epochs while the loss continues to escalate. Techniques that speed up the learning were considered to investigate the combined impact of speed and regularisation on evaluation accuracy and loss.

Loss Function
As a multi-class classification task, the standard loss function is the cross-entropy with the sof tmax logistic activation function, as described in equation 3.4.5 (Mannor et al., 2005).
where N is the number of instances in the training batch, T is the number of unique tags while X i , and y i are a set of input values and the corresponding label respectively.

Evaluation and discussion
With the accuracy of 93.64% and the F1 of 95.06% reported previously for the CyTag, it represented the state-of-the-art in Welsh POStagging. Also, although the CySemTagger did not report those specific metrics, it is currently the only semantic tagger for Welsh language that we are aware of. Therefore, the evaluation results from the multi-tagger built in this experiment, which simultaneously performs both POSand SEM-tagging, were compared against these tools.
The effects of dropout regularisation and batch normalisation were examined with the previously selected parameters for vector size=100, mini-batches=8 and dropout rate=30%. As shown in Table 4, the results indicate that, at the detriment of accuracy, both dropout and the batch normalisation achieved significant reductions in evaluation loss. Without them, the training accuracy and loss scores for the multi-task tagger are 99.23% and 0.021 respectively while the evaluation scores are 95.24% and 6.161. However, with only dropout, training accuracy and loss scores are 98.36% and 0.050 while those of evaluation are 94.89% and 4.880.
Batch normalisation without dropout produced accuracy and loss scores of 95.51% and 0.144 respectively while those of evaluation produced 92.57% and 3.837 respectively. The combination of them achieved a significant reduction in evaluation loss (2.682), but with relatively poorer accuracy scores for training (88.88%) and evaluation (86.66%). Figures 4 and 5 show that, as used in this experiment, the batch normalisation had a more regularising effect than the dropout, thereby slowing down convergence and avoiding over-fitting.

Conclusion
The main motivation for this work is to contribute a useful tool to the fledgling Welsh NLP research effort. There are two key objectives of this work: a) To build a multi-task classifier that can match the performance of the existing rule-based systems for Welsh POS and semantic taggers with as little human input as possible. b) To leverage existing language models such as word embedding created using unsupervised methods. Our work has demonstrated that these objectives can be achieved, although our results of a small-scale experiment can not be conclusive. The results obtained in this work compare favourably with those obtained from the existing rule-based models. We have also shown that, in a low resource setting, multi-task framework can also bring improvements to mono-lingual tasks, which is complementary to the previous findings from multilingual multi-task learning scenarios.
In our experiment, the neural network architecture was configured using pre-existing tools and frameworks, following suggestions from the literature. In future, we will focus on optimising the system parameters to improve the training efficiency and performance of the tagging models, as well as constructing larger training data.