Joint Prediction of Morphosyntactic Categories for Fine-Grained Arabic Part-of-Speech Tagging Exploiting Tag Dictionary Information

Part-of-speech (POS) tagging for morphologically rich languages such as Arabic is a challenging problem because of their enormous tag sets. One reason for this is that in the tagging scheme for such languages, a complete POS tag is formed by combining tags from multiple tag sets defined for each morphosyntactic category. Previous approaches in Arabic POS tagging applied one model for each morphosyntactic tagging task, without utilizing shared information between the tasks. In this paper, we propose an approach that utilizes this information by jointly modeling multiple morphosyntactic tagging tasks with a multi-task learning framework. We also propose a method of incorporating tag dictionary information into our neural models by combining word representations with representations of the sets of possible tags. Our experiments showed that the joint model with tag dictionary information results in an accuracy of 91.38% on the Penn Arabic Treebank data set, with an absolute improvement of 2.11% over the current state-of-the-art tagger.


Introduction
Part-of-speech (POS) tagging is a fundamental task in natural language processing. The granularity of the POS tag set that reflects languagespecific information varies from language to language. In morphologically simple languages such as English, the size of the tag set is typically less than a hundred. On the other hand, in morphologically rich languages such as Arabic, the number of theoretically possible tags can be up to 333,000, of which only 2,200 tags might appear in an actual corpus (Habash and Rambow, 2005). One reason for this is that in the tagging scheme for such languages, a complete POS tag is formed by combining tags from multiple tag sets defined for each morphosyntactic category. For example, a complete POS tag for the word Hb ("love") 2 can be defined as the combination of a noun from the coarse POS category, a nominative (n) from the case category, "not applicable" (na) from the mood category, and so on. The enormous number of resulting tags causes fine-grained POS tagging for Arabic to be more challenging.
In order to perform this task, it is beneficial to utilize information from other morphosyntactic categories when predicting a label for one category. For example, if a word is a noun, it should take one of three tags from the case category: nominative (n), accusative (a), or genitive (g), while it should take "not applicable" (na) from the mood category since mood is not defined for nominals. However, most of the previous approaches in Arabic did not utilize this information, applying one model for each task (Habash and Rambow, 2005;Pasha et al., 2014;Shahrour et al., 2015).
To make use of this information, we propose an approach that jointly models multiple morphosyntactic prediction tasks using a multi-task learning scheme. Specifically, we adopt parameter sharing in our bi-directional LSTM model in the hope that the shared parameters will store information beneficial to multiple tasks. To further boost the performance, we propose a method of incorporating tag dictionary information into our neural models by combining word representations with representations of the sets of possible tags.
with tag dictionary information yields the best accuracy on the Penn Arabic Treebank data set with 91.38%, an absolute improvement of 2.11% over the current state-of-the-art.
2 Fined-Grained Arabic POS Tagging POS tagging takes a sequence of n words x 1:n as input and outputs a corresponding sequence of labels y 1:n , where x t is the t-th word in a sentence and y t ∈ T is the tag of x t . In English, a POS tag is typically taken from a single tag set T . By contrast, in morphologically rich languages such as Arabic, a complete POS tag is formed by combining tags from multiple tag sets defined for each morphosyntactic category. For example, a complete POS tag for the word Hb ("love") can be defined as the combination of a noun from the coarse POS category, a nominative (n) from the case category, "not applicable" (na) from the mood category, and so on. Formally, the fine-grained POS tag y f ine t for a word x t is defined as the conjunction of the tags y from k tag sets T (1) , T (2) , ..., T (k) . Our purpose is then to predict all morphosyntactic categories for each word -in other words, this can be seen as a multi-class and multi-label sequential labeling problem.

Model
In this section, we first briefly describe bidirectional LSTMs. We then present our models which use bi-LSTMs for fine-grained Arabic POS tagging 4 . We also propose a method of incorporating tag dictionary information into our neural models by combining word representations with representations of the sets of possible tags.

Bi-directional LSTMs
Recurrent neural networks (RNN) (Elman, 1990) are a class of neural networks that are capable of handling sequences of any length. An RNN can be seen as a function that reads the input vector x t at time step t and calculates a hidden state h t using x t and the previous hidden state h t−1 . In classification tasks, the vector h t is then fed into the output layer and produces a probability distribution over the possible classes. One of the drawbacks of basic RNNs is their difficulty to train due to the so-called vanishing gradient problem. Long short term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) address this issue by in-prc3), and one enclitic (enc). troducing memory cells and gate units that capture long-term dependencies.
A bi-directional LSTM network (Graves and Schmidhuber, 2005) is an extension of an LSTM network that allows modeling of past and future dependencies in arbitrary-length input sequences. The output vector h t of a bi-LSTM is calculated by concatenating the output vector of the forward directional LSTM that reads the sequence from beginning to end with the output vector of the backward directional LSTM that reads the sequence in the reverse direction. Hb ("love") Hb ("love") fy ("in") Figure 1: Top: Our baseline model for the category "cas". We have one model for each category, resulting in 14 models in total. Bottom: How to create character-level embeddings. <w> and </w> indicates the beginning and the end of a word.

Independent Prediction Model
For our baseline method, we use a model that independently predicts each morphosyntactic category using bi-LSTMs. Our baseline is similar to the basic model in Plank et al. (2016). The top part of Figure 1 illustrates an overview of our baseline model. Given a sequence of n words x 1:n , we encode each word x t into a vector represen-tation r t = [w t ; c t ], which is the concatenation of the word embedding w t and the character-level embedding c t . The character-level embedding is computed by concatenating hidden states of the character-level forward LSTM and those of the backward LSTM as depicted in the bottom part of Figure 1. The vector representation r t is then fed into our bi-LSTM model, giving the forward hidden state − → h t and the backward hidden state and fed into the output layer. Finally, we obtain the output label y t by performing a softmax over the tag set vocabulary. We train models separately for each morphosyntactic category, resulting in 14 models in total.

Joint Prediction Model
Our baseline model does not share any information between morphosyntactic prediction tasks, as it is trained separately. However, it is beneficial to utilize information from other morphosyntactic categories when predicting a label for one category. In order to do this, we adopt a multi-task learning approach (Collobert et al., 2011;Yang et al., 2016;Bingel and Søgaard, 2017;Martínez Alonso and Plank, 2017). Specifically, we use parameter sharing in the hidden layers of our bi-LSTM model so that we can generate a unified model that can carry information beneficial to each task.   Figure 2 shows an overview of our joint model. The output vectors of the bi-LSTMs are fed into multiple output layers, each performing a corresponding morphosyntactic prediction task. Our model trains to minimize the cross-entropy loss averaged across all the tasks. The loss function for each input word is defined as follows: where M = {pos, cas, gen, ...} is the set of morphosyntactic prediction tasks and L(ŷ m , y m ) is the cross-entropy loss for the category m.

Encoding Tag Dictionary Information
One of our contributions is to incorporate tag dictionary information into our neural models by combining word representations with representations of the sets of possible tags. Unlike previous approaches that use tag dictionary information provided by a morphological analyzer as a hard constraint (Habash and Rambow, 2005;Pasha et al., 2014;Shahrour et al., 2015), we use it as a soft constraint, as well as an additional feature for our model. The drawback of using a morphological analyzer in a pipeline fashion is that the model cannot find the correct tag in the disambiguation step if the analyzer does not return any tag candidates.  report in their error analysis that 31.3% of their tagging errors were due to this problem. To cope with this issue, we propose a method of encoding tag dictionary information into our neural models instead of using a morphological analyzer in a pipeline fashion. As such, the output of our tagger is not restricted by the output candidates that are generated by the analyzer, and our method can be applied to POS tagging with an arbitrary tag set.
The bottom part of Figure 3 illustrates how to encode tag dictionary information for the word Hb ("love"). First, the input word is given to a tag dictionary that generates sets of possible tags for each morphosyntactic category. The outputs from the dictionary are then fed into the corresponding lookup tables, giving vector representations for possible tags. For each category, we sum over the outputs from the lookup table and then concatenate all the summed vectors into a single vector.
Formally, the encoded vector representation d t for the input word x t is computed by concatenating all the sub-vectors defined for each morphosyntactic category m: pos:noun gen:m cas:n is computed with the following equation: is the set of possible tags for the category m given the word x t , W (m) is the embedding matrix for the category m, and e (m) d is a onehot vector representing the tag d for the category m. Finally, the resulting vector d t is concatenated with the word embedding w t and the characterlevel embedding c t , forming the input word representation r t = [w t ; c t ; d t ] for our model. The top part of Figure 3 illustrates the overall architecture of our proposed model.

Experiments
In this section, we present our experimental setup and results. We report tagging accuracy on two data sets: the Penn Arabic Treebank (PATB) data set and the Arabic Universal Dependencies Treebank (UD Arabic) data set. We also report the effects of tag dictionary information in both data sets.

Implementation Details
We implement all bi-LSTM models using the DyNet library (Neubig et al., 2017). We use the same hyperparameters throughout the independent and joint models, i.e., Adam with cross entropy loss, mini-batch size of a single sentence, 100 dimensions for word embeddings, 50 for characterlevel embeddings, 10 for each morphosyntactic dictionary embedding, 500 hidden states, 100 dimensions for output layers, random initialization for the embeddings, and no dropout regularization. We do not use external resources for the word embeddings in order to emulate the data availability of earlier work as much as possible. The number of epochs is optimized based on evaluation over the development set, to a maximum of 10 epochs. We use ALMOR (Habash, 2007), which is part of the MADAMIRA distribution (Pasha et al., 2014), alongside the SAMA database (Maamouri et al., 2010c) to create the tag dictionary.

Data Sets
The PATB Data Set In order to compare our models with the current state-of-the-art tagger, we use the Penn Arabic Treebank (PATB, parts 1, 2 and 3) (Maamouri et al., 2010a(Maamouri et al., , 2011(Maamouri et al., , 2010b with the same partitioning as Diab et al. (2013). The statistics of the data set are shown in Table 2. The data sets are pre-processed as in Pasha et al. (2014) to correct annotation inconsistencies and to obtain the morphosyntactic feature representation for each word. All the Arabic characters are transliterated according to the Buckwalter transliteration scheme (Buckwalter, 2002) and each numerical digit is substituted with 0.   Table 3: Number of sentences, tokens, and finegrained POS tags in the UD Arabic data set.
For the fine-grained POS tag set, we use the universal POS tags and 16 of the morphological features defined in the UD Arabic data set. The annotations in the UD Arabic data set are automatically converted from the Prague Arabic Dependency Treebank (Smrž et al., 2008). Table 4 shows the lists of possible values for each morphosyntactic category. The annotations in UD Arabic are different from those in PATB with regard to the choice of categories and their granularity, although there are some overlaps in categories such as gender and person. For pre-processing, each numerical digit is substituted with 0.    Table 6: Performance comparison of the different models, each of which uses a single morphosyntactic category in its tag dictionary embeddings, on the PATB data set. +m in the leftmost column indicates the use of the category m to form the tag dictionary embeddings. +all indicates the use of all categories to form the tag dictionary embeddings. Boldfaced numbers represent the largest improvement in the category to predict (minimum of 0.05% absolute).
i.e., the fine-grained POS tag (All). For comparison, we use CamelParser (Shahrour et al., 2015), the current state-of-the-art tagger. CamelParser is an improved version of the previous state-ofthe-art tagger MADAMIRA (Pasha et al., 2014), which ranks the possible analyses provided by a morphological analyzer using SVMs. Camel-Parser adjusts the outputs of MADAMIRA by utilizing case-state classifiers that incorporate additional syntactic information provided by a dependency parser and hand-written rules. The tag set used in CamelParser is compatible with the 14 morphosyntactic categories we use.
Tagging Accuracy on the UD Arabic data set For the UD Arabic data set, we report tagging accuracy over the 17 morphosyntactic categories (i.e., the universal POS tags and 16 morphological features) and their combination (All). We use independent models with and without tag dictionary information and joint models with and without tag dictionary information for this data set.

Most Influential Categories
For both data sets, we conduct additional experiments to investigate which morphosyntactic category in the tag dictionary embeddings contributes most to the performance. Specifically, instead of using all morphosyntactic categories to create the tag dictionary embeddings, we use only one at a time. In other words, we skip the last step of concatenating all the sub-vectors defined for each morphosyntactic category, and use only one of the sub-vectors for the tag dictionary embeddings.

The PATB Data Set
Our Models vs CamelParser Table 5 illustrates our experimental results on the PATB data set. The best performing model was the joint model with tag dictionary embeddings (+Dict), achieving an accuracy of 91.38% on the strictest metric "All" (i.e., the fine-grained POS tag) with an absolute improvement of 2.11% over CamelParser, the current state-of-the-art tagger. This model outperforms CamelParser in every morphosyntactic category. Among these categories, the most notable improvement is the case category (cas) with an absolute improvement of 2.08% over the current state-of-the-art system. Leaving out the dictionary embeddings (+Dict) reduces the performance by 1.89% absolute, but still outperforms CamelParser without using any addi-  Table 7: Tagging accuracies on the UD Arabic data set. All is the percentage where all categories were correct (i.e., the fine-grained POS tag). +Dict indicates the use of the tag dictionary embeddings.
tional resources such as a morphological analyzer or a dependency parser, indicating the effectiveness of joint modeling of morphosyntactic categories. On the other hand, the independent model gives an accuracy of 87.74%, which is 1.53% absolute worse than CamelParser. However, adding dictionary embeddings (+Dict) enhances the performance with an absolute improvement of 2.43% and yields the second-best accuracy, showing the impact of the additional dictionary feature.

Most Influential Categories
Which morphosyntactic category in the tag dictionary embeddings contributes most to the performance? Table 6 compares the performance of the different models, each of which uses a single morphosyntactic category in its tag dictionary embeddings. The category that contributes most in the tag dictionary embeddings is the coarse POS category (+pos) with an absolute improvement of 1.48% on the metric "All". It is worth mentioning that case and state categories are tied for the second most contributing category, which supports CamelParser's idea that improving the prediction of case and state categories will provide further performance gains. Looking at the effects on each category to predict, the embeddings for coarse POS (+pos) give the best improvement in 5 categories: coarse POS (pos), gender (gen), case (cas), mood (mod), and voice (vox). We can see that the information carried by the coarse POS category plays a central role for predicting other morphosyntactic categories, especially for the case category. On the other hand, in 8 categories, the best improvement was achieved when the category used for the tag dictionary embeddings was the same as the category to predict. The 8 categories were: coarse POS (pos), number (num), person (per), state (stt), three of the proclitics (prc0, prc1, prc2), and en-clitic (enc). This result suggests that the tag dictionary embeddings of a given category behave as a soft constraint when predicting the same category, which makes intuitive sense. Table 7 illustrates our experimental results on the UD Arabic data set. The independent model gives an accuracy of 86.34% on the metric "All" (i.e., the fine-grained POS tag). Adding the tag dictionary embeddings (+Dict) improves the accuracy with an absolute improvement of 2.72%. Unlike the PATB data set, the joint model outperformed both independent models regardless of the use of the tag dictionary embeddings. The best performing model was the joint model with the tag dictionary embeddings (+Dict), achieving an accuracy of 91.68%. We can observe that the overall results show similar tendencies to the results on the PATB data set in spite of the different annotation schemes. Table 8 compares the performance of the different models, each of which uses a single morphosyntactic category in its tag dictionary embeddings, on the UD Arabic data set. As in the results on the PATB data set, the coarse POS category (+pos) is the category that contributes the most in the tag dictionary embeddings, giving an absolute improvement of 0.92% on the metric "All". It also gives the best improvement in 8 categories: POS, Aspect, Case, Definite, Foreign, Gender, Number, Person, and Voice. This result confirmed that the possible tag information from the POS category is more effective than information from the other categories.

Most Influential Categories
On the other hand, unlike in the PATB data set, we do not observe a relationship between the cat-  Table 8: Performance comparison of the different models, each of which uses a single morphosyntactic category in its tag dictionary embeddings, on the UD Arabic data set. +m in the leftmost column indicates the use of the category m to form the tag dictionary embeddings. +all indicates the use of all categories to form the tag dictionary embeddings. Boldfaced numbers represent the largest improvement in the category to predict (minimum of 0.05% absolute).
egory used for the tag dictionary embeddings and the category to predict, presumably because of the difference in the annotation schemes. Diab et al. (2004) proposed a segmentationbased approach, in which they tag each cliticsegmented token using SVMs. Mohamed and Kübler (2010) proposed a word-based approach which takes space-delimited words as inputs and uses memory-based learning. Their experiment showed that the word-based approach performed better than the segmentation-based approach, avoiding segmentation error propagation. Zhang et al. (2015) proposed joint modeling of segmentation, POS tagging, and dependency parsing using a randomized greedy algorithm. The aforementioned studies were focused on tagging with reduced POS tag sets whose sizes ranged from 12 to 993. However, we use one of the most fine-grained POS tag sets, with about 2,000 tags appearing in our training set.

Related Work
In the context of fine-grained POS tagging, Mueller et al. (2013) presented an approximated higher-order CRF for morphosyntactic tagging across six languages, assuming gold clitic segmentation. Pasha et al. (2014) used an analyze-anddisambiguate approach, in which they ranked the possible analyses provided by a morphological analyzer for each space-delimited word. The stateof-the-art tagger (Shahrour et al., 2015) extended their model by adjusting the outputs of Pasha et al.'s tagger by utilizing case-state classifiers that incorporate additional syntactic information provided by a dependency parser and hand-written rules.
Compared to their approaches, our model is simple but powerful: It does not assume gold clitic segmentation, since segmentation is also modeled as part of the morphosyntactic categories, nor does it require the additional pipeline process of syntactic parsing. Nonetheless, it is more accurate than the current state-of-the-art.
Another related line of work tackles sequential labeling problems using multi-task learning with deep neural networks and investigates situations where multi-task learning leads to improvements in performance Bingel and Søgaard, 2017;Martínez Alonso and Plank, 2017). Although our main focus is not on investigating the most effective task combination, it can be worth experimenting with various configurations in our settings.
With regard to the use of outputs from a morphological analyzer as additional features, our work is closely related to Bohnet et al. (2013) and Shen et al. (2016). Bohnet et al. (2013) presented a joint approach for morphological and syntactic analysis for morphologically rich languages, integrating additional features that encode whether a tag is in the dictionary or not. Shen et al. (2016) proposed an approach in which they encode a sequence of possible morphosyntactic tags provided by a morphological analyzer using bi-directional LSTMs. In contrast, we provide an alternative way of encoding this information, as well as an analysis on the most influential categories in the encoded tag embeddings.

Conclusions
We presented an approach for fine-grained Arabic POS tagging that jointly models each morphosyntactic tagging task using a multi-task learning framework. We also proposed a method of incorporating tag dictionary information into our neural models by combining word representations with representations of the sets of possible tags. The joint model with tag dictionary information results in the best accuracy of 91.38% with an absolute improvement of 2.11% over the current state-of-the-art tagger. In addition, our experiments showed that the proposed method of encoding tag dictionary information improves the tagging accuracy even on a data set with different annotations.
One potential future direction to explore is domain adaptation to Arabic dialects, since our ap-proach is easily applicable as it does not require construction of a morphological analyzer for each dialect. Another direction is to make use of publicly available dictionaries such as Wiktionary to construct a tag dictionary.