Character Eyes: Seeing Language through Character-Level Taggers

Character-level models have been used extensively in recent years in NLP tasks as both supplements and replacements for closed-vocabulary token-level word representations. In one popular architecture, character-level LSTMs are used to feed token representations into a sequence tagger predicting token-level annotations such as part-of-speech (POS) tags. In this work, we examine the behavior of POS taggers across languages from the perspective of individual hidden units within the character LSTM. We aggregate the behavior of these units into language-level metrics which quantify the challenges that taggers face on languages with different morphological properties, and identify links between synthesis and affixation preference and emergent behavior of the hidden tagger layer. In a comparative experiment, we show how modifying the balance between forward and backward hidden units affects model arrangement and performance in these types of languages.


Introduction
Subword vector representations are now a standard part of neural architectures for natural language processing (e.g., Bojanowski et al., 2017;Peters et al., 2018).
In particular, character representations have been shown to handle out-of-vocabulary words in supervised tagging tasks (Ling et al., 2015;Lample et al., 2016). These advantages generalize across multiple languages, where morphological formation may differ greatly but the character composition of words remains a relatively reliable primitive (Plank et al., 2016).
While the advantages of character-level models are readily apparent, existing evaluation methods * Work done while at Georgia Institute of Technology. fail to explain the mechanism by which these models encode linguistic knowledge about morphology and orthography. Different languages exhibit character-word correspondence in very different patterns, and yet the bi-directional LSTM appears to be, or is assumed to be, capable of capturing them all. In large multilingual settings, it is not uncommon to tune hyperparameters on a handful of languages, and apply them to the rest (e.g., Pinter et al., 2017).
In this work, we challenge this implicit generalization. We train character-based sequence taggers on a large selection of languages exhibiting various strategies for word formation, and subject the resulting models to a novel analysis of the behavior of individual units in the characterlevel Bi-LSTM hidden layer. This reveals differences in the ability of the Bi-LSTM architecture to identify parts-of-speech, based on typological properties: hidden layers trained on agglutinative languages find more regularities on the character level than in fusional languages; languages that are suffix-heavy give a stronger signal to the backward-facing hidden units, and vice versa for prefix-heavy languages. In short, character-level recurrent networks function differently depending on how each language expresses morphosyntactic properties in characters.
These empirical results motivate a novel Bi-LSTM architecture, in which the number of hidden units is unbalanced across the forward and backward directions. We find empirical correspondence between the analytical findings above and performance of such unbalanced Bi-LSTM models, allowing us to translate the typological properties of a language into concrete recommendations for model selection. 1

Related Work
Several recent papers attempt to explain neural network performance by investigating hidden state activation patterns on auxiliary or downstream tasks. On the word level, Linzen et al. (2016) trained LSTM language models, evaluated their performance on grammatical agreement detection, and analyzed activation patterns within specific hidden units. We build on this analysis strategy as we aggregate (character-) sequence activation patterns across all hidden units in a model into quantitative measures.
Substantial prior work exists on the character level as well (Karpathy et al., 2015;Vania and Lopez, 2017;Kementchedjhieva and Lopez, 2018;Gerz et al., 2018). Smith et al. (2018) examined the character component in multilingual parsing models empirically, comparing it to the contribution of POS embeddings and pre-trained embeddings. Chaudhary et al. (2018) leveraged crosslingual character-level correspondence to train NER models for low-resource languages. Most related to our work is Godin et al. (2018), who compared CNN and LSTM character models on a type-level prediction task on three languages, using the post-network softmax values to see which models identify useful character sequences. Unlike their analysis, we examine a more applied token-level task (POS tagging), and focus on the hidden states within the LSTM model in order to analyze its raw view of word composition.
Our analysis assumes a characterization of unit roles, where each hidden unit is observed to have some specific function. Findings from Linzen et al. (2016) and others suggest that a single hidden unit can learn to track complex syntactic rules. Radford et al. (2017) find that a character-level language model can implicitly assign a single unit to track sentiment, without being directly supervised. (Kementchedjhieva and Lopez, 2018) also examine individual units in a character model and find complex behavior by inspecting activation patterns by hand. In contrast, our metrics are motivated by discovering these units automatically, and capturing unit-level contributions quantitatively.

Tagging Task
We train a set of LSTM tagging models, following the setup of Ling et al. (2015). A word representation trained from a character-level LSTM submodule is fed into a word-level bidirectional LSTM, with each word's hidden state subsequently fed into a two-layer perceptron producing tag scores, which are then softmaxed to produce a tagging distribution. For languages with additional morphosyntactic attribute tagging, we follow the architecture in Pinter et al. (2017) where the same word-level Bi-LSTM states are used to predict each attribute's value using its own per-ceptron+softmax scaffolding. In order to produce character models which would be as informative as possible to our subsequent analysis, we do not include word-level embeddings, pre-trained or otherwise, in our setup.

Language Selection
As our goal is to examine the relationship between character-level modeling and linguistic properties, we drove language selection based on two morphological properties deemed relevant to the architectural effects examined. All 24 datasets were obtained from Universal Dependencies (UD) version 2.3 , and linguistic properties were found in the World Atlas of Language Structures (Bickel and Nichols, 2013;Dryer, 2013).
The selected languages and their properties are presented in Table 1. We note that eleven of the 24 languages selected are not Indo-European.
Affixation. To evaluate the role of forward and backward units in a bidirectional model, we selected all languages available in UD which are not classified as either weakly or strongly suffixing in inflectional morphology (the vast majority of UD languages). This includes a single prefixing language (Coptic), two equally suffixing and prefixing languages (Basque and Irish), and two languages with little affixation (Thai and Vietnamese).
Morphological Synthesis. Linguistically functional features vary between being expressed as distinct tokens (isolating languages), detectable unique character substrings (agglutinative), fused together but still distinguishable from the stem (fusional), and non-linearly represented within the word form (introflexive). This property has previously been found to affect performance in character-level models (Pinter et al., 2017;Gerz et al., 2018;Chaudhary et al., 2018), and thus we select representatives of each group, including most available non-fusional languages.

Technical Setup
Most of our selected languages have only a single UD 2.3 treebank. For languages with multiple treebanks we selected the largest, except in the cases of Spanish and Indonesian, where we selected the GSD treebanks. The Irish IDT treebank has only a train and test split, so we used the test set for early stopping. The Thai PUD treebank only provided a single dataset with 1000 instances, which we shuffled and partitioned into a 850/150 split. Tokens were normalized to remove noisy data: tokens containing 'http' were replaced with 'URL' and tokens containing '@' were replaced with 'EMAIL'. This was most relevant (293 replacements) for the English treebank, which contained many long URLs.
Hyperparameters. For the initial bidirectional character-level LSTM, we used a total hidden state size of 128 (64 units in each direction). The character embedding size is set to 256, initialized using the method of Glorot and Bengio (2010). The word-level bidirectional LSTM has two layers and a hidden state size of 128, with 50% dropout applied in the style of Gal and Ghahramani (2016).
Each attribute-prediction MLP has a single hidden layer that is the same size as the tagset size for that attribute, and includes a tanh nonlinearity. Models were trained for up to 80 epochs, and we select the model with the highest POS tagging accuracy on the dev set. Training used SGD with 0.9 momentum, and all models were implemented using DyNet 2.0 (Neubig et al., 2017).

Results
In our initial setup, we represent words using a concatenation of the final states from a bidirectional character-level LSTM with 64 forward and backward hidden units each. The results for POS tagging, presented in Table 1, are on par with similar models (Plank et al., 2016, for example) despite not including a word-level type embedding component. We attribute this success to our large character embedding size of 256, corroborating findings reported by Smith et al. (2018).

Analysis
We next analyze the models trained on the tagging task in an attempt to see how their character-level hidden states encode different manifestations of linguistic information. We suggest that individual hidden units in the character-level sequence model attune to track patterns in the words which would indicate their linguistic roles (POS and morphological properties), and so patterns in characterrole regularity across typologically different languages would manifest themselves in an observable form at the individual unit activation level. This motivates us to devise metrics which would characterize languages through aggregation of individual unit behaviour.

Metrics
For each language, we run the character-level BiLSTM from the trained tagger on POSunambiguous word types occurring frequently in the training set, grouped into their parts of speech. 2 This filtering was done in order to focus on the more consistent generalizations found by the taggers during training, as our goal is to qualify properties of languages. 3 On each word w, we observe each hidden unit h i 's activation level (output) on each character h c i . We obtain a base measure b(w, i) based on the activation pattern. For example, an average absolute base measure is defined as the average of absolute value activations: The max absolute diff base measure is defined as: Figure 1 demonstrates these two metrics for a sample (word, unit) pair, showing how the former captures the general level of activation the word caused on the unit, while the latter captures the local character pattern deemed most important by it. We intentionally did not consider metrics based on the final activation values, the direct signals used by the later layers in the model, as these bear no insight into the effect of a word's composition on the learned model.
Next, we derive a language-level metric for each hidden unit, based on the principle of Mutual Information (MI). The base metric's range ([0, 1) for b avg|·| , [0, 2) for b mad ) is divided into B bins of equal size, and base activations from each word are summed across each of the T POS tag categories 4 , then normalized to produce a joint probability distribution. The mutual information is computed as: and we call the resulting number the POS-Discrimination Index, or PDI. Intuitively, a higher  PDI implies that the unit activates differently on words of different parts of speech, i.e. it is a better discriminator for the task. At this point a language produces a set of d h PDI scores, one for each unit. We sort them from high to low, and define two language-level metrics: The mass is the sum of PDI values for all units, M(L) := d h i=1 PDI(L, i), intuitively meant to quantify the degree of success the model has in assigning hidden units to discriminate POS in this language. The head forwardness is the proportion of forward-directional units before the point at which half of the mass accumulates (in a random setup, this number would tend to 0.5): This metric aims to quantify the relative importance of forward and backward units in discriminating POS for L.

PDI Patterns
The PDI patterns on the b avg|·| base measure with B = 16 bins on all 24 languages are presented in Table 2. We see that agglutinative languages, where we can expect a better discrimination signal to emerge from the consistently-formed morphemes, cluster mostly at the top of the PDI mass scale, suggesting more individual character-level units extract these signals successfully. Introflexive languages, where character sequences seldom correspond to useful indications of POS or morphosyntactic attributes, cluster towards the bottom.
We present the full unit-level PDI value distributions for Coptic, a prefixing agglutinative language, and English, a suffixing fusional language, in Figure 2 (trends for b mad are similar). Consistent with other agglutinative languages, Coptic's cumulative mass is very large (M(cop) = 58.1), suggesting the predictive qualities of the sequence-based LSTM allows good discrimination from the character signal, as one might expect from an agglutinative language. Conversely, M(eng) = 16, demonstrating the difficulty presented by fusional languages. The accumulation of 71% forward (80% backward) units in the head of the Coptic (English) value ranking suggests an interesting relationship between affixation and LSTM direction: LSTM units are likely to hone in on POS-indicative signals, which often occur as affixes, in the beginning of their run, causing activation values to rise (in absolute value) and stay large throughout the subsequent traversal of the stem. Unfortunately, since no other prefixing languages are available in UD, we were not able to pursue this hypothesis further.

Asymmetric Directionality
Based on these observations, we conduct a directionality balance study, where we vary the number of hidden units in the forward and backwards  Table 3: Imbalanced models' mean POS accuracy on UD development data (differences between three averaged random runs in all models; boldfaced when significant at p < 0.05 using a paired two-tailed t-test).
dimensions. In addition to the models analyzed above, which use 64 forward and 64 backward units (denoted hereafter 64/64), we trained models with imbalanced directionality (128/0, 96/32, 32/96, 0/128). We test the hypothesis that imbalanced models affect languages differently based on their linguistic properties and statistical metrics. We note that these settings do not maintain parameter set size: intra-direction transition operations are quadratic in that direction's hidden layer size, and so this adds a possible advantage in favor of direction-imbalanced models.
The results for this study are presented in Table 3 as averages for the language categories listed in Table 1 (the full, raw results are available in Table 4).
One trend which emerges is the preference of agglutinative languages for imbalanced models, whereas the other languages are little affected by this change. This could be explained by the increase in inter-unit interaction in the larger direction of an imbalanced model -contiguous character sequences consistently code reliable linguistic features in these languages. A second finding is the slight bias of suffixing languages towards more forward units and of the prefixing language to more backward units, indicating that hidden LSTM units are better in detecting formations close to their final state. Coupled with the findings regarding PDI mass distribution in the different directional units in § 4.2, we suggest that a subtle relation exists between morphological information and model directionality: units which end their run on the affix are more important for detecting the POS signal, but it is more challenging for them to do so, and as a result more of them are necessary. We also note the stability of isolating and little-affixing languages to directionality balance, possibly owing to the relatively small significance of contiguous character sequences in detecting word role. Lastly, we point out that the compromise sesquidirectional models 96/32 and 32/96 did not tend to stand out significantly on our tested language categories, suggesting there is no substantial middle-ground between the two popu-lar techniques of unidirectional and bidirectional LSTMs.

Conclusion
While character-level Bi-LSTM models compute meaningful word representations across many languages, the way they do it depends on each language's typological properties. These observations can guide model selection: for example, in agglutinative languages we observe a strong preference for a single direction of analysis, motivating the use of unidirectional character-level LSTMs for at least this type of language.