Can LSTM Learn to Capture Agreement? The Case of Basque

Sequential neural networks models are powerful tools in a variety of Natural Language Processing (NLP) tasks. The sequential nature of these models raises the questions: to what extent can these models implicitly learn hierarchical structures typical to human language, and what kind of grammatical phenomena can they acquire? We focus on the task of agreement prediction in Basque, as a case study for a task that requires implicit understanding of sentence structure and the acquisition of a complex but consistent morphological system. Analyzing experimental results from two syntactic prediction tasks -- verb number prediction and suffix recovery -- we find that sequential models perform worse on agreement prediction in Basque than one might expect on the basis of a previous agreement prediction work in English. Tentative findings based on diagnostic classifiers suggest the network makes use of local heuristics as a proxy for the hierarchical structure of the sentence. We propose the Basque agreement prediction task as challenging benchmark for models that attempt to learn regularities in human language.

We focus on the task of agreement prediction in Basque, as a case study for a task that requires implicit understanding of sentence structure and the acquisition of a complex but consistent morphological system. Analyzing experimental results from two syntactic prediction tasks -verb number prediction and suffix recovery -we find that sequential models perform worse on agreement prediction in Basque than one might expect on the basis of a previous agreement prediction work in English. Tentative findings based on diagnostic classifiers suggest the network makes use of local heuristics as a proxy for the hierarchical structure of the sentence. We propose the Basque agreement prediction task as challenging benchmark for models that attempt to learn regularities in human language.

Introduction
In recent years, recurrent neural network (RNN) models have emerged as a powerful architecture for a variety of NLP tasks (Goldberg, 2017). In particular, gated versions, such as Long Short-Term Networks (LSTMs) (Hochreiter and Schmidhuber, 1997) and Gated Recurrent Units (GRU) Chung et al., 2014) achieve state-of-the-art results in tasks such as language modeling, parsing, and machine translation.
RNNs were shown to be able to capture long-term dependencies and statistical regularities in input sequences (Karpathy et al., 2015;Linzen et al., 2016;Shi et al., 2016;Jurafsky et al., 2018;Gulordava et al., 2018). An adequate evaluation of the ability of RNNs to capture syntactic structure requires a use of established benchmarks.
A common approach is the use of an annotated corpus to learn an explicit syntax-oriented task, such as parsing or shallow parsing (Dyer et al., 2015;Kiperwasser and Goldberg, 2016;Dozat and Manning, 2016) . While such an approach does evaluate the ability of the model to learn syntax, it has several drawbacks. First, the annotation process relies on human experts and is thus demanding in term of resources. Second, by its very nature, training a model on such a corpus evaluates it on a human-dictated notion of grammatical structure, and is tightly coupled to a linguistic theory. Lastly, the supervised training process on such a corpus provides the network with explicit grammatical labels (e.g. a parse tree). While this is sometimes desirable, in some instances we would like to evaluate the ability of the model to implicitly acquire hierarchical representations.
Alternatively, one can train language model (LM) (Graves, 2013;Józefowicz et al., 2016;Melis et al., 2017;Yogatama et al., 2018) to model the probability distribution of a language, and use common measures for quality such as perplexity as an indication of the model's ability to capture regularities in language. While this approach does not suffer from the above discussed drawbacks, it conflates syntactical capacity with other factors such as world knowledge and frequency of lexical items. Furthermore, the LM task does not provide one clear answer: one cannot be "right" or "wrong" in language modeling, only softly worse or better than other systems.
A different approach is testing the model on a grammatical task that does not require an extensive grammatical annotation, but is yet indicative of syntax comprehension. Specifically, previous works (Linzen et al., 2016;Bernardy and Lappin, 2017;Gulordava et al., 2018) used the task of predicting agreement, which requires detecting hierarchal relations between sentence constituents. Labeled data for such a task requires only the collection of sentences that exhibit agreement from an unannotated corpora. However, those works have focused on relatively small set of languages: several Indo-European languages and a Semitic language (Hebrew). As we show, drawing conclusions on the model's abilities from a relatively small subset of languages can be misleading.
In this work, we test agreement prediction in a substantially different language, Basque, which is a language with ergative-absolutive alignment, rich morphology, relatively free word order, and polypersonal agreement (see Section 3). We propose two tasks, verb-number prediction (Section 6) and suffix prediction (Section 7), and show that agreement prediction in Basque is indeed harder for RNNs. We thus propose Basque agreement as a challenging benchmark for the ability of models to capture regularities in human language.

Background and Previous Work
To shed light on the question of hierarchical structure learning, a previous work on English (Linzen et al., 2016) has focused on subject-verb agreement: The form of third-person present-tense verbs in English is dependent upon the number of their subject ("They walk" vs. "She walks"). Agreement prediction is an interesting case study for implicit learning of the tree structure of the input, as once the arguments of each present-tense verb in the sentence are found and their grammatical relation to the verb is established, predicting the verb form is straightforward. Linzen et al. (2016) tested different variants of the agreement prediction task: categorical prediction of the verb form based on the left context; grammatical assessment of the validity of the agreement present in a given sentence; and language modeling. Since in many cases the verb form can be predicted according to number of the preceding noun, they focused on agreement attractors: sentences in which the preceding nouns have the opposite number of the grammatical subject. Their model achieved very good overall performance in the first two tasks of number prediction and grammatical judgment, while in the third task of language modeling, weak supervision did not suffice to learn structural dependencies. With regard to the presence of agreement attractors, they have shown the performance decays with their number, to the point of worse-than-random accuracy in the presence of 4 attractors; this suggests the network relies, at least to a certain degree, on local cues. Bernardy and Lappin (2017) evaluated agreement prediction on a larger dataset, and argued that a large vocabulary aids the learning of structural patterns. Gulordava et al. (2018) focused on the ability of LM's to capture agreement as a marker of syntactic ability, and used nonsensical sentences to control for semantic clues. They have shown positive results in four languages, as well as some similarities between their models' performance and human judgment of grammaticality.

Properties of the Basque Language
Basque agreement patterns are ostensibly more complex and very different from those of English. In particular, nouns inflect for case, and the verb agrees with all of its core arguments. How well can a RNN learn such agreement patterns?
We first outline key properties of Basque relevant to this work. We have used the following two grammars written in English for reference (Laka, 1996;de Rijk, 2007).
Morphological marking of case and number on NPs The grammatical role of noun phrases is explicitly marked by nuclear case suffixes that attach after the determiner in a noun phrase -this is typically the last element in the phrase.
The nuclear cases are the ergative (ERG), the absolutive (ABS) and the dative (DAT). 1 In addition to case, the same suffixes also encode for number (singular or plural) as seen in Table 1.
Ergative-absolutive case system Unlike English and most other Indo-European languages that have nominative-accusative morphosyntactic alignment in which the single argument of intransitive verbs and the agent of transitive verbs behave similarly to each other ("subjects") but differently from the object of transitive verbs, Basque has ergative-absolutive alignment. This means that the "subject" of an intransitive verb and the "object" of a transitive verbs behave similarly to each other and receive the absolutive case, while the "subject" of a transitive verb receives the ergative case. To illustrate the difference, while in English we say "she sleeps" and "she sees them" (treating she the same in both sentences), in an imaginary ergative-absolutive version of English we would say "she sleeps" and "her sees they", inflecting "she" and "they" similarly (the absolutive), and different from "her" (the ergative).
Examples The following sentence (1) demonstrates the use of case suffixes to encode grammatical function.
In (1), the verb eman 'give' is transitive, the ergative corresponds to English grammatical subject and the absolutive corresponds to English grammatical object. However, Basque is absolutive-ergative, namely, the subject of an intransitive verb is marked for case like the object of a transitive verb, and differently from the subject of a transitive verb (2).
Since the verb daude 'are' is intransitive, the word kutxazain-'cashier' accepts the plural absolutive suffix -ak, and not the plural ergative suffix -ek.
The tree sees the people.

Word-order and Polypersonal Agreement
Basque is often said to have a SOV word order, although the rules governing word order are rather complex, and word order is dependent on the focus and topic of the sentence. While the case marking system handles most of the word-order variation, the ambiguity between the single ergative and plural absolutive -which are both marked with -ak -results in sentence-level ambiguity. For instance, Example (3) can also be interpreted as "it is the tree [SG] that sees the people [PL]" (with a focus on "the tree"). Disambiguation in such cases depends on context and world knowledge.
Unlike English verbs that only agree in number with their grammatical subject, Basque verbs agree in number with all their nuclear arguments: the ergative, the absolutive and the dative (roughly corresponding to the subject, the object and the indirect object). 2 Verbs are formed in two ways: aditz trinkoak 'synthetic verbs' -such as jakin 'to know' -are conjugated according to the aspect, tense and agreement patterns, e.g. dakigu 'We know it' and genekien 'We knew it'. There are only about two dozen such verbs; all other verbs are composed of a non-finite stem, indicating the tense or aspect, and an auxiliary verb, that is conjugated according to the number of its arguments -such as ikusi 'to see' -e.g. ikusten dugu 'We see it' and ikusi genuen 'We saw it'. There are several auxiliary verbs, including izan 'to be' and ukan 'to have'. The form of an auxiliary verb used in a sentence also is also dependent on the transitivity of the verb, with izan being the intransitive auxiliary and ukan being the transitive auxiliary.
To summarize Noun phrases are marked for case (ergative, absolutive or dative) and number (singular or plural), and appear in relatively-free word order relative to the verb to which they are arguments. The verbs (or their auxiliaries) inflect for tense, time and number-agreement, and agree with all their arguments on number. Case syncretism results in ambiguity between the singular ergative and the plural absolutive suffixes.

Learning Basque Agreement
To assess the ability of RNNs to learn Basque agreement we perform two sets of experiments. In the first set (Section 6), we focus on the ability to learn to predict the number inflections of verbs, namely, the number of each of their arguments, where the model reads the sentence, with one of the verbs randomly replaced with a verb token. This is analogous to the agreement task explored in previous work on English (Linzen et al., 2016) and other languages (Gulordava et al., 2018), but in an arguably more challenging settings, as the Basque task requires the model: (a) to identify all the verb's arguments; (b) to learn the ergativeabsolutive distinction; and (c) to cope with a relatively free word order and a rich morphological inflection system. As we show, the task is indeed substantially harder than in English, resulting in much lower accuracies than in Linzen et al. (2016) while not focusing on the hard cases.
However, we also identify some problems with the verb number prediction task. The presence of case suffixes presumably makes the task easier, in some sense, than in English: the grammatical role of arguments with respect to the verb is encoded in grammatical suffixes, potentially making it easier to capture surface heuristics that do not require the understanding of the hierarchical structure of the sentence. In addition, the ergative-whose number is encoded in the verb form-is often omitted from sentences, making the task of ergative number prediction impossible without relying on context or world knowledge. We thus propose an alternative setup (Section 7), in which, rather than predicting the agreement pattern of the verb, we remove all nuclear case suffixes from words and ask the model to recover them (or predict the absence of a suffix, for unsuffixed words). We argue that this setup is a better one for assessing models' ability to capture Basque sentence structure and agreement system: it requires the model to accurately identify the role of NPs with respect to a verb in order to assign them the correct case suffix (as marked on the verb), while not requiring the model to make-up information that is not encoded in the sentence.

Experimental Setup
In contrast to more explicit grammatical tasks (e.g. tagging, parsing), the data needed for training a model on agreement prediction task does not require annotated data and can be derived relatively easily from unannotated sentences. We have used the text of the Basque Wikipedia. A considerable number of the articles in Basque Wikipedia appear to be bot-generated; we have tried to filter these from the data according to keywords. The data consists of 1,896,371 sentences; we have used 935,730 sentences for training, 129,375 for validation and 259,215 for evaluation. We make the data publicly available 3 .
We use the Apertium morphological analyzer (Forcada et al., 2011;Ginest-Rosell et al., 2009) to extract the parts-of-speech (POS) and morphological marking of all words. 4 The POS information was used to detect verbs, nouns and adjectives, but was not incorporated in the word embeddings.
For section 7.1, grammatical generalization, we used the Basque Universal Dependencies treebank (Aranzabe et al., 2015) to extract humanannotated POS, case, number and dependency edge labels. We have used their train:dev:test division, resulting in 5,173 training sentences, 1,719 development sentences and 1,719 test sentences.
Word Representation We represent each word with an embedding vector. To account for the rich morphology of Basque, our word embeddings combine the word identity, its lemma 5 as determined by the morphological analyzer, and character ngrams of lengths 1 to 5. Let E t , E l and E ng be token, lemma and n-gram embedding matrices, and let t w , l w and {ng w } be the word token, the lemma and the set of all n-grams of lengths 1 to 5, for a given word w. The final vector representation of w, e w , is given by . We use embedding vectors of size 150. We recorded the 100,000 most common words, n-grams and lemmas, and used them to calculate the vector representation of words. Out-of-vocabulary words, ngrams and lemmas are replaced by a unk token.
Model In previous studies, the agreement was between two elements, and the model was tasked with predicting a morphological property of the second one, based on a property encoded on the first. Thus, a uni-directional RNN sufficed. Here, due to a single verb having to agree with several arguments, while following a relatively free word order, we cannot use a uni-directional model. We opted instead for a bi-directional RNN. 6 In all tasks, we use a one-layer BiLSTM network with 150 hidden units, compared with 50 units in (Linzen et al., 2016) 7 . In the verb prediction task, the BiLSTM encodes the verb in the context of the entire sentence, and the numbers of the ergative, absolutive and datives are predicted by 3 independent multilayer perceptrons (MLPs) with a single hidden layer of size 128, that receive as an input the hidden state of the BiLSTM over the verb token.
In the suffix prediction task, the prediction of the case suffix is performed by a MLP of size 128, that receives as an input the hidden state of the BiLSTM over each word in the sentence.
The whole model, including the embedding, is trained end-to-end with the Adam optimizer (Kingma and Ba, 2014).

Verb Argument Number Prediction
In this task, the model sees the sentence with one of the auxiliary verbs replaced by a verb token, and predicts the number of its ergative, absolutive and dative. For example, in (1) above, the network sees the embeddings of the words in the sentence: 8

Kutxazain-ek bezeroa-ri liburu-ak eman verb
It is then expected to predict the number of the arguments of the missing verb, dizkiote: ergative:plural, dative:singular and absolutive:plural.  Each argument can take one of three values, singular, plural or none. In order to succeed in this task, the model has to identify the arguments of the omitted verb, and detect their plurality status as well as their grammatical relation to the verb. Note that as discussed above, these relations do not overlap with the notions of "subject" and "object" in English, as the grammatical case is also dependent on the transitivity of the verb. Since the model is exposed to the lemma of the auxiliary verb and the stem that precedes it, it can, in principle, learn dividing verbs into transitive and intransitive.

Results and Analysis
We conducted a series of experiments, as detailed below. A summary of all the results in available in Table 2.

Main results
The model achieved moderate success in this task, with accuracy of 87.1% and 93.8% and recall of 80.0% and 100% 9 in ergative and absolutive prediction, respectively. Dative accuracy was 98.0%, but the recall is low (54.9%), perhaps due to the relative rarity of dative nouns in the corpus (only around 3.5% of the sentences contain dative). These relatively low numbers are in sharp contrast to previous results on English in which the accuracy scores on general sentences was above 99%. While English agreement results drop when considering hard cases where agreement distractors or intervening constructions intervene between the verb and its argument, in Basque the numbers are low already for the common cases. This suggests that agreement prediction in Basque can serve as a valuable benchmark for evaluating the syntactic abilities of sequential models such as LSTMs in a relatively challenging grammatical environment, as well as for assessing the generality of results across language families.
Ablations: case suffixes vs. word forms The presence of nuclear case suffixes in Basque can, in principle, make the task of agreement prediction easier, as (ambiguous) grammatical annotation is explicit in the form of the nuclear case suffixes, that encode the type of grammatical connection to the verb. How much of the relevant information is encoded in the case suffixes? To investigate the relative importance of these suffixes, we considered a baseline in which the model is exposed only to the nuclear suffixes, ignoring the identities of the words and the character n-grams ( Table  2, Suffixes only). This model achieved accuracy scores of 69.0%, 83.7% and 97.0% and recall values of 40.3%, 100% and 26% for ergative, absolutive and dative prediction, respectively. While substantially lower than when considering the word forms, the absolute numbers are not random, suggesting that agreement can in large part be predicted based on the presence of the different suffixes and their linear order in the sentence, without paying attention to specific words.
In a complementary setting the model is exposed to the sentence after the removal of all nuclear case suffixes (according to the morphological analyzer output). This setting ( Table 2, No suffixes) yields accuracies of 83.8%, 87.8% and 97.3% and recall scores of 80.0%, 100% and 34.7% for ergative, absolutive and dative, respectively. Interestingly, in the last setting the model succeeds to some extent to predict the verb arguments number although the number is not marked on the arguments. This suggests the model uses cues such as the existence of certain function words that imply a number, and the forms of nonnuclear suffixes to infer the number of the arguments.
Importance of explicit case marking The verb numbers prediction task requires the model to identify the arguments, and hence the hierarchical structure of the sentence. However, the Basque suffixes encode not only the number but also the explicit grammatical function of the argument. This makes the model's task potentially easier, as it may make use of the explicit case information as an effective heuristic instead of modeling the sentence's syntactic structure. To control for this, we consider a neutralized version (Table 2, Neutralized case) in which we removed case information and kept only the number information: suffixes were replaced by their number, or were marked as "ambiguous" in case of -ak. For example, the word kutxazainek was replaced with kutxazain plural , since the suffix -ek encodes plural ergative. Interestingly, in this settings the performance was only slightly decreased, with accuracy scores of 86.0%, 93.3% and 97.3% and recall values of 79.3%, 100% and 38.1% for ergative, absolutive and dative, respectively. These results suggest that the network either makes little use of explicit grammatical marking in the suffixes, or compensates for the absence of grammatical annotation using other information present in the sentence.
Performance on simple sentences The presence of multiple verbs, along with the inherent ambiguity of the suffix system, can both complicate the task of number prediction. To assess the relative importance of these factors, we considered modified training and test sets that contain only sentences with a single verb (Table 2, Single verb). This resulted in a significant improvement, with accuracy scores of 90.61%, 96.04% and 98.9% and recall values of 89.0%, 100% and 74.7% for ergative, absolutive, and dative, respectively; note that sentences with a single verb also tend to be shorter and simpler in their grammatical structure. To evaluate the influence of the ambiguous suffix, we removed all sentences that contain the ambiguous suffix -ak from the dataset (

NP Suffix Prediction
The general trend in the experiments above is a significantly higher success in absolutive number prediction, compared with ergative number prediction. This highlights a shortcoming in the verbnumber prediction task: as Basque encodes the number of the verb arguments in the verb forms, the subject can -and often is -be omitted from the sentence. Additionally, the number of proper nouns is often not marked. These cases are common for the ergative: 55% of the sentences marked for ERG.PL3 agreement do not contain words suf-fixed with -ek. This requires the model to predict the number of the verb based on information which is not directly encoded in the sentence.
To counter these limitations, we propose an alternative prediction task that also takes advantage of the presence of case suffixes, while not requiring the model to guess based on unavailable information. In this task, the network reads the input sentence with all nuclear case suffixes removed, and has to predict the suffix (or the absence of thereof) for each word in the sentence. For example, in (1) above, the model reads (5).
It is then expected to predict the omitted case and determiner suffixes (-ek, -ak, -ari, none, none). We note that we remove the suffixes only from NPs, keeping the verbs in their original forms. As the verbs encode the numbers of its argument as well as their roles, the network is exposed to all relevant information required for predicting the missing suffixes, assuming it can recover the sentence structure. In order to succeed in this task, the model should link each argument to its verb, evaluate its grammatical relation to the verb, and choose the case suffix accordingly. Case suffixes are appended at the end of the NP. As a result, suffix recovery also requires some degree of POS tagging and NP chunking, and thus shares some similarities with shallow parsing in languages such as English. This suggests that the task of case suffix recovery in languages with complex case system such as Basque can serve as a proxy task for full parsing, while requiring a minimal amount of annotated data.
The singular absolutive determiner suffix, -a, also appears in the base form of some words. Therefore, for -a suffixed words, we have used the morphological analyzer to detect whether not the -a suffix is a part of the lemma. Consider the examples ur 'water'-ura 'the water-ABS' and uda 'summer'-uda 'the summer-ABS'. -a suffixed words not known to the analyzer were excluded from the experiment.

Results and Analysis
The results for the suffix prediction task are presented in Table 3 and Table 4. The model achieves F1 scores of 78.2 and 83.2% for the erg. plural -ek and absolutive singular/ergative singular -ak suffixes, respectively. The F1 score for the ABS singular suffix -a is higher -85.5%; This might be   due to the fact this suffix is unambiguous (unlike -ak), and the fact the absolutive is rarely omitted (unlike words suffixed with -ek), which implies that verb forms indicating verb-absolutive singular agreement also reliably predicts the presence of a word suffixed with -a in the sentence. Similarly to the trend in the first task, the model achieved relatively low F1 scores in the prediction of dative suffixes, -ari and -ei: 78.8% and 65.0%, respectively.
Importance of verb form Once the grammatical connection between verbs and their arguments is established, the nuclear suffix of each of the verb's arguments is deterministically determined by the form of the verb. As such, verb forms are expected to be of importance for suffix prediction. To assess this importance, we have evaluated the model in a setting in which the original verb forms are replaced by a verb token. In this setting, the model achieved F1 scores of 72.0%, 65.4%, 78.1%, 67.5%, 47.3% and 92.0% for -ak, -ek, -a, -ari, -ei, and the prediction of the presence of any nuclear suffix, respectively (Table 4, No verb). These results, that are far from random, indicate that factors such as the order of words in the sentence, the identity of the words (as certain words tend to accept certain cases irrespective of context), and the non-nuclear case suffixes (which are not omitted), all aid the task of nuclear-suffix prediction.
Word-only baseline Some words tend to appear more frequently in certain grammatical positions, regardless of their context. We therefore compared the model performance with a baseline of a 1-layer MLP that predicts the case suffix of each word based only its embedding vector. As expected, this baseline achieved lower F1 scores of 56. 0%, 49.5%, 55.2%, 56.5%, 24.2% and 69.8% for -ak, -ek, -a, -ari, -ei, and the prediction of the presence of any suffix, respectively (Table 4, Words only).
Focusing on the harder cases An essential step in the process of suffix prediction is identifying the arguments of each verb. To what extent does the model rely on local cues as a proxy for this task? A simple heuristics is relating each word to its closest verb. We compared the model's performance on "easier" instances where the closest verb is grammatically connected to the word, versus "harder" instances in which the closest verb is not grammatically connected to the word. This evaluation requires automatically judging the grammatical connection between words and verbs in the input sentence. Due to the ambiguous case suffixes, this is generally not possible in unparsed corpora. However, we focus on several special cases of sentences containing exactly 2 verbs of specific types, in which it is possible to unambiguously link certain words in the sentences to certain verbs. Since these instances consists only a fraction of the dataset, for this evaluation we have used a larger test set containing 50% of the data. Table 5 depicts the results for sentences that contain the verb da 'is'. The general trend, for da and for several other verbs (not presented here ) , is higher F1 scores in the "easier" instances. We note, however, that in these instances there is also larger absolute distance between the verb and its argument, which prevents us from drawing causal conclusions.
Diagnostic classifiers To overcome this difficulty and understand if the model encodes the grammatical connection between a word to its closest verb in the BiLSTM hidden state over a given word, we have trained a diagnostic classifier (Adi et al., 2016;Hupkes and Zuidema, 2018) that receives as an input the hidden state of a BiLSTM over a word, and predicts whether or not the closest verb (which is unseen by the diagnostic classifier) was grammatically connected to the word.
We have compared two diagnostic classifiers: a linear model, and a 1-layer MLP. A training set was created by collecting hidden states of the BiL-STM over words, and labeling each training example according to the existence of a verb-argument connection between the word over which the state was collected and its closest verb (a binary classi-fication task). We then compared the success rate of the diagnostic classifier on instances in which the BiLSTM correctly predicted a case suffix (Table 6, BiLSTM correct), versus the instances on which the BiLSTM predicted incorrectly (Table 6, BiLSTM wrong). The results, depicted in Table 6, demonstrate that in instances in which the model predicts a wrong case suffix, the diagnostic classifier tends to inaccurately predict the connection between the closest verb and the word. For example, for sentences that contain the verb form da, the success rate of the linear model increases from 56.2% to 70.2% in the instances in which the BiL-STM predicted correctly. This differential success may imply a causal relation between the inference of the closest-verb grammatical connection to the word and the success in suffix prediction.
Grammatical generalization Does training on suffix recovery induce learning of grammatical generalizations such as morphosyntactic alignment (ergative, absolutive or dative), number agreement (sg / pl) and POS? To test this question, We have collected the states of our trained model over the words in sentences from the Basque Universal Dependencies dataset. Different diagnostic classifiers were then trained to predict case, number, POS and the type of the dependency edge to the head of the word. All diagnostic classifiers are MLPs with two hidden layers of sizes 100 and 50. For each task, we trained 5 models with different initializations and report those that achieved highest development set accuracy.
For nuclear case and number prediction, we limit the dataset to words suffixed with a nuclear case. In this setting, for words on which the BiL-STM predicted correctly, the MLPs perform well, predicting the correct number with an accuracy of 95.0% (majority classifier: 67.3%) and the correct case with an accuracy of 93.5% (majority: 61.7%). Even when the dataset is limited to words suffixed with the ambiguous suffix -ak, the MLP correctly distinguishes ergative and absolutive with 91.2% accuracy (majority: 65.4%). Interestingly, in a complementary setting on which the dataset is limited to words on which the BiLSTM failed in nuclear case suffix recovery, a diagnostic classifier can still be trained to achieve 74.7% accuracy in number prediction and 69.7% accuracy in case prediction. This indicates that to a large degree, the required information for correct prediction is encoded by the state of the model even when it predicts a wrong suffix.   Table 6: Diagnostic classifier accuracy in predicting whether or not the closest verb is grammatically connected to a word, according to BiLSTM suffix prediction success on that word. "BiLSTM correct": success rate on instances in which the BiL-STM correctly predicted the case suffix. 'BiLSTM wrong": success rate on instances in which the BiLSTM failed. "Majority" signifies the success of majority-classifier.
For the prediction of POS, dependency edge to the head and any case (not just nuclear cases -16 cases in total, including the option of an absence of case), the dataset was not limited to words suffixed with nuclear cases or to words on which the BiL-STM predicted correctly. The classifier achieves accuracies of 87.5% In POS prediction (majority: 23.2%), 85.7% in the prediction of any case (majority: 64.7%), and 69.0% for the prediction of dependency edge to the head (majority: 19.0%).
These results indicate that during training on suffix recovery, the model indeed learns, to some degree, the generalizations of number, alignment and POS, as well as some structural information (connection to the head in the dependency tree). These findings support our hypothesis that a success in case recovery entails the acquiring of some grammatical information.

Conclusion
In this work, we have performed of series of controlled experiments to evaluate the performance of LSTMs in agreement prediction, a task that requires implicit understanding of syntactic structure. We have focused on Basque, a language that is characterized by a very different grammar compared with the languages studied for this task so far. We have proposed two tasks for the evaluation of agreement prediction: verb number prediction and suffix recovery.
Both tasks were found to be more challenging than agreement prediction in other languages studied so far. We have evaluated different contributing factors to that difficulty, such as the presence of ambiguous case suffixes. We have used diag-nostic classifiers to test hypotheses on the inner representation the model had acquired, and found tentative evidence for the use of shallow heuristics as a proxy of hierarchical structure, as well as for the acquisition of grammatical information during case recovery training.
These results suggest that agreement prediction in Basque could be a challenging benchmark for the evaluation of the syntactic capabilities of neural sequence models. The task of case-recovery can be utilized in other languages with a case system, and provide a readily-available benchmark for the evaluation of implicit learning of syntactic structure, that does not require the creation of expert-annotated corpora. A future line of work we suggest is investigating what syntactic representations are shared between case recovery and full parsing, i.e., to what extent does a model trained on case recovery learn the parse tree of the sentence, and whether transfer learning from caserecovery would improve parsing performance.