A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration

One of the primary tasks of morphological parsers is the disambiguation of homographs. Particularly difficult are cases of unbalanced ambiguity, where one of the possible analyses is far more frequent than the others. In such cases, there may not exist sufficient examples of the minority analyses in order to properly evaluate performance, nor to train effective classifiers. In this paper we address the issue of unbalanced morphological ambiguities in Hebrew. We offer a challenge set for Hebrew homographs — the first of its kind — containing substantial attestation of each analysis of 21 Hebrew homographs. We show that the current SOTA of Hebrew disambiguation performs poorly on cases of unbalanced ambiguity. Leveraging our new dataset, we achieve a new state-of-the-art for all 21 words, improving the overall average F1 score from 0.67 to 0.95. Our resulting annotated datasets are made publicly available for further research.


Introduction
It is a known phenomenon that the distribution of linguistic units, or words, in a language follows a Zipf law distribution (Zipf, 1949), wherein a relatively small number of words appear frequently, and a much larger number of items appear in a long tail of words, as rare events (Czarnowska et al., 2019). Significantly, this also applies to the distribution of analyses of a given homograph. Take for instance the simple POS-tag ambiguity in English between noun and verb (Elkahky et al., 2018). The word "fair" can be used as an adjective ("a fair price") or as a noun ("she went to the fair"). Yet, the distribution of these two analyses is certainly not fair; the adjectival usage is far more frequent than the nominal usage (e.g., in Bird et al. (2008) the latter is six times more frequent than the former). We will call such cases "unbalanced homographs".
Cases of unbalanced homographs pose a formidable challenge for automated morphological parsers and segmenters. In tagged training corpora, the frequent option will naturally dominate the overwhelming majority of the occurrences. If the training corpus is not sufficiently large, then the sparsity of the minority analysis will prevent generalization by machine-learning models. By the same token, it can be difficult to evaluate the performance of tagging systems regarding unbalanced homographs, because the sparsity of the minority analysis prevents computation of adequate scoring.
The empirical consequences of unbalanced homographs are magnified in morphologically rich languages (MRLs), including many Semitic languages, where distinct morphemes are often affixed to the word itself, resulting in additional ambiguity (Fabri et al., 2014;Habash et al., 2009). Furthermore, in many Semitic MRLs, the letters are almost entirely consonantal, omitting vowels. This results in a particularly high number of homographs, each with a different pronunciation and meaning.
In this paper, we focus upon unbalanced homographs in Hebrew, a highly ambiguous MRL in which vowels are generally omitted (Itai and Wintner, 2008;Adler and Elhadad, 2006). Take for example the Hebrew word ‫.מדינה‬ This frequent word is generally read as a single nominal morpheme, ‫ה‬ ‫י‬ ‫ְד‬ ‫,מ‬ meaning "country". However, it can also be read as ‫הּ‬ ‫י‬ ‫ִדּ‬ ‫,מ‬ "from the law/judgment of her", wherein the initial and final letters both serve as distinct morphemes. This last usage is far less common, and, in an overall distribution, it would be relegated to the long tail, with very few attestations in any given corpus.
Hebrew is a low-resource language, and as such, the problem of unbalanced homographs is particularly acute. Existing tagged corpora of Hebrew are of limited size, and in most cases of unbalanced homographs, the corpora do not provide sufficient examples to evaluate performance regarding minority analyses, nor to train an effective classifier.
Here, we propose to overcome this difficulty by means of a challenge set: a group of specialized training sets which each focus upon one particular homograph, offering substantial attestations of the competing analysis. Designing such contrast sets that expose particularly hard unbalanced cases was recently proposed as a complementary evaluation effort for a range of NLP tasks by Gardner et al. (2020). Notably, all tasks therein focus exclusively on English, and do not make any reference to morphology. Another, particularly successful, instance of this approach is the Noun/Verb challenge set for English built by Elkahky et al. (2018). Yet, heretofore, no challenge sets have been built to address cases of unbalanced homographs in Hebrew.
In order to fill this lacuna, we built a challenge set for 12 frequent cases of unbalanced Hebrew homographs. Each of these words admits of two possible analyses, each with its own diacritization and interpretation. 1 For each of the possible analyses, we gather 400-2,500 sentences exemplifying such usage, from a varied corpus consisting of news, books, and Wikipedia. Furthermore, in order to highlight the particular problem regarding unbalanced homographs, we add an additional 9 cases of balanced homographs, for contrast and comparison. All in all, the corpus contains over 56K sentences. 2

Description of the Corpus
In Table 1 we list the 21 homographs addressed in our challenge set. For each case, we specify the frequency of each analysis in naturally-occurring Hebrew text, and the ratio between them. 3 The 21 homographs include a wide range of homograph types. Some are cases of different POS types: Adj vs. Prep (13), Noun vs. Verb (15, 18), Pronoun vs. Prep (2,4), Noun vs. Prep (9), etc. Other cases differ in terms of whether the final letter should be segmented as a suffix (10,13,20). In some instances, the morphology is the same, but the difference lies in the stem/lexeme (5,7,8,11).
In choosing our 21 homographs, we first assembled a list of the most frequent homographs in the Hebrew language. For the simplicity of this initial proof of concept, we constrained our list to homographs with only two primary analyses. We also constrained our list to cases where the two analyses represent different lexemes, skipping over cases in which the difference is only one of inflection. Further, some cases were filtered out due to data sparsity. Finally, we also included a number of less frequent homographs, to allow for a comparison between frequent and infrequent homographs.
In order to gather sentences for the contrast sets, we first sampled 5000 sentences for each target word, and sent them to student taggers. For balanced homographs, with ratios of 1:3 or less, this process handily provided a sufficiently large number of sentences for each of the two analyses. However, regarding cases of unbalanced homographs, wherein the naturally occurring ratio of the minority analysis can be 30:1 or even 129:1, this initial corpus was far from adequate. We used two methods to identify additional candidate sentences: (1) We ran texts through an automated Hebrew diacritizer (Shmidman et al., 2020) and took the cases where the word was diacritized as the minority analysis.
(2) Where relevant, we leveraged optional Hebrew orthographic variations which indicate that a given word is intended in one specific way. These candidate sentences were then sent to student taggers to confirm that the minority analysis was in fact intended. Our student taggers tagged approximately 300 sentences per hour. Evaluation of their work revealed that they averaged an accuracy of 98 percent. In order to overcome this margin of error, we employed a Hebrew-language expert who proofread the resulting contrast sets. In our final corpus, each analysis of each homograph is attested in at least 400 sentences, and usually in 800-2.5K sentences (full details in Appendix Table 1).
One issue we encountered when collecting naturally-occurring Hebrew sentences is that a small number of specific word-neighbors and collocations tend to dominate the examples. As an example: the word ‫אפשר‬ can be vocalized as ‫ר‬ ‫ְשׁ‬ ‫ֶפ‬ ‫א‬ ("possible", the majority case), or ‫ר‬ ‫ְשׁ‬ ‫ִפ‬ ‫א‬ ("he allowed"). However, over one third of the naturally occurring cases of the majority case boil down to some 90 frequently-occurring collocations, such as ‫אפשר‬ ‫אי‬ ("impossible") or ‫אפשר‬ ‫הא‬ ("is it possible?"). As such, a machine-learning model would overfit to those specific collocations, rather than learning more generic overarching patterns of  Table 1: The homographs covered in our challenge set. Words 1-12 are unbalanced homographs, in which the ratio between the two analyses is particularly skewed. These cases pose a particularly difficult disambiguation challenge because they are severely underrepresented in existing tagged Hebrew corpora.
the word usage. Therefore, we constrained our data collection such that there may be no more than 20 cases of any given word-neighbor combination. 4

Experiments
We first use our challenge set to evaluate current state-of-the-art performance on the morphological disambiguation of Hebrew homographs. The best existing tool for Hebrew morphological disambiguation is YAP: Yet Another Parser (Tsarfaty et al., 2019). We run all 56,000+ sentences from our challenge set through YAP. Due to the unbalanced natural distribution of the possible analyses in many of the cases, we compute recall and precision results separately for each analysis, and we then compute a macro-averaged F1 score. Next, we use our challenge set to train classifiers for each of the homographs in our corpus. We implement 2-layer MLPs using the DyNet framework (Neubig et al., 2017). As input, we feed the MLP an encoding h(w i ), a representation of the context of the target word within the sentence. The target word itself is masked and not included in the input. The output of the MLP is a probabilistic choice of either Class 1 or Class 2, where each class represents one of the two possible diacritization options.
We applied two methods to represent the surrounding context in the MLP input. The first is encoding the three neighboring words on both sides 4 Our challenge set is available for use in future research.   Table 3: Accuracy of our specialized classifiers for the 21 homographs in our challenge set. We evaluate three methods for encoding the context words, and we run each method two ways: (1) "Concat": concatenate encodings of 3 neighboring words on each side;

YAP
(2) "LSTM": run complete sentence context through a BiLSTM. We show F1 scores for each, macro-averaged across the two classes. See Appendix Tables 4-5 for a breakdown of recall/precision scores for each analysis.
encoding the whole sentence around the word using a 2-layer biLSTM (Hochreiter and Schmidhuber, 1997), Equation 2. ( (2) h(w i ) = LST M (w 0:i ) · LST M (w n:i ) We explore three alternate methods of encoding the vector w i . Our initial approach uses pre-trained word2vec embeddings for the neighboring words. 6 Our second approach uses morphological information about the context words. Of course, we don't have any a priori knowledge regarding the morphological tagging of the neighboring words; and indeed, in a large percentage of the cases, the morphology of the neighboring words can be resolved in multiple ways. Thus, we constuct a lattice of all possible analyses of the context words. short contexts was demonstrated by Fraenkel et al. (1979); Choueka and Lusignan (1985). Regarding short-context disambiguation methods in general, see Hearst (1991); Yarowsky (1994). 6 We use word2vecf (Levy and Goldberg, 2014) to build syntax-sensitive word embeddings, based on a corpus of 400M words of Hebrew text. To be sure, BERT might seem the more obvious choice rather than word2vec. However, BERT has been shown to be somewhat ineffective for morphologically rich languages such as Hebrew (Tsarfaty et al., 2020). BERTbased models underperform YAP and perform at the same level as BILSTM-based models, and BERT fails to capture internal morphological complexity (Klein and Tsarfaty, 2020).  For every context word w i , we construct a vector for each possible part-of-speech pos j containing a trainable embedding for each possible morphological feature. The vector thus encodes: part-ofspeech, gender, number, person, status, binyan, suffix, suf gender, suf person, suf number, prefix. 7 If a feature is not applicable to w i , we simply assign an NA embedding. We concatenate each vector w i pos j into a single vector representing w i . Finally, we explore a third composite method in which we concatenate the encodings from the two previous methods to the encoding for w i .
We run each contrast set using each of our three methods for encoding the neighboring words. We evaluate the results using 10-fold cross validation.

Results and Analysis
In Table 2, we display the results of our baseline experiment, where we evaluate current SOTA (YAP) performance on our challenge set. These results empirically demonstrate how much more difficult it is for YAP to resolve the cases of unbalanced homographs. The unbalanced cases are shown in the top half of the table (1-12). YAP's F1 score is below .8 for all but one of the cases, and it is below .6 for 9 out of the 12 cases. In the two cases of Pronoun vs. Suffixed Preposition (2,4), YAP performs particularly poorly, scoring .4 and .1. In contrast, the bottom half of the table (13-21) details nine cases of balanced homographs. As expected, 7 For verbs only, we add a morphosyntactic valence feature indicating the transitivity of the general usage of the verb. This is reminiscent of supertagging (Bangalore and Joshi, 1999) and shows non-negligible empirical contribution on our data. See Appendix Table 2 for a comparison of results with and without the valence feature. 5 YAP does considerably better here: all F1 scores are above .5, and four of the cases are above .8. The weakest cases are those in which YAP has to differentiate between an unsegmented noun and a case of a noun plus possessive suffix (cases 14,20). In both of these cases, YAP scores an F1 of approximately .56 (which, interestingly, is precisely on par with the analogous unbalanced case [10]).
In Table 3, we display results regarding our specialized classifiers. In most cases, using a biLSTM over the entire sentence context performs better than a concatenation of the three neighbor words on each side. In terms of the encoding method for the context words, word2vec performs better than the morphological lattice. This may be because word2vec can better represent the regularly expected usage of the neighboring words, while the morphology lattice represents all possible analyses with equal likelihood. A second possibility is that the contrast sets were not sufficiently large to optimally train the embeddings of the morphological characteristics, whereas word2vec embeddings have the benefit of pretraining on over 100M words. The combination of the latter two methods overall outperforms each one of them individually; thus, although word2vec succeeds in encoding most of what is needed to differentiate between the options, the information provided by the morph lattice sometimes helps to make the correct call.
In Table 4, we compare the results of our composite-method with those of YAP. Our specialized classifiers set a new SOTA for all the cases.  2009)). While such methods have obvious advantages, they have limited applicability to Hebrew. As noted, in Hebrew the majority of the words are ambiguous, including the core building blocks of the language; without these anchors, global approaches tend to result in poor performance regarding unbalanced homographs.

Related Work
The problem of Hebrew diacritization is analogous to that of Arabic diacritization; Arabic, like Hebrew, is a morphologically-rich language written without diacritics, resulting in high ambiguity. Many recent studies have proposed machinelearning approaches for the prediction of Arabic diacritics across a given text (e.g. Bebah et al.  (2020). However, these studies all perform evaluations on standard Arabic textual datasets, and do not evaluate accuracy regarding minority options of unbalanced homographs. We believe that these models would likely benefit from specialized challenge sets of the sort presented here to overcome the specific hurdle of unbalanced homographs.

Conclusion
Due to high morphological ambiguity, as well as the lack of diacritics, Semitic languages pose a particularly difficult disambiguation task, especially when it comes to unbalanced homographs. For such cases, specialized contrast sets are needed, both in order to evaluate performance of existing tools, as well as in order to train effective classifiers. In this paper, we construct a new challenge set for Hebrew disambiguation, offering comprehensive contrast sets for 21 frequent Hebrew homographs. These contrast sets empirically demonstrate the limitations of reported SOTA results when it comes to unbalanced homographs; a model may report a SOTA for a benchmark, yet fail miserably on real world rare-but-important cases. Our new corpus will allow Hebrew NLP researchers to test their models in an entirely new fashion, evaluating the ability of the models to predict minorityhomograph analyses, as opposed to existing Hebrew benchmarks which tend to represent the language in terms of its majority usage. Furthermore, our corpus will allow researchers to train their own classifiers and leverage them within a pipeline architecture. We envision the classifiers positioned at the beginning of the pipeline, disambiguating frequent forms from the get-go, and yielding improvement down the line, ultimately improving results for downstream tasks (e.g. NMT). Indeed, as we have demonstrated, neural classifiers trained on our contrast sets handily achieve a new SOTA for all of the homographs in the corpus.   Table 2: Quantification of the contribution of the valence "supertag". We examine results of our "Concat Composite" method, wherein we use the three neighboring words before and after the homograph, with each neighboring word represented by a concatenation of its word2vec embedding and a lattice of the morphological features of the possible analyses of the word. We indicate the change in results when adding the valence supertag to the lattice.   Table 4: Full breakdown of the performance of our specialized classifiers when trained with short contexts (concatenation of encodings of the three word neighbors before and after the homograph). We display results for each of our three methods of encoding context words.  Table 5: Full breakdown of the performance of our specialized classifiers when trained with a bi-LSTM of the full sentence context. We display results for each of our three methods of encoding context words.