What’s Wrong with Hebrew NLP? And How to Make it Right

For languages with simple morphology such as English, automatic annotation pipelines such as spaCy or Stanford’s CoreNLP successfully serve projects in academia and the industry. For many morphologically-rich languages (MRLs), similar pipelines show sub-optimal performance that limits their applicability for text analysis in research and the industry. The sub-optimal performance is mainly due to errors in early morphological disambiguation decisions, that cannot be recovered later on in the pipeline, yielding incoherent annotations on the whole. This paper describes the design and use of the ONLP suite, a joint morpho-syntactic infrastructure for processing Modern Hebrew texts. The joint inference over morphology and syntax substantially limits error propagation, and leads to high accuracy. ONLP provides rich and expressive annotations which already serve diverse academic and commercial needs. Its accompanying demo further serves educational activities, introducing Hebrew NLP intricacies to researchers and non-researchers alike.


Introduction
NLP pipelines for the automatic annotation of unstructured texts are at the core of language technology applications for Data Science, Text Analytic and Artificial Intelligence.For English, annotation pipelines such as spaCy (Honnibal and  Montani, 2017) or Stanford's CoreNLP (Manning  et al., 2014) successfully deliver the ability to automatically annotate unstructured texts with their underlying linguistic structures, including: Partof-Speech (POS) Tags, Morphological Features, Dependency Relations, Named Entities, and so on.These annotations serve research labs, non-profit organizations and commercial endeavors in their quest to make sense of the vast amount of unstructured data available to them.
Universal processing pipelines such as UDPipe (Straka et al., 2016) aim to serve a range of other languages, but unfortunately, their performance on many morphologically rich languages (MRLs) (Tsarfaty et al., 2010), and in particular Semitic languages, is not on a par with their performance on English.This, in turn, greatly limits their applicability for further research and commercial use.The main reason for this sub-optimal performance on Semitic languages is that the pipeline design inherent in these frameworks is inappropriate for languages that exhibit extreme morphological ambiguity in their input stream.This is because errors made in morphological segmentation and disambiguation early on, jeopardize the system accuracy down the pipeline.For Hebrew, this performance gap has long been a show-stopper for advancing Language Technology and Artificial Intelligence for the Hebrew-speaking community.With this contribution, we aim to remedy this situation.
In this paper we describe the design and use of the ONLP system, a joint morphological-syntactic parsing framework for processing the Semitic language Modren Hebrew (Henceforth, Hebrew).The system is accurate, efficient, and provides rich and expressive output including: Segmentation, POS tags, Lemmas, Features and Labeled Dependencies.The joint training and inference over the different layers substantially limits error propagation, and leads in turn to speed and high accuracy.Among the technical advantages of the ONLP suite are its open license, an easy 3-step installation, and a single package with all elements included -no need to train or maintain individual components separately.The ONLP suite already serves academic and commercial projects in diverse domains.Its accompanying online demo has further proved valuable for educational purposes, exposing CS/NLP and non-CS researchers and engineers to the intricacies of Semitic NLP.
In morphologically-rich languages (MRLs), each input token may consist of multiple lexical and functional units (henceforth, morphemes), each of which serves a particular role in the overall syntactic or semantic representation.In Hebrew, for example, the token ' ‫'וכשמהמעבדה‬ corresponds to five word tokens in English, each of which carrying its distinct role: ' ‫'ו‬ (and, CC), ' ‫'כש‬ (when, REL), ' ְ ‫'מ‬ (from, IN), ' ‫'ה‬ (the, DT), ' ‫'מעבדה‬ (lab, NN). 1 This means that in order to process Hebrew texts, one first needs to segment the Hebrew tokens into their constituting morphemes.At the same time, Hebrew raw tokens are highly ambiguous.A token such as: ' ‫'הקפה‬ may be interpreted as ' ‫'הקפה‬ (orbit, NN), ' ‫'ה‬ + ' ‫'קפה‬ (the+coffee, DT+NN), or ' ‫+'הק‬ ' ‫'של‬ + ' ‫'היא‬ (perimeter of her, NN+POSS+PRP), etc.This is further complicated by the lack of diacritics in standardized texts, meaning that most vowels are not present, and that no reading is a-priory more likely than the others, out of context.Only in context the correct interpretation and segmentation become apparent.
These facts create an apparent loop in the design of NLP pipelines for Hebrew: syntactic parsing requires morphological disambiguation -but morphological disambiguation requires syntactic context.This apparent loop has called for the development of joint systems rather than pipelines, for Semitic languages processing (Tsarfaty, 2006;  Green and Manning, 2010).This joint hypothesis has proven useful for Hebrew and Arabic phrasestructure parsing (Goldberg and Tsarfaty, 2008;  Green and Manning, 2010; Goldberg and Elhadad,  2011).The ONLP suite is a dependency-based parsing framework implementing this joint hypothesis, over the entire morpho-syntactic searchspace, as depicted in Figure 1  (More et al., 2019).

The Architectural Design
The core of ONLP is YAP (Yet Another Parser), a morpho-syntactic parser for morphological and syntactic analysis of Hebrew Texts.YAP reimplements and extends the structure-prediction framework of Zhang and Clark (2011).We describe YAP in detail in More and Tsarfaty (2016); More et al. (2019).Here we only provide a bird's eye view of the architecture.In YAP we embrace the extreme morphological ambiguity in Hebrew.That is, we do not aim to resolve morphological ambiguity via preprocessing.The input to YAP is the complete Morphological Analysis (MA) of an input sentence x, termed here MA(x).MA(x) is a lattice structure, consisting of all possible morphological analysis possibilities of the input sentence, as seen in the middle of Figure 1.Each arc is a tuple specifying the start-index, end-index, the form of the segment, its part-of-speech, lemma, features, and the index of the raw token the arc has originated from.An arc in the lattice can serve as a node in a syntactic dependency tree.Each contiguous path in the lattice presents one valid morphological segmentation of the sentence, for which a dependency tree can be assigned, as in Figure 1.For each path in the lattice, there is an exponential number of dependency trees that are potentially applicable.
We refer to the task of selecting the most likely lattice-path as Morphological Disambiguation (MD), and to the task of selecting the most likely dependency tree for a given path as Dependency Parsing (DEP).For an input sentence x, our goal is to jointly predict a single pair of MD(x) and DEP(x) that are consistent with one another, and form the most-likely analysis of the sentence.
The MD component is the transition-based morphological parser of More and Tsarfaty  (2016), which is formally based on the structure-prediction framework of Zhang and Clark (2011).MD accepts a sentence lattice MA(x) as input and delivers a selected sequence of arcs (morphemes) MD(x) as output.The transition-based system for MD selects arcs for MD one at a time.It decodes the lattice using beam-search, and keeps the K-best paths at each step, scored according to morpheme-level and token-level features, weighted via structured-perceptron learning.
The DEP component is a re-implementation of the Zhang and Nivre (2011) dependency parser for English, adapted for Hebrew.We assume an Arc-Eager transition system and beam-search decoding.Feature weights are learned via the structured perceptron.We employ a carefully-designed feature set that reflects linguistic properties of Hebrew such as its rich morphological paradigms, flexible word-order, agreement, etc.This provides SOTA results on Hebrew dependency parsing, albeit in Oracle (i.e., gold morphology) scenario.
Seen that both the MD and DEP realize the same formal framework and inherit from the same computational machinery, we can easily unify them and treat the morpho-synactic task as a single objective.The transition systems are combined and the beam-search decoder interleaves morphological and syntactic decisions. 2Now morphological decisions may be affected by syntactic content, and vice versa.
The architecture is depicted in Figure 2. In More et al. (2019) we compared the performance of the joint system to our own pipeline system and to other systems available for Hebrew morphological and syntactic parsing, and showed significant improvements of YAP's joint model over all competing systems.Morphological Segmentation The most basic form of analysis of Hebrew texts is the segmentation of raw tokens into multiple meaning-bearing units that we call morphemes. 5ue to orthographic and phonological processes, some morphemes do not appear explicitly in the surface form.Our segmentation recovers all morphemes, both overt and covert.
the token ' ‫'בבית‬ (in the house) is segmented as ' ‫'ב‬ + ' ‫'ה‬ + ' ‫.'בית‬ Part-of-Speech (POS) Tags Each morphological segment is assigned a single Part-of-Speech tag category that indicates its syntactic role.The set of tags used by the system is based on the SPMRL scheme which in turn adopts the POS labels from Sima'an et al. ( 2001) (detailed in our appendix).
Morphological Features Along with the POS category, we specify for each segment the properties that are signalled by inflectional morphology.The scheme encodes the following properties: 6 and Tense [Past, Present, Future, Imperative, Infinitive]. 7mmas Each segment is also assigned a lemma, i.e., the cannonical representation of its core (uninflected) meaning. 8For Hebrew nouns and adjectives, the lemma is chosen to be the Masculine-Singular form.For verbs, the lemma is in the Masculine-Singular-3per form in Past tense.

Dependency Tree
The dependency tree is defined over all morphological segments and an artificial root node.It consists of a set of labeled binary relations that indicate the bi-lexical dependencies between segments.
Note that the SPMRL dependency scheme, as opposed to UD, always selects functional heads, rather than lexical heads.The dependency labeling is based on the scheme from Tsarfaty (2013), repeated in the appendix.
Lattices As explained in section 3 above, a word can be segmented into morphemes in multiple ways, which are constrained by a broad-coverage lexicon.In addition to the parsed output, we makes available for each input sentence its sentence lattice, i.e. the set of all possible segmentations for a given sentence, along with all possible morphosyntactic analyses for each arc.

Technical Details and Forms of Use
YAP is implemented in the Go language.9It requires 6GB of RAM to run, and employs a simple 3-step installation, given in the supplementray material in the appendix.The input to the system is a tokenized sentence, with tokens appearing one per line, and a line break after every sentence. 10The output is a dependency tree (where each node in the tree is a lattice arc) provided in the CoNLL-X format (Buchholz and Marsi,  2006).YAP is trained on the Hebrew section of the SPMRL shared task.It also makes use of the broad-coverage lexicon of Itai and Wintner (2008)  for finding all potential lattice paths.In case of out-of-vocabulary (OOV) items, we employ a simple heuristics where we suggest the 10 most-likely analyses of rare tokens observed during training.
Simple Use | Command line From the command line, one can process one input file at a time, with a single sentence or more.The input file must be formatted with a single token per line, and an empty line denoting the end of every sentence.
Processing a file is done in 2 steps: First, run Morphological Analysis ./yaphebma to generates a sentence lattice containing all possible morphological breakdowns of each token.YAP will save the lattice to the file specified via the -out flag.Now you can run joint Morphological Disambiguation and Dependency Parsing ./yapjoint to jointly predict the best lattice path and corresponding dependency tree.The input to this command is the output file generated in the previous step, and there are 3 output files: one containing word segments, one containing the disambiguated lattice path, and one containing the complete dependency tree in CoNLL-X format.
Advanced Use | RESTful API YAP can run as a RESTful server that accepts parse requests.To do this simply start the server, listening on localhost port 8000.Now you can call the joint endpoint with a json object containing the list of tokens to process in the HTTP data payload.The response is a json object containing the three output levels (MA, MD and Dep).You can use jq and sed (or any other json and line processing tools) to format the (tab separated value) responses and reassemble the output.Check our appendix for an illustration.
Educational Use | The Online Demo In 2018 we decided to create an online demo of the system, for educational purposes: (i) To exposed NLP/AI researchers to NLP capabilities available for Hebrew.(ii) To educate non-CS scientists and engineers who work with Hebrew data (e.g., digital humanities) on text annotations that can potentially be useful for their applications.(iii) To launch outreach activities where we teach what is NLP to the local community (e.g., school kids). 11o use the demo, simply go to onlp.openu.ac.il and type Hebrew sentence in the textbox.The demo is built with Django and Bootstrap web frameworks.It sends the user's Hebrew text input to the ONLP server, which returns a CoNLL-X formatted parse along with the complete sentence lattice.Pre-processing includes pre-morphological tokenization of the input, where punctuation is being separated from the tokens.Double quotation marks are being separated from the word unless they appear before the last character of the word, to avoid over-segmentation of acronyms. 12The tokenized sequence is then passed to the ONLP server.The CoNLL-X output is then processed Furthermore, the demo presents the sentence lattice which is the input to the joint parser.This is useful for debugging, and for analyzing lexicalcoverage in out-of-domain scenarios.
Expert Use | Out of Domain Scenarios A bottleneck for the system in out-of-domain parsing scenarios is the coverage of the lexicon.We rely on a general-purpose lexicon containing over 500K entries.OOV words are treated via heuristics we designed, which are suitable for the general case only.However, identifying accurately vocabulary items may be critical when applying the parser to new domains with domain-specific information (medical, financial, political, etc.).Fortunately, we can extend the system with a domainspecific lexicon, thus extending the MA coverage.Due to joint inference, the availability of a better suited lexical analysis triggers better lexicosyntactic decisions on the whole. 13Related and Future Work Hebrew NLP in general and Hebrew parsing in particular are known to be challenging, due to interesting linguistic properties, the scarcity of annotated data, and the small research community around.So, Hebrew has been seriously understudied in NLP.During the early 2000, the MILA knowledge center was established, where the two of the main Hebrew resources for NLP were developed: the Hebrew treebank (Sima'an et al., 2001)  and the Hebrew Lexicon (Itai and Wintner, 2008).
Morphological Taggers for Hebrew using local linear-context have been trained on these data and were made available for free use (Adler and Elhadad, 2006; Bar-haim et al., 2008).However, 13 We discuss how exactly this is executed in the appendix.their performance was not on a par with parallel tools for English and thus insufficient for commercial use.Hebrew dependency parsing was initially provided by Goldberg and Elhadad (2009), but the parser provides unlabeled dependency, and the pipeline relied on Adler's morphological tagger.This left the automatic dependency trees inaccurate and unsatisfying.Joint morphosyntactic models for constituency-based parsing models Tsarfaty (2010) showed good performance on benchmark data, but their code was never released for open use.
With the development of the UD treebanks collection, general frameworks such as UDPipe (Straka et al., 2016) and CoreNLP (Manning et al.,  2014) have been trained on the Hebrew UD treebank, and made the model available.However, these models provide performance that is still far from satisfactory, As we also demonstrate in our screen-cast,14 these systems make very basic mistakes, even with the simplest sentence.We conjecture that this is due to their inherent pipeline assumption: initial layers of processing present many mistakes.due to the extreme morphological ambiguity, and later layers cannot recover.Notably, also neural network models utilizing word embeddings, (e.g., UDPipe) still lag behind.
Table 1 shows the task-coverage of existing tools and toolkits for NLP in Hebrew, academic as well as private initiatives (NITE,Hebrew-NLP).The task-coverage of the ONLP suite we present is on a par with international standards (UD-Pipe, CoreNLP), and its level of performance was shown to exceed all existing models (More et al.,  2019).We are currently working towards Named-Entity Recognition as well as Open Information Extraction, to be added to ONLP in the near future.

Conclusion
This paper presents ONLP, a complete languageprocessing framework for automatic annotation of Modern Hebrew Texts.The framework covers morphological segmentation, POS tags, lemmas and features, and dependency parsing, predicted jointly.The system is easy to install and to use, and we support multiple forms of usage fitting user-personas with different needs.We hope the availability of an open-source, accurate, and easyto-use system for NLP in Hebrew will benefit the local NLP open-source community and greatly advance Hebrew language technology research and development, in academia and in the industry.

B Screen-Cast
Check out our screen-cast online demo at: https://www.youtube.com/watch?v=H6pvh1x20FQ

C Morphological Ambiguity: Lattices
Table 2 shows a sentence lattice capturing the high ambiguity of Hebrew morphological analysis.For a simple 3-tokens input sentence, 22 possible arcs present valid analyses of the various tokens.A single consecutive path through the lattice needs to be selected, for the sentence to be further processed by syntactic parsers or downstream applications.

D Annotation Layers
The annotation scheme provided by ONLP corresponds to the Hebrew section of the SPMRL shared task.2013-2014 15 The Part-of-Speech Tags we employ are provided, along with illustrative examples, in Table 3.The Dependency labels are defined and illustrated in Table 5.

E The Online Demo
In Figure 3 we present a screen capture of the Morphological Segmentation, POS tags and Dependency Relations for two raw input sentences: 'the-boy was-lying in-the-shade' • ' ‫בצל‬ ‫שנ‬ ‫'הב‬ 'the-boy that-was-napping in-the-shade' As executed on our demo page.Note that the two raw sentences have very similar form (in fact, they only differ in two characters).But they end up forming very different syntactic structures, which the ONLP system annotates correctly.
Figures 4-6 present the usage patterns with the YAP parser, the core algorithm of the framework.In Figure 4 we present the 3-step installation, in Figure 5 we show a simple command-line use, and in Figure 6 we show how to use YAP as a service.As noted before, The input file must be formatted with a single token per line and an empty line denoting end of sentence. 1617YAP has been written in Go in order to enable multi-threading.This means that it can be called from multiple threads in parallel.As of June 2019 there is also a python wrapper, created by members of the Israeli opensource community.18

G Out-of-Domain Scenarios
When observing errors in a new domain, one first thing we have to check is whether or not these are due to lexical gaps.I.e., whether they stem from lack of coverage of the lexicon.
The availability of the sentence lattice output is of great value in this respect.By reviewing the lattice, it is possible to see whether the lexicon contains the correct morphological analysis for the input token at all.If the correct analysis is not in the lattice, it is easy to add the missing analyses by editing the lexicon. 19ach line in the lexicon file contains a token followed by a list of one or more possible morphological analyses of that token.An analysis is a tuple made of 3 parts prefix:host:suffix followed by the host lemma.Each tuple member contains the part-of-speech tag and morphological features for any of these elements.prefix and suffix can possibly be empty.E.g. > ‫אאבד‬ :VB-MF-S-1-FUTURE-NIFAL: ‫נאבד‬ :VB-MF-S-1-FUTURE-PIEL: ‫איבד‬ An example use case could arise when processing medical domain texts related to cancer in which the word ' ‫'לימפה‬ (lymph) appears in the text but is missing from the lexicon.In this case, the parser errs in identifying the first ' ‫'ל‬ as the preposition "to", followed by a proper noun.
To remedy this, we can update the lexicon by adding the following line: > ‫לימפה‬ :NN-F-S: ‫לימפה‬ This means that the token ‫לימפה‬ is a common noun with feminine gender and singular number, followed by the lemma, and that it is unambigous (i.e., only one analysis is available).Note that after updating the lexicon you need to restart YAP (if running as a restful server) for the lexical changes to apply.Now that ‫לימפה‬ is no longer an OOV, sentences containing this token will be given a more accurate lattice and as a result will be analyzed with a global syntactic structure that accords with the correct analysis.We suggested these lexicon edits for our users working in specific domains in the industry (medical, social, political), and they attested to significant improvements when running on particular domains. 20rom To Form Lemma Part of Speech

Figure 1 :
Figure 1: The Joint Morpho-Syntactic Search-Space.Lattice paths are of different lengths.Each lattice path can be assigned an exponential number of trees.

Figure 2 :
Figure 2: A bird's eye view of the Architecture

Figure 3 :
Figure 3: On the right, we present a screen capture of the Morphological Segmentation, POS tags and Dependency Relations for the raw input sentence ' ‫בצל‬ ‫שכב‬ ‫'הב‬ ('the boy was lying in the shade'), as seen on our demo page.On the left, we likewise present the Morphological Segmentation, POS tags and Dependency Relations for the nominal phrase ' ‫בצל‬ ‫שנ‬ ‫'הב‬ ('the boy that was napping in the shade').Note that the two raw sentences have very similar form (in fact, they only differ in two characters).But they end up forming very different syntactic structures, which our system identifies and annotates correctly.

Figure 4 :
Figure 4: A 3-Step Installation.To install YAP make sure you have Go, Git and BZip2 installed and available on your system's PATH.The instructions are for Linux but similarly can be done on Windows/MacOS

Table 1 :
Existing Coverage for Hebrew NLP Tasks into the following layers: the FORM column is concatenated and presented as "Segmented Text", and the POS, LEMMA, FEATS and DEPS are presented in separate accordion tabs.

Table 2 :
The Lattice representation for ' ‫בצל‬ ‫שנ‬ ‫'הב‬ 'The boy who slept in the shade'.Col 1-2: From/To -the start and end nodes of the morpheme.The numbers are with respect to the maximal length route.Col 3: Formthe surface form of the morphological segment.Col 4-5-6: Form/Lemma/Part of Speech -the same segment may belong to different entries in the lexicon.Each entry is given in a separate row, where the differences between the different meanings are surfaced in one (or more) of the Form/Lemma/Part of Speech columns.Col 7: Token Number -represents the index of the raw (space-delimited) token in the input before segmentation.