A neural parser as a direct classifier for head-final languages

This paper demonstrates a neural parser implementation suitable for consistently head-final languages such as Japanese. Unlike the transition- and graph-based algorithms in most state-of-the-art parsers, our parser directly selects the head word of a dependent from a limited number of candidates. This method drastically simplifies the model so that we can easily interpret the output of the neural model. Moreover, by exploiting grammatical knowledge to restrict possible modification types, we can control the output of the parser to reduce specific errors without adding annotated corpora. The neural parser performed well both on conventional Japanese corpora and the Japanese version of Universal Dependency corpus, and the advantages of distributed representations were observed in the comparison with the non-neural conventional model.


Introduction
Dependency parsing helps a lot to give intuitive relationships between words such as nounverb and adjective-noun combinations. Those outputs are consumed in text mining systems (Nasukawa and Nagano, 2001) and rule-based approaches such as in fine-grained sentiment extractors (Kanayama et al., 2004), though some of recent end-to-end systems do not require intermediate parsing structures.
Many recent dependency parsers have been implemented with neural net (NN) methods with (typically bidirectional) LSTM and distributed word vectors (Dozat et al., 2017;Shi et al., 2017), as we can see in the 2017 shared task on dependency parsing from raw text for 49 lan-guages (Zeman et al., 2017) based on the multilingual corpora of Universal Dependencies (UD) (Nivre et al., 2015).
Most of such dependency parsers exploit a transition-based algorithm (Nivre et al., 2007), a graph-based algorithm (McDonald et al., 2005) or a combination of both (Nivre and McDonald, 2008). Those algorithms addressed several problems in multilingual dependency analysis such as bidirectional dependency relationships and nonprojective sentences. However, it is hard to intuitively interpret the actions to be trained on the transition-based parser. Though it can handle the history of past parsing actions, the output may violate syntactic constraints due to the limitation of visible histories. The graph-based approach captures global information in a sentence, but the difficulty in reflecting the interaction of attachment decisions causes contradictory labels in a tree.
The parsing results from the participants in the 2017 shared task show low scores on Japanese (67 to 82% in the UAS scores, excluding the team that provided the data) in particular, which shows that the language-universal approaches do not work effectively for Japanese. 1 The syntactic structures in the Japanese version of Universal Dependencies (Tanaka et al., 2016;Asahara et al., 2018) have dependencies in both directions, as well as in other languages, since it is based on the word level annotations and the content-head dependency schema. However, when the syntactic structures are expressed with the dependencies between phrasal units (so-called bunsetsus in Japanese; 'PU' hereafter in this paper), the head element always comes in the right position, since Japanese is a consistently headfinal language. This allows us to apply a method for such languages to simplify the model. We con- Figure 1: Japanese word-level dependencies in the UD-style content-head schema for an example sentence " " ('A boy bought a red ball and is playing at the school') . structed a neural parsing model to directly select the head word among the limited candidates. The model works as a classifier that outputs intuitive and consistent results while exploiting grammatical knowledge.
Section 2 reviews the head-final property of Japanese and the Triplet/Quadruplet Model (Kanayama et al., 2000) to exploit syntactic knowledge in a machine-learning parser. Section 3 designs our neural model relying on the grammatical knowledge, and its experimental results are reported in Section 4. Other head-final languages are discussed in Section 5 and some related approaches are discussed in Section 6. Section 7 concludes this paper.

Background
First, Section 2.1 shows the head-final property of the Japanese language. and Sections 2.2 and 2.3 explain the main ideas in the Triplet/Quadruplet Model: the methods of simplification of dependency parsing task using the linguistic knowledge.
2.1 Head-final structure in Japanese Figure 1 shows an example of a word-level dependency structure of a Japanese sentence in the representation of Universal Dependencies. Traditionally Japanese dependency parsers have been evaluated on the unit of bunsetsu, a phrasal unit (PU), as performed in Kyoto University Text Corpus (Kawahara et al., 2002). A PU consists of a content word 2 and optional functional words and prefixes and suffixes. Figure 2 depicts the dependency structure represented in PUs, where the all dependencies are in a single direction. The head PU is always in the right and the rightmost PU is always the root, as long as exceptional inversion cases are not cared. In this paper we exploit this property to apply a simplified parsing algorithm: the parsing can be regarded as the selection of the head PU from the limited number of candidates PUs located to the right of the dependent in question. For example, the second PU " " ('red') in Figure 2 must modify one of the four PUs from the third " " ('ball'-ACC) to the sixth PU " " ('play'-PROG). The correct head is " " ('ball'-ACC).

Restriction of modification candidates
The head-final feature can further simplify the dependency parsing by adding syntactic constraints. The Triplet/Quadruplet Model (Kanayama et al., 2000) has been proposed to achieve the statistical dependency parsing making most of the linguistic knowledge and heuristics. In their work, a small number (about 50) of hand-crafted grammar rules determine whether a PU can modify each PU to its right in a sentence as shown in Table 1. In the rules, the modified PUs are determined on the conditions of the rightmost morpheme in the modifier PU. In addition to PoS-level relationships, Rightmost morpheme of the modifier PU Conditions for the modified PUs postpositional " " wo (accusative) verb, adjective postpositional " " kara ('from') verb, adjective postpositional " " kara ('from') nominal followed by postpositonal " " made ('to') proper noun + postpositional " " proper noun followed by postpositional " " (-GEN) postpositional " " no (genitive, nominative) noun, verb, adjective postpositional " " to (conjunctive) noun, verb, adjective postpositional " " to (conjunctive) adverb " " isshoni ('together') adverb verb, adjective, adverb, nominal with copula Table 1: The excerpt of Japanese grammar rules. The left side is the condition of the modifier PU specified with the rightmost morpheme (except for punctuation) with optional preceding morphemes, and the right side is the list of the condition for the modifiable PUs specified with the head word and optional functional words.  Table 2: Percentages of the position of the correct modified PU among the candidate PUs selected by the initial grammar rules. The column 'Sum' shows the coverage of the 1st, 2nd and last (the farthest) PUs in the distance from the modifier PUs. The EDR Japanese corpus was used in this analysis.
detailed condition with specific functional words and exceptional content words are covered in the rules. Even with these simplified rules, 98.5% of the modifications between PUs are covered. The role of the grammar rules is to maximize the coverage, and the rules are simply describing high-level syntactic dependencies so that the rules can be created easily without worrying about precision or contradictory rules. The statistical information is later used to select the rules necessary for a given sentence to produce an accurate parsing result.
Furthermore, an analysis of the EDR corpus shows that 98.6% of the correct dependencies are either the nearest PU, the second nearest PU, or the farthest PU from the modifier (more details in Table 2) among the modifiable PUs enumerated by the grammar rules. Therefore, the model can be simplified by restricting the candidates to these two or three candidates and by ignoring the other PUs with a small sacrifice (1.4%) of parsing accuracy. Retaining the farthest modifiable PU from the modifier, the long distance dependencies are captured.

Calculation of modification probabilities
Let u be a modifier PU in question, c un the u's nth modification candidate PU, Φ u and Ψ cun the respective attributes of u and c un . Then the probability of u modifying its n-th candidate is calculated by the triplet equation (1) when u has two candidates or the quadruplet equation (2) when u has three candidates. 3 These two equations are known as the Triplet and Quadruplet Model.
Assuming the independence of those modifications, the probability of the dependency tree for an entire sentence P (T ) is calculated as the product of the probabilities of all of the dependencies in the sentence using beam search from the rightmost PU to the left, to maximize P (T ) under the constraints of the projected structure.
Equations (1) and (2) have two major advantages. First, all the attributes of the modifier and its candidates can be handled simultaneously -the model expresses the context through the combination of those attributes. Second, the probability of each modification is calculated based on the ' period hidden layer 6 6 softmax Figure 3: The neural net for the quadruplet model to select the head of the PU " " ('president'-ACC) from three modification candidates in an example sentence " 0 1 2 3 " ('... introduced 2 the previous president 0 to the acquired 1 company, and launched 3 a ... business'). The attributes of the modifier PU and three modification candidates, and the features between the modifier and each candidate are input as distributed vectors.
relative positions of the candidates, instead of the distance from the modifier PU in the surface sentence, making the model more robust.

NN parsing model
In the past implementation of the parser by the Triplet/Quadruplet model (Kanayama et al., 2000), the equations (1) and (2) were calculated with logistic regression (maximum entropy method) in which many binary features represent the attributes of each PU. We designed the NN model using distributed representation of words and parts-of-speech as Chen and Manning (2014) did. Figure 3 shows the neural net model that directly selects the head PU and an example sentence. Here, the head of " " ('president-ACC') is predicted among three modification candidates selected by the method described in Section 2.2. The second candidate PU " " ('introduced-ADV') is the correct head.
To make the prediction, the attributes for each PU are extracted, and we focus on two words in a PU: the head word -the rightmost content word in the PU -and the form word -the rightmost functional word in the PU except for a punctuation. First, the surface form and the PoS of the head word and the form word are converted into vector representations. That is, two vectors are used for 6 PUs in the triplet model and 8 PUs are used in the quadruplet model. Furthermore, the following attributes between two PUs are added.
• the number of a postpositional " ha" 4 between two PUs (0, 1, 2, 3, 4, or 5+) • the number of commas between two PUs (0, 1, 2, 3, 4, or 5+) • the distance between two PUs (1, 2, 3, ..., 9, or 10+) These features expressed as vectors are concatenated to form a single layer, and the final output is given as the softmax of two or three values (1, 2, or 3). The above calculation computes the probabilities of the modification to candidate PUs, but it does so independently for each modifier PU; therefore, there may be crossing of modification in a whole sentence. Since the Japanese dependency structures are fully projective, the optimal tree for the sentence is constructed using a beam search to maximize the Equation (3) in Section 2.3, excluding modification pairs that violate the projective constraint in each step of the beam search. More specifically, combinations of dependencies which violate projective constraints (e.g. 5 ← 7 and 6 ← 8) are excluded from the beam, then the projective tree structure is guaranteed.

Experimental settings
We used EDR Japanese Corpus (EDR, 1996) for the initial training and evaluation. After remov-  ing inconsistent PUs due to tokenization mismatch between the corpus and the runtime process, the evaluation was conducted on 2,941 test sentences. 160,080 sentences were used for training and 8,829 sentences were kept for validation. The models were implemented with Tensor-Flow (Abadi et al., 2015). The loss function was calculated by cross entropy. The L2 normalization factor multiplied by 10 −8 was added, and output was optimized with AdamOptimizer (Kingma and Ba, 2014).
Words are expressed by two vectors. One was 100 dimensional embeddings of the surface form -the other was 50 dimensional embeddings of 148 types of values of the combination of 74 types of fine-grained PoS and a binary feature to find the existence of a comma in the PU. The three features between PUs were converted into 10 dimensional vectors. All of these vectors were randomly initialized and updated during the training. The input layer formed 990 in the triplet model and 1,320 dimensions in the quadruplet model. The dimension of the hidden layer was set to 200 and conducted a beam search with the size 5. Table 3 shows the accuracies of dependency parsing by the conventional model trained with logistic regression and our proposed neural net model. Both models used the very similar grammar rules and the features are used. While the logistic regression method required manual selection of combination of features to get optimal accuracy, the neural net model outperformed the others when the training corpus was more than 120k sentences, by 0.4 points when the training corpus was 160k sentences. The maximum number of the training data in neural net model (160k) Table 4: Ablation studies to remove content word vocabularies and commas.

Experimental results
kept for the neural model, and some sentences were dropped as described in Section 4.1.
Only words in the modifier PU and the candidate PUs were used in these methods, and other surrounding context and other dependencies were not considered. By capturing appropriate contexts of candidate PUs selected by the grammar rules and heuristics, our method successfully predicted the dependencies with relatively small pieces of information compared to the initial transitionbased neural parser (Chen and Manning, 2014) that used a maximum of 18 words in the buffer, stack and modifiers.
There was a huge difference in the training speed. The logistic regression method took 4 to 20 hours on a CPU, but the neural net model converged in 5 to 15 seconds on a GPU.
We conducted an ablation study to see the contribution of attributes. We focused on the vocabulary of content words that can be better captured using distributed representation rather than the conventional method, and commas that played an important role in suggesting long-distance attachments. According to the results in Table 4, the contribution of the content words (the vocabulary size was 11,362) was not very big; even if the content words were ignored, the loss of accuracy was only 0.36 points. On the other hand, the model ignoring commas (where all of the features regarding commas was removed) downgraded the word-based (UD) accuracy by nearly 2 points, which suggests that commas are important in parsing.
The example dependency in Figure 3 ('president' ← 'introduced') was correctly solved by our neural model. Though many PUs followed the modifier PU in question, the model selected the head word from only three candidates restricted by the grammar rules, and the known dependency relationship between two PUs is guaranteed to be associated with the parsing result. The conventional model without neural net wrongly selected the first candidate ('acquired') as the head. The ablation of the content words also made the prediction wrong, that clarified that it was because the conventional model did not capture the content words and the attributes in the distance were stronger. On the other hand, the neural model with the content word embeddings appropriately captured the relationship between the functional word in the modifier PU (accusative case) and the content word of the correct candidate PU ('introduced').

Comparison on Japanese UD
To compare the performance of our parser with the results in the 2017 Shared Task (Zeman et al., 2017), we apply the trained model to UD Japanese-GSD 5 test data. As shown in Figure 4, the word-based dependency in UD Japanese and PU-based dependencies are interchangeable with a strict rule to detect PU boundaries and the head word in a PU. In UD Japanese-GSD data, the attachment direction between head words in PUs is always the same after converting to the PU-based structure. Also the non-head words in a PU always depends on the head word of the PU.
The first section of Table 5 shows comparisons with the shared task results. Here we evaluate 5 It was formerly known as UD Japanese until 2017. them by UAS (unlabeled attachment score) rather than by LAS (labeled attachment score) because most of the Japanese labels can be deterministically assigned with the combination of the head and the dependent, and assignment with the rules can reproduce the labels in UD Japanese-GSD corpus, thus it is not fair to compare LAS with other machine learning methods. The scores are associated with tokenization accuracies because they highly affect Japanese parsing accuracies in the shared task to handle raw text inputs. Our model performed better than any other results in the shared task, though the comparison is not completely fair since we rely on the segmentation and functional word attachment based on the consistent rules with the UD data creation.
In the 2017 Shared Task, the raw text was used as the input, thus the performance of sentence splitting and tokenization highly affected the parsing result. 6 To make more direct comparisons in parsing, our results were mapped with the baseline tokenzation by UDPipe (Straka et al., 2016) which many task participants have used. That is, the parsing score was intentionally downgraded, but it outperformed any other results which used UDPipe tokenization as it was, as shown in the second section of Table 5.
Also we compared our parser when the gold tokens are given, with UDPipe UDv2.0 model (Straka and Straková, 2017), and RBG Parser (Lei et al., 2014) which was trained with UD Japanese-GSD training set. Our model had 9% and 20% less errors than UDPipe and RBG Parser, respectively.

Application to other languages
Our approach relies on the head-final feature of the languages. In addition to Japanese, Korean and Tamil are categorized as rigid head-final languages (Polinsky, 2012). Table 6 shows the ratio of head-final dependencies by languages in the Universal Dependencies version 2.0 development data. Though the word-level dependencies in UD do not reflect the head finalness as only 45% of Japanese dependencies have the head in the right side, but when it comes to content words (the list of functional PoSs are shown in the caption of Table 6) without functional labels and exceptional labels such as conjunction (see the caption again),
Japanese has the complete head-final structures, and Korean and Tamil have high ratios supporting the linguistic theory. However, the UD Korean corpus has so many coordination structures under the UD's general constraint that the left coordinate should be the head, because many subordinating structures are represented as coordination while corresponding Japanese ones are not, that it is difficult to convert the corpus to the strictly head-final structure. That is the reason why we could not evaluate our method on Korean, but the Triplet/Quadruplet Model has been applied to Korean with similar grammatical rules and it has been shown that the transfer learning from Japanese worked (Kanayama et al., 2014), thus our neural classifier approach to Korean parsing is expected to work well.
Also UD Tamil data has exceptional cases in proper nouns and other phenomena, and the relatively small corpus made the further investigation difficult. We are leaving it to future work.

Related work
The parsing approach to select heads of dependents has been proposed by Zhang et al. (2017). They applied bidirectional RNN to select the probability that each word chooses another word or the ROOT node as its head. They reported comparable results for four languages. Their method required a maximum spanning tree algorithm to generate valid trees. On the other hand, our approach straightforwardly outputs the projective tree by exploiting the head-final feature in Japanese. Martínez-Alonso et al. (2017) shares the similar motivation with ours. They tackled multilingual parsing by using a small set of attachment rules determined with Universal POS, and achieved 55 UAS value with predicted PoS as input. Our method applied a neural model on top of the grammatical restriction to achieve higher accuracy for a specific language. García et al. (2017) tackled the multilingual shared task with the rule-based approach. The rules are simplified with the almost delexicalized PoS-level constraints and created with a small effort by an expert. Though the performance was limited compared to other supervised approaches, it is meaningful for the comparison of linguistic features, and the combination with machine learning methods can be useful as we are aiming at.

Conclusion
In this paper we implemented a neural net parsing model as the direct classifier to predict the attachment of phrasal units in a intuitive manner by exploiting grammatical knowledge and heuristics, and confirmed that the neural net model outperformed the conventional machine learning method, and our method worked better than the shared task results. Unlike the most of neural parsing methods in which interpretation of the model output and control of the model without data supervision are difficult, our method is simple enough to understand the behavior of the model, and the grammatical knowledge can be reflected in the restriction of modification candidates. Moreover, the neural net with distributed vector representations enabled us to handle more vocabularies than the logistic regression with distinct word features in which only the limited number of content words and their combination with other features could be distinguished in the parsing model.
Our experiments showed that a limited number of words are seen as able to predict attachments. For further improvement, we can integrate the LSTM model, which handles more contextual information with simplification (Cross and Huang, 2016). We did not handle coordination relationships explicitly in this work, but we will intend to address coordination with more lexical knowledge and a broader context.