Predicting Japanese Word Order in Double Object Constructions

This paper presents a statistical model to predict Japanese word order in the double object constructions. We employed a Bayesian linear mixed model with manually annotated predicate-argument structure data. The findings from the refined corpus analysis confirmed the effects of information status of an NP as ‘givennew ordering’ in addition to the effects of ‘long-before-short’ as a tendency of the general Japanese word order.


Introduction
Because Japanese exhibits a flexible word order, potential factors that predict word orders of a given construction in Japanese have been recently delved into, particularly in the field of computational linguistics (Yamashita and Kondo, 2011;Orita, 2017). One of the major findings relevant to the current study is 'long-before-short', whereby a long noun phrase (NP) tends to be scrambled ahead of a short NP (Yamashita and Chang, 2001). This paper sheds light on those factors in double object constructions (DOC), where either (1) an indirect object (IOBJ) or (2) a direct object (DOBJ) can precede the other object: (1) Taro Since both of the word orders are available, studies in theoretical syntax have been disputing about what is the canonical word order under the hypothesis of deriving one word order (i.e., either IOBJ-DOBJ or DOBJ-IOBJ) from another in the context of derivational syntax (Hoji, 1985;Miyagawa, 1997;Matsuoka, 2003). In this paper, we do not attempt to adjudicate upon the dispute solely based on the frequency of the two word orders in a corpus, but aim to detect principal factors that predict the word order in the DOC, which may eventually lead to resolving the issue in theoretical syntax. To that end, we employed a Bayesian linear mixed model with potential factors affecting the word orders as a preliminary study.
Other than the factor 'long-before-short' proposed in previous studies, the key factor in the current study is an information status of an NP in a given context under the theoretical framework of information structure (Lambrecht, 1994;Vallduví and Engdahl, 1996). The framework provides us key categories, such as (informationally) given/old, new, topic, and focus, to classify an NP as how it functions in a particular context. We assume the information status as one of the principle predictors based on the following two reasons; (i) a discourse-given element tends to precede a discourse-new one in a sentence in Japanese (Kuno, 1978(Kuno, , 2004Nakagawa, 2016), (ii) focused or new elements in Japanese tend to appear in a position immediately preceding the predicate (Kuno, 1978;Kim, 1988;Ishihara, 2001;Vermeulen, 2012). These two claims regarding the general word order of Japanese are combined into the following hypothesis regarding the word orders in the DOC.
(3) Our hypothesis: In the DOC, a discourse-given object tends to appear on the left of the other object, and a discourse-new object tends to be on the right side.
Incorporating the information status of an NP with another factor 'long-before-short' proposed in the previous studies, we built a statistical model  (Sasano and Okumura, 2016) (Orita, 2017) The to predict the word orders in the DOC. One important advantage of our study is that, with the latest version of the corpus we used (See Section 3), the information status of an NP can be analyzed not simply by bipartite groups as either pronoun (given) or others (new) but by the number of coindexed items in a preceding text.
2 Preceding Work Table 1 shows a comparison with the latest corpus studies on Japanese word ordering. Sasano and Okumura (2016) explored the canonical word order of Japanese double object constructions (either SUBJ-IOBJ-DOBJ-PRED or SUBJ-DOBJ-IOBJ-PRED) by a large-scale web corpus. The web corpus contains 10 billion sentences parsed by the Japanese morphological analyzer JUMAN and the syntactic analyzer KNP. In their analysis, the parse trees without syntactic ambiguity were extracted from the web corpus, and the word order was estimated by verb types with a linear regression and normalized pointwise mutual information. Their model did not include any inter-sentential factors such as coreference.
Orita (2017) made a statistical model to predict a scrambled word order as (direct) object-subject. She used the NAIST Text corpus which has a manual annotation of predicate-argument structure and coreference information. She explored the effect of syntactic priming, NP length, animacy, and given-new bipartite information status (given was defined as having a lexically identical item in a previous text). Her frequentism statistical analysis (simple logistic regression) did not detect a significant effect of the given-new factor on the order of a subject and an object.
As a preliminary study which features coreferential information as a potential factor, we used manual annotation of syntactic dependencies, predicate-argument structures and coreference in-formation, employing a Bayesian statistical analysis on the small-sized well-maintained data.

Corpora: BCCWJ-PAS
We used the 'Balanced Corpus of Contemporary Written Japanese' (BCCWJ) (Maekawa et al., 2014), which includes morphological information and sentence boundaries, as the target corpus. The corpus was extended with annotations of predicate-argument structures as BCCWJ-PAS (BCCWJ Predicate Argument Structures), based on the NAIST Text Corpus (Iida et al., 2007) compatible standard. We revised all annotations of the BCCWJ-PAS data, including subjects (with case marker -ga), direct objects (with case marker -o), and indirect objects (with case marker -ni), as well as coreferential information of NPs. After the revision process, syntactic dependencies of BCCWJ-DepPara (Asahara and Matsumoto, 2016) were overlaid on the predicate-argument structures.
We extracted 4-tuples of subject (subj), direct object (dobj), indirect object (iobj) and predicate (pred) from the overlaid data. Excluding 4-tuples with zero-pronoun, case alternation, or inter-clause dependencies from the target data, we obtained 584 samples of the 4-tuples. Figure 1 shows an example sentence from BC-CWJ Yahoo! Answer sample (OC09 04653). The surface is segmented into base phrases, which is the unit to evaluate the distance between two constituents as in the following pairs of the 4-tuples: subj-pred (dist subj pred ), dobj-pred (dist dobj pred ), iobj-pred (dist iobj pred ), subj-iobj (dist subj iobj ), subj-dobj (dist subj dobj ), and iobj-dobj (dist iobj dobj ). The distance was calculated from the rightmost word in each pair. For example, in Figure 1, dist subj pred is identified as the distance between "" and "" as 4.
Verifying effects of 'long-before-short' as a  general Japanese word-order tendency, lengths of constituents were modeled as fixed effects in the statistical analysis. The lengths of subject, direct object and indirect object were calculated based on a mora count (in pronunciation) available in BCCWJ as N subj mora , N dobj mora , and N iobj mora , respectively. For example, in Figure 1, N subj mora is the number of morae of " " (sono kanojoga), which is 6. Note that an NP may contain more than one base phrase including an embedded clause. We evaluated the maximum span of the dependency subtree in BCCWJ-DepPara as a length of the NP.
In addition, the numbers of coreferent items in a preceding text were modeled as fixed effects. The numbers of coreferent items for subject, direct object and indirect object were obtained from the BCCWJ-PAS annotations as N subj coref , N dobj coref , and N iobj coref , respectively. Table 2 shows the basic statistics of the distance, mora, and number of coreferent items.

Statistical Analysis
We used Bayesian linear mixed models (Sorensen et al., 2016) (BLMM) for the statistical analysis on the distance between arguments as well as an argument and its predicate. We modeled the following formula: dist lef t right (e.g. dist subj iobj : distance between subject (left) and indirect object (right)) stands for the distance between left and right elements, which is modeled by a normal distribution with average µ and stdev σ. µ is defined by a linear formula with an intercept α and two types of interest coefficients. N subj mora , N dobj mora , and N iobj mora are the number of morae of a subject, a direct object, and an indirect object, respectively. The subject and objects can be composed of more than one phrase, and when they contain a clause, the number of morae was defined with the clause length. N subj coref , N dobj coref , and N iobj coref stand for the number of preceding coreferent NPs of a subject, a direct object, and an indirect, respectively. β a b are the slope parameters for the coefficients N a b . Note that the distance was measured by the number of base phrase units, and a minus value indicates a distance in an opposite direction.
We ran 4 chains × 2000 post-warmup iteration, and all models were converged. Table 3 shows the estimated parameters by the BLMM; the values are means with standard deviations (in brackets). The findings are summarized as follows.

Results
First, the distance between a subject and its predicate (dist subj pred ) is affected only by the number of morae of a subject, which indicates that a longer subject NP has a longer distance from its predicate.
Second, the distance between a direct object and its predicate (dist dobj pred ) is affected by the number of morae of the direct object, the number of its preceding coreferent items, and the number of morae of the indirect object. It indicates that i) a longer direct object has a longer distance from its predicate, ii) a direct object with more coreferent items in a preceding text has a longer distance from its predicate, and iii) a longer indirect object makes shorter the distance between the direct object and its predicate.
Third, the distance between an indirect object and its predicate (dist iobj pred ) is affected by the number of morae of the indirect object, the number of its preceding coreferent items, the number of morae of a direct object, and the number of preceding coreferent items of a subject. It indicates that i) a longer indirect object has a longer distance from its predicate, ii) an indirect object with more coreferent items in a preceding text has a longer distance from its predicate, iii) a longer direct ob-  ject makes shorter the distance between the indirect object and its predicate, and iv) a subject with more coreferent items makes shorter the distance between the indirect object and its predicate.
The distance between arguments (dist subj iobj , dist subj dobj , and dist iobj dobj ) represents nearly the same tendency as the combination of the predicateargument distance. However, the number of morae of an argument is correlated with the length of the argument (i.e., the number of base phrases), and thus, the distance between the leftmost and rightmost arguments (e.g. subject, direct object) is affected by the number of morae of the middle argument (e.g. N iobj mora ).

Discussions
The results revealed that the subject tends to precede the direct and indirect objects in the double object constructions. Although the indirect object tends to precede the direct object, it is not significant (p=0.09). The estimated coefficients for the number of coreferent items (N dobj coref for dist dobj pred and N iobj coref for dist iobj pred ) support our hypothesis in (3) as 'given-new ordering' for the direct and indirect objects. An object with many preceding coreferent items tends to be farther from a corresponding predicate.
The estimated coefficients for the number of morae (N subj mora for dist subj pred , N dobj mora for dist dobj pred and N iobj mora for dist iobj pred ) indicate that the orders of all arguments in the DOC follow 'long-before-short'. It is also confirmed by the minus values as the estimated coefficients for the number of morae of one object in relation to the order of the other object and its predicate (N dobj mora for dist iobj pred and N iobj mora for dist dobj pred ), suggesting that a longer object tends to precede the other object in the DOC.

Conclusions
This article presents a Bayesian statistical analysis on Japanese word ordering in the double object constructions. It revealed the 'given-new ordering' for the indirect and direct objects and also confirmed the 'long-before-short' tendency for all of the arguments in the constructions.
Setting off from the current preliminary study, our future work is to investigate effects of verb type and animacy of an NP. We are currently annotating the labels of a Japanese thesaurus 'Word List by Semantic Principles' (WLSP) (Kokuritsukokugokenkyusho, 1964), which enables us to explore those effects.