What Determines the Order of Verbal Dependents in Hindi? Effects of Efficiency in Comprehension and Production

Word order flexibility is one of the distinctive features of SOV languages. In this work, we investigate whether the order and relative distance of preverbal dependents in Hindi, an SOV language, is affected by factors motivated by efficiency considerations during comprehension/production. We investigate the influence of Head–Dependent Mutual Information (HDMI), similarity-based interference, accessibility and case-marking. Results show that preverbal dependents remain close to the verbal head when the HDMI between the verb and its dependent is high. This demonstrates the influence of locality constraints on dependency distance and word order in an SOV language. Additionally, dependency distance were found to be longer when the dependent was animate, when it was case-marked and when it was semantically similar to other preverbal dependents. Together the results highlight the crosslinguistic generalizability of these factors and provide evidence for a functionally motivated account of word order in SOV languages such as Hindi.


Introduction
Natural languages are known to be influenced by a pressure for efficiency, that is, languages should enable speakers to communicate as well as possible, subject to constraints on the complexity of production and comprehension (Zipf, 1949;Jaeger and Tily, 2011;Hawkins, 2014). One area where efficiency pressures appear to have a large effect is in word order: for example, across languages, the distribution of word orders can be well predicted using the principle of dependency length minimization, the idea that words in syntactic dependencies are under a pressure to be close to each other (Liu et al., 2017;Temperley and Gildea, *Equal contribution by RF and SH. 2018). Dependency length minimization is motivated by efficiency because it results in lower working memory requirements for language production and comprehension.
Here, we take up the question of whether word orders can be predicted using efficiency in an area that goes beyond dependency length minimization. In particular, we examine how efficiency pressures may influence the order of pre-verbal dependencies in a verb-final language, Hindi. We formalize a number of measures of the complexity of comprehension and production, drawing from the psycholinguistic literature, and we test what effect these measures have on word order as observed in a large dependency treebank.
The paper is arranged as follows: in Section 2, we describe the various psycholinguistic factors used to investigate preverbal ordering and review related work. Section 3 describes the data and methods used to undertake the various analyses. In Section 4 we present the results. Section 5 discusses the implications of our results and concludes.

Psycholinguistic Factors influencing Word Order
We consider four psycholinguistically-motivated factors as predictors of the order of verbal dependents in Hindi.

Head-dependent mutual information (HDMI)
While the theory of dependency length minimization holds that all words in dependencies should be close to each other, there is no theoretical consensus on which words in dependencies should be especially close to each other. In contrast, the more general theory of information locality holds that any two words w 1 and w 2 should be close to each other in proportion to their pointwise mutual in-formation (pmi): In support of this idea, in previous work, Futrell (2019) found that dependencies in which the head and dependent have high pmi are under a stronger pressure to be close than dependencies with less head-dependent mutual information.
The idea of information locality is motivated by efficiency in language comprehension, based on information-theoretic models of incremental sentence processing. For example, the model of Futrell et al. (2020b) holds that the difficulty of processing a word in context is given by the surprisal (negative log probability) of the word given a lossy memory representation of the context. Information locality can be derived as a consequence of this model.
Following previous work, we operationalize the head-dependent mutual information as the pointwise mutual information between part-of-speech tags. Our main reason for using the pmi between part-of-speech tags, rather than the pmi between wordforms, is in order to avoid data sparsity issues in the estimation of mutual information (Paninski, 2003;. However, this choice also changes the interpretation of the HDMI measure. Instead of reflecting the full predictive information contained in one word about another word, the measure now only reflects something like the syntactic predictive information contained in one word about another.
For representing words for this measure, we use an augmented part-of-speech tagset described in Section 3.2.1. This tagset captures not only part-ofspeech information, but also verb argument structure and nominal case marking.
Accessibility A persistent generalization in the typological literature is that there is a bias for words which are more accessible to go earlier in a sentence. Words are more accessible when they are animate, definite, imageability, and/or salient in discourse (Ariel, 1990;Jaeger and Tily, 2011). The preference for more accessible words to go earlier in sentences appears to be motivated by ease of language production (Ferreira and Dell, 2000;Kurumada and Jaeger, 2015), under the theory that language producers will tend to produce the words which are most accessible as quickly as possible. Evidence for production ease comes from, amongst others, the observation that lemma selection during grammatical encoding is influenced by accessibility of the lemma (e.g., Bock and Warren, 1985;Bock et al., 1992;Prat-Sala and Branigan, 2000;Branigan et al., 2008).
Here we operationalize accessibility as the animacy of a referent. We leave investigations of other factors affecting accessibility (definiteness, imageability, etc.) to future work.
Semantic similarity A great deal of work in psycholinguistics has focused on the effect of similarity-based interference on processing: the idea that difficulty results when a comprehender must retrieve a target item from working memory, but there is another distractor item in working memory which interferes with the retrieval of the target item (Jäger et al., 2017). From a production perspective, presence of similar phrases in an utterance can lead to competition between the phrases and thereby difficulty in planning. Successful articulation of a phrase therefore requires inhibition of one of the phrases, thereby resulting in a delayed articulation of the inhibited phrase. The magnitude of this interference increases as the target and the distractor become more similar (Gordon et al., 2006;Gennari et al., 2012). For comprehension, this effect has been modeled within frameworks based on cue-based retrieval such as ACT-R (Lewis and Vasishth, 2005).
In terms of word order, it has been shown that when syntactically/semantically similar nominals appear adjacent (or close) to each other, processing suffers (e.g., Lewis and Nakayama, 2002;Vasishth, 2003;Gordon et al., 2006;Apurva and Husain, 2020). Relatedly, there has been some work on the influence of semantic interference during production (e.g., Ferreira and Firato, 2002;Humphreys et al., 2016;Gennari et al., 2012;Mac-Donald, 2013). For example, Gennari et al. (2012) found that an animate head noun leads to a higher chance for producing passive relative clause construction in English, compared to an active relative clause, thus keeping the animate head noun distant from the animate passive subject. On this account, a pre-verbal dependent that is similar to other phrases should appear earlier in the sentence because increased dependency distance between similar nouns should lead to production ease.
Therefore, the prediction based on psycholinguistics is that a word should be pushed out towards the beginning of the sentence when it is semantically similar with other words. This creates dis-tance between the two similar words and facilitates language production and comprehension.
Morphological marking A feature of a majority of head-final languages is the presence of case marking on nouns that signifies the syntactic relation between a nouns and its verbal head. 2 Crosslinguistically, the presence of case-markers has been shown to increase the dependency distance between a nominal and its verbal head (e.g., Yadav et al., 2020), perhaps because nominal casemarkers help in making robust predictions about upcoming verbs, as shown by . Therefore, we predict that nominals with case marking should be farther out from the verb than those without.

Related Work
There have been some previous corpus-based investigations on word order variation in Hindi (e.g., Husain et al., 2013;Ranjan et al., 2019); (also see, Vasishth, 2004). For example, Ranjan et al. (2019) investigated the role of case-marking on word order choices in Hindi and found evidence for the Easy-First and Reduce Inference principles of the Production Distribution and Comprehension (PDC) hypothesis (MacDonald, 2013). Production ease was operationalized in Ranjan et al. (2019) as low n-gram/dependency surprisal value (Hale, 2001) and interference was operationalized as casemarker similarity between preverbal nominals. Jain et al. (2018) investigated the Uniform Information Density hypothesis (Jaeger, 2010) with regard to predicting Hindi word order in corpus sentences vs random sentences and found no support for UID in capturing such a distinction. They also did not find UID to predict non-canonical word order in Hindi.
With regard to ordering of co-siblings in dependency trees, Dyer (2018) has argued that the relative predictability of the head at the dependent (which can be operationalized as the entropy of heads given dependents) determines which sibling is closer to the head. In particular, he shows that cosiblings that induce lower entropy tend to be closer to the head compared to co-siblings that induce higher entropy. The current work will complement the above investigations by exploring the role of some novel factors such as HDMI, semantic interference, and animacy. The effect of these factors on preverbal ordering and dependency distance in SOV languages is largely unknown.

Data and Tools
We use the monolingual Hindi corpus developed by Kunchukuttan et al. (2017). This corpus includes raw sentences of Hindi, collected from various sources (HindMonoCorp (Bojar et al., 2014), BBC, Wikipedia etc.). We restricted our analysis 3 to only the first 5 million sentences of this data, resulting in a dataset of 14 million verbal dependencies. Since we needed to extract noun-verb dependency relations, we parsed this data using the ISC dependency parser for Hindi. 4 The parser is trained on the Hyderabad Dependency Treebank (Bhatt et al., 2009) that is based on the Computational Paninian Grammar (CPG, henceforth) (Bharati et al., 1995).

Model
In this section, we provide details regarding the computation of the factors discussed in Section 2.

HDMI
As mentioned in Section 2, we calculate the pointwise mutual information between words using augmented part-of-speech tags. Here, we describe the augmentation in detail. A verb is identified with its verb class. Verbal classes are defined based on the different argument structures that a verb can have. A class of a verb is characterized by the set of core argument relations (Subject, Noun complement for Copula, Direct Object, Indirect Object). For example, two verbs will belong to the same class if the set of the core argument relations they have is identical. Thus, we classify a verb into one of the 16 classes according to this scheme. A 16 way classification is based on Sharma et al. (2019). This constitutes an exhaustive set of possible argument structures a verb can have, considering only the core-arguments.
A verb POS tag (VM) is augmented with the verb class that it belongs to. Nominal POS tags (NN-Common noun, NNP-Proper noun) are augmented by case-marker information (if any). 4

Semantic Similarity
Semantic similarity of a noun in its context is modeled as the maximum cosine similarity (Salton, 1972) of the noun with other pre-verbal dependents of the corresponding verb. The cosine similarity is calculated from the word vectors taken from a pre-trained model for Hindi (Grave et al., 2018). Semantic similarity is where wv(x) denotes the word vector of a word X, id(x) denotes the index of X in the sentence, and (h, d) ∈ Dep. Note that Dep is the set of all dependencies in the sentence. We choose to use word vectors to generate similarity scores because collecting human judgments for all words in our large dataset is impractical. There is mixed evidence about whether semantic similarity as defined using word vectors truly captures psycholinguistically relevant aspects of similarity. Despite arguments that word vectors do not capture certain properties of human similarity judgments (Griffiths et al., 2007;De Deyne et al., 2016) and certain semantic interference effects (Merlo and Ackermann, 2018), they have been used successfully in psycholinguistic models of interference (Smith and Vasishth, 2020).

Accessibility
In the current study, the accessibility of a noun is determined based on the notion of humanness. Nouns that are +Human as assumed to have conceptual prominence while nouns with a -Human feature are assumed to have low conceptual prominence. Thus, this is a categorical variable in our study. Animacy information was obtained from hand annotations in a version of the Hyderabad Dependency Treebank augmented with nominal semantic features (Jena et al., 2013). In this text, animate will be used interchangeably for +Human, unless otherwise specified.

Case Marking
The information regarding nominal case-marking is extracted from the parsed data using the DEPREL tag. A noun is case-marked if it has a dependent with DEPREL equal to lwg psp in CPG. Thus, we mark a noun-verbal dependency as case-marked if the corresponding noun is case-marked, otherwise it is deemed unmarked.

Granularity of Analysis
In order to get a fine-grained view of the data, we will separately analyze the four different relation types: subjects, direct objects, indirect objects, and adjuncts. We label a relation as a subject if its DEPREL tag is either k1 or k1s, as a direct object if it is k2, as an indirect object if it is k4, and as a verbal adjunct otherwise.
For all analyses, we consider only dependencies with length greater than 1. This is necessary because in Hindi there cannot be any noun-verb dependency at distance 1 when the noun is casemarked-a case-marker appears post-nominally and is adjacent to the noun. Since case-marker is a factor in our analysis, any real effect may be confounded by this constraint of the grammar.

Results
In this section we discuss the length and ordering of preverbal nominal dependencies. We first discuss how the length is affected by the factors discussed above. We then discuss how they affect the wordorder.  Figure 1 shows the distribution of dependency distance (the number of words from the head to the dependent) for various arguments in a log-log plot. We find that verbal adjuncts are more frequent than all the arguments at all distances. Among the arguments, subjects are the most frequent at almost all distances followed by direct objects and then indirect objects. With increasing distances, the frequencies of all dependencies decrease but the decay is strikingly slower in subjects and adjuncts compared to the objects. Direct objects decrease steadily following a power law while the others show a nonlinear trend until large distances (> 12).

Length of Preverbal Dependencies
We now investigate the effect of our psycholinguistically-motivated factors on dependency distance. Figure 2a shows a plot for HDMI at various dependency distances. The trend for preverbal subjects is quite clear: as dependency distance increases, HDMI decreases.
The figure unexpectedly shows that the HDMI for adjuncts and indirect object tends to increase starting around dependency distance 7. One possible reason for this increase is estimation error in the pmi values: given small amounts of data from a relatively large number of classes, pmi is likely to be overestimated , and the frequency of these dependents is less at higher dependency distance. Therefore, the HDMI -dependency distance relation for these dependents seems to be more stable at shorter dependency distance. Another explanation for the apparent HDMI increase at large dependency distances could have to do with certain discourse-related dependencies between verbs and adjuncts, where the adjunct appears preferentially at the beginning of a clause or sentence (Butt and King, 1996). Such discourse dependencies might have high HDMI while also having large dependency distance due do the constraint of appearing early. Figure 2b shows a plot for mean semantic similarity at various dependency distance. We find that for all the arguments, an increase in semantic similarity leads to increase in dependency distance. Figure 2c shows a plot for proportion of animate arguments at various dependency distance. The most clear trend is for the subjects -the proportion of +Human subjects tend to increase with increase in dependency distance. This positive trend is also seen for the other dependents, but for short distances.
Finally, Figure 2d shows a plot for proportion of case-marked arguments at various dependency distance. Here again, the expected positive trend, i.e., an increase in case-marked arguments with increasing length, is seen only for subjects. For direct objects and adjuncts, this trend is reversed. In fact, the negative trend of fewer case-marked nominals at longer dependency distance is quite strong for direct objects. Note that indirect objects are always case-marked.
In order to quantify the effects of our predictors, we fit a linear regression predicting dependency distance as a function of our predictors: The above model was fit separately for all 4 relations (subject, direct object, indirect object, and adjunct). The variables HDMI and similarity are continuous. On the other hand, the variables animacy and case are categorical in nature. Based on the discussion in Section 2, we expect β 1 to be negative, i.e., dependency distance is expected to decrease with increase in HDMI. The coefficients β 2 , β 3 and β 4 are expected to have positive values. In other words, dependency distance is expected to increase when there is increased semantic similarity between nouns, when the argument is +Human and when it is case-marked. The values of the fitted coefficients β are shown in Table 1. All coefficients are significant at p < .05. For the subjects, adjuncts and indirect objects, we find that all the coefficients are in the expected direction. On the other hand, mirroring the unexpected effects visible in Figure 2, the coefficients of HDMI and case-marker for direct objects 5 are not in the expected direction.

Order of Preverbal Dependencies
We now consider how different dependencies of a verb are ordered relative to each other. We consider order between 3 pairs of constituents: (i) arguments and adjuncts, (ii) subjects and objects, and (iii) direct objects and indirect objects. Note that objects constitute both direct objects and indirect objects, and arguments constitute both subjects and objects. All nominal dependencies attached to a common verb were collected and grouped into the abovementioned pairs according to their dependency relations. Then, the order between the constituents is calculated by considering the sign of the difference between their dependency distances. Henceforth, we denote an order between constituents X and Y as X-Y if X precedes Y and Y -X otherwise.
We fit a logistic regression predicting the order of two dependents given the difference in the values of the factors for the two dependents (as in Morgan and Levy, 2016;Futrell et al., 2020a). The factors are same as used in the previous section. In particular, we find coefficients β to fit where ∆HDMI indicates the HDMI of X minus the HDMI of Y , etc. This model is evaluated separately for argument (X) and adjunct (Y ), subject (X) and object (Y ), and direct object (X) and indirect object (Y ), along with random intercepts by sentence.
Based on the discussion in Section 2, we expect β 1 to be negative, since a higher value of HDMI of X than Y should imply that X is closer to the verb than Y , i.e. the order should be Y − X. The similarity coefficient β 2 is expected to be positive based on the effect of interference on distance seen in Section 4.1, and β 3 and β 4 are also expected to have positive values. In other words, a dependent is expected to be farther away from the verb than another dependent when it is more +Human (and case-marked) than the other dependent.
The values of the fitted regression coefficients β are shown in Table 2. Significance is according to the criterion p < .05. All effects except ∆Similarity are in the theoretically expected directions. Figure 3 shows the effect of each factor on the word-order. One can verify the trends with the sign of the coefficients in the regression.

Coefficient
Arg

Discussion
The results show that comprehension/production efficiency based factors affect dependency distance and ordering of preverbal dependents in Hindi. Previous work investigating dependency distance has demonstrated that SOV languages allow for longer dependency distance between the verb and its prior dependents (Futrell et al., 2020c;Yadav et al., 2020;Konieczny, 2000). The current work makes an important contribution by highlighting that compared to preverbal adjuncts, the core arguments (subject, indirect object and direct object) tend to be closer to the verb in an SOV language like Hindi. Additionally, when the HDMI between the preverbal dependent and the verb is high, they tend to be close to each other. The HDMI difference between a pair of dependencies also correctly predicts their order. This provides compelling evidence that, all else being equal, the preverbal dependents in an SOV language such as Hindi are under information-theoretic locality constraints. Indeed, recent behavioral studies (e.g., Apurva and Husain, 2020) suggest that clause final verb prediction in Hindi suffers with increased distance between prior arguments and the upcoming verb.
The current findings also highlight certain conditions under which locality constraints can be overridden, leading to increased distance between preverbal arguments and the verb. These factors relate to well established processing constraints such as animacy-driven accessibility and similarity-based interference. Language-specific characteristics also influence syntactic configurations. In the current study, the presence of nominal case-markers tends to increase the dependency distance between the verb and its prior dependents. This could highlight the predictive strength of case-markers vis-à-vis a verb type . Under such a setting, the locality constraint can be violated, leading to longer dependency distance.
The results also raise some inconsistencies. One such inconsistency was that the dependency distance prediction for factors such as HDMI and case-marking in the case of direct object is not borne out. It is possible that compared to other arguments, the relationship between a direct object and verb is distinct (Momma and Ferreira, 2019). If this is true, then we should be able to replicate the DO pattern found in the current study for other SOV languages. The other inconsistency in the result was that the prediction for word order with respect to similarity-based interference turned out to be incorrect in the order-based analyses.
The current work does not probe the effect of discourse/information structure on word order. Butt and King (1996) have proposed a discourse-centric mapping of word position vis-à-vis the verb and discourse function -Topic maps to the sentence initial position, Focus maps to the immediate preverbal position, Background Information maps to the postverbal position and Completive Information maps to the non-immediate preverbal position. Such information structure roles can be difficult to ascertain automatically, but future work can attempt to investigate the role of these factors in the context of the current work.

Conclusion
Our work provides evidence in support of the information locality hypothesis which states that headdependent are under pressure to be close to each other when they are strongly associated with each other. It also supports other efficiency-driven factors such as similarity-based interference, accessibility, and the presence of case-markers, which act as countervailing forces to keep the distance between the dependent and its verbal head long.
Together these factors influence the dependency distance and word order patterns in an SOV language like Hindi.