Ordering of Adverbials of Time and Place in Grammars and in an Annotated English-Czech Parallel Corpus

The data from a parallel annotated English–Czech corpus serve for testing the general issue of the variability of the mutual position of LOC and TWHEN in Czech and English (Sect. 4.1) and for the analysis of the relation between information structure and the given order in the two languages (Sect. 4.2). The most relevant and innovative results in the investigation, namely the cases where the position of TWHEN and LOC differs in Czech and English in that the same modification is placed in Topic in the sentence in one language and in Focus in the corresponding sentence in the other are presented in Sect. 4.3. 1 Motivation and Research Question In the early days of a massive entrance of corpus linguistics on the linguistic scene, C. J. Fillmore, in an attempt to characterize his own research position, compares two kinds of linguists: an armchair linguist and a corpus linguist. Fillmore (1992, 35) says: “Armchair linguist sits in his armchair, with his eyes closed and his hands clasped behind his back, once in a while, opens his eyes and shouts: Wow, what a neat fact”, while “Corpus linguist: has all of the primary facts he needs in the form of a corpus of approximately one zillion running words and he sees his job as that of deriving secondary facts from his primary facts.” And he concludes: “... the two kinds of linguists need each other. Or better, that the two kinds of linguists, wherever possible, should exist in the same body”. As for himself, he claims to be “an armchair linguist who refuses to give up his old ways but who finds profit in being a consumer of some of the resources that corpus linguists have created”. In the era (and in the context) of treebanking, one can consider an armchair linguist to be a theoretically minded linguist and a corpus to be an annotated corpus in the form of treebanks, and it is in this sense that we have formulated our particular research question. The phenomenon under investigation is the relation of word order and information structure, the particular cases are temporal and local modifications of predicates and the data come from a parallel English–Czech annotated corpus (treebank). The task we have faced is complicated by two facts: first, the information on structure is a very complex phenomenon and different approaches to its treatment have been proposed in theoretical literature since the pioneering studies by Czech scholars in the first half of the last century followed by such prominent linguists and semanticists as M.A.K. Halliday, B.H. Partee, M. Rooth, M. Krifka, E.F. Prince, K. Lambrecht, M. Steedman, E. Vallduví and E. Engdahl, to name just a few, and, second, it is hard to assess this phenomenon, so that annotation of information structure is very tricky (cf. Cook and Bildhouer, 2011) and therefore has to be carefully checked.


Motivation and Research Question
In the early days of a massive entrance of corpus linguistics on the linguistic scene, C. J. Fillmore, in an attempt to characterize his own research position, compares two kinds of linguists: an armchair linguist and a corpus linguist. Fillmore (1992, 35) says: "Armchair linguist sits in his armchair, with his eyes closed and his hands clasped behind his back, once in a while, opens his eyes and shouts: Wow, what a neat fact", while "Corpus linguist: has all of the primary facts he needs in the form of a corpus of approximately one zillion running words and he sees his job as that of deriving secondary facts from his primary facts." And he concludes: "… the two kinds of linguists need each other. Or better, that the two kinds of linguists, wherever possible, should exist in the same body". As for himself, he claims to be "an armchair linguist who refuses to give up his old ways but who finds profit in being a consumer of some of the resources that corpus linguists have created".
In the era (and in the context) of treebanking, one can consider an armchair linguist to be a theoretically minded linguist and a corpus to be an annotated corpus in the form of treebanks, and it is in this sense that we have formulated our particular research question. The phenomenon under investigation is the relation of word order and information structure, the particular cases are temporal and local modifications of predicates and the data come from a parallel English-Czech annotated corpus (treebank).
The task we have faced is complicated by two facts: first, the information on structure is a very complex phenomenon and different approaches to its treatment have been proposed in theoretical literature since the pioneering studies by Czech scholars in the first half of the last century followed by such prominent linguists and semanticists as M.A.K. Halliday, B.H. Partee, M. Rooth, M. Krifka, E.F. Prince, K. Lambrecht, M. Steedman, E. Vallduví and E. Engdahl, to name just a few, and, second, it is hard to assess this phenomenon, so that annotation of information structure is very tricky (cf. Cook and Bildhouer, 2011) and therefore has to be carefully checked.
Though English representative grammars do not provide a systematic and comprehensive information on a possible variability of word order in English (which is quite understandable due to the predominance of grammatical factor determining the English SVO word order), it is somehow taken for granted, esp. in teaching English as a second language, that the unmarked order is SVOMPT, that is to say that with adverbials placed after the Object, Manner precedes Place and Place precedes Time. This more or less practical instruction is also reflected in Quirk et al. (1985, esp. parts 8.22-8.23): "Concerning adjuncts of the same grammatical class, subject to the stylistic and realizational factors already mentioned, will have their sequence determined by semantics and will normally appear in the order: process -space -time" (p. 650) giving examples such as He worked at home that day. or The plane arrived uneventfully at Honolulu by midnight. The authors continue: "Thus within the same class of adjuncts, those concerned with time are seen to be rather peripheral and this explains the case with which they can be moved to I (= initial position, EH): By midnight, the plane arrived uneventfully at Honolulu." In the part on the relative positions of adjuncts (Chapter 8.87, pp. 565 ff.) the authors specify the order as respect -processspace -time -contingency, with two restrictions influenced by the information focus and the form of realization. Leech and Svartvik (1994) mention the issues relevant for our investigation only briefly in the part on the position of adverbials (pp. 226-231) saying (p. 226) that "the place of an adverbial depends partly on its structure (whether it is an adverb, a prepositional phrase or clause, etc.), partly on its meaning (whether it denotes time, place, manner, degree, etc.). End-focus and end-weight also play a part." The abovementioned SVOMPT rule obtains here the following form: "When more than one of the main classes of adverbials occur in end-position, the normal order is manner/means/instrument + place + time." The authors also take into account the influence of the form and the overall structure of the sentence, e.g. the fact that some adverbials which normally have an end-position can be in the front-position to avoid too many adverbials at the end of the sentence: The whole morning he was working on his speech in the office.
As for Czech, the relative freedom of surface word order makes it necessary to look for other than grammatical factors as determinants of the linear ordering of words in the sentence, the information structure being one of the main. In Vol. 3 of the representative Czech grammar Mluvnice češtiny (1987, p. 602) a "basic word order" is postulated, which is considered to be semantically based, reflecting the degrees of the so-called communicative dynamism (CD) as defined by the Czech anglicist Jan Firbas. 1 This basic word order may be influenced by the grammatical structure of the sentence, by its rhythmical structure and, marginally, by the size of the sentence elements in question. In the theory of information structure we subscribe to (the so-called topic-focus articulation, TFA, see e.g. Sgall et al., 1973;1980;1986) two orderings are postulated: one reflected in the surface shape of the sentence (surface word order) and the so-called underlying (deep) word order in the underlying (tectogrammatical) structure of the sentence. The underlying word order is semantically determined (and relevant), it reflects the TFA of the sentence and its counterpart in the surface is influenced, in addition to the TFA factors, by prosody, the overall structure of the sentence (e.g. the complexity of the structure), etc. One of the important notions introduced is the so-called systemic ordering (SO) as the order of verb modifications in the Focus part (see e.g. Sgall et al., 1980). The hypothesized order of main verb modifications is as follows: Actor -Temp -Cause -Regard -Aim -Manner -Accompaniment -Locative -Means -Addressee -Patient -Effect. The notion of SO in Focus is supposed to be universal, but the concrete order of modifications may differ from language to language and has been already tested for some of them, see e.g. for German Sgall et al. (1995), for English Preinhaelterová (1997), for Czech Rysová (2014).

Methodology and Data
Our research question concerns the position of temporal and local modifications of predicates in Czech and English and the relation of this position to the information structure. The data come from a parallel English-Czech annotated corpus PCEDT (Hajič et al., 2012), which is a mostly manually annotated parallel corpus of English and Czech texts with almost 50 thousand sentences for each part. The E. part contains the Wall Street Journal section of the Penn Treebank (Marcus et al., 1993), along with the original phrase-structure analysis and a newly added dependency-based deep structure syntactic analysis (tectogrammatics). The Cz. part consists of manual translations of the original texts, along with their surface and deep syntactic analyses, automatically parsed and manually checked. We have analyzed the corpus findings and compared the results with claims made by existing representative grammars and other relevant studies and have tried to draw attention to contextual and other factors that play decisive role in the surface ordering of temporal (TWHEN) and locative (LOC) modifications. In doing so, we had in mind two limitations: the corpus data belong to the journalistic genre in which the TFA is not that clear as in other genres, and the translated sentences may be inclined to follow automatically the original order.

Queries and Corpus Findings
We have carried out a series of queries in which we were concerned with a general issue of variability of the mutual position of LOC and TWHEN in Czech and English (Sect. 4.1) and with the relation between TFA and the given order in the given languages (Sect. 4.2). The most relevant and innovative results in our investigation, namely the cases where the position of TWHEN and LOC differs in Czech and English in that the same modification is placed in Topic in the sentence in one language and in Focus in the corresponding sentence in the other are presented in Sect. 4.3.

Variability of the position of TWHEN and LOC
We have searched in the parallel corpus for cases with the Predicate as the root of the tree (excluding thus coordinated sentences) in which both TWHEN and LOC (occurring in the same tree) depend on the same Predicate. This search was carried out in the whole PCEDT, i.e. in the total of 39507 sentences with the Predicate as the root of the tree. The cases relevant for this step amount to 0.96% of the corpus. The results of our search are summarized in Table 1, where the E || Cz column refers to the number of cases in which the positions in Czech and English are the same.
It should be emphasized that the figures in Table 1 do not take into account the position of the modifications be it in the Topic or in the Focus, they just reflect the mutual positions of these modifications in the sentences in which both of them occur. The figures indicate that both orders are possible both in English and in Czech, and that in English the orders are relatively balanced (190 to 191), while in Czech the more frequent order is that of TWHEN before LOC (278 times  In the next step, we have taken into account the assumed division of the sentence into Topic and Focus and looked for cases in which both TWHEN and LOC were in the Focus part. The reasons why we have concentrated on the Focus part of the sentence, are twofold: first, and most importantly, we wanted to check whether and under which conditions the hypothesis of the above mentioned SO in the Focus is valid, both for English and for Czech, and, second, in this way, we could also check the beforementioned general English word order "rule" SVOMPT, which indicates the order of Time after Place in the post-verbal position; with certain simplifications the post-verbal position may be considered to function as the Focus of the sentence. We have tried first to search in that part of the PCEDT in which the sentences were annotated also as for their Topic-Focus articulation (3857 sentences), but the number of cases in which both TWHEN and LOC occurred in the same sentence in the relevant positions both in English and in Czech was very low (34 instances). Therefore we have decided to approximate the division into Topic and Focus as the position before (Topic) and after (Focus) the Predicate 2 and to carry out the search in the whole of PCEDT (on sentences with Predicate as the root of the tree), separately for English and for Czech.  The total number of sentences checked was 42717 for English and 39507 for Czech; the difference follows from the fact that there exist cases where one of the modifications is not realized by a separate sentence element. 3 The results are given in Table 2. The data obtained have made it possible to check the validity of the assumed so-called systemic ordering. The first attempt at such a verification for Czech was carried out by Rysová (2014) analyzing the data from the Prague Dependency Treebank 2.0 (PDT). 4 The relevant figures in her Tables 6.1 (p. 77) and 6.10 (p.96) are summarized below in Table 3 (2014) Rysová's results demonstrate that the data of PDT 2.0 support the SO as TWHEN < LOC; she also gives an explanation of the cases that do not correspond to this hypothesized order. Her observation is supported by the PCEDT data (see Table 2), though not so convincingly, which may be explained by the fact that the PCEDT data are translations and as such may mimicry to a considerable extent the E. order. For English, our "raw" data indicate a different situation: TWHEN < LOC = 129 which is less than LOC < TWHEN = 202. However, after a manual inspection resulting in filtering out cases where the given modification, though placed after the verb, has to be characterized as contextually bound, 5 i.e. belonging to the Topic part of the sentence, the figures attested were 103 for the TWHEN < LOC order, and 130 LOC < TWHEN order, which means that the preference for LOC < TWHEN is not so striking.

4.2.2
Let us first examine the examples of the TWHEN < LOC order, i.e. the order hypothesized by SO but counter to the assumed SVOMPT order. In 3 cases a decisive role was played by the form of the LOC modification as a clause (1).
(1) Researchers began using the drug in February.TWHEN on patients.LOC who had received kidney, liver, heart and pancreas transplants.
In the remaining 100 cases the LOC modification can be supposed to exemplify the order as predicted by SO. In most of them, both modifications are short (or of a comparable length) so that the "weight" criterion cannot be applied, see (2).
(2) A volcano will erupt next month.TWHEN on the fabled Strip.LOC: a 60-foot mountain spewing smoke and flame every five minutes.
With some examples, the TWHEN modification is closely related to the extralinguistic context (e.g. today) so that it can be understood as contextually bound and belonging to the Topic (3), though a different interpretation is also possible because in the preceding co-text District Court in Philadelphia is mentioned.
(3) The trial begins today.TWHEN in federal court.LOC in Philadelphia.LOC.

4.2.3
As for the LOC < TWHEN order, i.e. the order counter to the SO but in concord with the assumed SVOMPT order, we have again put aside examples in which TWHEN was expressed by a clause, which certainly had an impact on its end-position. This group was much larger than in the previous case, namely there were 48 examples in which the TWHEN modification was expressed by a clause, see (4): (4) Judy and I were in our back yard.LOC when the lawn started rolling like ocean waves.TWHEN The rest of the examples (82 sentences) mostly include the two modifications expressed by noun groups of a similar length (5), with an exception of some cases where the weight was a decisive factor (6). 6 (5) Mr. Guber got his start in the movie business at Columbia.LOC two decades.TWHEN ago.
(6) WASHINGTON lies low.LOC after the stock market's roller-coaster ride.TWHEN. 5 In the TFA theory, on which the TFA annotation is based (see e.g. Sgall et al., 1986), contextual boundness is a primary notion interpreted as follows: A contextually bound node represents an item presented by the speaker as referring to an entity assumed to be easily accessible by the hearer(s), i.e. more or less predictable, readily available to the hearers in their memory. Each element of the underlying dependency tree of a given sentence is assigned one of the values of the TFA attribute, namely cb (contextually bound non-contrastive), c (contextually bound contrastive) or nb (contextually non-bound). 6 As remarked by one of the reviewers, "lie low" may be understood rather as an idiomatic expression.

4.2.4
We have also made a random inspection for particular cases where the parallel English and Czech sentences differed in the ordering of the two modifications. Interestingly enough, there are cases for which we have not found any reason why this was so, except for the "different ordering principles" (7). However, having in mind that our parallel corpus was composed of translations from English to Czech, there was no surprise that the "principle ordering" in the target language was not obeyed and the Czech translation copied the order in E., see (8) To sum up, while the SO for Cz. has been supported by both the PDT and the PCEDT data, the data for E. provide a slight support for the SVOMPT order.

Differences between Czech and English in the placement of TWHEN or LOC in the Topic and in the Focus
Most interesting for our study are the cases, where the two languages studied differ in the placement of the modifications TWHEN or LOC in the Topic in one language and in the Focus part of the same sentence in the other. In order to get a richer sample of examples, we have searched in the whole of PCEDT and we have again approximated the division into Topic and Focus by the position of these modifications before (Topic) and after (Focus) the main verb (PRED). We have at our disposal the samples in Table 4.

The position of TWHEN
We have randomly chosen a sample of 100 E. sentences and their Cz. counterparts from each of the sets (out of 233 and 765 examples, respectively) and analyzed them, also with regard to the previous context. The following observations seem to hold: 7

A. TWHEN after the Predicate in English and before the Predicate in Czech
(i) Typically, TWHEN is expressed in E. by a short adverb (-ly adverb, yesterday, …) and is placed next to the Predicate. In such a case, this post-verbal element may be considered to be a part of Topic also in E.
(ii) TWHEN is expressed in E. by a short adverb and placed at the end of the sentence, but (presumably) this adverb does not carry the intonation centre; these examples, if analyzed properly with regard to Top-ic and Focus rather than with regard to its pre-or post-verbal position, would not represent instances of differences we are looking for (10).
(iii) In E., the position of TWHEN at the end of the sentence (i.e. in the prototypical position of Focus) is due to the weight of the element, being a prepositional phrase or a whole dependent clause (11). For some of these cases, as (14), the initial position of TWHEN in Cz. may be interpreted as a contrastive Topic: it is still (a part of) Topic, the sentence being "about" it, but the contrastive character of this element makes it comparable with Focus (which, as a choice of alternatives, always has a contrastive character).

B. TWHEN before the Predicate in English and after the Predicate in Czech
(i) A tendency observed by Czech grammars was attested in our data, to place the Predicate into the second position of the Cz. sentence, which has led to the placement of the TWHEN modification after the verb also in case in which it was an indisputable element of the Topic of the sentence (15) (iii) However, even in this group, quite clear examples are found testifying the difference in Topic and Focus in E. and in Cz.; in some cases, the initial position should be understood as a contrastive Topic (17), see Quirk et al. (1985) where fronting is mentioned as a regular means for emphasizing a contrastive Topic.

The position of LOC
We have again randomly chosen 100 sentences from the set of LOC after PRED in E. and we have analyzed all the sentences in the set of LOC before PRED in E. (i.e. the total of 67 sentences) taking into consideration also the previous context. The following observations seem to hold:

A. LOC after the Predicate in English and before the Predicate in Czech
(i) As has been mentioned above in our discussion on the sentences with TWHEN, the position of a modification close to the Predicate may be considered as a part of Topic or alternatively as a part of Focus, as the example below demonstrates: (iii) A modification is placed at the end of the sentence in E. because of its weight, which does not necessarily mean that this modification is in Focus (20): (20) E.: The topic never comes up in ozone depletion "establishment'' meetings, of which I have attended many. Cz.: Toto téma se na "schvalovacích" schůzích o ozónové díře, kterých jsem navštívil hodně, nikdy neujme.
(iv) The placement of the modification is given by grammatical restrictions of word order in E., namely that subject should precede the verb (21); there belong also examples with there-construction (22):

Summary
Our main concern has been the relation of word order and information structure in English and in Czech, in particular the mutual order of temporal and local modifications of predicates. We have put under scrutiny the data from the annotated parallel English-Czech treebank (PCEDT) and tested the variability of the order of the given types of modifications in general and two hypotheses on their preferential order in particular, namely the SVOMPT hypothesis for English and the so-called systemic ordering hypothesis for both languages. Our probe has demonstrated that corpus data offer much richer material to work with than an "arm-chair" linguist has ever had at her/his disposal but also that a careful manual check is necessary to obtain a reliable source for a detailed linguistic analysis that eventually may lead to some wellfounded theoretical conclusions.