Annotating formulaic sequences in spoken Slovenian: structure, function and relevance

This paper presents the identification of formulaic sequences in the reference corpus of spoken Slovenian and their annotation in terms of syntactic structure, pragmatic function and lexicographic relevance. The annotation campaign, specific in terms of setting, subjectivity and the multifunctionality of items under investigation, resulted in a preliminary lexicon of formulaic sequences in spoken Slovenian with immediate potential for future explorations in formulaic language research. This is especially relevant for the notable number of identified multi-word expressions with discourse-structuring and stance-marking functions, which have often been overlooked by traditional phraseology research.


Introduction
There has been an extensive body of research on the formulaic nature of language in the last three decades (Wray, 2013) exposing the large number of multi-word combinations that speakers seem to process as single vocabulary units (Sinclair, 1991;Wray). In addition to the most commonly studied groups of multi-word expressions, such as idioms (e.g. break a leg) and collocations (e.g. heavy rain), corpus-driven research (Biber, 2009;Conklin and Schmitt, 2012) has shown that formulaic status can also be attributed to frequently recurring sequences of words (variously termed formulaic sequences or lexical bundles), which are not necessarily structurally or semantically complete (e.g. this means that).
Although there is a general consensus on the need to systematically identify and formalize formulaic sequences, both for native and non-native speakers of a language (Simpson-Vlach and Ellis, 2010;Brooke et al., 2015), there has been less discussion on the optimal approach to their linguistic description and (sub)categorization. In addition, few studies that do involve some kind of quantification of formulaic sequences by syntactic, semantic or other properties, rarely report on the methodological issues related to the categorization itself.
To provide insight on the nature of formulaic language in (spoken) Slovenian, and the methodological aspects related to its linguistic categorization in general, this paper presents the annotation of formulaic sequences in the reference corpus of spoken Slovenian in terms of syntactic structure, pragmatic function and semantic relevance. After a short presentation of the corpus (Section 2) and the formulaic sequence extraction (Section 3), we present the annotation workflow and the guidelines in Section 4. Given several distinct aspects of this annotation campaign, a detailed analysis of inter-annotator disagreements is given in Section 5, followed by the presentation and discussion of the resulting list of annotated sequences in Section 6.

GOS corpus
GOS is the reference corpus of spoken Slovenian including approximately 120 hours (1 million tokens) of spontaneous speech in different everyday situations in public (radio and TV shows, school lessons and lectures) and non-public settings (meetings, consultations, services, private conversations).
The recordings, balanced for communication channels, situations and speaker demographics, have been manually transcribed in both pronunciation-based and standardized spelling (Verdonik et al., 2013). In this research, version 1.0 of the GOS corpus was used, freely available for download from the CLARIN.SI repository (Zwitter Vitez et al., 2013). 1 3 Identification of formulaic sequences

N-gram extraction
To generate the list of formulaic sequences in GOS corpus, the LIST extraction tool (Krsnik et al., 2019) was used to extract all n-grams of length 2-5 tokens (words with normalized spelling) occurring above the frequency threshold of 20 occurrences per million. In addition to frequency counts, the tool also calculates the strength of association between words in a given n-gram, using three effectsize measures (Dice coeeficient, point-wise mutual information, and cubic mutual information) and two significance measures (t-score, simple log-likelihood), extended for multi-word combinations (Ramisch et al., 2010), as well.

N-gram ranking
There is no uniform consensus on the optimal method for measuring formulaicity in a language, with methods ranging from raw frequency counts to specific association measures (Biber, 2009;Gries, 2012), producing only partially overlapping recommendations of the most salient multi-word units in a language (Evert, 2009), including Slovenian (Dobrovoljc, 2017). Instead of opting for a single method, we narrowed the initial list of frequently recurring n-grams to the union of top-1,000 candidates ranked by each of the six methods (frequency, Dice, t-score, LL, MI, MI 3 ). This amounted to the final list of 2,374 formulaic sequences for subsequent annotation (Table 1).

Annotation of formulaic sequences
The list of formulaic sequences has been split into multicolumn spreadsheets containing the sequences, slots for predefined labels and the hyperlinks to the corresponding concordances in GOS.
Each spreadsheet was manually annotated by two independent annotators (trained native speakers) based on the guidelines summarized below, with disagreements adjudicated by an expert third annotator.

Syntactic structure
In terms of syntactic structure, the sequences have been categorized into structurally complete and incomplete sequences. Structurally complete are the sequences that can be attributed a specific syntactic role in a utterance. This includes complete utterances or phrases (e.g. to je res "that's true", no no "well well"), sentence elements, such as predicates (boš videl "you-will see"), predicate arguments (glava družine "head of the family") and adjuncts (pol ure "half an hour"), as well as modifiers (bolj ali manj "more or less"), multi-word conjunctions (zaradi tega ker "given the fact that"), and connectives (tako da "so that"). Incomplete sequences, on the other hand, include fragments of the above constructions (da bi se "that they", minutečez "minutes past"), including speech-specific sequences involving fillers (eee in eee "uhm and uhm"), discourse markers (ja tako da "yes so") and repetitions (kaj kaj "what what").

Pragmatic function
In terms of pragmatic function, the guidelines followed previous influential functional taxonomies (Simpson-Vlach and Ellis, 2010;Biber et al., 2004), in which formulaic sequences are divided into referential expressions that reference physical or abstract entities and their properties (e.g. to je bilo "that was", v skladu z "in line with", uradni listšt. "official gazette no.'), stance expressions that express attitudes or assessments of certainty (e.g. na nek način "in a way", se mi zdi "I think", naj bi bil "is supposed to", ja ne vem "well I don't know"), and discourse organizers that contribute to textual and interactional coherence (e.g. kar pomeni da "which means that", to se pravi "that is to say", tako da je "so that is", ja ja ja "yes yes yes").

Lexicographic relevance
In order to determine which formulaic sequences are potentially relevant for inclusion in future dictionaries and similar lexical resources for Slovenian, the annotators were asked to label the sequence in terms of its semantic relevance, i.e. whether the sequence is a multi-word expression they would expect to find in a general dictionary intended for both native and non-native speakers of Slovenian. Specifically, they were instructed to identify multi-word expressions as opposed to free word combinations, ranging from collocations (na internetu "on the Internet") to fixed multi-word units with denominative (javni sektor "public sector"), syntactic (kljub temu da "despite the fact that"), or pragmatic functions (tako rekoč "so to speak", dame in gospodje "ladies and gentlemen"), regardless of semantic transparency.

Disambiguation
Only one label was allowed per category. In case of ambiguity, the annotators were advised to inspect a random sample of the concordances provided and decide for the most frequently occurring structural or functional interpretation, i.e. a primary interpretation for the given string. For semantic relevance, on the other hand, the annotators were instructed to label a sequence as relevant regardless of the frequency of this particular usage.

Inter-annotator agreement
On average, the two annotators agreed on 81.6% of categorization decisions, with disagreements distributed similarly across different n-gram lengths. This confirms the relatively high level of subjectivity involved in this annotation task, specific not just in terms of categories (intuitive interpretations of abstract classes), but also in terms of items under investigations (highly ambiguous and multifunctional), and the annotation setting itself (lack of immediate context, simple guidelines).
As expected, best inter-annotator agreement was observed for syntactic structure (86% absolute agreement, Cohen's Kappa 0.66), where annotators mostly disagreed on the structure of sequences occurring as both syntactically complete and incomplete units with similar frequency distribution (e.g. veš kaj "you know what"). Other frequent groups with structure disagreement include predicates with transitive verbs (bom rekel "I-will say"), numerals (deset tisoč "ten thousand"), repetitions (dobro dobro "good good"), fragments of prepositional phrases (današnji dan "(on) this day"), as well as strings of discourse connectives (in s tem "and thus"), and clause stems (kar pomeni "which means").
For all three categories, the competing annotations were resolved by an expert third annotator. However, given the high level of ambiguity and subjectivity inherent to the annotation task, the information on the degree of inter-annotator agreement for each decision has been preserved in the final data release. 2

List of annotated sequences
In general, the distribution of specific annotation labels in the resulting list of formulaic sequences (summarized in Table 2) confirms previous empirical observations that formulaic sequences mostly consist of structurally incomplete n-grams (72.2%) with referential function (72.0%) that do not correspond to traditional dictionaryrelevant multi-word expressions (74.6%). Specifically, 50.6% of sequences (1,201) have been labelled with this exact combination of characteris-tics, among which sentence fragments (da je "that is", je to "is this", ki je v "which is in") prevail. N  structure  complete  661  incomplete 1,713  function  referential 1,709  stance  306  discourse  359  relevance yes  604  no 1,770 Total 2,374 Nevertheless, the annotated list reveals several other groups of formulaic language in spoken Slovenian with potential relevance for further linguistic inquiries and applications. From the point of syntactic structure, the structurally complete sequences (27.8%) include a diverse set of constructions, ranging from sentence elements, such as predicates (smo rekli "we-have said"), and adjuncts (v Sloveniji "in Slovenia", dve leti "two years"), to various types of modifiers (še en "another") and sentence-peripheral multi-word expressions. This last group also corresponds to the function-related findings that show a notable share of formulaic sequences with discourse-organizing (15.1%, e.g. tako da "so that", na primer "for example", a ne "right", dobro jutro "good morning") and stance-marking functions (12.9%, e.g. se mi zdi "it seems", mislim da "I think", po svoje "in a way"), confirming the importance of discourse structuring, interaction management and speaker mitigation in speech.

Conclusion
This paper presented the identification of the most frequent and statistically prominent word n-grams in the reference spoken corpus of Slovenian and their annotation in terms of syntactic structure, pragmatic function and lexicographic relevance. The annotation campaign resulted in a preliminary lexicon of formulaic sequences in (spoken) Slovenian with a high potential for future explorations in both theoretical and applied formulaic language research.
In particular in relation to the latter, our research represents an important addition to existing corpus-based collections of multi-word units in Slovenian (Gantar et al., 2016;Kosem et al., 2018;Ljubešić et al., 2015), which predominantly focus on units with propositional meaning. The large number of formulaic expressions with discourseorganizing and stance-marking functions identified in this research, however, confirms the need for future investigations of non-propositional multi-word expressions, as well.
In doing so, we plan to extend our work to the identification and annotation of formulaic sequences in written texts, drawing on the findings and observations presented above. In addition to the immediate benefits to lexicography, language teaching and natural language processing, an exhaustive inventory of formulaic sequences in Slovenian will also enable further research on methods for their identification and categorization. This also includes a comparison with manual formulaic sequence identification in corpora, bringing insight to issues related to instance-level annotation, as well.