Boundary-based MWE segmentation with text partitioning

This submission describes the development of a fine-grained, text-chunking algorithm for the task of comprehensive MWE segmentation. This task notably focuses on the identification of colloquial and idiomatic language. The submission also includes a thorough model evaluation in the context of two recent shared tasks, spanning 19 different languages and many text domains, including noisy, user-generated text. Evaluations exhibit the presented model as the best overall for purposes of MWE segmentation, and open-source software is released with the submission (although links are withheld for purposes of anonymity). Additionally, the authors acknowledge the existence of a pre-print document on arxiv.org, which should be avoided to maintain anonymity in review.


Introduction
Multiword expressions (MWEs) constitute a mixed class of complex lexical objects that often behave in syntactically unruly ways. A unifying property that ties this class together is the lexicalization of multiple words into a single unit. MWEs are generally difficult to understand through grammatical decomposition, casting them as types of minimal semantic units. There is variation in this non-compositionality property (Bannard et al., 2003), which in part may be attributed to differences in MWE types. These range from multiword named entities, such as Long Beach, California, to proverbs, such as it takes one to know one, to idiomatic verbal expressions, like cut it out (which often contain flexible gaps). For all of their strangeness they appear across natural languages (Jackendoff, 1997;Sag et al., 2002), though generally not for common meanings, and frequently with opaque etymologies that confound non-native speakers.

Motivation
There are numerous applications in NLP for which a preliminary identification of MWEs holds great promise. This notably includes idiom-level machine translation (Carpuat and Diab, 2010); reduced polysemy in sense disambiguation ; keyphrase-refined information retrieval (Newman et al., 2012); and the integration of idiomatic and formulaic language in learning environments (Ellis et al., 2008). Parallel to these linguistically-focused applications is the possibility that MWE identification can positively affect machine learning applications in text analysis. Regardless of algorithm complexity, a common preliminary step in this area is tokenization. Having the "correct" segmentation of a text into words and MWEs results in a meaning-appropriate tokenization of minimal semantic units. Partial steps in this direction have been taken through recent work focusing on making the bag of phrases framework available as a simple improvement to the bag of words. However, that work (Handler et al., 2016) utilized only noun phrases, leaving the connection between MWEs and a comprehensive bag of phrases framework yet to be acknowledged. With the specific focus of MWEs on idiomaticity, a comprehensive bag of words and phrases framework would be possible, provided the MWE identification task is resolved.

Task description
Despite the variety that exist, studies often only focus on a few MWEs classes, or on only specific lengths (Tsvetkov and Wintner, 2011). In fact, 1 named entity extraction may be thought of as satisfying the MWE identification task for just this one MWE class. The problem has a broader framing when all classes of MWEs are considered. Furthermore, since a mixed tokenization of words and phrases as minimal semantic units is a desired outcome, it is helpful to consider this task as a kind of fine-grained segmentation. Thus, this work refers to its task as MWE segmentation, and not identification or extraction. In other words, the specific goal here is to delimit texts into the smallest possible, independent units of meaning. Schneider et al. (2014b) were the first to treat this problem as such, when they created the first data set comprehensively annotated for MWEs. From this data set, an exemplar annotated record is: 1 My wife had taken 1 her 07' 2 Ford 2 Fusion 2 in 1 for a routine oil 3 change 3 . whose segmentation is an example of the present focus of this work. Note that the present study focuses only on MWE tokens, does not aim to approach the task of MWE class identification, and does not attempt to disambiguate MWE meanings.
For detailed descriptions of these other MWErelated tasks, Baldwin and Kim (2010) provide an extensive discussion.

Existing work
The identification of MWEs and collocations is an area of study that has seen notable focus in recent years (Seretan, 2008;Pecina, 2010;Newman et al., 2012;Ramisch, 2015;Schneider et al., 2014a), and has a strong history of attention (both directly and through related work) in the literature (Becker, 1975;Church and Hanks, 1990;Sag et al., 2002). It has become commonplace for approaches to leverage well-studied machine learning algorithms such as structured perceptrons (Schneider et al., 2014a) and conditonal random fields (Constant and Sigogne, 2011;Hosseini et al., 2016). The flexibility of these algorithms allow researchers to mix a variety of feature types, ranging from tokens to parts of speech to syntax trees. Juxtaposed to these relativelycomplex models exist the simpler and moreheuristic (Cordeiro et al., 2015). Some rely singularly on MWE dictionaries, while others incorporate multiple measures or are rule-based, like those present in the suite available through mwetoolkit (Ramisch, 2015) or jMWE . MWEs have been the focus of considerable attention for languages other than English, too. Hungarian MWE corpora focusing on light verb constructions have been under development for some time (T. et al., 2011). In application to the French language, part-of-speech tagging has seen benefit (Constant and Sigogne, 2011) through awareness and relativity to MWEs. Recently, Savary et al. (2017) conducted a shared task for the identification of verbal MWEs with a data set spanning 18 languages (excluding English). While extending this area of work to a large variety of languages, this task saw notable multilingual algorithmic developments (Saied and Candito, 2017), but did not approach the identification of all MWE classes, comprehensively. On the other hand, a Se-mEval 2016 shared task (Schneider et al., 2016) covered English domains and all MWE classes, bearing the greatest similarity to the present work. In general, these shared tasks have all highlighted a need for the improvement of algorithms.

Text partitioning
Text partitioning is a physical model developed recently (Williams et al., 2015) for fine-grained text segmentation. It treats a text as a dichotomous squence, alternating between word (w i ) and nonword (b i ) tokens: The key feature of text partitioning is its treatment of non-word, i.e., "boundary", tokens. Acting like glue, these may take one of two distinct states, s ∈ {0, 1}, identifying if a non-word token is bound (b 1 i ) or broken (b 0 i ). A non-word token in the bound state binds words together. Thus, a text partitioning algorithm is a function that determines the states of non-word tokens.
In its original development, text partitioning was studied simplistically, with space as the only non-word token. In that work, a threshold probability, q, was set. For each space, b i , in a text, a uniform random binding probability, q i , would be drawn. If q i > q, b i would be bound, and otherwise it would be broken. As a parameter, q thus allowed for the tuning of a text into its collection of words (q = 1), clauses (q = 0), or, for any value, q ∈ (0, 1), a randomly-determined collection of N -grams. While non-deterministic, this method was found to preserve word frequencies, (unlike the sliding-window method), and made possible the study of Zipf's law for mixed distributions of words and N -grams.
The present work utilizes the parameter q to develop a supervised machine learning algorithm for MWE segmentation. A threshold probability, q, is still set, and the supervised component is the determination of the binding probabilities (q i ) for a text's non-word tokens. Provided a gold-standard, MWE-segmented text: ) denote the frequency at which a boundary b i is observed between w i and w i+1 in the state s i . Provided this, a binding probability is defined as: .
This basic, 2-gram text partitioning model makes the binding probabilities a function of boundaries and their immediately-surrounding words.
In principle, this might be extended to a morenuanced model, with binding probabilities refined by larger-gram information.

Extensions
Some MWEs consist of non-contiguous spans of words. These varieties are often referred to as "gappy" expressions, an example of which is shown in Sec. 1.2. Text partitioning may easily be extended to handle gappy MWEs by instituting a unique boundary token, 2 e.g., b = GAP that indicates the presence of a gap. For example, to handle the gappy MWE out of control in the statement: The situation was out 1 of 1 their control 1 .
a binding probability for b (as above) between words w 5 = of and w 7 = control would be computed from the state frequencies f (w 5 , b {0,1} , w 7 ).
Since gappy MWEs are relatively sparse as compared to other MWEs, a single gap-boundary token is used for all gap sizes. This is designed for a flexible handling of variable gap sizes, given the relatively small amount of gold-standard data that is presently available. However, this may in principle be refined to particular gap-sized specifications, possibly ideal for higher precision in the presence of larger quantities of gold-standard data. A number of MWE types, such as named entities, are entirely open classes. Often occurring only once, or as entirely emergent objects, these pose a significant challenge for MWE segmentation, along with the general sparsity and size of the current gold-standards. For their inclusion in the gold-standard datasets and the general quality of automated taggers, part-of-speech (POS) information may generally be leveraged to increase recall. These data are utilized in a parallel text partitioning algorithm, swapping tokens for tags, 3 so that binding probabilities, q i,tok and q i,POS , are computed for both data types. Two thresholds are then used to determine states via a logical disjunction, Algorithm 1 Pseudocode for the longest first defined (LFD) algorithm. Here, a candidate MWE's tokens are pruned from left to right for the longest referenced in a training lexicon, lex. When no form is found in lex, the first token is automatically pruned, (accepting it as an expression), leaving the algorithm to start from the next. Note that the " " symbol indicates a concatenation operation in line 10, where the current f orm is placed onto the end of the lexemes array. if f orm ∈ lex or not i − 1 then 10: lexemes ← lexemes f orm 11: if length(tokens) = 1 then return lexemes

The longest first defined
In the presented form, text partitioning only focuses on information immediately local to boundaries (surrounding word pairs). This has positive effects for recall, but can result in lower precision, since there is no guarantee that a sequence of bound tokens is an MWE. For example, if presented with the text: "I go for take out there, frequently." the segment take out there might be bound, since take out and out there are both known MWE forms, potentially observed in training. To balance this, a directional, lookup-based algorithm is proposed. Referred to as the longest first defined (LFD) algorithm (see Alg. 1), this algorithm prunes candidates by clipping off the longest known (MWE) references along the reading direction of a language. This requires knowledge of MWE lexica, which may be derived from both gold-standard data and external sources (see Sec. 3). Continuing with the example, if the text partitioning algorithm outputs the candidate, take out there, it would next be passed to the LFD. The LFD would find take out there unreferenced, and check the next-shortest (2-word) segments, from left to right. The LFD would immediately find take out referenced, output it, and continue on the remainder, there. With only one term remaining, the word there would then be trivially output and the algorithm terminated. While this algorithm will likely fail when confronted with pathological expressions, like those in "garden path" sentences, e.g., "The prime number few.", directionality is a powerful heuristic in many languages that may be leveraged for increased precision.

Gold standard data
Treating MWE segmentation as a supervised machine learning task, this work relies on several recently-constructed MWE-annotated data sets. This includes the business reviews contained in the Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions, annotated by Schneider et al. (2014b;2015). These data were harmonized and merged with the Ritter and Lowlands data set of supersense-annotated tweets (Johannsen et al., 2014) for the SemEval 2016 shared task (#10) on Detecting Minimal Semantic Units and their Meanings (DIMSUM), conducted by Schneider et al. (2016). The DIM-SUM data set additionally possesses token lemmas and gold-standard part of speech (POS) tags for the 17 universal POS categories. In addition to the shared task training data of business reviews and tweets, the DIMSUM shared task resulted in the creation of three domains of testing data, which spanned business reviews, tweets, and TED talk transcripts. All DIMSUM data are comprehensive in being annotated for all MWE classes.
To evaluate against a diversity of languages this work also utilizes data produced by the multinational, European Cooperation in Science and Technology's action group: PARSing and Multiword Expressions within a European multilingual network (PARSEME) (Savary et al., 2015). In 2017, the PARSEME group conducted a shared task with data spanning 18 languages 4 (Savary et al., 2017), focusing on several classes of verbal MWEs. So, while the PARSEME data are not annotated for all MWEs classes, they do provide an assessment against multiple languages. However, the resources gathered for the 18 languages exhibit a large degree of variation in overall size and numbers of MWEs annotated, leading to observable differences in identifiability.
The gold standard data sets were produced with variations in annotation formats. The DIMSUM data set utilizes a variant of the beginning inside outside (BIO) scheme (Ramshaw and Marcus, 1995) used for named entity extraction. Additionally, their annotations indicate which tokens are linked to which, as opposed to the PARSEME data set, which simply identifies tokens to indexed MWEs. Note that this has implications to task evaluation: the PARSEME evaluations can only assess tokens' presence inside of specific MWEs, while the DIMSUM evaluations can focus on specific token-token attachments/separations. Evaluations against the DIMSUM datasets are therefore more informative of segmentation, than identification. Additionally, the DIMSUM data sets use lowercase BIO tags to indicate the presence of tokens inside of the gaps of others. However, the DIMSUM data sets provide no information on the locations of spaces in sentences, unlike the PARSEME data sets, which do. Since the present work relies on knowledge of spaces to identify token-token boundaries for segmentation, the DIMSUM data sets had to first be preprocessed to infer the locations of spaces. This is done in such a way as to preserve comparability with the work others, (discussed in Sec. 4.1).

Support data
The gold-standard data sets (DIMSUM, and PARSEME) exhibit variations in size, domain, language, and in the classes of annotated MWEs. Ideally, each of these data sets would cover all MWE classes. Since the English data sets do, and many are open classes (e.g., the named entity class readily accepts new members), gold standards cannot be expected to cover all MWE forms. So, to produce segmentations that identify rare MWEs, like those that occur once in the gold standard data, this work relies on support data. Note that because the PARSEME data set covers a restricted set of MWE types (verbal MWEs, only), type-unrestricted lexical resources, such as Wiktionary and Wikipedia, can be expected to substantially hurt precision while helping recall. Thus, the support data described below are only used for the English language experiments, i.e., the DIM-SUM data sets. Enhancement by support data for the PARSEME task and extension to the identification of MWE types are thus left for future development, together.
Since this work approaches the problem as a segmentation task, information is needed on MWE edge-boundaries. Thus, support data must present MWEs in their written contexts, and not just as entries in a lexicon. Example usages of dictionary entries provide this detail, and are leveraged from Wiktionary (data accessed 1/11/16) and Wordnet (Miller, 1995). These exemplified dictionary entries help to fill gold standard data gaps, but still lack many noun compounds and named entities. Outside of dictionaries, MWEs such as these may be found in encyclopedias. Thus, the Wikipedia hyperlinks present in all Wikipedia (data accessed 5/1/16) articles are utilized. Specifically, the exact hyperlink targets are used (not the displayed text), and without using any term extraction measures for filtering, as opposed to the data produced by Hartmann et al. (2012). This results in data that are noisy, with many entities that may not actually be classifiable as MWEs. However, their availability and broad coverage offset these negative properties, which is exhibited by this work's evaluation.

Pre-processing
None of the gold standard data sets explicitly identify the locations of spaces in their annotations. This is a challenge for the present work, since it focuses on word-word boundaries (of which space is the most common) to identify the separations between segments. This turns out to not be an issue with the PARSEME data sets, which indicate when a given token is not followed by a space. However for the DIMSUM data sets, the locations of spaces had to be inferred. To resolve this issue, a set of heuristic rules are adopted with a default assumption of space on both sides of a tokens. Exceptions to this default include, group openings (e.g., brackets and parentheses) and oddindexed quotes (double, single, etc.), for which space is only assumed at left; and punctuation tokens (e.g., commas and periods), group closures (e.g., brackets and parentheses), and even-indexed quotes (double, single, etc.), for which space is only assumed at right. While these heuristics will certainly not correctly identify all instances of space, they make the data sets more faithful to their original texts. Furthermore, since the annotations and evaluation procedures only focus on links between non-space tokens, the data may be re-indexed during pre-processing so as to allow for any resulting evaluation to be comparable to those of the data set authors' and shared task participants'. Thus, the omission of space characters and their inference in this work only negatively impacts text partitioning's evaluation. In other words, if this work were applied to annotated data that properly represents space, higher performance might be exhibited.

Evaluation
It is reasonably straightforward to measure precision, recall, and F 1 for exact matches of MWEs. However, this strategy is unreasonably coarse, failing to represent partial credit when algorithms get only portions of MWEs correct. Thus, the developers of the different gold standard data sets have established other evaluation metrics that are more flexible. Utilizing these partial credit MWE evaluation metrics provides refined detail into the performance of algorithms. However, these are not the same across the gold standard data sets. So, to maintain comparability of the present results, this work uses the specific strategies associated to each shared task.
In application to the PARSEME data sets, precision, recall, and F 1 describe tokens' presence in MWEs. Alternatively, DIMSUM-style metrics measure link/boundary-based evaluations. Specifically, this strategy checks if the links between tokens are correct. Note that this latter (DIM-SUM) evaluation is better aligned to the formulation of text partitioning, but leaves the number evaluation points at one fewer per MWE than the PARSEME scheme. Thus, PARSEME evaluations favor longer MWEs more heavily.

Experimental design
The basic text partitioning model relies on the single threshold parameter, q, and integration of POS tags relies on a second. So, optimization ultimately entails the determination of parameters for both tokens, q tok , and POS tags q POS . To balance both precision and recall, these parameters are determined through optimization of the F 1 measure. In the absence of the LFD, F 1 -optimal pairs, (q tok , q POS ), are first determined via a full parameter scan over (q tok , q POS ) ∈ {0, 0.01, · · · , 0.99, 1} 2 .
For a given threshold pair, LFD-enhancement can then only increase precision, while decreasing recall. So, subsequent optimization with the LFD is accomplished through scanning values of q tok and q POS in the parameter space no less than those previously determined for basic, non-LFD model.
The different experiments were conducted in accordance with the protocols established by the designers of data sets and shared tasks, and in all cases, an eight-fold cross-validation was conducted for optimization. Exact comparability was achieved for the DIMSUM and PARSEME experiments as a result of the precise configurations of training and testing data from the shared tasks. Moreover, since an evaluation script was provided for each, metrics reported for DIMSUM and PARSEME experiments are in complete accord with the results of the shared tasks. For the DIMSUM experiments, results should be com-pared to the open track (external data was utilized), and for the PARSEME experiments, results should be compared to the closed track (no external data was utilized).

Results
Evaluations spanning the variety of languages (19, in total) showed high levels of performance, especially in application to English, where there was a diversity of domains (business reviews, Tweets, and TED talk transcripts), along with comprehensive MWE annotations. Moreover, these results were generally observed for text partitioning both with, and without the LFD. As expected, application of the LFD generally led to increased precision. While integration of POS tags was found to generally improve MWE segmentation in all English experiments, this was frequently not the case in applications to other languages. However, this observation should be taken with consideration for the restriction to the fewer MWE classes (verbal MWEs, only) annotated in the PARSEME (non-English) shared task languages, and additionally the fact that no external data were used. Detailed results for all DIMSUM and PARSEME experiments are recorded in Tab. 1.
For the DIMSUM experiments, final parameterizations were determined as (q tok , q POS ) = (0.5, 0.71) for text partitioning, alone, and (q tok , q POS ) = (0.74, 0.71) for the LFD-enhanced model. Comparing the base and LFD-enhanced models, higher overall performance was always achieved with the LFD (increasing F 1 by as many as 12 points). Including text partitioning in the shared-task rankings (for a total of 5 models) placed the LFD-enhanced model first at all domains but Twitter, for which third was reached (though within 3 F 1 -points of first). However, combining all three domains into a single experiment placed the LFD-enhanced text partitioning algorithm as first, making it the best-performing algorithm, overall. In application to the userreviews domain, text partitioning maintained firstplace status, even without the LFD enhancement. For all other domains the base model ranked third.
For the PARSEME experiments, final parameterizations varied widely. This is not surprising, considering the significant variation in data set annotations and domains across the 18 languages. Additionally, POS tags were found to be of less-consistent value to the text partition-  thresholds; precision (P ), recall (R), and F-measure (F1); shared-task rank (Rank); and shared task F1 ranges (F1-Range). DIMSUM experiments spanned three domains: Twitter (Tweets), business reviews (Reviews), and TED talk transcripts (TED), with combined evaluation under EN. PARSEME language experiments are identified by ISO 639-1 two-letter codes.
ing algorithm, particularly when the LFD was not applied. Indeed, cross-validation of the base model resulted in q POS = 0 as optimal for 11 out of the 15 languages where POS tags were made available. However, cross-validation of the LFDenhanced algorithm resulted in only 6 parameterizations having q POS = 0 as optimal. First place status was achieved for three out of the 18 lan-guages (LT, PL, and SL), and for all languages aside from SV and TR, mid-to-high ranking F 1 values were achieved. 5 In contrast to the DIM-5 Note that anomalous MWEs were observed in the DE HU data sets, where large portions of the annotated MWEs consisted of only a single token. While the PARSEME annotation scheme includes multiword components that span a single token, e.g., "don't" in don't talk the talk, those observed in DE and HU were found outside of the annotation SUM data sets, application of the LFD improved F 1 scores in only roughly half of the experiments.

Discussion
Evaluation against the comprehensively-annotated English data sets has shown text partitioning to be the current highest overall ranking MWE segmentation algorithm. This result is upheld for two out of the three available test domains (business reviews and TED talk transcripts), with a close third place achieved against data from Twitter. This exhibits the algorithms general applicability across domains, and especially in the context of noisy text. Combined with the algorithm's fast-running and non-combinatorial nature, this makes text partitioning ideal for large-scale applications to the identification of colloquial language, often found on social media. For these purposes, the presented algorithms have been made available as opensource tools as the Python "Partitioner" module, which may be accessed through Github 6 and the Python Package Index 7 for general use.
Unfortunately, the PARSEME experiments did not provide an evaluation against all types of MWEs. However, they did exhibit the general applicability of text partitioning across languages. So, while the PARSEME data are not sufficient for comprehensive MWE segmentation, trained models have also been made available for the 18 non-English languages through the Python Partitioner module. Across the 18 PARSEME shared-task languages text partitioning's F 1 values were found to rank as mid to high, with the notable exception of SV. While the SV data is peculiar in being quite small (with its training set smaller than its testing set), models entered into the PARSEME shared task achieved roughly twice the F 1 score for SV, indicating the possibility that text partitioning requires some critical mass of training data in order to achieve high levels of performance. Thus, for general increases in performance and for extension to comprehensive MWE segmentations, future directions of this work will likely do well to seek the format. This included 27.2% of all MWEs annotated in the DE test records and 64.8% of all in the HU test records. Since text partitioning identifies segment boundaries, it cannot handle these anomalous MWEs, unlike the models entered into the PARSEME shared task. So to accommodate these and maintain comparability, a separate algorithm was employed. This simply placed lone MWE tags on tokens that were observed as anomalous 50% or more of the time in training. 6 https://github.com/jakerylandwilliams/partitioner 7 https://pypi.python.org/pypi/partitioner collection of larger and more-comprehensive data sets. As defined, text partitioning is subtly different from a 2-gram model: it focuses on non-word boundary tokens, as opposed to just word-word pairs. Because this algorithm relies on knowledge of boundary token states, it cannot be trained well on MWE lexica, alone. Fort this model to achieve high precision, boundaries commonly occurring as broken must be observed as such, even if they are necessary components of known MWEs. Thus, the use of boundary-adjacent words for prediction is a limitation of the present model. This may possibly be overcome through use of more distant words and boundaries. However, since goldstandard data are still relatively small, they will likely require significant expansion before such models may be effectively implemented. Thus, future directions with more nuanced text partitioning models highlight the importance of generating more gold standard data, too.