UW-CSE at SemEval-2016 Task 10: Detecting Multiword Expressions and Supersenses using Double-Chained Conditional Random Fields

We describe our entry to SemEval 2016 Task 10: Detecting Minimal Semantic Units and their Meanings. Our approach uses a discriminative ﬁrst-order sequence model similar to Schneider and Smith (2015). The chief nov-elty in our approach is a factorization of the labels into multiword expression and supersense labels, and restricting ﬁrst-order dependencies within these two parts. Our submitted models achieved ﬁrst place in the closed competition (CRF) and second place in the open competition (2-CRF).

1 Introduction Schneider and Smith (2015) argued that the problems of segmenting a piece of text into minimal semantic units, and of labeling those units with semantic classes (e.g., supersenses), are intimately connected.
We propose to use a double-chained conditional random field (which we refer to as "2-CRF," an example of a factorial CRF; §3.4) for joint multiword expression identification and supersense tagging. Like other CRFs, 2-CRF is a feature-rich probabilistic model that can represent probabilistic dependencies between features and labels and between the labels of the consecutive words. The 2-CRF models local dependencies between MWE and supersense sequences with two parallel chains of labels, restricting direct interaction between the two to local, single-word positions. Label constraints on tag bigrams ensure a globally consistent tagging.
Our experiments show that 2-CRF outperforms a zero-order baseline, the structured perceptron used by Schneider and Smith (2015), and a conventional CRF ( §4). For SemEval 2016 Task 10: Detecting Minimal Semantic Units and their Meanings, we submitted a CRF for the closed condition and a 2-CRF (incompletely trained) for the open condition, achieving first and second place, respectively.

Task Description
For completeness, we briefly review the shared task. The shared task training dataset, called "Detecting Minimal Semantic Units and their Meanings" (DiM-SUM) (Schneider et al., 2016), 1 consists of sentences with multiword expression (MWE) and supersense annotations. The data combine and harmonize the STREUSLE 2.1 corpus of web reviews (Schneider and Smith, 2015) 2 and Ritter and Lowlands Twitter datasets (Johannsen et al., 2014). 3 Similar to prior work (Schneider and Smith, 2015), the annotation for MWEs extends the conventional BIO scheme (Ramshaw and Marcus, 1995) to include gappy MWEs with one level of nesting. 4 Segmentations are represented using six tags; the lower-case variants indicate that an expression is within another MWE's gap.
• O and o: single word expression • B and b: the first word of a MWE • I and i: a word continuing a MWE We call a tag sequence valid if it matches the regular expression (O|B(o|bi + |I) * I + ) + . Validity can be ensured using label constraints on tag bigrams (Schneider, 2014).
Each noun or verb expression is also annotated with a supersense; there are 26 supersenses for nouns and 15 for verbs. Only the first word of a MWE receives a supersense tag.
One approach to encoding the MWE and supersense tags is to define an extended label set containing both tags (Schneider and Smith, 2015). This will result in 170 potential labels: I, i, and each of B, b, O and o paired with one of the 41 supersenses and no supersense (2 + 4 × 42 = 170). Only 110 of these are attested in the training data, and these are the combinations our approach considers.
There are 4,799 sentences in the training data. For each token, the dataset provides its offset in the sentence, lemma, POS tag, MWE tag, offset of parent, and supersense label (if applicable).
The blind test set consists of 1,000 sentences from three sources: online reviews from the Trust-Pilot corpus (Hovy et al., 2015), tweets from the Tweebank corpus (Kong et al., 2014) and TED talk transcripts from the IWSLT MT evaluation campaigns, obtained from the WIT 3 archive (Cettolo et al., 2012).
The shared task has three data conditions: supervised closed, semi-supervised closed, and open. In the supervised closed condition, only the labeled data, the English WordNet lexicon, a provided Brown clustering (Brown et al., 1992) on the 21million-word Yelp Academic Dataset 5 (Schneider et al., 2014), and any of the ARK Tweet NLP clusters 6 are allowed. The semi-supervised closed condition adds the Yelp Academic Dataset to the resources. The open condition allows the use of any available resources. We have participated in the supervised closed and open conditions. The evaluation is based on F 1 score for MWE identification, supersense labeling, and their combination. 7 3 Models

Input Features
For the open condition, we use all features introduced in Schneider and Smith (2015): a.) Basic MWE features used by , including lemma, POS tags, word shapes and features indicating whether the token matches entries in any of several multiword lexicons (WordNet, SemCor, SAID, WikiMwe, English Wiktionary and Multiword Entries on the Phrases.net website), b.) the provided Brown clusters and c.) capitalization features, an auxiliary verb vs. main verb feature and unlexicalized WordNet supersense features. Based on the model, these features are conjoined with the MWE, supersense, or extended label set to form zero-order features. For the closed condition, we exclude the features based on multiword lexicons.

Baseline: Multinomial Logistic Regression
As a baseline, we predict the label of each word based on the features of the word within a sentence. The multinomial logistic regression models the conditional probability of the label of the ith word, denoted by Y i , in a sentence x as: where h denotes a feature vector that contains features that describe the token i, and its relationships with some of its adjacent words in x conjoined with the label y. λ denotes a vector of feature weights and is learned from data. Constraints on labels are not taken into account during training. We incorporated these constraints during testing in a greedy manner: For the ith word, we considered only the labels that make it valid with respect to the bigram label constraints based on the predicted label for the (i − 1)th word.

Conditional Random Field
In the linear chain CRF (Lafferty et al., 2001), the conditional probability of a valid label sequence y of words in a sentence x is modeled as: (2) where λ is a vector of feature weights shared across all positions (i.e., words) and sentences. The feature vector h contains the zero-order features described above and the first-order features. The first-order features model the dependencies between the label of ith word and that of (i + 1)th word. We assume a dummy label y 0 for notational convenience.
In both training and testing, we ensure that the constraints on the consecutive labels are satisfied. The label of the ith word only depends on the token sequence, its offset in the sentence and the labels of (i−1)th and (i+1)th words. Dynamic programming is used for exact inference; runtime is quadratic in the size of the label set and linear in the sequence length. We maximize 2 -regularized log-likelihood using L-BFGS to learn the feature weights λ: where D contains all training instances, λ 1 (λ 2 ) corresponds to the parameters for zero-order (firstorder) features, and α 1 (α 2 ) is the regularization strength for zero-order (first-order) feature weights. In our preliminary experiments, we found that using different regularization strengths for zero-order and first-order features can benefit accuracy.

Double-Chained CRF
We propose a double-chained CRF (2-CRF) that factors the labels into separate MWE and supersense annotations. Such a model has been used for joint POS tagging and noun-phrase chunking by Sutton et al. (2007). The model is illustrated in Fig. 1; the heart of the difference lies in restricting first-order dependencies within MWE or supersense labels, not the combination of the two.
Concretely, the 2-CRF separates the zero-order features for MWE and for supersense tags. Second, while the traditional chain-structured CRF has a feature for each pair of labels in the extended label set, the 2-CRF introduces first-order features capturing each consecutive pair of MWE labels, and (separately) each consecutive pair of supersense labels. This model removes some repetitive parameters. For example, instead of having parameters to capture the relation between consecutive B and I tags paired with all supersenses, 2-CRF will have only one parameter. Moreover, if the feature weights λ for all the features between m i and s i pairs are zero, the 2-CRF model is equivalent to two separate CRFs for the two tasks. Therefore, it has a flexibility to learn the parameters for the two tasks jointly or separately. Due to this kind of flexibility, we expect that the 2-CRF model has a better generalization ability. For a sentence x with a valid label sequence y = (m, s), where m denotes the MWE tag sequence and s denotes supersense tag sequence, the conditional probability of (m, s) given x is defined as: where the feature vector function h can be written as: Similar to the CRF, we enforce label constraints on the MWE label sequence both at training and prediction time.
Inference can be carried out exactly using similar dynamic programming algorithms to those used for the CRF. Training is carried out as for the CRF (i.e., 2 -regularized log-likelihood; see Eq. 3). 8

Experimental Setup
We compare the performance of the following four models that use exactly the same input features §3.1: • Multinomial logistic regression (MLR) as described in §3.2 (a zero-order model) • Structured perceptron as used by Schneider and Smith (2015) with the same set of features (first-order, similar to our CRF) • CRF as described in §3.3 • Double-chained CRF as described in §3.4 We used the AMALGrAM 9 code base for feature extraction (Schneider and Smith, 2015). For hyperparameter tuning, we hold out 30% randomly selected training samples of the DiMSUM dataset as validation data. Using preliminary experiments on validation data, we set the number of L-BFGS iterations for multinomial logistic regression, CRF, and 2-CRF as 100, 120, and 120, respectively. We set the number of iterations of averaged perceptron algorithm for structured percetpron as 10. We also impose a percept cutoff of 3 on the minimum number of occurrences for a zero-order percept to be considered in the models. We use validation data to tune α 1 and α 2 (where applicable) hyperparameters. After tuning the parameters, we use the whole DiMSUM training dataset to train the models.

Results and Discussion
Tables 1 and 2 show the results for the closed and open conditions. The selected hyperparameters α 1 and α 2 are shown for each model. In each table, we 8 An open source efficient cython implementation of our method will be made publicly available at: https:// github.com/mjhosseini/2-CRF-MWE. 9 http://www.cs.cmu.edu/˜ark/LexSem show the results on our held-out validation data and on the DiMSUM test datasets. The official submitted systems are marked with * in the tables. 10 For the closed condition, the 2-CRF had not completed training, so our entry was the CRF; it achieved first place.
For the open condition, training of 2-CRF had only completed 80 iterations at the submission deadline, so that is what was entered (it achieved second place). We report those scores, as well as the slightly improved scores obtained after 120 iterations.
Across the board, there is roughly a 9% decrease in the F 1 score when we move from validation to DiMSUM test datasets. This is not surprising because the validation and DiMSUM datasets represent different text genres and styles.
We measured the statistical significance of the difference between the structured perceptron (SP) and the other methods. We used a randomization test (Yeh, 2000) at the sentence level to estimate the confidence level of the difference (p-value < 0.05). We indicate in Tables 1 and 2 in italics the cases where the improvement over the structured perceptron is significant.
In the closed condition, the 2-CRF model leads to the highest F 1 scores for all evaluation metrics. Interestingly, the structured perceptron improves on MWE but suffers on supersenses, relative to the zero-order MLR model. 11 CRF and 2-CRF show improvements against MLR on both tasks, with the latter winning overall on validation and (slightly) on test data.
In the open condition, we see similar patterns except a few cases: structured perceptron has the highest F 1 score in MWE identification on validation data and MLR slightly outperforms 2-CRF in supersense tagging on test data. However, the differences are not statistically significant over 2-CRF, and it has the highest combined score.
Finally, we observe that adding the features based   . The best result in each column is bolded. The results that are significant over SP are italicized. The system denoted by * is our official submission for the open condition.
on multiword lexicons (moving from supervised closed to open condition) improves MWE identification, without harming supersense tagging performance. The increase in the performance of MWE identification is statistically significant across all methods and test datasets.

Conclusions
We presented the results of four models for the joint prediction of MWE annotations and supersense annotations: multinomial logistic regression, structured perceptron, CRF and double-chained CRF. We found that double-chained CRF performs well on both tasks. We showed that, consistent with past work, adding features based on multiword lexicons improves the performance of all models.