Lisbon: Evaluating TurboSemanticParser on Multiple Languages and Out-of-Domain Data

As part of the SemEval-2015 shared task on Broad-Coverage Semantic Dependency Parsing, we evaluate the performace of our last year’s system ( TurboSemanticParser ) on multiple languages and out-of-domain data. Our system is characterized by a feature-rich linear model, that includes scores for ﬁrst and second-order dependencies (arcs, siblings, grandparents and co-parents). For decoding this second-order model, we solve a linear relaxation of that problem using alternating directions dual decomposition (AD 3 ). The experiments have shown that, even though the parser’s performance in Chinese and Czech at-tains around 80% (not too far from English performance), domain shift is a serious issue, suggesting domain adaptation as an interesting avenue for future research.


Introduction
The last years have witnessed a continuous progress in statistical multilingual models for syntax, thanks to shared tasks such as CoNLL 2006-7 (Buchholz and Marsi, 2006;Nivre et al., 2007) and, more recently, SPMRL 2013-14 (Seddah et al., 2013;Seddah et al., 2014). As a global trend, we observe that models that incorporate rich global features are typically more accurate, even if pruning is necessary or decoding needs to be approximate Koo and Collins, 2010;Bohnet and Nivre, 2012;Martins et al., 2009Martins et al., , 2013. The same rationale applies to semantic dependency parsing, also a structured prediction problem, but where the output variable is a semantic graph, rather than a syntactic tree. Indeed, the best performing systems in last year shared task on broad-coverage semantic dependency parsing follow this principle (Oepen et al., 2014). This year, a new challenge was put forth: how to handle multiple languages and out-ofdomain data?
Our proposed parser ( §2) is essentially the same that we submitted in the previous year to the same SemEval task (Martins and Almeida, 2014), where we scored top in the open challenge and second in the closed track. This year, we report results using new out-of-domain and multilingual data (namely, Czech and Chinese, in addition to English). For the English language, we participated in the closed and open tracks, using as additional resources the syntactic dependency annotations provided by the organizers. For Czech and Chinese, we only addressed the closed track, since no companion data were provided for these languages. We did not participate in the gold track that uses gold-standard syntactic annotations; and we did not address the prediction of predicate senses.

Semantic Parser
For this year's shared task, we re-run the semantic parser that we developed last year, which is fully desc1ribed in Martins and Almeida (2014), on the new datasets. Since this parser was designed to be multi-lingual, it was straightforward to apply it to the languages introduced this year (Chinese and Czech), as well as on the out-of-domain data.
We briefly describe our semantic parser (which we dub TurboSemanticParser and release as opensource software 1 ), and refer the interested reader to Figure 1: Parts considered by our semantic parser. The top row illustrate the basic parts, representing the event that a word is a predicate, or the existence of an arc between a predicate and an argument, eventually labeled with a semantic role. Our second-order model looks at some pairs of arcs: arcs bearing a grandparent relationship, arguments of the same predicate, predicates sharing the same argument, and consecutive versions of these two. (2014) for further details.

Martins and Almeida
The parser was built as an extension of a recent dependency parser, TurboParser (Martins et al., 2010(Martins et al., , 2013, with the goal of performing semantic parsing using any of the three formalisms considered in the shared task (DM, PAS, and PSD). We have followed prior work in semantic role labeling (Toutanova et al., 2005;Johansson and Nugues, 2008;Das et al., 2012;Flanigan et al., 2014), by adding constraints and modeling interactions among arguments within the same frame; however, we went beyond such sibling interactions to consider more complex grandparent and co-parent structures, effectively correlating different predicates. The overall set of parts used by our parser is illustrated in Figure 1; note that by using only a subset of the parts (predicate, arc, labeled arc, and sibling parts), the semantic parser decodes each predicate frame independently from other predicates; it is the co-parent and grandparent parts that have the effect of creating inter-dependence among predicates; we will analyze the effect of these dependencies in the experimental section ( §3).
For each part in our model (shown in Figure 1), we computed binary features based on various combination of lexical forms, lemmas, POS tags and syntactic dependency relations of words related to the corresponding predicates and arguments. Most of these features were taken from TurboParser (Martins et al., 2013), and others were inspired by the semantic parser of Johansson and Nugues (2008).
To tackle all the parts, we formulate parsing as a global optimization problem and solve a relaxation through AD 3 (Martins et al., 2011), a fast dual decomposition algorithm in which several simple local subproblems are solved iteratively. Through a rich set of features, we arrive at top accuracies at parsing speeds around 1,000 tokens per second. See Martins and Almeida (2014) for details on the model, features and decoding process that were used.

Experimental Results
All models were trained by running 10 epochs of max-loss MIRA with C = 0.01 (Crammer et al., 2006). The cost function takes into account mismatches between predicted and gold dependencies, with a cost c P on labeled arcs incorrectly predicted (false positives) and a cost c R on gold labeled arcs that were missed (false negatives). These values were set through cross-validation in the dev set, yielding c P = 0.4 and c R = 0.6 in all runs, except for the English PSD dataset in the closed track, for which c P = 0.3 and c R = 0.7.
As in the previous work, we speed up decoding by training a probabilistic unlabeled first-order pruner and discarding the arcs whose posterior probability is below 10 −4 . This allows a significant reduction of the search space with a very small drop in recall. Table 1 shows our final results in the test set, for a model trained in the train and development partitions. Note that we do not report scores for complete predications, since we did not predict predicate sense. Our system achieved the best final score in 3 out of the 4 tracks for the English language, and for the in-domain closed track in the Czech language. For the remaining 3 tracks we scored relatively close to the best system (Peking), which consists of an ensemble of various methods. For all languages, the runtimes are in par with last year's submission (around 1,000 tokens per second).
As expected, the scores obtained for out-ofdomain data are significantly below those obtained with in-domain data. This degradation becomes particularly striking for Czech, with F 1 -scores dropping more than 15%. This suggests that domain adaptation (Blitzer et al., 2006;Daumé III, 2007) is an interesting research avenue for future work. In ad- dition, as found last year for English, the gap between labeled and unlabeled scores is much higher in the PSD formalism (for English and Czech) then it is for the DM and PAS formalism (for English and Chinese).
Finally, to assess the importance of the second order features, Table 2 reports experiments in the devset that progressively add several groups of features. We can see that second order features provide valuable information that improves the final scores. In particular, the higher-order features are extremely useful for Chinese and Czech, where we can observe gains of 1.5-2.0% over a sibling model that factors over predicates.

Conclusions
Our system, which is inspired by prior work in syntactic parsing, implements a linear model with second-order features, being able to model interactions between siblings, grandparents and co-parents. We have shown empirically that, for all the three languages, second-order features that correlate multiple predicates have a strong impact in the final scores. However, there is a large drop in accuracy when moving to out-of-domain data.