SRL4ORL: Improving Opinion Role Labeling Using Multi-Task Learning with Semantic Role Labeling

For over a decade, machine learning has been used to extract opinion-holder-target structures from text to answer the question “Who expressed what kind of sentiment towards what?”. Recent neural approaches do not outperform the state-of-the-art feature-based models for Opinion Role Labeling (ORL). We suspect this is due to the scarcity of labeled training data and address this issue using different multi-task learning (MTL) techniques with a related task which has substantially more data, i.e. Semantic Role Labeling (SRL). We show that two MTL models improve significantly over the single-task model for labeling of both holders and targets, on the development and the test sets. We found that the vanilla MTL model, which makes predictions using only shared ORL and SRL features, performs the best. With deeper analysis we determine what works and what might be done to make further improvements for ORL.


1) Australia said [it] H [feared] Oneg [violence] T
if voters thought the election had been stolen.
As the commonly accepted benchmark corpus MPQA (Wiebe et al., 2005) uses span-based annotations to represent opinion entities (opinions, 1 Examples are drawn from MPQA (Wiebe et al., 2005). holders and targets), the task is usually approached with sequence labeling techniques and the BIO encoding scheme (Choi et al., 2006;Yang and Cardie, 2013;Katiyar and Cardie, 2016). Initially pipeline models were proposed which first predict opinion expressions and then, given an opinion, label its opinion roles, i.e. holders and targets (Kim and Hovy, 2006;Johansson and Moschitti, 2013). Pipeline models have been substituted with so-called joint models that simultaneously identify all opinion entities, and predict which opinion role is related to which opinion (Choi et al., 2006;Yang and Cardie, 2013;Katiyar and Cardie, 2016). Recently an LSTM-based joint model was proposed (Katiyar and Cardie, 2016) that unlike the prior work (Choi et al., 2006;Yang and Cardie, 2013) does not depend on external resources (such as syntactic parsers or named entity recognizers). The neural variant does not outperform the feature-based CRF model (Yang and Cardie, 2013) in Opinion Role Labeling (ORL).
Both the neural and the CRF joint models achieve about 55% F1 score for predicting which targets relate to which opinions in MPQA. Thus, these models are not yet ready to answer the question this line of research is usually motivated with: Who expressed what kind of sentiment towards what?. Our goal is to investigate the limitations of neural models in solving different subtasks of FGOA on MPQA and to gain a better understanding of what is solved and what is next.
We suspect that one of the fundamental obstacles for neural models trained on MPQA is its small size. One way to address scarcity of labeled data is to use multi-task learning (MTL) with appropriate auxiliary tasks. A promising auxiliary task candidate for ORL is Semantic Role Labeling (SRL), the task of predicting predicate-argument structure of a sentence, which answers the question Who did what to whom, where and when?.  -A1 A1  A1  A1  A1  A1  A1  A1  A1  A1  A1  fear.01  --A0  -A1  AM-ADV AM-ADV AM-ADV AM-ADV AM-ADV AM-ADV AM-ADV AM-ADVthink.01  ------A0  -A1  A1  A1  A1  A1  steal.01  --------A1  A1  ----Table 1: Output of the SRL demo. Table 1 illustrates the output of the SRL demo 2 for example (1), following the PropBank SRL scheme (Palmer et al., 2005) 3 . SRL4ORL. The semantic roles of the predicate fear (marked blue bold) correspond to the opinion roles H and T, according to MPQA. For this reason, the output of SRL systems has been commonly used for feature-based FGOA models (Kim and Hovy, 2006;Johansson and Moschitti, 2013;Choi et al., 2006;Yang and Cardie, 2013). Additionally, a considerable amount of training data is available for training SRL models ( Table 2 in  Sec. 3), which made neural SRL models successful (Zhou and Xu, 2015;Yang and Mitchell, 2017).
Obstacles. Although SRL is similar in nature to ORL, it cannot solve ORL for all cases (Ruppenhofer et al., 2008). In example (2) holder and target of the predicate please correspond to A1, A0 semantic roles respectively, wheres for the predicate fear in (1) holder and target correspond to A0, A1 respectively. We took into account this observation when deciding on an appropriate MTL model by splitting its parameters into shared and task-specific ones (i.e. hard-parameter sharing).
(2) [I] A1 H am very [pleased] Opos that [the Council has now approved the Kyoto Protocol thus enabling the EU to proceed with its ratification] A0 T .
A further obstacle for properly exploiting SRL training data with MTL could be specificities, inconsistency and incompleteness of the MPQA annotations. In example (3), Rice expressed his negative sentiment towards the three countries in question by setting the criteria which states something negative about those countries: they are repressive and grave human rights violators [...]. In this case, the model should not pick any local semantic role for the target.
( repressive and grave human rights violators, and aggressively seeking weapons [...]. In examples (4-5), the same opinion expression concerned realizes different scopes for the target. A model which exploits SRL knowledge could be biased to always label targets as complete SRL role constituents, as in example (5) The examples above show that incorporating SRL knowledge via multi-task learning is a reasonable way to improve ORL, but at the same time they alert us that given the specificities of MPQA and ORL annotations in general, it is not obvious whether MTL can overcome divergences in the annotation schemes of opinion and semantic role labeling. We investigate this research question by adopting one of the recent successful architectures for SRL (Zhou and Xu, 2015) and experiment with different multi-task learning frameworks.
Our contributions are: (i) we adapt a recently proposed neural SRL model for ORL, (ii) we enhance the model using different MTL techniques with SRL to tackle the problem of scarcity of labeled data for ORL, (iii) we show that most of the MTL models improve the single-task model for labeling of both holders and targets on development and test sets, and two of them make yield significant improvements, (iv) by deeper analysis we provide a better understanding of what is solved and where to head next for neural ORL.

Neural MTL for SRL and ORL
Neural multi-task learning (MTL) receives a lot of attention and new MTL architectures emerge regularly. Yet there is no clear consensus which MTL architecture to use in which conditions. We experiment with well-received architectures that could adapt to different cases of ORL from Section 1. As a general neural architecture for single-and multi-task learning we use the recently proposed SRL model (Zhou and Xu, 2015) (Z&X-STL) which successfully labels semantic roles without any syntactic guidance. This model consists of a stack of bi-directional LSTMs and a CRF which makes the final prediction. The inputs to the first LSTM are not only token embeddings but three additional features: embedding of the predicate, embedding of the context of the predicate and an indicator feature (1 if the current token is in the predicate context, 0 otherwise). Thus, every sentence is processed as many times as there are predicates in it. Adapting this model for labeling of opinion roles is straightforward, the only difference being that opinion expressions can be multiwords and only two opinion roles are assigned.
MTL techniques aim to learn several tasks jointly by leveraging knowledge from all tasks. In the context of neural networks, MTL is commonly used in such a way that it is predefined which layers have tied parameters and which are taskspecific (i.e. hard-parameter sharing). There are various ways of defining which parameters should be shared and how to train them.
Fully-shared (FS) MTL model. A fully-shared model ( Fig. 1) shares all parameters of the general model except the output layer. Each task has a task-specific output layer which makes the prediction based on the representation produced by the final LSTM. When training on a mini-batch of a certain task, parameters of the output layer of the Hierarchical MTL (H-MTL) model. For NLP applications, often some given (high-level) task is supposed to benefit from another (low-level) task more than the other way around, e.g. parsing from POS tagging. This intuition lead to designing hierarchical MTL models (Søgaard and Goldberg, 2016;Hashimoto et al., 2017) in which predictions for low-level tasks are not made on the basis of the representation produced at the final LSTM, but on the representation produced by a lower-layer LSTM (Fig. 2). Task-specific layers atop shared layers could potentially give the model more power to distinguish or ignore certain semantic roles. If so, this MTL model is more suitable for examples like (2) and (3) (Sec. 1).
Shared-private (SP) MTL model. In the stateprivate model, in addition to the stack of shared LSTMs, each task has a stack of task-specific LSTMs   (Fig. 3). Representations at the outermost shared LSTM and the taskspecific LSTM are concatenated and passed to the task-specific output layer. The ORL representation produced independently from SRL gives the model the ability to utilize the shared and entirely task-specific information. For labeling of targets, it is expected that for examples (1) & (5) the model relies mostly on the shared representation, for examples (2) & (4) on both shared and ORL-specific representations, and for example (3) solely on the ORL-specific representation.
Adversarial shared-private (ASP) model. The limitation of the SP model is that it does not prevent the shared layers from capturing taskspecific features. To ensure this, ASP extends the  SP model with a task discriminator . The task discriminator (Fig. 3, marked red) predicts to which task the current batch of data belongs, based on the representation produced by the shared LSTMs. If the shared LSTMs are taskinvariant, the discriminator should perform badly. Thus, we update the shared parameters to maximize the discriminator's cross-entropy loss. At the same time we want the discriminator to challenge the shared LSTMs, so we update the discriminator's parameters to minimize its cross-entropy loss. This minmax optimization is known as adversarial training and recently it gained a lot of attention for NLP applications Chen et al., 2017;Kim et al., 2017;Qin et al., 2017;Wu et al., 2017;Gui et al., 2017;Li et al., 2017;Joty et al., 2017).

Experimental setup 3.1 Datasets
For SRL we use the newswire CoNLL-2005 shared task dataset (Carreras and Màrquez, 2005), annotated with PropBank predicate-argument structures. Sections 2-21 of the WSJ corpus (Charniak et al., 2000) are used for training and section 24 as dev set. The test set consists of section 23 of WSJ and 3 sections of the Brown corpus. For ORL we use the manually annotated MPQA 2.0. corpus (Wiebe et al., 2005;Wilson, 2008). It mostly contains news documents, but also travel guides, transcripts of spoken conversations, emails, fundraising letters, textbook chapters and translations of Arabic source texts.
We report detailed pre-processing of MPQA 4 and data statistics in the Supplementary Material.

Evaluation metrics
For both tasks we adopt evaluation metrics from prior work. For SRL, precision is defined as the proportion of semantic roles predicted by a system which are correct, recall is the proportion of gold roles which are predicted by a system, F1 score is the harmonic mean of precision and recall.
In case of ORL, we report 10-fold CV 5 and repeated 4-fold CV with binary F1 score and proportional F1 score, for holders and targets separately. Binary precision is defined as the proportion of predicted holders (targets) that overlap with the gold holder (target), binary recall is the proportion of gold holders (targets) for which the model predicts an overlapping holder (target). Proportional recall measures the proportion of the overlap between a gold holder (target) and an overlapping predicted holder (target), proportional precision measures the proportion of the overlap between a predicted holder (target) and an overlapping gold holder (target). F1 scores are the harmonic means of the corresponding precision and recall.

Training details
We evaluate our models using two evaluation settings. First, we follow Katiyar and Cardie (2016) which set aside 132 documents for development and used the remaining 350 documents for 10-fold CV. However, in the 10-fold CV setting, the size of the tests sets is 3 times smaller than the dev set size (Table 2, row 3), and, consequently, results in high-variance estimates on the test sets. Therefore we additionally evaluate our models with 4-fold CV. We set aside 100 documents for development and use 25% of the remaining documents for testing. The resulting test sets are comparable in size to the dev set (Table 2, row 2). We run 4-fold CV twice with two different random seeds. We do not tune hyperparameters (HPs), but follow suggestions proposed in the thorough HP study for sequence labeling tasks (Reimers and Gurevych, 2017). HPs can be found in the Supplementary Material.

Results
We evaluate all models after every train size batch size iteration on the ORL dev set and save them if they achieve a higher arithmetic mean of proportional F1 scores of holders and targets on the ORL dev set. The saved models are used for testing.
We report the mean of F1 scores over 10 folds and the standard deviation (appears as a subscript) of all models in Table 3. We report the mean of F1 scores over 4 folds and 2 different seed (8 evalua-   tions) and the standard deviation of all models in Table 4. Evaluation metrics follow Section 3.2. We mark significant difference between MTL models and the single-task (Z&X-STL) model, observed using a Kolmogorov-Smirnov significance test (p < 0.05) (Massey Jr, 1951), with • in superscript and between the FS-MTL model and other MTL models with 3.
STL vs. MTL. In the 10-fold CV evaluation setting (Table 3), the FS-MTL and the H-MTL models improve over the Z&X-STL model in all evaluation measures, for both holders and targets. When evaluated in the repeated 4-fold CV setting (Table 4), all MTL models improve over the Z&X-STL model in all evaluation measures, for both holders and targets.
The FS-MTL and the H-MTL models improve significantly in all evaluation measures, for both holders and targets, on both dev and test sets, when evaluated with repeated 4-fold CV. With 10-fold CV the improvements are also significant, except for targets on the test set. This is probably due to the small size of the test sets (Table 2, row 3), which results in a high-variance estimate. Indeed, standard deviations on the 10-fold CV test sets are always much higher compared to the dev set or to the test sets of 4-fold CV.
It is not surprising that larger improvements are visible in the labeling of holders. They are usually short, less ambiguous and often presented with the A0 semantic role, whereas annotating targets is a challenging task even for humans. 6 Larger improvements are visible for proportional F1 score than for binary F1 score. That is, more data and SRL knowledge helps the model to better annotate the scope of opinion roles.
Comparing MTL models. In Section 2 we introduced MTL models with task-specific LSTM layers hypothesizing that these layers should give MTL models more power to adapt to a variety of potentially problematic cases that we illustrated in the Introduction. However, our results show that the FS-MTL model performs significantly better or comparable to MTL models that include taskspecific layers. Reimers and Gurevych (2017) show that MTL is especially sensitive to the selection of HPs. Thus, a firm and solid comparison of the different MTL models requires thorough HP optimization, to properly control the number of parameters and regularization of the models. We leave HP optimization for future work.

Analysis
Our aim in this section is to analyze what the proposed models are good at, in which ways MTL improves over the single-task ORL model and what could be done to achieve further progress.
We evaluate the FS-MTL and the Z&X-STL 1 Malinga F S,ZX said according to the guidelines in the booklet, the election had been legitimate .
2 movie um-hum that 's interesting so that was a good movie too well do you F S,ZX think we've covered baseball i think so okay well have a good night 3 The nation F S,ZX should certainly be concerned about the plans to build a rocket launch pad , work on the infrastructure for which is due to start in 2002 , with launches beginning from 2004 .

4
Bam on Sunday said she F S,ZX believed Zimbabwe's election was not free and fair , adding they were not in line with international standards as well as those of her organisation .

5
The majority report , endorsed only by the ANC , said the observer mission F S,ZX had noted that over three million Zimbabweans had cast their votes and this substantially represented the will of the people . He said those who thought the election process would be rigged were supporters of the MDC party , adding that they were prejudging and wanted to direct the process F S,ZX .
5 People in the rural areas support the ruling party because our party has been genuine on its policy on land reform F S,ZX .   models on the ORL dev set using 4-fold CV repeated twice with different seeds (8 evaluation trials). We say that a model predicts a role of a given opinion expression correctly if the model predicts a role that overlaps with the correct role in at least 6 out of 8 evaluation trials. If a model predicts a role that overlaps with the correct role in at most 2 out of 8 trials, we say that the model predicts the role incorrectly. The requirement on 6-8 (in)correct predictions reduces the risk of analyzing inconsistent predictions and enables us to draw firmer conclusions. We analyze the following scenarios: (i) both the FS-MTL model and the Z&X-STL model make correct predictions (Tables 5-6) (ii) the FS-MTL model makes a correct prediction, while the Z&X-STL makes an incorrect prediction (Tables 11-12) (iii) both models make wrong predictions (Tables  9-10) In the following, we categorize predictions in case (i) as easy cases, and predictions in case (iii) as hard cases.
In Tables 5-6 and 9-12, the opinion expression is bolded, the correct role is italicized, predictions of the FS-MTL model are colored blue (subscript FS), predictions of the Z&X-STL model are colored yellow (subscript ZX) and green marks predictions where both models agree. For simplicity, we show only holders or targets, although the models predict both roles jointly.
What works well? There are 668/1055 instances in the dev set for which both models predict holders correctly, and 663/1055 for targets.
Examples 1-5 in Table 5 suggest that holders that can be properly labeled by both models (easy 588 1 It would be entirely improper if , in its defense of Israel F S , the United States continues to exert pressure on [...] .

2
Indonesia F S,ZX has come under pressure from several quarters to take tougher action against alleged terrorist leaders but has played down the threat .
3 Australia should adhere to the Cardinal Principle of International Law , which states that all nations in the world must first respect and promote the humanitarian interests and progress of all humankind . 4 The department said that it will cost $ 600 for an HIV/AIDS patient per year at this time , and the following years this cost is expected to stand at just $ 400/year for one patient as the production of such drugs becomes stable .

5
The Organisation of African Unity OAU ZX also backed Zimbabwean President Robert Mugabe 's re-election , with its observer team F S,ZX describing the poll as " transparent , credible , free and fair " .

6
Regarding the American proposed Anti-Missile Defense System too , neither Russia , China , Japan , nor even the European Union , had shown any enthusiasm ; rather they F S had all F S,ZX expressed their reserves on the project .

7
The president renewed his pledge to thwart terrorist groups F S,ZX who want to " mate up " with regimes hoping to acquire weapons of mass destruction and said " nations will come with us " if the US-led war on terrorism is extended . 1 State-sanctioned land invasions , several times declared illegal by Zimbabwe 's courts , as well as a drought have disrupted Zimbabwe 's food production and famine is already looming in much of the country .
2 But he told the nation F S,ZX that in spite of stiff opposition to the agrarian reforms from powerful Western countries , especially the country 's former colonial power of Britain , he would press ahead to seize farms from whites and [...] . 3 If the Europeans wish to influence Israel in the political arena -in a direction that many in Israel would support wholeheartedly -they will not be able to promote their positions in such a manner .
4 They F S,ZX are fully aware that these are dangerous individuals , he said during a press conference [...] .
5 And her little girl just complained , " I don't want to wash the dishes " .
6 During President Bush's speech , I thought of heckling ZX ; ' What are you going to do with the Kyoto Protocol ? F S ' 7 At first I didn't want to apply for it F S,ZX , but the principal called me during the summer months and said , " Sandra the time is running out , you need to apply ". cases) are subjects of their governing heads or A0 roles. The statistics in Table 7 (col. 1, rows 2-3) supports this observation. 7 In contrast, holders that both models predict incorrectly (hard cases) are less frequently subjects or A0 roles (col. 2, rows 2-3). Also, easy holders are close to the corresponding opinion expression: the average distance is 1.54 tokens (Table 7, row 4), contrary to the hard holders with the average distance of 7.56. Examples 1-5 in Table 6 suggest that targets that can be properly labeled by both models are objects of their governing heads or A1 roles. Table 8, row 3, shows that the majority of the easy targets are indeed A1 roles, in contrast to the hard targets. Similar to holders, the easy targets are in average 7 tokens closer to the opinion expression.
What to do for further improvement? There are 165/1055 instances in the dev set for which 7 The statistics is calculated using the output of mate-tools (Björkelund et al., 2010). both models predict holders incorrectly, and 176 for targets.
As we have seen so far, many holders that are subjects or A0 roles, and targets that are A1 roles, are properly labeled by both models. However, a considerable amount of such holders and targets are not correctly predicted (Table 7-8, col. 2, rows 2-3). Thus our models do not work flawlessly for all such cases. A distinguishing property of the hard cases is the distance of the role from the opinion. Thus, future work should advance the model' s ability to capture long-range dependencies. Table 9 demonstrate that holders, harder to label with our models, occur with the corresponding opinions in more complicated syntactic constructions. In the first example, the FS-MTL model does not recognize the possessive and is possibly biased towards picking the country (Isreal), which occurs immediately after the opinion. In the second example, the opinion expression is 1 Yoshihisa Murasawa , a management consultant for Booz-Allen & Hamilton Japan Inc. , said his firm F S,ZX will likely be recommending acquisitions of Japanese companies more ZX often to foreign clients in the future .

2
The source F S , interviewed by Interfax in Grozny , expressed confidence that that the command of the Russian forces in Chechnya would soon " be able to obtain documentary confirmation " that Khattab was dead . 3 The Commonwealth team earlier this week F S said that " the conditions in Zimbabwe did not adequately allow the free and fair expression of will by the electorate ".

4
Publishing such biased reports will only create mistrust among nations F S regarding the objectives and independence of the UN Commission on Human Rights .  1 In most cases he described the legal punishments F S like floggings and executions of murderers and major drug traffickers that are applied based on the Shria , or Islamic law as human rights violations .

2
In another verbal attack Kharazi accused the United States F S of wanting to exercise " world dictatorship " since the " horrible attacks " of September 11 .

3
He said those who thought the election process would be rigged were supporters of the MDC party , adding that they were prejudging and wanted to direct the process ZX . 4 However , the fact that certain countries have a more balanced view of the conflict ZX is not the only reason to doubt that anti-Israeli decisions F S will , in fact , be adopted .

5
But his tough stand on P'yongyang F S has provoked concern in Seoul ZX , where President Kim Tae-chung , who is in the last year of his five-year term , has been trying to prise the hermit state out of isolation . a nominal predicate and the holder is its object. The sentence is in passive voice but the models probably interpret it in the active voice and thus make the wrong prediction. In the third example, the opinion expression is the head of the relative clause that modifies the holder. These examples raise the following questions: would improved consistency with syntax lead to improvements for ORL and could we train a dependency parsing model with SRL and ORL to help the models handle syntactically harder cases? Example 4 shows that holders specific to the MPQA annotation schema are hard to label as they require inference skills: from the department said, we can defeasibly infer that it is the department who expects [this cost] to stand at just $400/year [...]. To handle such cases, it would be worth trying training our models jointly with models for recognizing textual entailment.
Examples 6-7 illustrate that some gap in per-formance stems from difficulties in processing MPQA. Example 5 has no gold holder, but the models make plausible predictions. For example 6, FS-MTL predicts the discontinuous holder they ... all, while MPQA allows only contiguous entities. Therefor our evaluation scripts interpret they and all as two separate holders and deem all as incorrect, resulting with lower precision. Finally, for example 7 our models make plausible predictions. However, the gold holder is always the entity from the coreference cluster that is the closest to the opinion. 8 The evaluation scripts needs to be extended such that predicting any entity from the coreference cluster is considered to be correct.  Table 11). From examples 1-5 we notice that SRL data helps to handle more complex syntactic constructions. From examples 5-7 we observed that using MTL with SRL helps to handle cases when more than one person or organization is present in the close neighborhood of the opinion. For targets, for 11 out of 18 cases the Z&X-STL model does not predict anything as in examples 1-2 in Table 11. We conclude that the greatest improvements from the FS-MTL model comes from having far fewer missing roles.
6 Related work FGOA. Closest to our work are Yang and Cardie (2013) (Y&C) and Katiyar and Cardie (2016) (K&C). They as well label both holders and targets in MPQA. By contrast, our focus is on the task of ORL. We thus refrain from predicting opinion expressions first, to ensure a reproducible evaluation setup on a fixed set of gold opinion expressions. The MTL models we develop in this work will, however, be the basis for the full task in a later stage. Because of these differences, direct comparison to Y&C and K&C is not possible. However, if we compare our results we notice a big gap that demonstrates that opinion expression extraction is the import step in FGOA. Similar to K&C, Liu et al. (2015) jointly labels opinion expressions and their targets in reviews.
Some work focuses entirely on labeling of opinion expressions (Yang and Cardie, 2014;Irsoy and Cardie, 2014). Other work looks into specific subcategories of ORL: opinion role induction for verbal predicates (Wiegand and Ruppenhofer, 2015), categorization of opinion words into actor and speaker view (Wiegand et al., 2016b), opinion roles extraction on opinion compounds (Wiegand et al., 2016a). Wiegand and Ruppenhofer (2015) report 72.54 binary F1 score for labeling of holders in MPQA (results for targets are not reported).
Neural SRL. New neural SRL models have emerged (He et al., 2017;Yang and Mitchell, 2017;Marcheggiani and Titov, 2017) since we started this work. In future work we can improve our models with such new proposals.
Auxiliary tasks for MTL. Other work investigates under which conditions MTL is effective. Martínez Alonso and Plank (2017) show that the best auxiliary tasks have low kurtosis of labels (usually a small label set) and high entropy (labels occur uniformly). We show that the best MTL model for ORL is the model which uses shared layers only. Thus it seems reasonable to consider only a small and uniform SRL label set {A0, A1}. Bingel and Søgaard (2017) show that MTL works when the main task has a flattening learning curve, but the auxiliary task curve is still steep. We notice such behavior in our learning curves.

Conclusions
We address the problem of scarcity of annotated training data for labeling of opinion holders and targets (ORL) using multi-task learning (MTL) with Semantic Role Labeling (SRL). We adapted a recently proposed neural SRL model for ORL and enhanced it with different MTL techniques. Two MTL models achieve significant improvements with all evaluation measures, for both holders and targets, on both dev and test set, when evaluated with repeated 4-fold CV. We recommend evaluation with comparable dev and test set sizes for future work, as this enables more reliable evaluation.
With deeper analysis we show that future developments should improve the ability of the models to capture long-range dependencies, investigate if consistency with syntax can improve ORL, and consider other auxiliary tasks such as dependency parsing or recognizing textual entailment. We emphasize that future improvements can be measured more reliably if opinion expressions with missing roles are curated and if the evaluation considers all mentions in opinion role coreference chains as well as discontinuous roles.