A Unified Span-Based Approach for Opinion Mining with Syntactic Constituents

Fine-grained opinion mining (OM) has achieved increasing attraction in the natural language processing (NLP) community, which aims to find the opinion structures of “Who expressed what opinions towards what” in one sentence. In this work, motivated by its span-based representations of opinion expressions and roles, we propose a unified span-based approach for the end-to-end OM setting. Furthermore, inspired by the unified span-based formalism of OM and constituent parsing, we explore two different methods (multi-task learning and graph convolutional neural network) to integrate syntactic constituents into the proposed model to help OM. We conduct experiments on the commonly used MPQA 2.0 dataset. The experimental results show that our proposed unified span-based approach achieves significant improvements over previous works in the exact F1 score and reduces the number of wrongly-predicted opinion expressions and roles, showing the effectiveness of our method. In addition, incorporating the syntactic constituents achieves promising improvements over the strong baseline enhanced by contextualized word representations.


Introduction
Opinion mining (OM), which aims to find the opinion structures of "Who expressed what opinions towards what." in one sentence, has achieved much attention in recent years (Katiyar and Cardie, 2016;Marasović and Frank, 2018;Zhang et al., 2019bZhang et al., , 2020. The opinion analysis has many NLP applications, such as social media monitoring (Bollen et al., 2011) and e-commerce applications (Cui et al., 2017). The commonly used benchmark MPQA (Wiebe et al., 2005) uses span-based annotations to represent opinion expressions and roles. Figure 1 gives an example of its opinion structures with two opinion expressions and related roles.
Previous OM works (Yang and Cardie, 2013;Katiyar and Cardie, 2016;Quan et al., 2019;Zhang et al., 2020) mainly treat it as a BMESO-style tagging problem, which converts opinion expressions and opinion roles (holder/target) into BMESObased labels and uses a linking module to connect the predicted expressions and roles. The B, M, and E represent the beginning, middle, and ending word of a role, S denotes a single-word role, and O denotes other words. However, this kind of method is not perfect for the end-to-end OM setting, because one word can only belong to one opinion role (one word has only one label), while there exist overlapping opinion structures between different expressions in one sentence. Figure 1 gives an example, in which some overlapped opinion relations have been discarded by previous works (Katiyar and Cardie, 2016), such as [happy, he loves being Enderly Park, Target] and [loves, he, Holder]. There are also other works which focus only on predicting opinions roles based on the gold-standard expressions, which also follow the BMESO-based method (Marasović and Frank, 2018;Zhang et al., 2020). However, they also suffer from some weaknesses: 1) the expressions are usually fed into the model input as indicator embeddings (1 if the current word belongs to an expression, 0 otherwise), thus one sample is expanded n times if one sentence has n expressions, which is inefficient (Marasović and Frank, 2018;Zhang et al., 2020). 2) The BMESObased method is weak to capture long-range dependencies and prefers to predict shorter opinion role spans (Zhang et al., 2020).
Motivated by the span-based representations of opinion expressions and roles, we propose a unified span-based opinion mining model (SPANOM) that can solve or alleviate the aforementioned weaknesses. First, we treat the identification of opinion expressions and roles as two unified binary span classification problems, i.e., judging whether the word span is an expression (or role) or not. Then, we allocate the opinion relations on the predicted expression-role pairs. This strategy converts the overlapped opinion role identification of different expressions into classifying different expressionrole pairs. For example, predicting [happy, he loves being Enderly Park, Target] and [loves, he, Holder] is infeasible in BMESO-based method, while it is feasible in our span-based method. Benefit from the model architecture, the proposed model only needs to train once for one sample in one epoch, which is very efficient for training. Besides, the unified model can be easily adapted to the givenexpression setting by using gold-standard expressions. Furthermore, inspired by the same spanbased formalism between the syntactic constituents and opinion roles, we explore two types of methods to encode the syntactic knowledge to improve the role spans recognition for two motivations, i.e., multi-task learning (MTL) for enhancing the model representative ability and graph convolutional networks (GCN) (Kipf and Welling, 2016;Guo et al., 2019) for encoding the constituent structures.
We conduct extensive experiments on the commonly used MPQA2.0 dataset and demonstrate that our proposed unified model achieves superior performance compared with previously proposed BMESO-based works. Our contributions are: (i) we propose a unified span-based model for opinion mining in the end-to-end fashion that also supports the given-expression setting, (ii) we successfully integrate syntactic constituents knowledge into our model with MTL and GCN, achieving promising improvements, (iii) detailed analyses demonstrate the effectiveness of our unified model and the usefulness of integrating constituent syntactic knowledge on the long-distance opinion roles.

Related Work
There are several task settings for opinion mining in the community: 1) Breck et al. (2007); Yang and Cardie (2014) focus on labeling the expressions. 2) Katiyar and Cardie (2016); Zhang et al. (2019b); Quan et al. (2019) discover the opinion structures in the end-to-end setting, i.e, based on the systematic expressions. 3) Marasović and Frank (2018); Zhang et al. (2019aZhang et al. ( , 2020 identify the opinion roles based on the given expressions. Our work follows the end-to-end setting and also supports the given-expression setting. Most of the previous opinion mining works treat it as a BMESO-tagging problem, which can be handled by the typical sequence labeling model, such as bi-directional long-short term memory network conditional random field (BiLSTM-CRF). Yang and Cardie (2013) propose to use traditional feature-based CRF model to predict the BMESObased opinion role labels. Katiyar and Cardie (2016) propose a BiLSTM-CRF model to first predict the word-wise opinion role label and then determine the relationship with the expression by the role label and distance to the expressions. Zhang et al. (2019b) propose a transition-based model for opinion mining, which identifies opinion expressions and roles by the human-designed transition actions. Quan et al. (2019) integrate BERT representations into a BiLSTM-CRF model, but they do not distinguish different expressions in one sentence. As aforementioned, it is trivial for the sequence labeling style models to handle the overlapped opinion roles belonging to different expressions in one sentence.
Due to the issue of data scarcity, several kinds of external knowledge have been investigated to improve OM performance. Marasović and Frank (2018) propose several MTL frameworks with semantic role labeling (SRL) to utilize semantic knowledge. Zhang et al. (2019a) extract the semantic representations from a pre-trained SRL model and feed them into the opinion mining model, achieving substantial improvements. Zhang et al. (2020) incorporate the powerful contextual representations of bi-directional encoder representations from Transformers (BERT) (Devlin et al., 2019) and external dependency syntactic knowledge.
To solve or alleviate the weaknesses of the previously proposed BMESO-based models, we propose a new method to unifiedly model the opinion expressions and roles, which treats the expres-sion identification, role identification, and opinion relation classification as an MTL problem. Besides, to boost the opinion mining performance and motivated by the span-based task formalism, we explore to incorporate syntactic constituents into our model. Utilizing span-based representations have been investigated for many other NLP tasks, such as named entity recognition (NER) (Tan et al., 2020), constituency parsing (Kitaev and Klein, 2018), and semantic role labeling (SRL) (He et al., 2018). Generally, NER is a single span classification problem, constituency parsing is a span-based structure prediction problem, and SRL is a word-span classification problem. Different from them, in our methodology, OM is a span-span classification problem.

Task Definition.
Given an input sentence s = w 1 , w 2 , ..., w n , our model aims to predict the gold-standard opin- is the set of opinion roles , and R is the set of opinion relations (holder and target) with a dummy relation ψ that represents no relation.
Accordingly, we treat the opinion expression and role recognition as the unified span classification problem and determine the opinion relation based on the predicted expressions and roles. We jointly model the three sub-tasks in an MTL fashion to enhance the modules' interplay. The left part of Figure 2 shows the model architecture of our model and we will detailedly describe the components in the following sections.

Input Layer.
For each word w i in sentence s, we employ word embedding, char representation, and contextual word representation to compose the model input, denoted as: where ⊕ means the concatenate operation. We use the convolutional neural networks (CNN) (Kalchbrenner et al., 2014) to generate the character representations over the characters of words.

Encoder Layer.
Over the input layer, we employ BiLSTM to encode the model input. We treat the concatenation of the outputs of the left-to-right LSTM and right-to-left LSTM as the output:

Span Representation and Identification
Layer.
To better distinguish opinion expression and role representations, we first employ two multi-layer perceptions (MLP) to re-encode the output of BiL-STM encoder, denoted as: (3) For a word span that begins at b-th word and ends at e-th word, we define it as span b,e . So the representations of expression and role are defined as: Given the representations of expressions and roles, we employ another two MLPs to classify whether the span is the gold expression/role or not. Furthermore, we also incorporate the span boundary information to help the determination of spans. Specifically, we employ another four MLPs on the span boundary positions to determine whether the word is a boundary position or not 1 . Thus, the score formulation of the span is as: We can observe that for a sentence with n words, the numbers of candidate spans for expressions and roles are both n * (n+1) and negative samples, we adapt the focal loss that is widely used in computer vision (Lin et al., 2017) into our model. Formally, for every span i in a sentence, the sentence focal loss is defined as: where p i,c is the softmax value of the s exp c (or s opi c ) for class c of span i, γ is a pre-defined hyperparameter and y i,c is an indicator value that equals to 1 if c is the ground-truth class 0 otherwise. Compared with the typical cross-entropy loss, the difference appears in the first item, which can intuitively make the model focus more on the hard-to-classify samples. We denote the loss of the opinion expressions and roles as L exp and L rol , respectively.

Relation Classification Layer.
Given the predicted opinion expressions and roles, the next step is to determine the opinion relation (holder, target, or no relation) for each expressionrole pair. We employ another MLP classifier to compute the score for each relation of the focused expression span exp and role span rol : Focal loss is also employed to estimate this module, which is denoted as L rel .

Training and Inference.
We sum the three losses from the three modules as the final model loss: For the end-to-end OM setting, the model predicts the relation of the predicted expressions and roles. As for the given-expression mode, we directly feed the gold expressions into the model, with other parts the same as the end-to-end mode. During the inference process, we employ dynamic programming to predict opinion expressions and roles.

Syntactic Constituents
Since the data scale is relatively small, previous works usually try to integrate external knowledge to enhance the basic OM model and improve its performance (Marasović and Frank, 2018;Zhang et al., 2019a). Previous sequence tagging models usually incorporate word-wise external information, such as dependency parsing (Zhang et al., 2020). We try to investigate the integration of constituent knowledge, which is motivated by their unified span-based formalism. Two different methods are explored in our work, i.e., MTL and GCN.

The MTL Method.
MTL is an effective method to utilize external knowledge, which is usually by sharing the model parameters of the main task and auxiliary task (Ruder, 2017). Considering the efficiency of full constituent parsing, we use partial constituent parsing in our model, i.e., training partial constituent trees (constituent spans), not the entire constituent tree. In detail, we first extract all the constituent spans 2 from the OntoNotes corpus. See 5.1 for the detailed settings. Then, we add a span classification module over the BiLSTM encoder, which is similar to the unified opinion classifier, to predict the span belonging to which kind of constituent labels. Third, with the addition of the constituent span classification module, we can easily allocate automatic constituent labels to enhance the predicted opinion expressions and roles. Thus, we create randomly initialized constituent label embeddings for representing the syntactic labels, which are then concatenated with the expression and role representations: The syntax-enhanced span representations are then passed to participate in the later computation process. Finally, the focal loss is used to estimate the partial constituent tree prediction module and the partial constituent loss (L cons ) is used to update the shared input layer, encoder layer, and the partial constituent parsing classification layer. So the loss of our constituents-enhanced OM model becomes: It is worth noting that the data size of OM and constituent trees is different, so we employ a corpusweighting parameter α to balance it. In general, the MTL method brings two benefits: 1) enhancing the model encoder and 2) adding constituency label information to expressions and roles.

The GCN Method.
The MTL method enhances our OM model from the aspect of model representative ability by jointly modeling opinion mining and partial constituency parsing. We argue that modeling the syntactic constituent structure is also beneficial for OM because it provides valuable syntactic information for a sentence. Therefore, we try to employ the recently popular GCN (Kipf and Welling, 2016) to encode the constituent structure. However, the conventional GCN is not suitable for constituency trees, because it usually works on the dependency trees (Zhang et al., 2018(Zhang et al., , 2020 where the nodes are the surface words in a sentence. While, in constituent trees, there exists a certain number of non-terminal nodes 3 , such as "NP", "VP", "SBAR" and so on. So it is hard to directly apply conventional GCN on the constituent trees. In the following, we first introduce the definition and workflow of typical GCN and then describe our modification. Formally, we denote an undirected graph as G = (V, E), where V and E are the set of nodes and edges, respectively. The GCN computation flow of node v ∈ V at l-th layer is defined as: 3 Terminal nodes are the surface words in the sentence.
where W l ∈ R m×m is the weight matrix, b l ∈ R m is the bias term, N (v) is the set of all one-hop neighbour nodes of v, and ρ is an activation function (relu activation function in our work). Especially, h 0 u ∈ R m is the initial input representation, and m is the representation dimension.
Since there are some non-terminal nodes in the constituent tree, the GCN input can not directly get from the surface words. We create a randomly initialized non-terminal embedding matrix E N ×D and a dynamic mask for composing the GCN input and extracting the GCN output, where N is the number of non-terminal nodes and D is the dimension of the terminal node inputs. There are two main ways to add the GCN modules in the neural network models, i.e., concatenating with the input layer and stacking over the encoder layer. According to our preliminary experiments, we choose the former method. In detail, we treat the composition of non-terminal node representations and terminal node representations as the GCN input, and then concatenate the terminal node GCN outputs x GCN i with the basic model input as the final model input. The top right part of Figure 2 shows the overall workflow.
The final constituent-enhanced unified spanbased opinion mining model combines the two methods, which we denoted as "MTL+GCN" in the later sections. The workflow is shown by the right bottom part of Figure 2.

Settings.
We conduct experiments on the commonly used English MPQA2.0 dataset (Wiebe et al., 2005). Following the data split of previous works (Zhang et al., 2019a(Zhang et al., , 2020, the development data contains 132 documents and the test data contains 350 documents, using five-fold cross-validation to evaluate the test data. For constituent data, we use the OntoNotes 5.0 dataset (Pradhan et al., 2013) in our MTL method. We use the constituent parser of Kitaev and Klein (2018)

Hyper-parameters.
We employ the 300-dimension GloVe vector (Pennington et al., 2014) as our pre-trained word embeddings. The character embeddings are randomly initialized and a CNN with kernel sizes of 3, 4, 5 is used to capture the character representations. For the contextual representations, we extract the representations from the base BERT by making a weighted summation over the last four layer outputs. The hidden size of the BiLSTM layer is set to 300 and we employ 2-layer BiLSTMs to encode the input representations. The dimension of opinion expression and role representations is 300 and the hidden size of expression, role, and relation classifiers is 150. We use 3-layer GCNs with hidden size 300. The dropout rate of the input layer, encoder layer, and other components are 0.5, 0.4, and 0.3, respectively. The hyper-parameter γ is 3.0.

Training Criterion.
We employ Adam optimizer with an L2 weight decay of 1e-6 to optimize our model. The batch size is 32. The initial learning rate is set to 0.001 and decays 0.99 for every 50 steps. Our model trains for at most 320k steps and early stops if no performance gains happen in 100 epochs on the development data. We pick the model that performs best on the development data for evaluation. It costs about 4 minutes to run one epoch training and 1 minute for evaluation.

Evaluation Metrics.
Following previous works (Marasović and Frank, 2018;Zhang et al., 2020), we use the Precision, Recall, and F1 score to measure the experimental results regarding to Exact match setting, and two other auxiliary metrics of Binary and Proportional match. The average value of the five-fold cross-validation results is reported in our work. The binary and proportional metrics are also called overlap metric, which includes the opinion roles that exactly match the gold opinions and inexactly match but overlap with gold roles. In detail, the binary match means an opinion overlaps with a gold-standard opinion and the proportional match computes the maximum ratio value of an role with the overlapped gold role.

Results of SPANOM.
Results in the end-to-end setting. Table 1 lists the results of previous works and our model (SPANOM) in the end-to-end setting. First, our model achieves superior performance than previous works in terms of exact F1 score, reaching better results of 52.90 and 32.42 exact F1 scores on the holder and target roles. The overall exact F1 score of the two roles is 43.12. Second, integrating BERT representations into the model input can bring substantial improvements, achieving 49.89 exact F1 score. We can see that in the auxiliary metrics of binary and proportional, previous works perform better than ours, which we think because our model more focuses on the entire word spans and we will detailedly discuss it in the analysis section. Finally, the results of expression prediction are shown in Table 2. We can see that our model outperforms Zhang et al. (2019b) by +5.02 exact F1 score. Results in the given-expression setting.   ting. First, we can see that our proposed span-based model outperforms previously proposed BMESObased models in the exact F1 score metric, achieving 59.62 exact F1 score. Second, when using contextual word representations of BERT, our model consistently outperforms the previous best result, resulting in a new state-of-the-art result of 65.95 exact F1 score, showing superior performance compared with the BMESO-based methods.

Results of Integrating Syntactic
Constituents. Table 4 shows the results of our model integrating syntactic constituents and compare with previous works with SRL or dependency syntax knowledge.
In the end-to-end setting, incorporating constituent knowledge brings an improvement of +0.57 exact F1 score. In the given-expression setting, we can see that integrating constituent syntactic knowledge into our model brings a +2.07 exact F1 score improvement, achieving comparable results with previous best results of Zhang et al. (2020). Even though our basic OM model outperforms Zhang et al. (2020), the improvements from syntactic constituents lag behind the dependency syntax. We think this is partly because of the relatively low performance of constituent parsing (93.55 F1 score) compared with dependency parsing (95.7 F1 score). Apart from syntactic knowledge, Marasović and Frank (2018); Zhang et al. (2019a) both try to encode semantic knowledge, but their models don't use BERT representations.

Analysis
In this section, we conduct detailed analyses to gain more insights into our unified OM model and the effectiveness of integrating syntactic constituents.

Span-based Model vs. BMESO-based Model
As the experimental results shown, our span-based model performs better in the exact matching metric than the BMESO-based models, while the BMESObased models have better results in the auxiliary overlap metric. To understand the performance difference, we list the detailed percentage of opinion statistics of the system outputs of our span-based model and the BMESO-based model of Zhang et al. (2019a) in Figure 3, both using the BERT representations. The "Matched", "Overlapped" and "Error" mean the predicted opinion role matches the gold role, not matches but overlaps part of the gold role and totally mismatches the gold role, respectively. We can see that: 1) our model achieves better per-  formance on the exact match setting through all the span length scenarios, especially on the spans that contain more than 10 words, 2) the BMESO-based model outputs more overlapped opinion roles than our span-based model, thus the BMESO models have better results in the auxiliary metric of binary and proportional settings. This demonstrates that our SPANOM more focuses on the full opinion role spans while the BMESO-based method may weak to give high exact predictions. Case study. The upper part of Figure 4 shows an example of the output of our span-based model and previous BMESO-based model of Zhang et al. (2019a). We can find that the span-based model successfully predicts the full agent while the BMESObased model only predicts part of the agent span. This confirms the intuition that our span-based model is more good at predicting the long-range arguments, while the BMESO-based model is weak at long-range spans, which is consistent with the findings of Zhang et al. (2020).

Effect of Syntactic Constituents
Which source of constituent knowledge is better? There are two main constituent syntax corpus in the community, i.e., Penn Treebank (PTB) (Marcus et al., 1993) and OntoNotes5.0 (Weischedel et al., 2013). The PTB corpus contains about 39k training data and mainly focuses on news data, while the OntoNotes5.0 corpus contains about 75k training data and focuses on multi-domain data (news, web, telephone conversation, and etc.).
It is a worthy question to explore which is better for our span-based OM model, or what kind of combination is better. We compare them with various combinations on the BERT-based model, whose results are shown in Table 5. First, the second major row shows the results of our model with  the MTL method, where MTL with PTB achieves the best exact F1 score of 68.02. Second, the results of our model with the GCN method are listed in the third major row, where "OntoNotes" and "PTB" means the automatic constituent trees are generated by parser trained on OntoNotes 7 and PTB, respectively. We can see that using the automatic constituent trees from Parser PTB achieves the best exact F1 score of 67.66. Finally, we try to combine the two kinds of methods and the results are shown in the last major row. It is clear that combining the MTL method with OntoNotes and the GCN method with Parser PTB achieves better results than the reversed one. Therefore, our constituent-enhanced opinion mining model follows this combination. Besides, we can also see the relative lower results of "OntoNotes+PTB" in "+MTL" and "+GCN" settings, which is strange that combining more information leads to lower performance. We think this is mainly caused by the different domains of the data in OntoNotes. As is well known, learning uniform knowledge from different domains data is a challenging problem. So, in the MTL method, adding OntoNotes into PTB can enhance such domain problems, and vice versa. In the GCN method, the two GCN outputs are concatenated, so the potential conflicts of different arcs are alleviated. Thus, the performance didn't drop too much.
We also try to utilize dependency syntax. However, it brings less improvement compared with constituent syntax, which is understandable that word-based information is not very appropriate for the span-based model. It is also consistent with our intuition that span-based syntactic constituents are more suitable for the span-based model. Why and where do syntactic constituents help? OM aims to discover the structure of "Who expressed what" in a sentence and constituent syntax provides valid information like the "NP" and "VP" phrases in a sentence. Intuitively, the "agent/target and expression" may be covered by "NP and VP" phrases. We make statistics on the overlapping of constituent spans and opinions. We find that about 88% opinion roles can be covered by the predicted constituent spans from the MTL module, where the most four are "NP", "VP", "SBAR" and "PP". Since the constituent knowledge can intuitively help the determination of roles, we list the result of the different span lengths in Figure 5a. We can find that constituent knowledge helps most on those opinion roles with longer length. We also report the results regarding the distance between the expressions and roles in Figure 5b, which shows a similar conclusion.
Case study. The bottom part of Figure 4 gives a case study that shows the difference between syntax-enhanced and syntax-agnostic models. We can see that the target argument "All composite things" is hard to be identified by our baseline model. When integrating constituent knowledge, the model correctly discovers this opinion role and give the "target" relation. We think it is because the constituent tree gives a "NP" label to the word span, which helps our model to identify it. We also observe that there are some peculiarities of the MPQAs annotation scheme. For example, in the sentence "The criteria set by Rice are the following: the three countries in question are repressive ...", "set by" is the expression, "Rice" is the holder, and "the three countries in question" is the target. However, "set by" is not a constituent phrase at all. In fact, "by" and "Rice" compose a prepositional phrase in the constituent tree. So, it is hard for our model to recognize "set by" as an opinion expression. Besides, "the three countries in question" is also not a dependent of the opinion expression "set by", in which the constituent tree can not provide valuable structural information for the two phrases. Such phenomena is hard to handle by our model and raise challenges to the future work.

Conclusion
In this paper, we propose a unified span-based opinion mining model that can handle the overlapped opinion roles, providing a new methodology. Our proposed model outperforms previously proposed BMESO-based models in terms of exact match metric on both the end-to-end and givenexpression settings. Furthermore, integrating syntactic constituents knowledge with MTL and GCN brings substantial improvements over our BERTenhanced baseline model. Detailed analyses show the difference between the span-based model and the BMESO-based model and the effectiveness of incorporating syntactic constituents on the determination of opinion role spans.