A Multi-task Learning Framework for Opinion Triplet Extraction

The state-of-the-art Aspect-based Sentiment Analysis (ABSA) approaches are mainly based on either detecting aspect terms and their corresponding sentiment polarities, or co-extracting aspect and opinion terms. However, the extraction of aspect-sentiment pairs lacks opinion terms as a reference, while co-extraction of aspect and opinion terms would not lead to meaningful pairs without determining their sentiment dependencies. To address the issue, we present a novel view of ABSA as an opinion triplet extraction task, and propose a multi-task learning framework to jointly extract aspect terms and opinion terms, and simultaneously parses sentiment dependencies between them with a biaffine scorer. At inference phase, the extraction of triplets is facilitated by a triplet decoding method based on the above outputs. We evaluate the proposed framework on four SemEval benchmarks for ASBA. The results demonstrate that our approach significantly outperforms a range of strong baselines and state-of-the-art approaches.


Introduction
Aspect-based sentiment analysis (ABSA), also termed as Target-based Sentiment Analysis in some literature (Liu, 2012), is a fine-grained sentiment analysis task. It is usually formulated as detecting aspect terms and sentiments expressed in a sentence towards the aspects He et al., 2019;Hu et al., 2019). This type of formulation is referred to as aspect-sentiment pair extraction. Meanwhile, there exists another type of approach to ABSA, referred to as aspectopinion co-extraction, which focuses on jointly deriving aspect terms (a.k.a. opinion targets) and Example sentence: The atmosphere is attractive , but a little uncomfortable .
Aspect-sentiment pair extraction : [(atmosphere, positive), (atmosphere, negative)] Aspect-opinion co-extraction : [atmosphere, attractive, uncomfortable] Opinion triplet extraction : [(atmosphere, attractive, positive), (atmosphere, uncomfortable, negative)] Figure 1: Differences among aspect-sentiment pair extraction, aspect-opinion co-extraction, and opinion triplet extraction. Words in blue are aspect terms. Words in red are opinion terms. [ ] denotes a set of extracted patterns, and ( ) denotes an extracted pattern. opinion terms (a.k.a. opinion expressions) from sentences, yet without figuring out their sentiment dependencies Li et al., 2018b). The compelling performances of both directions illustrate a strong dependency between aspect terms, opinion terms and the expressed sentiments.
This motivates us to put forward a new perspective for ABSA as joint extraction of aspect terms, opinion terms and sentiment polarities, 2 in short opinion triplet extraction. An illustrative example of differences among aspect-sentiment pair extraction, aspect-opinion co-extraction, and opinion triplet extraction is given in Figure 1. Opinion triplet extraction can be viewed as an integration of aspect-sentiment pair extraction and aspect-opinion co-extraction, by taking into consideration their complementary nature. It brings in two-fold advantages: (1) the opinions can boost the expressive power of models and help better determine aspectoriented sentiments; (2) the sentiment dependencies between aspects and opinions can bridge the gap of how sentiment decisions are made and further promote interpretability of models.
There is some prior research with a similar viewpoint.  proposes to extract opin-ion tuples, i.e., (aspect-sentiment pair, opinion)s, by first jointly extracting aspect-sentiment pairs and opinions by two sequence taggers, in which sentiments are attached to aspects via unified tags, 3 and then pairing the extracted aspect-sentiments and opinions by an additional classifier. Despite of remarkable performance the approach has achieved, two issues need to be addressed.
The first issue arises from the prediction of aspects and sentiments with a set of unified tags thus degrading the sentiment dependency parsing process to a binary classification. As is discussed in prior studies on aspect-sentiment pair extraction (He et al., 2019;Hu et al., 2019), although the concerned framework with unified tagging scheme is theoretically elegant and mitigates the computational cost, it is insufficient to model the interaction between the aspects and sentiments (He et al., 2019;. Secondly, the coupled aspect-sentiment formalization disregards the importance of their interaction with opinions. Such interaction has been shown important to handle the overlapping circumstances where different triplet patterns share certain elements, in other triplet extraction-based tasks such as relation extraction (Fu et al., 2019). To show why triplet interaction modelling is crucial, we divide triplets into three categories, i.e., aspect overlapped, opinion overlapped, and normal ones. Examples of these three kinds of triplets are shown in Figure 2. We can observe that two triplets tend to have the same sentiment if they share the same aspect or opinion. Hence, modelling triplet interaction shall benefit the ASBA task, yet it can not be explored with the unified aspect-sentiment tags in which sentiments have been attached to aspects without considering the overlapping cases.
To circumvent the above issues, we propose a multi-task learning framework for opinion triplet extraction, namely OTE-MTL, to jointly detect aspects, opinions, and sentiment dependencies. On one hand, the aspects and opinions can be extracted with two independent heads in the multi-head architecture we propose. On the other hand, we decouple sentiment prediction from aspect extraction. Instead, we employ a sentiment dependency parser as the third head, to predict word-level sentiment dependencies, which will be utilized to further decode span-level 4 dependencies when incorporated with the detected aspects and opinions. In doing so, we expect to alleviate issues brought by the unified tagging scheme. Specifically, we exploit sequence tagging strategies (Lample et al., 2016) for extraction of aspects and opinions, whilst taking advantage of a biaffine scorer (Dozat and Manning, 2017) to obtain word-level sentiment dependencies. Additionally, since these task-heads are jointly trained, the learning objectives of aspect and opinion extraction could be considered as regularization applied on the sentiment dependency parser. In this way, the parser is learned with aspect-and opinion-aware constraints, therefore fulfilling the demand of triplet interaction modelling. Intuitively, if we are provided with a sentence containing two aspects but only one opinion (e.g., the third example in Figure 2), we can identify triplets with overlapped opinion thereby.
Extensive experiments are carried out on four SemEval benckmarking data collections for ABSA. Our framework are compared with a range of stateof-the-art approaches. The results demonstrate the effectiveness of our overall framework and individual components within it. A further case study shows that how our model better handles overlapping cases. which separately stand for aspect span, opinion span, and sentiment. While the aspects and opinions are usually spans over several words in the sentence, we simplify the notation with the start position (denoted as sp) and end position (denoted as ep) of a span. Accordingly, m can be represented as (sp ). Thus, the problem is formulated as finding a function F that accurately maps the

The OTE-MTL Framework
Our proposed OTE-MTL framework folds the triplet extraction process into two stages, i.e., prediction stage and decoding stage. An overview of our framework is presented in Figure 3. The prediction stage is parameterized by neural models and thus is trainable. It builds upon a sentence encoding module based on word embedding and a bidirectional LSTM structure, to learn an abstract representation of aspects and opinions. Underpinned by the abstract representation, there are three core components, accounting for three subgoals, i.e., aspect tagging, opinion tagging, and word-level sentiment dependency parsing. After the aspects, opinions and word-level dependencies have been detected, a decoding stage is then carried out to produce triplets based on heuristic rules.

Sentence Encoding
Context awareness is crucial for sentence encoding, i.e., encoding a sentence into a sequence of vectors. Hence, we adopt a bidirectional Long Short-term Memory network (LSTM) (Hochreiter and Schmidhuber, 1997) as our sentence encoder, owing to the context modelling capability of LSTMs. In order to encode the input sentence, we first embed each word in a sentence to a low-dimensional vector space (Bengio et al., 2003) with pre-trained word embeddings 5 . With the embedded word represen- by the following operation: where d e and d h denote the dimensionality of a word embedding and a hidden state from an unidirectional LSTM, while − −−− → LSTM(·) and ← −−− − LSTM(·) stand for forward and backward LSTM, respectively. ⊕ means vector concatenation.

Aspect and Opinion Representation
We then extract the aspect-and opinion-specific features from the encoded hidden states, by applying dimension-reducing linear layers and nonlinear functions, rather than directly feeding the hidden states into the next components, for two reasons. First, the hidden states might contain superfluous information for follow-on computations, potentially causing a risk of overfitting. Second, such operations are expected to strip away irrelevant features for aspect tagging and opinion tagging. The computation process is formulated as below: ∈ R dr are learnable weights and biases. Here, g(·) is a nonlinear function, which is ReLU, i.e., max(·, 0), in our case.
Note that above representations are prepared for tagging. Likewise, we obtain another set of representations r ∈ R dr for sentiment parsing, following the same procedure as Equation 2 and 3 but with different parameters.

Multi-task Architecture
The multi-task architecture includes two parts: aspect and opinion tagging, and word-level sentiment dependency parsing. Aspect and Opinion Tagging. Following the {B, I, O} tagging scheme, we tag each word in the sentence with two taggers, i.e., one tagger for aspect, and the other for opinion. In particular, we receive two series of distributions over {B, I, O} tags p Accordingly, we can deduce the loss function, typically cross entropy with categorical distribution, for tagging as: respectively denote the ground truth aspect and opinion tag distributions of each word, and k is an enumerator over each item in a categorical distribution. Word-level Sentiment Dependency Parsing. There are |S| 2 possible word pairs (including selfpairing cases) in each sentence and we intend to determine dependency type of every word pair. The set of dependency types is defined as {NEU, NEG, POS, NO-DEP}, so as to address all kinds of dependencies. Here, NO-DEP denotes no sentiment dependency. In addition, inspired by the table filling methods (Miwa and Sasaki, 2014;Bekoulis et al., 2018), sentiment dependencies are considered only for a pair of words that are exactly the last word of an aspect and the last word of an opinion in a triplet. Recall the example sentence "Great battery, start up speed.". For the triplet (start up speed, great, POS), the sentiment dependency is simplified to (speed, great, POS). As such, the learning redundancy for the parser is much reduced, while the span-level sentiment dependency is still available when it is combined with extracted aspect and opinion spans.
We utilize a biaffine scorer to capture the interaction of two words in each word pair, due to its proven expressive power in syntactic dependency parsing (Dozat and Manning, 2017). The score assignment to each word pair is as below: wheres i,j,k stands for score of the k-th dependency type for a word pair (w i , w j ). W (k) and b (k) are trainable weight and bias for producing the k-th score, respectively. Moreover, we use s i,j to indicate a softmax-normalized vector of scores, which contains probabilities of all dependency types for the word pair (w i , w j ): As observed from the factorization in Equation 7, conceptually the biaffine scorer can not only model the likelihood of w i receiving w j as a dependent of a specific type (the first term), but also include the prior probability of w j being a dependent of such type (the second term). When it is implemented, the scorer is essentially an affine transform followed by matrix multiplication.
Thereafter, the loss function for word-level sentiment dependency parsing is a cross entropy function given below: whereŝ i,j is the ground-truth dependency distribution for each word pair (w i , w j ).
Overall Learning Objective. Ultimately, we can conduct joint training of the multi-task learning framework with the following objective: where α is a trade-off term to balance the learning between tagging and sentiment dependency parsing. θ stands for trainable parameters. ||θ|| 2 and γ are L 2 regularization of θ and a controlling term, respectively.

Triplet Decoding
Upon obtaining the extracted aspects, opinions, and word-level sentiment dependencies, we conduct a triplet decoding process using heuristic rules. Basically, we view the sentiment dependencies resulted from the biaffine scorer as pivots, and carry out a reverse-order traverse on tags generated by the aspect and opinion taggers. and a word-level sentiment dependency, which is represented in index form, (6, 1, POS). The yielded sentiment dependency typically means that the last word of aspect is the 6-th word (speed), the last word of opinion is the 1-th word (Great), and they together form a positive sentiment. The traverse is conducted based on the aspect and opinion index (pivots) and the word sequence following stop-onnon-I criterion. And the final output should be [(4, 6), (1, 1), POS]. Details of the algorithm is shown in 1.

Algorithm 1 Decoding w/ stop-on-non-I criterion.
Input: aspect tags {g if j ≤ 0 then ¡ or exceeding boundary.

Datasets and Evaluation Metrics
We conduct experiments on three datasets in the "restaurant" domain from SemEval 2014, 2015 and 2016 (Pontiki et al., 2014(Pontiki et al., , 2015(Pontiki et al., , 2016, and one dataset in the "laptop" domain from SemEval 2014. Hereafter, we will refer to them as REST14, REST15, REST16, and LAPTOP14 respectively. Since they are originally annotated with aspects and sentiments only, we additionally adopt annotations of opinion terms from  and . Each dataset is split to three subsets, namely, training set, validation set, and test set. The statistics of these datasets are shown in Table 1. It is worth noting that, in , the opinion overlapped triplets (in short OOTs) are removed from all four datasets in the preprocessing step. However, these cases are preserved in our setting. A key observation from the statistics is that there are large amounts of overlapping cases in the datasets, on average accounting for 24.2% of the total number of triplets across all four datasets. This phenomenon suggests the need and significance of triplet interaction modelling.
Moreover, we adopt precision, recall, and micro F1-measure as our evaluation metrics for triplet extraction. Only exactly matched triplets, i.e., with all of the aspect, opinion and sentiment matched against gold standards, are viewed as true positives during evaluation. All results are reported by averaging 10 runs with random initialization. Paired t-test is used to examine statistical significance of the results.

Implementation Details
In our experiments, the word embeddings are initialized with pretrained GloVe word vectors (Pennington et al., 2014). The dimensionalities of embeddings d e , hidden states d h , aspect and opinion representations d r are set to 300, 300, 100, respectively. The trade-off term in learning objective, i.e., α, is set to be 1. The coefficient for L 2 regularization, i.e., γ, is 10 -5 . Dropout is applied on embeddings to avoid overfitting and the drop rate is 0.5. The learning rate during training is 10 -3 while the batch size is 32. All the parameters are initialized with uniform distribution and optimized with the Adam optimizer. Besides, we set a patience number 5, so that we could stop the learning process early if there is no further performance improvement on validation set.

Baselines and Variants
To perform a systematic comparison, we introduce a variety of baselines, which can be classified into two groups, i.e., pipeline methods proposed in  and joint methods we adapted from previous aspect-opinion co-extraction systems based on our framework OTE-MTL. First, we list the baselines with a pipeline structure. (1) Pipeline  decomposes triplet extraction to two stages: stage one for predicting unified aspect-sentiment and opinion tags, while stage two for pairing the two results from stage one. We further include three models adjusted in accordance with Pipeline: (2) Unified+ ) is a typical aspect-sentiment pair extraction system, in which the unified tagging scheme is used.
(3) RENANTE+ (Dai and Song, 2019) is originally an aspect-opinion co-extraction system in a weakly-supervised manner. (4) CMLA+  is an aspect-opinion co-extraction system modelling the interaction between the aspects and opinions. Additionally, we adapt two extra baseline models to the multi-task leaning, resulting in: (5) CMLA-MTL and (6) HAST-MTL (Li et al., 2018b), which are extended from existing state-of-the-art aspect-opinion co-extraction systems.
We also propose a list of variants of our proposed OTE-MTL framework to examine the efficacy of different components in it. (a) OTE-MTL-Inter feeds the prediction of aspects and opinions to the biaffine scorer by imposing tag embedding and concatenating tag embeddings to the input of the scorer. (b) OTE-MTL-Concat replaces the biaffine scorer with an activated linear layer applied on the concatenated vectors of aspect and opinion representations. (c) OTE-MTL-Unified uses unified aspect-sentiment tagging scheme and degrades the biaffine scorer to a binary pair classifier, which is similar to Pipeline but is jointly trained. (d) OTE-MTL-Collapsed combines the aspect and opinion tagging components into one single module via a collapsed tag set {B-AP, I-AP, B-OP, B-OP, O}, thus is forced to account for the constraint that aspects and opinions would never overlap.

Quantitative Evaluation
Comparison with Baselines. The results in comparison with baselines are shown in Table 2, both on datasets with and without OOTs for a fair comparison. Our propose model OTE-MTL consistently outperforms all state-of-the-art baselines on all datasets with and without OOTs. Thus, we conclude OTE-MTL is effective in dealing with opinion triplet extraction task.
We observe that the results of OTE-MTL on datasets without OOTs are generally better than those with OOTs except for LAPTOP14, implying that datasets without OOTs is comparably simpler and easier to achieve a good performance. Hence, we believe that overlapping cases bring challenges and can be partly addressed via triplet interaction modelling. Nevertheless, CMLA+ presents a worse performance in contrast to superior performance produced by CMLA-MTL. This fact suggests that, through decoupling aspect and sentiment predictions and puting them under the multi-task learning framework, the model can be enhanced and gain better results. Comparison with Variants. The comparison with variants of OTE-MTL shown in Table 2 aims to verify the effectiveness of different components of OTE-MTL. As a whole, OTE-MTL surpasses all its variants. Specifically, OTE-MTL is slightly better than OTE-MTL-Inter, however, OTE-MTL exceeds other variants by large margins.
Rather than implicitly modelling the interaction between tagging and sentiment dependency parsing, OTE-MTL-Inter explicitly feeds emebddings of predicted tags to the biaffine scorer. It gets an inferior performance. We conjecture the reason lies in the latent error propagation when tags are par-  Table 2: Quantitative evaluation results (%). Results of models with marker * are reported on datasets without OOTs. Results of models with marker † are directly cited from . F1 measures in bold are the best performing numbers on each dataset. F1 measures with marker ‡ are significantly better than other numbers on each dataset with paired t-test (p < 0.01).
tially wrong, therefore hinting implicit modelling is a promising choice. The failure of OTE-MTL-Concat, which cannot model priors, supports the idea of leveraging biaffine scorer as word-level sentiment dependency parser. The result of OTE-MTL-Unified indicates that coupling aspect and sentiment extraction is suboptimal. Furthermore, we use OTE-MTL-Collapsed to account for nonoverlap constraint of aspects and opinions, however, it obtains unexpectedly poor results. A possible explanation is that simultaneously collapsing aspect and opinion representations into one space may cause limited capacity for expressiveness.

Qualitative Evaluation
Case Study. To understand in what way our framework overwhelms the other unified tagging-based approaches, we perform a case study on three representative examples from test sets, as displayed in Table 3.
We notice that both OTE-MTL-Unified and OTE-MTL are working well for the first case which involves no overlapping. Nonetheless, OTE-MTL-Unified performs less well when faced with the second sample which contains aspect overlapped triplets and requires triplet interaction modelling. This case also shows conflicting opinions to an aspect (Tan et al., 2019), which is not covered by the training set but exists in real-world applications. It cannot be coped with by coupled aspect-sentiment tags since a tag should not have diverse sentiments. Thus decoupling sentiments from aspect tags is necessary. In the third example with long-range de-pendency, both aspect overlap and opinion overlap exist. For this case, OTE-MTL is not strong enough to make all correct predictions, but still seems to work better than OTE-MTL-Unified. Error Analysis. To further find out the strengths and limitations of OTE-MTL, we conduct a detailed analysis of false positives (extracted by the system but not existing in ground truth) and false negatives (not extracted by the system but existing in ground truth) on REST14. For false positives, we categorize them into four classes: false aspect, false opinion, false sentiment, and other (mixed) case. For false negatives, we divide them according to categories of overlap (i.e., aspect overlapped, opinion overlapped, normal). Figure 4 shows the analysis result. False positives are largely triggered by only one false element, especially, aspect or opinion, of an extracted triplet, motivating us to develop more robust span detection algorithms. In addition, the circumstance might also reflect that exact match is not an ideal metric when systems are evaluated, since minor discrepancy in a span may be harmless for opinion interpretation in practice, as we could observe in Table 3. Likewise, from Figure 4, we posit that overlapping cases are still non-trivial to solve given they have almost taken half of the false negatives.

Aspect-based Sentiment Analysis
Our work falls in the broad scope of ABSA. As we have previously discussed, there are two types  of approaches in ABSA: aspect-sentiment pair extraction that concentrates on collaboratively detecting aspects and attached sentiment orientations He et al., 2019;Hu et al., 2019), and aspect-opinion co-extraction that tends to co-extract aspects and opinions Li et al., 2018b). Alternatively, ABSA is also formulated as determining sentiment polarity of a given aspect in a sentence (Jiang et al., 2011;Dong et al., 2014;Tang et al., 2016a,b;Li et al., 2018a;, which is inflexible for practical use since aspects are not naturally accessible.
In this paper, we unify the aspect-sentiment pair extraction and aspect-opinion co-extraction, and formulate them as a triplet extraction problem. Our work is also aimed at addressing several issues in , as discussed in the Introduction Section.

Triplet Extraction-based Task
Other than ABSA, a majority of triplet extractionbased tasks lies in the area of natural language processing. For example, Joint Entity and Rela-tion Extraction (JERE) aims at detecting a pair of entity mentions in a sentence and predicting relation between the two. Approaches to JERE can be sorted into four streams: pipeline-based, table filling-based (Miwa and Sasaki, 2014;Bekoulis et al., 2018;Fu et al., 2019), tagging-based (Zheng et al., 2017), and encoder decoder-based (Zeng et al., 2018). Our work is motivated by table filling methods in Miwa and Sasaki (2014) and Bekoulis et al. (2018). We decompose triplet extraction to three subtasks, in which word-level sentiment dependency parsing can actually be viewed as a table filling problem, and solve them jointly in a multitask learning framework.

Conclusions and Future Work
Our work put forwards an opinion triplet extraction perspective for aspect-based sentiment analysis. Existing works that are applicable to opinion triplet extraction have been shown insufficient, owing to the use of unified aspect-sentiment tagging scheme and ignorance of the interaction between elements in the triplet. Thus, we propose a multi-task learning framework to address the limitations by highlighting the uses of joint training, decoupled aspect and sentiment prediction, and regularization among correlated tasks during learning. Experimental results verify the effectiveness of our framework in comparison with a wide range of strong baselines. Comparison results with different variants of the proposed framework signify the necessity of the core components in the framework.
Based on the observations from a case study and error analysis, we plan to carry out further research in the following aspects: (1) more robust taggers for aspect and opinion extraction, (2) more flexible evaluation metric for triplet extraction, and (3) more mighty triplet interaction mechanism (e.g., encoder decoder structure).