Deep Weighted MaxSAT for Aspect-based Opinion Extraction

Though deep learning has achieved signiﬁcant success in various NLP tasks, most deep learning models lack the capability of encoding explicit domain knowledge to model complex causal relationships among different types of variables. On the other hand, logic rules offer a compact expression to represent the causal relationships to guide the training process. Logic programs can be cast as a satisﬁability problem which aims to ﬁnd truth assignments to logic variables by maximizing the number of satisﬁable clauses (MaxSAT). We adopt the MaxSAT semantics to model logic inference process and smoothly incorporate a weighted version of MaxSAT that connects deep neural networks and a graphical model in a joint framework. The joint model feeds deep learning outputs to a weighted MaxSAT layer to rectify the erroneous predictions and can be trained via end-to-end gradient descent. Our proposed model associates the beneﬁts of high-level feature learning, knowledge reasoning, and structured learning with observable performance gain for the task of aspect-based opinion extraction.


Introduction
Aspect-based opinion extraction aims to identify opinion targets (or aspects) of a review corpus that indicate specific product features, as well as the opinion terms expressed towards the aspects. For example, in the sentence "The wine list is excellent", the aspect term is wine list, whereas the opinion term is excellent. Many deep learning models have been proposed for this task via enumerating high-level features (Liu et al., 2015;Xu et al., 2018a;Wang et al., 2017;Li and Lam, 2017;Yin et al., 2016;. However, these methods fail to explicitly encode prior knowledge * This work was done when the first author was an undergraduate student with Nanyang Technological University. on the relationships among aspect terms and opinion terms which are crucial for the task at hand, as shown in earlier rule-based models (Hu and Liu, 2004;Qiu et al., 2011). As in the previous example, if wine list is extracted as an aspect term and it has dependency relation "nsubj" with excellent which is an objective, then we can deduce that excellent is an opinion term. Though in (Yu et al., 2019), rules are incorporated as constraints into a deep neural network, the constraints cannot be backpropagated to the feature learning process. Recently, Wang and Pan (2020) proposed a joint model to combine deep learning with logic rules via minimizing the discrepancy between them. Their approach, however, only indirectly guides deep learning in training without the ability to rectify the predictions according to logic rules in inference.
To address the aforementioned limitations for aspect-based opinion extraction, we propose a novel joint model DeepWMaxSAT to integrate logic knowledge via a weighted MaxSAT layer into a deep learning architecture. Specifically, DeepW-MaxSAT consists of 1) a DNN layer that transforms an input embedding to a high-level feature representation; 2) a weighted MaxSAT layer that takes DNN outputs as the initial probabilistic evaluations on the logic variables and produces the values for the output logic variables corresponding to the head atoms of selected logic rules; 3) a conditional random field (CRF) (Lafferty et al., 2001) layer that generates structured outputs (label sequences) considering linear context interactions among the tokens in a sequence. Moreover, to fully inherit the advantages of both DNNs and logic programs, we adopt a form of residual connection that combines both DNN predictions and the outputs from the weighted MaxSAT layer with a learnable weight, which is then fed into the CRF layer.
It is worth noting that the weighted MaxSAT layer contains all the prior knowledge about the cor-relations among aspect and opinion terms encoded in conjunctive normal form (CNF) for all the logic rules. For example, the association between the aspect term wine list and the opinion term excellent in the previous example can be expressed using CNF as ¬aspect(list) ∨ ¬nsubj(list, excellent)∨ ¬obj(excellent) ∨ opinion(excellent), which is converted from the first-order-logic (FOL) rule obj(excellent) ∧ nsubj(list, excellent) ∧ aspect(list) ⇒ opinion(excellent). A learnable weight is associated to each disjunctive clause in the CNF formula to indicate its confidence. The weighted MaxSAT layer is able to rectify DNN predictions according to preset rules, at the same time, the loss signal for the final predictions can be back-propagated smoothly through the weighted MaxSAT layer to DNN parameters to guide the training of the deep learning model. Though Wang et al. (2019) proposed a differentiable satisfiability solver that integrates MaxSAT into deep learning, they only assumed a fixed set of rules that are true in nature, making it less flexible for general NLP problems where data can be noisy. With this consideration, we adopt the attention mechanism to adaptively select useful rules in the weighted MaxSAT layer for each data instance and treat the learnable attention scores as rule weights. The intuition is that different data instances may fit to different rules with varying probabilities.
To summarize, our contributions include: • We propose a novel attention-based weighted MaxSAT solver that can selectively rectify and update deep learning predictions according to the relevance of specific rules.
• An end-to-end joint model associating DNNs, logic reasoning and structured learning is introduced to enhance the model performance.
• We focus on evaluating the effectiveness of encoding manually-designed prior knowledge as logic rules into a deep architecture. To achieve that, a real NLP application, namely aspectbased opinion extraction is chosen which is noisy but contains certain syntactic regularities that are difficult to be captured by pure deep learning models.
• We demonstrate the generality of the proposed joint framework over different DNN systems and word embeddings on the task of aspectbased opinion extraction.

Related Work
Aspect-based Opinion Extraction Various deep learning approaches have been introduced for aspect-based opinion extraction, including context-based recurrent neural networks (Liu et al., 2015) and convolutional neural networks (Xu et al., 2018a), dependency-tree-based models (Yin et al., 2016;, and attention-based models (Wang et al., 2017;Li and Lam, 2017). Despite the promising performaces, it is hard to interpret and explicitly encode prior knowledge for deep learning models. The prior knowledge has been commonly used in the earlier works by designing specific features and rules among aspect terms and opinion terms (Hu and Liu, 2004;Popescu and Etzioni, 2005;Wu et al., 2009;Qiu et al., 2011). Yu et al. (2019) used integer linear programming with explicit constraints for joint inference as a post-processing step. However, these rule-based methods fail to propagate training signal to the feature learning process, making them suboptimal. On the other dimension, graphical models were also proposed to model the contextual or syntactic interactions among the tokens (Jin and Ho, 2009;Li et al., 2010). However, the optimization process is usually non-trivial especially for complex graphical structures. Recently, Wang and Pan (2020) introduced a logic-informative deep learning model that converts the relations among aspect and opinion terms to logic rules. Nevertheless, the logic rules only implicitly guide the training process of DNN and fail to rectify DNN predictions directly.
Deep Learning with Logic Reasoning Recent years have witnessed an increasing focus on neural symbolic learning that combines deep learning systems with discrete symbolic rules (Garcez et al., 2012;Manhaeve et al., 2018;Dong et al., 2019;Sourek et al., 2018; by constructing a logic network or connecting the distributed systems with logic rules for reasoning and inference in the logic domains. Xu et al. (2018b) treated logic knowledge as semantic regularization in the loss function. For NLP applications, the neural-symbolic systems were recently proposed in (Rocktäschel et al., 2015;Guo et al., 2016) for relation and knowledge graph learning that embed logic into the same space as distributed features in a single system. Logical knowledge has also been incorporated as a form of posterior regulariza-tion in (Hu et al., 2016) to enhance deep learning predictions. Moreover, logic rules can be used as evidences to construct adversarial sets (Minervini et al., 2017;Minervini and Riedel, 2018), or as a form of indirect supervision (Wang and Poon, 2018) to improve model training. Li and Srikumar (2019) further augmented deep learning models with logic neurons that can be trained together with the neural networks.

Problem Definition
We treat the extraction problem as a sequence labeling task. Given a sequence of tokens {w 1 , w 2 , ...w n }, sequence labeling produces a segmentation label y i for each token w i where y i ∈ Y = {B-ASP, I-ASP, B-OPN, I-OPN, O}. We use BIO encoding scheme to differentiate whether the token is the beginning of an aspect/opinion term (B-ASP/B-OPN), inside an aspect/opinion term (I-ASP/I-OPN), or out of any targets (O). A first-order-logic (FOL) rule or a clause has the form of a 1 ∧ a 2 ∧ ... ∧ a K ⇒ h, where a 1 ∧ a 2 ∧ ... ∧ a K is the rule body containing a conjunction of atoms a k , and h is the head atom. Here, an atom is an n-ary predicate a k = pred k (x 1 , ..., x n ) with x 1 , ..., x n representing n variables. A ground atom assigns a constant to each varible in its argument. A set of FOL rules can be transformed to a conjunctive normal form (CNF) which is a conjunction of one or more disjunctive clauses, e.g., the clause ¬a 1 ∨ ¬a 2 ∨ ... ∨ ¬a K ∨ h is converted from a 1 ∧ a 2 ∧ ... ∧ a K ⇒ h. Here, each disjunctive clause corresponds to an FOL rule. When the CNF formula is satisfised, all its corresponding FOL rules are true. In our setting, we treat the linguistic features, e.g., dependency relations, POS tags, and the segmentation labels as different predicates. For example, B-ASP(w i ) is a ground atom indicating w i as the beginning of an aspect term. We utilise these atoms to form the CNF formula in the MaxSAT formulation.

Differentiable MaxSAT Solver
The maximum satisfiability problem (MAX-SAT) is the problem of determining the maximum number of satisfied clauses. Given a formula in CNF c 1 ∧ ... ∧ c m with m disjunctive clauses c 1 , ..., c m over a total number of n different atoms a 1 , ..., a n , each atom takes one of the 2 assignments: v i ∈ {−1, +1} indicating its truth value. For each clause c j , we denote its sign s j corresponding to all the atoms by s j = {−1, 0, +1} n , where s ji ∈ s j takes −1, 0 or +1 indicating the sign of atom a i in clause c j . 0 represents the absence of a i . Then the MaxSAT problem can be casted into the following optimization problem: (1) To solve this problem,  transformed (1) to the following objective by relaxing each discrete v i to a continuous unit vectorv i ∈ R k with respect to some "truth direction" v through (2) can be solved via coordinate descent with the following update: This update is guaranteed to converge to the global optimal as long as k > √ 2n. To obtain the final probabilistic evaluations for atom a i , we convert the updatedv i to p

Methodology
In this section, we present our proposed model in detail. To make the logic knowledge more effective that is able to directly rectify the erroneous predictions made by deep learning models, and at the same time adapt its rules selectively according to different data instances, we propose a neuralsymbolic integration by incorporating an attentionbased weighted MaxSAT layer. The attention mechanism is used to automatically select relevant logic rules according to each specific data instance and to weigh the importance of each rule that could affect the final objective. Furthermore, we also integrate a CRF layer to generate structured predictions. As a result, the joint framework inherit the advantage of high-level feature learning, knowledge reasoning and structured learning. Figure 1 provides an overview of the proposed model. It consists of 3 layers: 1) a deep learning module that takes input embeddings x 1 , ..., x N as Deep Neural Networks inputs and generates a prediction for each word q 1 , ..., q N via feature learning; 2) a weighted MaxSAT layer that takes deep learning predictions as the initial probabilistic evaluations p(v i = 1) of the input atoms a i and generates probabilistic values p(v o = 1) of the output atoms; 3) a CRF layer that combines the outputs from the previous 2 layers with a residual connection to produce the final structured predictionsȳ 1 , ...,ȳ N . The joint model can be trained in an end-to-end manner via gradient descent, which is reflected with the dotted arrows in Figure 1. We illustrate each component in more detail in the sequel.

Deep Learning Layer
The deep learning layer aims to capture high-level feature representation for each word considering the complex interactions among different words within a setence 1 . We use a transformer model which takes a combination of word embedding x e i and POS tag embedding x p i as input and generates a hidden representation h i for each word via a multilayer self-attention mechanism. Specifically, at the l-th layer of the transformer, each attention 1 It is flexible to adopt different deep learning models with various word embeddings. To demonstrate such flexibility, we use different DNNs and word embeddings in experiments. Here, we only describe a transformer-style DNN for illustration. head computes one interaction factor between each token and other tokens within the sentence in order to produce where each h i,l−1 is a column vector of the ma- . A Bi-GRU (gated recurrent unit) f θ is then applied after the last layer of the transformer h i,L to produce context-sensitive hidden representations The final prediction of each word is obtained via a fully-connected layer with a softmax activation function: and x l i−1 indicates the label embedding of the preceding token.

Weighted MaxSAT Layer
As discussed in Section 3.1, we convert FOL rules to CNF formulas which consist of multiple disjunctive clauses in order to be fed into the MaxSAT solver. In our problem setting, each atom in a clause is a 1-ary or 2-ary predicate, e.g., a clause in the form of ¬ASP(Y ) ∨ ¬POS N OU N (X) ∨ ¬dep nsubj (X, Y ) ∨ ¬POS ADJ (X) ∨ OPN(X) indicates that if Y is an aspect word with POS tag "NOUN", and Y has dependency relation "nsubj" with X, then we can deduce that X is an opinion word when it has POS tag "ADJ". This clause can be well fit into the following sentence "The wine list is excellent" for extracting excellent as an opinion word when wine list is correctly predicted as an aspect term. The clauses we adopt are shown and explained in Figure 2.
In the weighted MaxSAT layer, we define the set of all atoms {a 1 , ..., a n } as the atoms appeared in Figure 2, including label atoms 2 with comfortable chairs amod Clause Example (e.g., ASP(Y ), I-ASP(X), OPN(X)), POS atoms (e.g., POS N OU N (Y ), POS ADJ (X)) and dependency relation atoms (e.g., dep compound (X, Y ), As shown in (2), the MaxSAT problem can be relaxed with the converted sign matrix S and atom value matrix V. Here S is computed from the given clauses as our prior knowledge and kept fixed during training. To obtain V = [v ,v 1 , ...v n ], we take the softmax prediction from the deep learning layer as the initialized probabilistic value of each atom. Specifically, denote by p(v 1 = 1), ..., p(v n = 1) ∈ [0, 1] the probabilistic evaluations of all the atoms a 1 , ..., a n . If a i is one of the label atoms, i.e., a i ∈ {ASP(X), I-ASP(X), OPN(X)}, we take DNN predictions as the initial evaluations for the corresponding atoms, e.g., p(v i = 1) = q B-OPN i when a i = B-OPN(X) and q B-OPN i is the DNN prediction for the class B-OPN. When a i corresponds to the atom of POS tags or dependency relations, e.g., a i = dep nsubj (X, Y ), we use 0/1 assignment for p(v i = 1) obtained through the Stanford Parser, where 0 indicates non-existence of the corresponding POS tag or dependency relation, and vice versa. Different from existing works using a differentiable MaxSAT solver, we assign a probabilistic weight w j ∈ [0, 1] for each clause indicating its confidence of being true, which is updated during training. To adapt the logic knowledge into the noisy dataset, where each clause is not guaranteed to be always true for different data instances, we adopt an attention mechanism to compute the adaptive clause weight for each data instance, which measures the similarity between the DNN predictions and each specific clause grounding. Since in the real cases, each data instance may only satisfy at most 2 clauses, we use the sparsemax operator to transform the attention weights such that only 1 or 2 clauses are being chosen at each time. The procedure is shown as follows: where sparsemax(α) = argmin x∈∆ N −1 x − α 2 , and represents the weight for clause c j corresponding to data instance z. v z ∈ R n−1 is the initial probabilistic evaluation vector for atoms A = {a i } i =n h except the head atom of the rule corresponding to data instance z. Andŝ j = |s j | where s j ∈ R n−1 corresponds to the sign of each atom except the head atom of the rule. In our context, a data instance z corresponds to a pair of words (w 1 , w 2 ) which are the instantiations for X and Y , respectively, in Figure 2. Intuitively, by using (7), the model tends to select the most relevant rules/clauses according to the similarity between the rule body and the vlaues of the associated groundings (e.g., POS tags, dependency relations and DNN predictions for each token).
With the incorporation of the attention-based weights of rules, the original MaxSAT objective can be transformed to the following form: where V = [v ,v 1 , ...v n ] and U = WS with S = [s , s 1 , ..., s n ] diag(1/ 4 |s j |) ∈ R m×(n+1) and W = diag(w j ), j = 1, . . . , m. By using coordinate descent, the update forv i becomes Note that we use (9) to computev o until convergence with o being the index of the head atom of the selected rules according to the attention mechanism. We then further convert the real vector to probabilistic evaluation via p(v o = 1) = cos −1 −v o v /π. For ease of illustration, for each data instance z, we denote by p the output probability from the weighted MaxSAT layer. Intuitively, f MaxSAT aims to produce a rule-satisfied evaluation to its corresponding head atom, given the DNN predictions of the input body atoms. When the DNN prediction for the head atom is not accurate, the MaxSAT layer is able to revise its value. In the meantime, the partial gradient of the final loss with respect to the MaxSAT output is backpropagated to the DNN parameters, making logic rules as a form of indirect supervision to the training of the DNN.

CRF Layer
To further mitigate the degradation problem caused by inaccurate MaxSAT updates or uncertain DNN predictions, we use a residual network with a trainable gate r to combine the outputs from both the DNN layer and the weighted MaxSAT layer as where q i and p i represent the outputs from the DNN and the MaxSAT layers, respectively. On top of the combination, a CRF layer is performed to generate the structured prediction outputs, which takes into consideration of the sequential dependencies among entities. Denote by x and y = (y 1 , . . . , y N ) the input and the output of the CRF layer, respectively. The CRF layer computes conditional distributions as follows, where f (x, y) = i log ψ i (x, y)+ i log φ i (y).
Here, ψ i (x, y) and φ i (y) indicate the unary and pairwise potentials, respectively. To integrate the information from the preceding layers, we substitute ψ i (x, y) withȳ i obtained via (10). The pairwise potential is determined via a trainable transition matrix specifying the score of transitioning from each label tag to other labels.

Training
The entire model can be trained in an end-to-end manner via gradient descent with the final loss function as whereŷ d is the ground-truth label sequence for data x d . During training, the objective updates the weighted MaxSAT layer according to (10) and (9) via: wherev o andv i represent the output index (head atom) and the input index (body atom), respectively. Following , we take the analytical form of the resulting gradients to compute (14) and (15), respectively. Note that the gradients of DNN parameters (denoted by Θ) are obtained through backpropagating information from both the final loss function L and the MaxSAT gradient ∂L ∂v i via:

Experiment
We conduct experiments on the benchmark datasetfrom SemEval Challenge 2014 task 4 (subtask 1) that consists of a restaurant domain and a laptop domain (Pontiki et al., 2014), and a restaurant corpus from SemEval 2016 task 5 (Pontiki et al., 2016). The details of each data are listed in Table 1. For preprocessing, we use NLTK toolkit for tokenization, POS tagging and generating dependency parse tree for each sentence. We use 1 GPU with model Tesla P100-PCIE-16GB to run our experiment. For the joint model, it takes around 20 minutes for an epoch with 3000 data instances and it takes 10 epochs to achieve the optimal performance.

Experimental Setting
Follow the setting in , the pre-training of word embedding is first conducted  using word2vec on Yelp Challenge dataset 3 and electronic dataset in Amazon reviews 4 for restaurant and laptop domain, respectively. Following (Vaswani et al., 2017), we add positional encoding on top of input representations in the transformer network. We assign 10 heads to the multi-head selfattention model, which generates attention weight parameters with dimension 10. We set the word embedding dimension as 300, POS-tag embedding as 50, hidden layer as 200, and label embedding as 25. For training, we adopt the adadelta optimizer with a learning rate of 2e −3 and a weight decay of 5e −4 . All parameters are chosen based on crossvalidation. To evaluate the model performance, F1 scores on non-negative classes are adopted, where the correctness of a prediction is fulfilled if and only if the predicted tag exactly matches the true label for each aspect/opinion term.

Overall Results
We evaluate our model performance by comparing with the following well-known baseline methods: • RNCRF ): A joint model combining a dependency-based recursive neural network with CRF to model syntactic interactions among aspect and opinion terms.
• CMLA (Wang et al., 2017): Coupled attention network with tensor-based interaction for coextraction of aspect and opinion terms.
• GMTCMLA (Yu et al., 2019): Global inference with multi-task neural networks that regularize DNN predictions with integer linear programming.
The comparison results are shown in Table 2, and the last column corresponds to our proposed model. Clearly, our model achieves best performances on almost all the tasks across 3 datasets. The first 3 models represent pure deep learning methods by adopting either dependency trees (RN-CRF), attention-based interactions (CMLA), or contextual interactions using convolutional neural network (Demb). These methods, however, only assume that the complex interactions among aspect terms and opinion terms can be captured via implicit feature learning. When feeding prior knowledge as constraints in integer linear programming, GMTCMLA is able to regularize deep learning predictions, but without the ability to backpropagate error information. Hence, its performance does not show clear improvement. DeepLogic is able to update the deep learning model by treating logic rules as indirect supervision. Without the capability to directly revise DNN outputs, it shows suboptimal performance compared to our proposed model.
We further conduct a qualitative analysis to demonstrate how the weighted MaxSAT (WMaxSAT) layer rectify the erroneous predictions made by deep neural networks. Some representative cases that WMaxSAT corrects DNN predictions are shown in Table 3. The left column shows predictions made by the deep learning model with the incorrectly predicted words marked in red. The right column shows the corresponding predictions made by applying a WMaxSAT layer on top of DNN outputs. It is clear that those mislabeling words are all corrected in this case, demonstrating the effect of our proposed model.

Ablation Analysis
To further demonstrate the effect of each component of our proposed model, we conduct ablation experiments with 6 different model set- DNN prediction WMaxSAT correction ['pretentious' -O, 'and' -O, 'inappropriate' -B-OPN] ["pretentious' -B-OPN, 'and' -O, 'inappropriate' -B-OPN] ['flan ' -B-ASP, 'and' -O, 'sopaipillas' -O] [   tings as shown in Table 4. The advantage of DNN+WMaxSAT over DNN alone in most cases reveals the power of using WMaxSAT to incorporate domain knowledge. Using CRF further improves the model performance through effective capturing of sequential correlations among terms. To show the advantage of using the proposed attention mechanism for rule weight computation, we compare with 2 other variations of the MaxSAT layer. DNN+MaxSAT+CRF assumes each logic rule as correct at all times (fixed weights to be 1.0). Whereas DNN+MaxSAT*+CRF assigns each rule with a unified weight which applies to all data instances. The rule weights in this model are randomly initialized and trained through the learning process. As can be seen, in most of the cases, attention-based WMaxSAT is most effective for aspect/opinion extraction.
Our proposed model is flexible to integrate any deep learning modules or pre-trained word embeddings. To show the generality and advantage of combining DNNs with logic reasoning and structured learning, we replace the transformer model in the deep learning layer with 2 other commonly used word embeddings, namely BERT (Devlin et al., 2019) and ELMO (Peters et al., 2018) followed by a BiGRU layer. The results for using different word embeddings with different model settings are shown in Table 5. Clearly, BERT achieves better performances than ELMO in general. It is worth noting that the weighted MaxSAT layer always brings performance gain when combined with the DNN model. The joint model over all the three components produces the best results when using BERT as the word embeddings. Whereas joining   ELMO with WMaxSAT produces comparable performances with or without CRF.
To provide a clear idea of the effect for each logic rule described in Figure 2, we conduct experiments on feeding each single clause into the WMaxSAT layer as shown in Table 6. We observe the best performance on aspect extraction when only using c 1 for restaurant domain and c 3 for the laptop domain. For opinion extraction, c 2 is most effective for both domains. However, using separate rules are inferior than using all 5 rules for opinion extraction. We also analyze the percentage of each rule being selected during training, as shown in the second row of Table 6. On average, most rules has about 20% chance of being selected, which shows that the attention model is able to select diverse rules according to different data characteristics.
In the previous experiments, we initialize the residual connection gate r as 1.0 and update it through the training process. To demonstrate the effect of different initializations for this hyperparameter, we conduct another experiment on varying the value of r from 0.1 to 1.0. As shown in Figure 3, the f1 scores do not fluctuate substan- tially when 0.1 ≤ r ≤ 0.9. When r = 1.0, there is a clear change of f1 scores. The reason might come from the fact that some logic rules are not always feasible for the actual noisy dataset, especially when some general objects which should be regarded as aspect terms according to the rules are not labeled as aspect terms. For example, in the sentence "This place is amazing", amazing is labeled as an opinion term whereas place is not labeled as an aspect term, which contradicts with rule c 4 . When training with r < 1.0, the combination of label supervision and rule c 4 may result in missing the opinion term amazing given place is not an aspect term. In other words, the joint model tries to find a tradeoff between the labels and the rules that makes the result of aspect extraction and opinion extraction more balanced, instead of the evident performance difference when r = 1.0.

Conclusion
We propose a novel joint model that inherits the advantage of high-level feature learning, logic reasoning and structured learning which can be trained smoothly in an end-to-end manner. To adapt logic knowledge with noisy real applications, we introduce an attention mechanism to generate an adaptive weight corresponding to each data instance for each logic rule. The attention weights control the information flow between deep neural networks and the MaxSAT layer which automatically weigh the relevance of each rule towards the data given. Extensive experiments are conducted to verify both quantitatively and qualitatively the effectiveness of the proposed model.