PoD: Positional Dependency-Based Word Embedding for Aspect Term Extraction

Dependency context-based word embedding jointly learns the representations of word and dependency context, and has been proved effective in aspect term extraction. In this paper, we design the positional dependency-based word embedding (PoD) which considers both dependency context and positional context for aspect term extraction. Specifically, the positional context is modeled via relative position encoding. Besides, we enhance the dependency context by integrating more lexical information (e.g., POS tags) along dependency paths. Experiments on SemEval 2014/2015/2016 datasets show that our approach outperforms other embedding methods in aspect term extraction.


Introduction
Aspect term extraction aims to extract expressions that represent properties of products or services from online reviews (Hu and Liu, 2004a;Hu and Liu, 2004b;Popescu and Etzioni, 2007;Liu, 2010). Understanding the context between words in reviews, such as through conditional random fields (Pontiki et al., 2014;Pontiki et al., 2015;Pontiki et al., 2016), is the key to superior results in aspect term extraction. Word embeddings are effective to capture the contextual information across a wide range of NLP tasks (Tai et al., 2015;Lei et al., 2015;Bojanowski et al., 2017;Devlin et al., 2019). However, they only produce moderate results in aspect term extraction. Recent studies (e.g., Yin et al. (2016)) indicate that this is due to the distributed nature of the word embedding (Mikolov et al., 2013b), which ignores the rich context between the words, such as syntactic information.
In this paper, we propose positional dependency-based word embedding (POD) to enhance the context modeling capability for aspect term extraction. POD explicitly captures two types of contexts, dependency context and positional context. Inspired by the simple-yet-effective position encoding in Transformer (Vaswani et al., 2017), POD models the positional context via relative position encoding (Shaw et al., 2018) between words within a fixed window. Besides, the dependency context is defined as the dependency path as well as the attached lexical information (e.g., POS tags and words) along the path. Moreover, POD is able to incorporate more lexical information into the semantic compositional model via the dependency context, making representations of dependency paths more informative than the ones that only consider grammatical information (Yin et al., 2016). We then linearly combine the dependency and positional context to produce the positional dependencies among words. We also define a margin-based ranking loss to efficiently optimize POD.
Our contributions are two-fold, (i) we propose positional dependency-based word embedding POD, which incorporates both positional context and dependency context, (ii) we compare POD with existing aspect term extraction methods and demonstrate that POD yields improved results on aspect term extraction datasets. POD aims to maximize likelihoods of triples (w t , c, w c ), where w t and w c represent target word and context word respectively, c refers to positional dependency-based context (an example is in Table 1), which consists of two types of contexts: the dependency context (dependency paths between target and context word) and positional context (relative position encoding between target and context word). Figure 1 illustrates the sentence example according to the triples in Table 1. We introduce two score functions for triples (w t , c, w c ) which are as follows.
where S add uses the element-wise addition for the context word and its context c, while S puct uses the element-wise product. We use two embedding matrices M t ∈ R |V |×d and M c ∈ R |V |×d to represent target words and context words respectively, where |V | is the size of vocabulary and d is the dimension of embeddings. The w c ∈ R 1×d and w t ∈ R 1×d are obtained through lookup operations. Note that we describe how to derive c in Section 2.2.

Positional Dependency
We construct the positional dependency-based context c by linearly combining the dependency context vector c dep derived from semantic composition of lexical dependency paths and the positional context vector c pos computed based on relative position encoding (Shaw et al., 2018). The representation of positional dependency-based context is defined in Eq. (2).
where α is used to trade-off the effects between dependency and positional contexts in the model. The basic idea of using relative position encoding is based on the assumption that context words with different relative positions have different impacts on learning the representations of target words. The use of relative position encoding has been proved to be useful in supervised relation classification (Zeng et al., 2014) and machine translation (Vaswani et al., 2017;Shaw et al., 2018). Similar to using embeddings to represent words, we also introduce M l ∈ R (s−1)×d to represent the relative position encoding and derive c pos from it, where s is the window size.
We also consider the lexical information along dependency paths when learning the representations of the dependency context. For example, for the pair (food, wonderful) in Figure 1, the corresponding dependency path is * nsubj ←− smells/VBZ xcomp −→ * . We denote the words, POS tags as the lexical information, and use dep = {g 1 , g 2 , ..., g |c| } to denote the composite lexical dependency path. The embedding matrix M dep ∈ R n×d is utilized to derive the distributed representations of lexical dependency path {g 1 , g 2 , ..., g |c| }, where n is the size of dictionary including words, POS tags and dependency paths.
To obtain c dep , we use RNN model which learns the dependency path representations along the sequence dep in a recurrent manner.

Model Optimization
We use a margin-based ranking objective to learn model parameters in Eq. (1), which encourages scores of positive triples (w t , c, w c ) ∈ T to be higher than scores of sampled triples (w t , c, w c ) ∈ T . The ranking loss is as follows.
where δ is the margin value, S( * ) is the score function defined in Eq.
(1), in which c is introduced in Eq.
(3) conducts negative sampling on target words rather than dependency paths, which proposes two advantages, (i) it can exploit arbitrary hop dependency paths. Besides, the words and POS tags along the path can be utilized; (ii) it avoids to memorize dependency path frequencies which grow exponentially with the number of hops.
The negative sampling method is employed to train the embedding model (Eq. (1)). These randomly chosen words in T are sampled based on the marginal distribution p(w) and p(w) is estimated from the word frequency raised to the 3 4 power (Mikolov et al., 2013a) in the corpus. We set the negative number to 15 which is a trade-off between the training time and performance. The δ is empirically set to 1 according to (Collobert and Weston, 2008;Bollegala et al., 2015). To avoid the overfitting in RNN, we employ dropout on the input vectors and set the dropout rate to 0.5. The asynchronous gradient descent is used for parallel training. Moreover, Adagrad (Duchi et al., 2011) is used to adaptively change learning rate and the initial learning rate is set to 0.1.

Dataset
We evaluate POD on aspect term extraction benchmark datasets: SemEval 2014/2015/2016. The SemEval 2014 datasets include two domains: laptop and restaurant, and we use the D1 and D2 to denote these two datasets respectively. The SemEval 2015/2016 datasets only include restaurant domain. D3 and D4 are utilized to represent them. We use the corpora introduced in (Yin et al., 2016) to learn the distributed representations of words and lexical dependency paths.

Baseline and Setting
We compare POD with top systems in SemEval with method class Top system as shown in Table 1. We also compare our method with notable embedding-based methods with method class Embedding method illustrated in Table 1.
In order to choose l, d (Section 2.1) and α (Eq. (2)), 80% sentences in training data are used as training set, and the rest 20% are used as development set. The dimensions of word and dependency path embeddings are set as 100. Larger dimensions get similar results in the development set but cost more time. l is set as 10 which performs best in the development set. Similarly, the αs are set as 0.7, 0.5, 0.5 and 0.5 for datasets D1, D2, D3 and D4 respectively.
To make fair comparisons, we choose parameters l and d on the development set for embedding baselines. All the dimensions of embedding methods are set as 100. The dimensions l in Skip-gram, CBOW and WDEmb models are set as 15, the dimensions in Glove and DepEmb are set as 10. The windows of Skip-gram, CBOW and Glove are set as 5, which are the same as our model. As derived embeddings are not necessarily in a bounded range (Turian et al., 2010), this might lead to moderate results. We apply a simple function of discretization following (Yin et al., 2016) to make embedding features more effective.

Result and Analysis
The results are described in Table 2 and the t-test is also conducted by random initialization. From the table, we find that POD with both S puct and S add consistently outperform WDEmb which is one of the  Table 3: Effects of information in dependency context. best embedding methods. The reasons are that (i) our model incorporates positional context as relative position encoding to help enhance word embeddings; (ii) the dependency context leverages the lexical dependency path capturing more specific lexical information such as words and POS tags (extracted using Stanford CoreNLP ) than WDEmb. POD also achieves comparable results with top systems which are based on hand-crafted features in all datasets, which shows that our learned embeddings are effective for aspect term extraction. The S puct performs better than S add , which indicates that the product-based composition method is more capable in capturing the useful features in aspect term extraction. In terms of embedding-based baselines, DepEmb and WDEmb perform better than other baselines, which indicates that encoding syntactic knowledge into word embeddings is desirable for aspect term extraction.
We also analyze the effects of POS tags and words along dependency paths in the dependency context on final results. The results are presented in Table 3. From the table, we observe that both POS tags and words along dependency paths boost aspect term extraction, which indicates that lexical information can encode discriminative information for representations of dependency paths. Meanwhile, POD obtains better results by adding both POS tags and words.

Related Work
Association rule mining is used in (Hu and Liu, 2004b) to mine aspect terms. Opinion words are used to extract infrequent aspect terms. The relationship between opinion words and aspect words is crucial to extract aspect terms, which are deployed in many follow-up studies. In (Qiu et al., 2011), the predefined dependency paths are utilized to iteratively extract aspect terms and opinion words. POD instead learns the representation of the dependency context.
Dependency-based word embedding (Levy and Goldberg, 2014;Komninos and Manandhar, 2016) encodes dependencies into word embeddings, and has been shown effective in aspect term extraction as well (Yin et al., 2016). However, only grammatical information is considered among the dependency paths. We instead introduce a positional dependency-based embedding method which considers both dependency context and positional context. End-to-end aspect term extraction (Wang et al., 2016;Wang et al., 2017;Li et al., 2018;Xu et al., 2018) based on neural networks and attention mechanism, have been recently developed. Compare to these methods, POD is an embedding method, can thus be applied to more applications. Compare to deep word representations (Peters et al., 2018;Devlin et al., 2019), POD is more efficient, which is crucial to aspect term extraction.

Conclusion
In this paper, we develop a specific word embedding method for aspect term extraction. Our method considers both positional and dependency context when learning the word embedding. Meanwhile, the lexical information along dependency path is encoded into representations of dependency context. Compared with other embedding methods, our method achieves better results in aspect term extraction.

Acknowledgement
This paper is supported by National Key Research and Development Program of China with Grant No. 2018AAA0101900 / 2018AAA0101902 as well as the National Natural Science Foundation of China (NSFC Grant No. 61772039 and No. 91646202). Chenguang Wang is supported by Berkeley DeepDrive and Berkeley Artificial Intelligence Research.