Leveraging Just a Few Keywords for Fine-Grained Aspect Detection Through Weakly Supervised Co-Training

User-generated reviews can be decomposed into fine-grained segments (e.g., sentences, clauses), each evaluating a different aspect of the principal entity (e.g., price, quality, appearance). Automatically detecting these aspects can be useful for both users and downstream opinion mining applications. Current supervised approaches for learning aspect classifiers require many fine-grained aspect labels, which are labor-intensive to obtain. And, unfortunately, unsupervised topic models often fail to capture the aspects of interest. In this work, we consider weakly supervised approaches for training aspect classifiers that only require the user to provide a small set of seed words (i.e., weakly positive indicators) for the aspects of interest. First, we show that current weakly supervised approaches fail to leverage the predictive power of seed words for aspect detection. Next, we propose a student-teacher approach that effectively leverages seed words in a bag-of-words classifier (teacher); in turn, we use the teacher to train a second model (student) that is potentially more powerful (e.g., a neural network that uses pre-trained word embeddings). Finally, we show that iterative co-training can be used to cope with noisy seed words, leading to both improved teacher and student models. Our proposed approach consistently outperforms previous weakly supervised approaches (by 14.1 absolute F1 points on average) in six different domains of product reviews and six multilingual datasets of restaurant reviews.


Introduction
A typical review of an entity on platforms such as Yelp and Amazon discusses multiple aspects of the entity (e.g., price, quality) in individual review segments (e.g., sentences, clauses). Consider for example the Amazon product review in TV such as price, ease of use, and sound quality. Given the vast number of online reviews, both sellers and customers would benefit from automatic methods for detecting fine-grained segments that discuss particular aspects of interest. Fine-grained aspect detection is also a key task in downstream applications such as aspect-based sentiment analysis and multi-document summarization (Hu and Liu, 2004;Liu, 2012;Pontiki et al., 2016;Angelidis and Lapata, 2018).
In this work, we consider the problem of classifying individual segments of reviews to predefined aspect classes when ground truth aspect labels are not available. Indeed, reviews are often entered as unstructured, free-form text and do not come with aspect labels. Also, it is infeasible to manually obtain segment annotations for retail stores like Amazon with millions of different products. Unfortunately, fully supervised neural networks cannot be applied without aspect labels. Moreover, the topics learned by unsupervised neural topic models are not perfectly aligned with the users' aspects of interest, so substantial human effort is required for interpreting and mapping the learned topics to meaningful aspects.
Here, we investigate whether neural networks can be effectively trained under this challenging setting when only a small number of descriptive keywords, or seed words, are available for each Aspect Seed Words Price (EN) price, value, money, worth, paid Image (EN) picture, color, quality, black, bright Food (EN) food, delicious, pizza, cheese, sushi Drinks (FR) vin, bière, verre, bouteille, cocktail Ambience (SP) ambiente, mesas, terraza, acogedor, ruido Table 1: Examples of aspects and five of their corresponding seed words in various domains (electronic products, restaurants) and languages ("EN" for English, "FR" for French, "SP" for Spanish).
aspect class. Table 1 shows examples of aspects and five of their corresponding seed words from our experimental datasets (described later in more detail). In contrast to a classification label, which is only relevant for a single segment, a seed word can implicitly provide aspect supervision to potentially many segments. We assume that the seed words have already been collected either manually or automatically. Indeed, collecting a small 1 set of seed words per aspect is typically easier than manually annotating thousands of segments for training neural networks. As we will see, even noisy seed words that are only weakly predictive of the aspect will be useful for aspect detection.
Training neural networks for segment-level aspect detection using just a few seed words is a challenging task. Indeed, as a contribution of this paper, we observe that current weakly supervised networks do not effectively leverage the predictive power of the available seed words. To address the shortcomings of previous seed word-based approaches, we propose a novel weakly supervised approach, which uses the available seed words in a more effective way. In particular, we consider a student-teacher framework, according to which a bag-of-seed-words classifier (teacher) is applied on unlabeled segments to supervise a second model (student), which can be any supervised model, including neural networks.
Our approach introduces several important contributions. First, our teacher model considers each individual seed word as a (noisy) aspect indicator, which as we will show, is more effective than previously proposed weakly supervised approaches. Second, by using only the teacher's aspect probabilities, our student generalizes better than the teacher and, as a result, the student outperforms both the teacher and previously proposed weakly supervised models. Finally, we show how iterative co-training can be used to cope with noisy seed words: the teacher effectively estimates the predictive quality of the noisy seed words in an unsupervised manner using the associated predictions by the student. Iterative co-training then leads to both improved teacher and student models. Overall, our approach consistently outperforms existing weakly supervised approaches, as we show with an experimental evaluation over six domains of product reviews and six multilingual datasets of restaurant reviews.
The rest of this paper is organized as follows. In Section 2 we review relevant work. In Section 3 we describe our proposed weakly supervised approach. In Section 4 we present our experimental setup and findings. Finally, in Section 5 we conclude and suggest future work. A preliminary version of this work was presented at the Second Learning from Limited Labeled Data Workshop (Karamanolakis et al., 2019).

Related Work and Problem Definition
We now review relevant work on aspect detection (Section 2.1), co-training (Section 2.2), and knowledge distillation (Section 2.3). We also define our problem of focus (Section 2.4).

Segment-Level Aspect Detection
The goal of segment-level aspect detection is to classify a segment s to K aspects of interest.
Supervised Approaches. Rule-based or traditional learning models for aspect detection have been outperformed by supervised neural networks (Liu et al., 2015;Poria et al., 2016;Zhang et al., 2018). Supervised neural networks first use an embedding function 2 (EMB) to compute a low dimensional segment representation h = EMB(s) ∈ R d and then feed h to a classification layer 3 (CLF) to predict probabilities for the K aspect classes of interest: p = p 1 , . . . , p K = CLF(h). For simplicity, we write p = f (s). The parameters of the embedding function and the classification layer are learned using ground truth, segment-level aspect labels. However, aspect labels are not available in our setting, which hinders the application of supervised learning approaches.
Unsupervised Approaches. Topic models have been used to train aspect detection with unannotated documents. Recently, neural topic models (Iyyer et al., 2016;Srivastava and Sutton, 2071;He et al., 2017) have been shown to produce more coherent topics than earlier models such as Latent Dirichlet Allocation (LDA) (Blei et al., 2003). In their Aspect Based Autoencoder (ABAE), He et al. (2017) first use segment s to predict aspect probabilities p = f (s) and then use p to reconstruct an embedding h for s as a convex combination of K aspect embeddings: h = K k=1 p k A k , where A k ∈ R d is the embedding of the k-th aspect. The aspect embeddings A k are initialized by clustering the vocabulary embeddings using kmeans with K clusters. ABAE is trained by minimizing the segment reconstruction error. 4 Unfortunately, unsupervised topic models are not effective when used directly for aspect detection. In particular, in ABAE, the K topics learned to reconstruct the segments are not necessarily aligned with the K aspects of interest. A possible fix is to first learn K >> K topics and do a K -to-K mapping as a post-hoc step. However, this mapping requires either aspect labels or substantial human effort for interpreting topics and associating them with aspects. This mapping is nevertheless not possible if the learned topics are not aligned with the aspects.
Weakly Supervised Approaches. Weakly supervised approaches use minimal domain knowledge (instead of ground truth labels) to model meaningful aspects. In our setting, domain knowledge is given as a set of seed words for each aspect of interest (Lu et al., 2011;Lund et al., 2017;Angelidis and Lapata, 2018). Lu et al. (2011) use seed words as asymmetric priors in probabilistic topic models (including LDA). Lund et al. (2017) use LDA with fixed topic-word distributions, which are learned using seed words as "anchors" for topic inference (Arora et al., 2013). Neither of these two approaches can be directly applied into more recent neural networks for aspect detection. Angelidis and Lapata (2018) recently proposed a weakly supervised extension of the unsupervised ABAE. Their model, named Multi-seed Aspect Extractor, or MATE, initializes the aspect embedding A k using the weighted average of the corresponding seed word embeddings (instead of the k-means centroids). To guarantee that the aspect embeddings will still be aligned with the K aspects of interest after training, Angelidis and Lapata (2018) keep the aspect and word embeddings fixed throughout training. In this work, we will show that the predictive power of seed words can be leveraged more effectively by considering each individual seed word as a more direct source of supervision during training.

Co-training
Co-training (Blum and Mitchell, 1998) is a classic multi-view learning method for semi-supervised learning. In co-training, classifiers over different feature spaces are encouraged to agree in their predictions on a large pool of unlabeled examples. Blum and Mitchell (1998) justify co-training in a setting where the different views are conditionally independent given the label. Several subsequent works have relaxed this assumption and shown co-training to be effective in much more general settings (Balcan et al., 2005;Chen et al., 2011;Collins and Singer, 1999;Clark et al., 2018). Cotraining is also related to self-training (or bootstrapping) (Yarowsky, 1995), which trains a classifier using its own predictions and has been successfully applied for various NLP tasks (Collins and Singer, 1999;McClosky et al., 2006). Recent research has successfully revisited these general ideas to solve NLP problems with modern deep learning methods. Clark et al. (2018) propose "cross-view training" for sequence modeling tasks by modifying Bi-LSTMs for semisupervised learning. Ruder and Plank (2018) show that classic bootstrapping approaches such as tritraining (Zhou and Li, 2005) can be effectively integrated in neural networks for semi-supervised learning under domain shift. Our work provides further evidence that co-training can be effectively integrated into neural networks and combined with recent transfer learning approaches for NLP (Dai and Le, 2015;Howard and Ruder, 2018;Devlin et al., 2019;Radford et al., 2018), in a substantially different, weakly supervised setting where no ground-truth labels but only a few seed words are available for training.

Variable
Description s Segment (e.g., sentence) of a text review K Number of aspects of interest D Total number of seed words G i (i = 1, . . . , K) Set of seed words for the i-th aspect h ∈ R d Segment embedding (student) c ∈ N D Bag-of-seed-words representation of s p = p 1 , . . . , p K Student's aspect predictions q = q 1 , . . . , q K Teacher's aspect predictions

Knowledge Distillation
Our approach is also related to the "knowledge distillation" framework (Buciluǎ et al., 2006;Ba and Caruana, 2014;Hinton et al., 2015), which has received considerable attention recently (Lopez-Paz et al., 2016;Kim and Rush, 2016;Furlanello et al., 2018;Wang, 2019). Traditional knowledge distillation aims at compressing a cumbersome model (teacher) to a simpler model (student) by training the student using both ground truth labels and the soft predictions of the teacher in a distillation objective. Our work also considers a studentteacher architecture and the distillation objective but under a considerably different, weakly supervised setting: (1) we do not use any labels for training and (2) we create conditions that allow the student to outperform the teacher; in turn, (3) we can use the student's predictions to learn a better teacher under co-training.

Problem Definition
Consider a corpus of text reviews from an entity domain (e.g., televisions, restaurants). Each review is split into segments (e.g., sentences, clauses). We also consider K pre-defined aspects of interest (1, . . . , K), including the "General" aspect, which we assume is the K-th aspect for simplicity. Different segments of the same review may be associated with different aspects but ground-truth aspect labels are not available for training. Instead, a small number of seed words G k are provided for each aspect k ∈ [K]. Our goal is to use the corpus of training reviews and the available seed words G = (G 1 , . . . , G K ) to train a classifier, which, given an unseen test segment s, predicts K aspect probabilities p = p 1 , . . . , p K .

Our Student-Teacher Approach
We now describe our weakly supervised framework for aspect detection. We consider a studentteacher architecture (Figure 2), where the teacher is a bag-of-words classifier based solely on the provided seed words (i.e., a "bag-of-seed-words" classifier), and the student is an embedding-based neural network trained on data "softly" labeled by the teacher (as in the distillation objective). In the rest of this section, we describe the individual components of our student-teacher architecture and our proposed algorithm for performing updates.

Teacher: A Bag-of-Seed-Words Classifier
Our teacher model leverages the available seed words G that are predictive of the K aspects. Let D denote the total number of seed words in G. We can represent a segment s i using a bag-of-seedwords representation c i ∈ N D , where c j i encodes the number of times the j-th seed word occurs in s i . (Note that c i ignores the non-seed words.) The teacher's prediction for the k-th aspect is: If no seed word appears in s, then the teacher predicts the "General" aspect by setting q K i = 1. Under this configuration the teacher uses seed words in a direct and intuitive way: it predicts aspect probabilities for the k-th aspect, which are proportional to the counts of the seed words under G k , while if no seed word occurs in s, it predicts the "General" aspect. The classifier receives c i as input and predicts q i = q 1 i , . . . , q K i . Although the teacher only uses seed words to predict the aspect of a segment, we also expect non-seed words to carry predictive power. Next, we describe the student network that learns to associate non-seed words with aspects.

Student: An Embedding-Based Network
Our student model is an embedding-based neural network: a segment is first embedded (h i = EMB(s i ) ∈ R d ) and then classified to the K aspects (p i = CLF(h i )) (see Section 2.1). The student does not use ground-truth aspect labels for training. Instead, it is trained by optimizing the distillation objective, i.e., the cross entropy between the teacher's (soft) predictions and the student's predictions: While the teacher only uses the seed words in s i to form its prediction q i , the student uses all the words in s i . Thus, using the distillation loss for training, the student learns to use both seed words and non-seed words to predict aspects. As a result, the student is able to generalize better than the teacher and predict aspects even in segments that do not contain any seed words. To regularize the student model, we apply L2 regularization to the classifier's weights and dropout regularization to the word embeddings (Srivastava et al., 2014). As we will show in Section 4, our student with this configuration outperforms the teacher in aspect prediction.

Iterative Co-Training
In this section, we describe our iterative cotraining algorithm to cope with noisy seed words. The teacher in Section 3.1 considers each seed word equally, which can be problematic because not all seed words are equally good for predicting an aspect. In this work, we propose to estimate the predictive quality of each seed word in an unsupervised way. Our approach is inspired in the Model Bootstrapped Expectation Maximization (MBEM) algorithm of Khetan et al. (2018). MBEM is guaranteed to converge (under mild conditions) when the number of training data is sufficiently large and the worker quality is sufficiently high. Here, we treat seed words as "noisy annotators" and adopt an iterative estimation procedure similar to MBEM, as we describe next. We model the predictive quality of the j-th seed word as a weight vector z j = z 1 j , . . . , z K j , where

Algorithm 1 Iterative Seed Word Distillation
Input: {s i } i∈ [N ] , D seed words grouped into K disjoint sets G = (G 1 , . . . , G K ) Output:f : predictor function for segmentlevel aspect detection Predict {q i } i∈[N ] (Eq. (1)) Apply teacher Repeat until convergence criterion Learnf (Eq. (2)) Train student Apply student Apply teacher z k j measures the strength of the association with the k-th aspect. We thus change the teacher to consider seed word quality. In particular, we replace Equation (1) by: whereẑ j is the current estimate of z j . As no ground-truth labels are available, we follow Khetan et al. (2018) and estimate z j via Maximum Likelihood Estimation using the student's predictions as the current estimate of the ground truth labels. In particular, we assume that the prediction of the student for a training segment s i is t i = argmax k p k i . Then, for each seed word we compute the quality estimate for the k-th aspect using the student's predictions for N segments: According to Equation (4), the quality of the j-th seed word is estimated according to the studentteacher agreement on segments where the seed word appears. Building upon the previous ideas, we present our Iterative Seed Word Distillation (ISWD) algorithm for effectively leveraging the seed words for fine-grained aspect detection. Each round of ISWD consists of the following steps (Algorithm 1): (1) we apply the teacher on unlabeled training segments to get predictions q i (without considering seed word qualities); (2) we train the student using the teacher's predictions in the distillation objective of Equation (2); 5 (3) we apply the student in the training data to get predictions p i ; and (4) we update the seed word quality parameters using the student's predictions in Equation (4).
In contrast to MATE, which uses the validation set (with aspect labels) to estimate seed weights in an initialization step, our proposed method is an unsupervised approach to modeling and adapting the seed word quality during training. We stop this iterative procedure after the disagreement between the student's and teacher's hard predictions in the training data stops decreasing. We empirically observe that 2-3 rounds are sufficient to satisfy this criterion. This observation also agrees with Khetan et al. (2018), who only run their algorithm for two rounds.

Experiments
We evaluate our approach to aspect detection on several datasets of product and restaurant reviews.

Experimental Settings
Datasets. We train and evaluate our models on Amazon product reviews for six domains (Laptop Bags, Keyboards, Boots, Bluetooth Headsets, Televisions, and Vacuums) from the OPO-SUM dataset (Angelidis and Lapata, 2018), and on restaurant reviews in six languages (English, Spanish, French, Russian, Dutch, Turkish) from the SemEval-2016 Aspect-based Sentiment Analysis task (Pontiki et al., 2016). Aspect labels (9class for product reviews and 12-class for restaurant reviews) are available for each segment 6 of the validation and test sets. The restaurant reviews also come with training aspect labels, which we only use for training the fully supervised models. For a fair comparison, we use exactly the same 30 seed words (per aspect and domain) used in Angelidis and Lapata (2018) for the product reviews and use the same extraction method described in Angelidis and Lapata (2018) to extract 30 seed words for the restaurant reviews. See the supplementary material for more dataset details.
Experimental Procedure. For a fair comparison, we use exactly the same pre-processing (tokenization, stemming, and word embedding) and evaluation procedure as in Angelidis and Lapata in Khetan et al. (2018), which is an alternative form of noiseaware loss functions (Natarajan et al., 2013), is equivalent to our distillation loss: using the log loss as l(.) in Equation (4) of Khetan et al. (2018) yields the cross entropy loss. 6 In product reviews, elementary discourse units (EDUs) are used as segments. In restaurant reviews, sentences are used as segments.
(2018). For each domain, we train our model on the training set without using any aspect labels, and only use the seed words G via the teacher. For each model, we report the average test performance over 5 different runs with the parameter configuration that achieves best validation performance. As evaluation metric, we use the microaveraged F1.
Model Configuration. For the student network, we experiment with various modeling choices for segment representations: bag-ofwords (BOW) classifiers, the unweighted average of word2vec embeddings (W2V), the weighted average of word2vec embeddings using bilinear attention (Luong et al., 2015) (same setting as He et al. (2017); Angelidis and Lapata (2018)), and the average of contextualized word representations obtained from the second-to-last layer of the pretrained (self-attention based) BERT model (Devlin et al., 2019), which uses multiple self-attention layers (Vaswani et al., 2017) and has been shown to achieve state-of-the-art performance in many downstream NLP applications. For the English product reviews, we use the base uncased BERT model. For the multilingual restaurant reviews, we use the multilingual cased BERT model. 7 In iterative co-training, we train the student network to convergence in each iteration (which may require more than one epoch over the training data). Moreover, we observed that the iterative process is more stable when we interpolate between weights of the previous iteration and the estimated updates instead of directly applying the estimated seed weight updates (according to Equation (3)).
Model Comparison. For a robust evaluation of our approach, we compare the following models and baselines: • LDA-Anchors: The topic model of Lund et al. (2017) using seed words as "anchors." • ABAE: The unsupervised autoencoder of He et al. (2017), where the learned topics were manually mapped to aspects.
• MATE-*: The MATE model of Angelidis and Lapata (2018) with various configurations: initialization of the aspect embeddings   Table 4: Micro-averaged F1 reported for 12-class sentence-level aspect detection in restaurant reviews. The fully supervised *-Gold models are not directly comparable with the weakly supervised models.
A k using the unweighted/weighted average of seed word embeddings and an extra multitask training objective (MT). 8 • Teacher: Our bag-of-seed-words teacher.
• Student-*: Our student network trained with various configurations for the EMB function.
• *-Gold: Supervised models trained using ground truth aspect labels, which are only available for restaurant reviews. These models are not directly comparable with the other models and baselines.

Experimental Results
Tables 3 and 4 show the results for aspect detection on product and restaurant reviews, respectively. The rightmost column of each table reports the average performance across the 6 domains/languages.

MATE-* models outperform ABAE.
Using the seed words to initialize aspect embeddings leads to more accurate aspect predictions than mapping the learned (unsupervised) topics to aspects.
LDA-Anchors performs worse than MATE-* models. Although averages of seed words were used as "anchors" in the "Tandem Anchoring" algorithm, we observed that the learned topics did not correspond to our aspects of interest.
The teacher effectively leverages seed words. By leveraging the seed words in a more direct way, Teacher is able to outperform the MATE-* models. Thus, we can use Teacher's predictions as supervision for the student, as we describe next.
The student outperforms the teacher. Student-BoW outperforms Teacher: the two models have the same architecture but Teacher only considers seed words; regularizing Student's weights encourages Student to mimic the noisy aspect predictions of Teacher by also considering non-seed words for aspect detection. The benefits of our dis-  Figure 3: Our weakly supervised co-training approach when seed words are removed from the student's input (RSW baseline). Segment s non−seed is an edited version of s, where we replace each seed word in s by an "UNK" special token (like out-of-vocabulary words).
tillation approach are highlighted using neural networks with word embeddings. Student-W2V outperforms both Teacher and Student-BoW, showing that obtaining segment representations as the average of word embeddings is more effective than using bag-of-words representations for this task.
The student outperforms previous weakly supervised models even in one co-training round. Student-ATT outperforms MATE-unweighted (by 36.3% in product reviews and by 52.2% in restaurant reviews) even in a single co-training round: although the two models use exactly the same seed words (without weights), pre-trained word embeddings, EMB function, and CLF function, our student-teacher approach leverages the available seed words more effectively as noisy supervision than just for initialization. Also, using our approach, we can explore more powerful methods for segment embedding without the constraint of a fixed word embedding space. Indeed, using contextualized word representations in Student-BERT leads to the best performance over all models. As expected, our weakly supervised approach does not outperform the fully supervised (*-Gold) models. However, our approach substantially reduces the performance gap between weakly supervised approaches and fully supervised approaches by 62%. The benefits of our student-teacher approach are consistent across all datasets, highlighting the predictive power of seed words across different domains and languages.
The student leverages non-seed words. To better understand the extent to which non-seed words  Table 5: Micro-averaged F1 scores during the first round (middle column) and after iterative co-training (right column) in product reviews (top) and restaurant reviews (bottom).
can predict the aspects of interest, we experiment with completely removing the seed words from Student-W2V's input during training (Student-W2V-RSW method; see Figure 3). Thus, in this setting, Student-W2V-RSW is forced to only use non-seed words to detect aspects. Note that the cotraining assumption of conditionally independent views (Blum and Mitchell, 1998) is satisfied in this setting, where Teacher is only using seed words and Student-W2V is only using non-seed words. Student-W2V-RSW effectively learns to use nonseed words to predict aspects and performs better than Teacher (but worse than Student-W2V, which considers both seed and non-seed words). For additional ablation experiments, see the supplementary material.
Iterative co-training copes with noisy words. Further performance improvement in Teacher and Student-* can be observed with the iterative cotraining procedure of Section 3.3. Table 5 reports the performance of Teacher and Student-* after co-training for both product reviews (top) and English restaurant reviews (bottom). (For more detailed, per-domain results, see the supplementary material.) Compared to the initial version of Teacher that does not model the quality of the seed words, iterative co-training leads to estimates of seed word quality that improve Teacher's performance up to 12.3% (in product reviews using Student-BERT).
A better teacher leads to a better student. Cotraining leads to improved student performance in both datasets (Table 5). Compared to MATE, which uses the validation set to estimate the seed weights as a pre-processing step, we estimate and iteratively adapt the seed weights using the Figure 4: Co-training performance for each round reported for product reviews (left) and restaurant reviews (right). T<i>and S<i>correspond to the teacher's and student's performance, respectively, at the i-th round. student-teacher disagreement, which substantially improves performance. Across the 12 datasets, Student-BERT leads to an average absolute increase of 14.1 F1 points. Figure 4 plots Teacher's and Student-BERT's performance after each round of co-training. Most of the improvement for both Teacher and Student-BERT is gained in the first two rounds of cotraining: "T0" (in Figure 4) is the initial teacher, while "T1" is the teacher with estimates of seed word qualities, which leads to more accurate predictions, e.g., in segments with multiple seed words from different aspects.

Conclusions and Future Work
We presented a weakly supervised approach for leveraging a small number of seed words (instead of ground truth aspect labels) for segmentlevel aspect detection. Our student-teacher approach leverages seed words more directly and effectively than previous weakly supervised approaches. The teacher model provides weak supervision to a student model, which generalizes better than the teacher by also considering nonseed words and by using pre-trained word embeddings. We further show that iterative co-training lets us estimate the quality of the (possibly noisy) seed words. This leads to a better teacher and, in turn, a better student. Our proposed method consistently outperforms previous weakly supervised methods in 12 datasets, allowing for seed words from various domains and languages to be leveraged for aspect detection. Our student-teacher approach could be applied for any classification task for which a small set of seed words describe each class. In future work, we plan to extend our frame-work to multi-task settings, and to incorporate interaction to learn better seed words.