Improving Relation Extraction with Knowledge-attention

While attention mechanisms have been proven to be effective in many NLP tasks, majority of them are data-driven. We propose a novel knowledge-attention encoder which incorporates prior knowledge from external lexical resources into deep neural networks for relation extraction task. Furthermore, we present three effective ways of integrating knowledge-attention with self-attention to maximize the utilization of both knowledge and data. The proposed relation extraction system is end-to-end and fully attention-based. Experiment results show that the proposed knowledge-attention mechanism has complementary strengths with self-attention, and our integrated models outperform existing CNN, RNN, and self-attention based models. State-of-the-art performance is achieved on TACRED, a complex and large-scale relation extraction dataset.


Introduction
Relation extraction aims to detect the semantic relationship between two entities in a sentence.For example, given the sentence: "James Dobson has resigned as chairman of Focus On The Family, which he founded thirty years ago.", the goal is to recognize the organization-founder relation held between "Focus On The Family" and "James Dobson".The various relations between entities extracted from large-scale unstructured texts can be used for ontology and knowledge base population (Chen et al., 2018a;Fossati et al., 2018), as well as facilitating downstream tasks that requires relational understanding of texts such as question answering (Yu et al., 2017) and dialogue systems (Young et al., 2018).
Traditional feature-based and kernel-based approaches require extensive feature engineering (Suchanek et al., 2006;Qian et al., 2008;Rink and Harabagiu, 2010).Deep neural networks such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have the ability of exploring more complex semantics and extracting features automatically from raw texts for relation extraction tasks (Xu et al., 2016;Vu et al., 2016;Lee et al., 2017).Recently, attention mechanisms have been introduced to deep neural networks to improve their performance (Zhou et al., 2016;Wang et al., 2016;Zhang et al., 2017).Especially, the Transformer proposed by Vaswani et al. (2017) is based solely on self-attention and has demonstrated better performance than traditional RNNs (Bilan and Roth, 2018;Verga et al., 2018).However, deep neural networks normally require sufficient labeled data to train their numerous model parameters.The scarcity or low quality of training data will limit the model's ability to recognize complex relations and also cause overfitting issue.
A recent study (Li and Mao, 2019) shows that incorporating prior knowledge from external lexical resources into deep neural network can reduce the reliance on training data and improve relation extraction performance.Motivated by this, we propose a novel knowledge-attention mechanism, which transforms texts from word semantic space into relational semantic space by attending to relation indicators that are useful in recognizing different relations.The relation indicators are automatically generated from lexical knowledge bases which represent keywords and cue phrases of different relation expressions.While the existing self-attention encoder learns internal semantic features by attending to the input texts themselves, the proposed knowledge-attention encoder captures the linguistic clues of different relations based on external knowledge.Since the two attention mechanisms complement each other, we integrate them into a single model to maximize the uti-arXiv:1910.02724v1[cs.CL] 7 Oct 2019 lization of both knowledge and data, and achieve optimal performance for relation extraction.
In summary, the main contributions of the paper are: (1) We propose knowledge-attention encoder, a novel attention mechanism which incorporates prior knowledge from external lexical resources to effectively capture the informative linguistic clues for relation extraction.(2) To take the advantages of both knowledge-attention and self-attention, we propose three integration strategies: multi-channel attention, softmax interpolation, and knowledgeinformed self-attention.Our final models are fully attention-based and can be easily set up for end-toend training.(3) We present detailed analysis on knowledge-attention encoder.Results show that it has complementary strengths with self-attention encoder, and the integrated models achieve startof-the-art results for relation extraction.

Related Works
We focus here on deep neural networks for relation extraction since they have demonstrated better performance than traditional feature-based and kernel-based approaches.
Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are the earliest and commonly used approaches for relation extraction.Zeng et al. (2014) showed that CNN with position embeddings is effective for relation extraction.Similarly, CNN with multiple filter sizes (Nguyen and Grishman, 2015), pairwise ranking loss function (dos Santos et al., 2015) and auxiliary embeddings (Lee et al., 2017) were proposed to improve performance.Zhang and Wang (2015) proposed bi-directional RNN with max pooling to model the sequential relations.Instead of modeling the whole sentence, performing RNN on sub-dependency trees (e.g.shortest dependency path between two entities) has demonstrated to be effective in capturing longdistance relation patterns (Xu et al., 2016;Miwa and Bansal, 2016).Zhang et al. (2018) proposed graph convolution over dependency trees and achieved state-of-the-art results on TACRED dataset.
Recently, attention mechanisms have been widely applied to CNNs (Wang et al., 2016;Han et al., 2018) and RNNs (Zhou et al., 2016;Zhang et al., 2017;Du et al., 2018).The improved performance demonstrated the effectiveness of attention mechanisms in deep neural networks.Particu-larly, Vaswani et al. (2017) proposed a solely selfattention-based model called Transformer, which is more effective than RNNs in capturing longdistance features since it is able to draw global dependencies without regard to their distances in the sequences.Bilan and Roth (2018) first applied self-attention encoder to relation extraction task and achieved competitive results on TACRED dataset.Verga et al. (2018) used self-attention to encode long contexts spanning multiple sentences for biological relation extraction.However, more attention heads and layers are required for self-attention encoder to capture complex semantic and syntactic information since learning is solely based on training data.Hence, more high quality training data and computational power are needed.Our work utilizes the knowledge from external lexical resources to improve deep neural network's ability of capturing informative linguistic clues.
External knowledge has shown to be effective in neural networks for many NLP tasks.Existing works focus on utilizing external knowledge to improve embedding representations (Chen et al., 2015;Liu et al., 2015;Sinoara et al., 2019), CNNs (Toutanova et al., 2015;Wang et al., 2017;Li andMao, 2019), andRNNs (Ahn et al., 2016;Chen et al., 2016Chen et al., , 2018b;;Shen et al., 2018).Our work is the first to incorporate knowledge into Transformer through a novel knowledge-attention mechanism to improve its performance on relation extraction task.

Knowledge-attention Encoder
We present the proposed knowledge-attention encoder in this section.Relation indicators are first generated from external lexical resources (Section 3.1); Then the input texts are transformed from word semantic space into relational semantic space by attending to the relation indicators using knowledge-attention mechanism (Section 3.2); Finally, position-aware attention is used to summarize the input sequence by taking both relation semantics and relative positions into consideration (Section 3.3).

Relation Indicators Generation
Relation indicators represent the keywords or cue phrases of various relation types, which are essential for knowledge-attention encoder to capture the linguistic clues of certain relation from texts.We utilize two publicly available lexical resources including FrameNet1 and Thesaurus.com2to find such lexical units.
FrameNet is a large lexical knowledge base which categorizes English words and sentences into higher level semantic frames (Ruppenhofer et al., 2006).Each frame is a conceptual structure describing a type of event, object or relation.FrameNet contains over 1200 semantic frames, many of which represent various semantic relations.For each relation type in our relation extraction task, we first find all the relevant semantic frames by searching from FrameNet (refer Appendix for detailed semantic frames used).Then we extract all the lexical units involved in these frames, which are exactly the keywords or phrases that often used to express such relation.Thesaurus.com is the largest online thesaurus which has over 3 million synonyms and antonyms.It also has the flexibility to filter search results by relevance, POS tag, word length, and complexity.To broaden the coverage of relation indicators, we utilize the synonyms in Thesaurus.comto extend the lexical units extracted from FrameNet.To reduce noise, only the most relevant synonyms with the same POS tag are selected.
Relation indicators are generated based on the word embeddings and POS tags of lexical units.Formally, given a word in a lexical unit, we find its word embedding w i ∈ R dw and POS embedding

Knowledge-attention process
In a typical attention mechanism, a query (q) is compared with the keys (K) in a set of key-value pairs and the corresponding attention weights are calculated.The attention output is weighted sum of values (V ) using the attention weights.In our proposed knowledge-attention encoder, the queries are input texts and the key-value pairs are both relation indicators.The detailed process of knowledge-attention is shown in Figure 1 (left).
Formally, given text input x = {x 1 , x 2 , ..., x n }, the input embeddings Q = {q 1 , q 2 , ..., q n } are generated by concatenating each word's word embedding and POS embedding in the same way as relation indicator generation in Section 3.1.The hidden representations H = {h 1 , h 2 , ..., h n } are obtained by attending to the knowledge indicators K, as shown in Equation 1.The final knowledgeattention outputs are obtained by subtracting the hidden representations with the relation indicators mean, as shown in Equation 2.
where knwl indicates knowledge-attention process, m is the number of relation indicators, and d k is dimension of key/query vectors which is a scaling factor same as in Vaswani et al. (2017).
The subtraction of relation indicators mean will result in small outputs for irrelevant words.More importantly, the resulted output will be close to the related relation indicators and further apart from other relation indicators in relational semantic space.Therefore, the proposed knowledgeattention mechanism is effective in capturing the linguistic clues of relations represented by relation indicators in the relational semantic space.

Multi-head knowledge-attention
Inspired by the multi-head attention in Transformer (Vaswani et al., 2017), we also have multi-head knowledge-attention which first linearly transforms Q, K and V h times, and then perform h knowledge-attentions simultaneously, as shown in Figure 1 (right).
Different from the Transformer encoder, we use the same linear transformation for Q and K in each head to keep the correspondence between queries and keys. where Besides, only one residual connection from input embeddings to outputs of position-wise feed forward networks is used.We also mask the outputs of padding tokens using zero vectors.
The multi-head structure in knowledgeattention allows the model to jointly attend inputs to different relational semantic subspaces with different contributions of relation indicators.This is beneficial in recognizing complex relations where various compositions of relation indicators are needed.

Position-aware Attention
It has been proven that the relative position information of each token with respective to the two target entities is beneficial for relation extraction task (Zeng et al., 2014).We modify the positionaware attention originally proposed by Zhang et al. (2017) to incorporate such relative position information and find the importance of each token to the final sentence representation.
Assume the relative position of token x i to target entity is pi .We apply position binning function (Equation 4) to make it easier for the model to distinguish long and short relative distances.
After getting the relative positions p s i and p o i to the two entities of interest (subject and object respectively), we map them to position embeddings base on a shared position embedding matrix W p .The two embeddings are concatenated to form the final position embedding for token x i : Position-aware attention is performed on the outputs of knowledge-attention O ∈ R n×d k , taking the corresponding relative position embeddings P ∈ R n×dp into consideration: d a is attention dimension, and c ∈ R da is a context vector learned by the neural network.

Integrate Knowledge-attention with Self-attention
The self-attention encoder proposed by Vaswani et al. (2017) learns internal semantic features by modeling pair-wise interactions within the texts themselves, which is effective in capturing longdistance dependencies.Our proposed knowledgeattention encoder has complementary strengths of capturing the linguistic clues of relations precisely based on external knowledge.Therefore, it is beneficial to integrate the two models to maximize the utilization of both external knowledge and training data.In this section, we propose three integration approaches as shown in Figure 2, and each approach has its own advantages.

Multi-channel Attention
In this approach, self-attention and knowledgeattention are treated as two separate channels to model sentence from different perspectives.After applying position-aware attention, two feature vectors f 1 and f 2 are obtained from self-attention and knowledge-attention respectively.We apply another attention mechanism called multi-channel attention to integrate the feature vectors.
In multi-channel attention, feature vectors are first fed into a fully connected neural network to get their hidden representations h i .Then attention weights are calculated using a learnable context vector c, which reflects the importance of each feature vector to final relation classification.Finally, the feature vectors are integrated based on attention weights, as shown in Equation 6.
After obtaining the integrated feature vector r, we pass it to a softmax classifier to determine the relation class.The model is trained using stochastic gradient descent with momentum and learning rate decay to minimize the cross-entropy loss.
The main advantage of this approach is flexibility.Since the two channels process information independently, the input components are not necessary to be the same.Besides, we can add more features from other sources (e.g.subject and object categories) to multi-channel attention to make final decision based on all the information sources.

Softmax Interpolation
Similar as multi-channel attention, we also use two independent channels for self-attention and knowledge-attention in softmax interpolation.Instead of integrating the feature vectors, we make two independent predictions using two softmax classifiers based on the feature vectors from the two channels.The loss function is defined as total cross-entropy loss of the two classifiers.The final prediction is obtained using an interpolation function of the two softmax distributions: where p 1 , p 2 are the softmax distributions obtained form self-attention and knowledgeattention respectively, and β is the priority weight assigned to self-attention.
Since knowledge-attention focuses on capturing the keywords and cue phrases of relations, the precision will be higher than self-attention while the recall is lower.The proposed softmax interpolation approach is able to take the advantages of both attention mechanisms and balance the precision and recall by adjusting the priority weight β.

Knowledge-informed Self-attention
Since knowledge-attention and self-attention share similar structures, it is also possible to integrate them into a single channel.We propose knowledge-informed self-attention encoder which incorporates knowledge-attention into every self-attention head to jointly model the semantic relations based on both knowledge and data.
The structure of knowledge-informed selfattention is shown in Figure 3. Formally, given texts input matrix Q ∈ R n×d k and knowledge indicators K ∈ R m×d k .The output of each attention head is calculated as follows: where knwl and self indicate knowledgeattention and self-attention respectively, and all the linear transformation weight matrices have the dimensionality of Since each self-attention head is aided with prior knowledge in knowledge-attention, the knowledge-informed self-attention encoder is able to capture more lexical and semantic information than single attention encoder.

Baseline Models
To study the performance of our proposed models, the following baseline models are used for comparison: CNN-based models including: (1) CNN: the classical convolutional neural network for sentence classification (Kim, 2014).(2) CNN-PE: CNN with position embeddings dedicated for relation classification (Nguyen and Grishman, 2015).(3) GCN: a graph convolutional network over the pruned dependency trees of the sentence (Zhang et al., 2018).RNN-based models including: (1) LSTM: long short-term memory network to sequentially model the texts.Classification is based on the last hidden output.(2) PA-LSTM: Similar position-aware attention mechanism as our work is used to summarize the LSTM outputs (Zhang et al., 2017).CNN-RNN hybrid model including contextualized GCN (C-GCN) where the input vectors are obtained using bi-directional LSTM network (Zhang et al., 2018).Self-attention-based model (Self-attn) which uses self-attention encoder to model the input sentence.Our implementation is based on Bilan and Roth (2018) where several modifications are made on the original Transformer encoder, including the use of relative positional encodings instead of absolute sinusoidal encodings, as well as other configurations such as residual connection, activation function and normalization.
For our model, we evaluate both the proposed knowledge-attention encoder (Knwl-attn) as well as the integrated models with self-attention including multi-channel attention (MCA), softmax interpolation (SI) and knowledge-informed selfattention (KISA).

Experiment Settings
We conduct our main experiments on TACRED, a large-scale relation extraction dataset introduced by Zhang et al. (2017).TACRED contains over 106k sentences with hand-annotated subject and object entities as well as the relations between them.It is a very complex relation extraction dataset with 41 relation types and a no relation class when no relation is hold between entities.The dataset is suited for real-word relation extraction since it is unbalanced with 79.5% no relation samples, and multiple relations between different entity pairs can be exist in one sentence.Besides, the samples are normally long sentences with an average of 36.2 words.
Since the dataset is already partitioned into train (68124 samples), dev (22631 samples) and test (15509 samples) sets, we tune model hyperparameters using dev set and evaluate model using test set.The evaluation metrics are microaveraged precision, recall and F 1 score.For fair comparison, we select the model with median F 1 score on dev set from 5 independent runs, same as Zhang et al. (2017).The same "entity mask" strategy is used which replaces subject (or object) entity with special NER -SUBJ (or NER -OBJ) tokens to avoid overfittting on specific entities and provide entity type information.
Besides TACRED, another dataset called SemEval2010-Task8 (Hendrickx et al., 2009) is used to evaluate the generalization ability of our proposed model.The dataset is significantly smaller and simpler than TACRED, which has 8000 training samples and 2717 testing samples.It contains 9 directed relations and 1 other relation (19 relation classes in total).We use the official macro-averaged F 1 score as evaluation metric.
We use one layer encoder with 6 attention heads for both knowledge-attention and self-attention since further increasing the number of layers and attention heads will degrade the performance.For softmax interpolation, we choose β = 0.8 to balance precision and recall.Word embeddings are fine-tuned based on pre-trained GloVe (Pennington et al., 2014) with dimensionality of 300.Dropout (Srivastava et al., 2014) is used during trianing to alleviate overfitting.Other model hyperparameters and training details are described in Appendix due to space limitations.

Results on TACRED dataset
Table 1 shows the results of baseline as well as our proposed models on TACRED dataset.It is observed that our proposed knowledge-attention encoder outperforms all CNN-based and RNNbased models by at least 1.3 F 1 .Meanwhile, it achieves comparable results with C-GCN and selfattention encoder, which are the current start-ofthe-art single-model systems.
Comparing with self-attention encoder, it is observed that knowledge-attention encoder results in higher precision but lower recall.This is reasonable since knowledge-attention encoder focuses on capturing the significant linguistic clues of relations based on external knowledge, it will result in high precision for the predicted relations similar to rule-based systems.Self-attention encoder is able to capture more long-distance dependency features by learning from data, resulting in better recall.By integrating self-attention and knowledge-attention using the proposed approaches, a more balanced precision and recall can be obtained, suggesting the complementary effects of self-attention and knowledge-attention mechanisms.The integrated models improve performance by at least 0.9 F 1 score and achieve new state-of-the-art results among all the single endto-end models.
Comparing the three integrated models, softmax interpolation (SI) achieves the best performance.More interestingly, we found that the precision and recall can be controlled by adjusting the priority weight β. Figure 4 shows impact of β on precision, recall and F 1 score.As β increases, precision decreases and recall increases.Therefore, we can choose a small β for relation extraction system which requires high precision, and a large β for the system requiring better recall.F 1 score reaches the highest value when precision and re-  1: Micro-averaged precision (P), recall (R) and F 1 score on TACRED dataset.†, ‡ and † † mark the results reported in (Zhang et al., 2017), (Zhang et al., 2018) and (Bilan and Roth, 2018) respectively.* marks statistically significant improvements over Selfattn with p < 0.01 under one-tailed t-test.call are balanced (β = 0.8).
Knowledge-informed self-attention (KISA) has comparable performance with softmax interpolation, and without the need of hyper-parameter tuning since knowledge-attention and self-attention are integrated into a single channel.The performance gain over self-attention encoder is 1.2 F 1 with much improved precision, demonstrating the effectiveness of incorporating knowledgeattention into self-attention to jointly model the sentence based on both knowledge and data.
Performance gain is the lowest for multichannel attention (MCA).However, the model is more flexible in the way that features from other information sources can be easily added to the model to further improve its performance.Table 2 shows the results of adding NER embeddings of each token to self-attention channel, and entity (subject and object) categorical embeddings to multi-channel attention as additional feature vectors.We use dimensionality of 30 and 60 for NER and entity categorical embeddings respectively, and the two embedding matrixes are learned by the neural network.Results show that adding NER and entity categorical information to MCA integrated model improves F 1 score by 0.2 and 0.5 respectively, and adding both improves precision significantly, resulting a new best F 1 score.

Results on SemEval2010-Task8 dataset
We use SemEval2010-Task8 dataset to evaluate the generalization ability of our proposed model.Experiments are conducted in two manners: mask or keep the entities of interest.Results in Table 3 show that the "entity mask" strategy degrades the performance, indicating that there exist strong correlations between entities of interest and relation classes in SemEval2010-Task8 dataset.Although the results of keeping the entities are better, the model tends to remember these entities instead of focusing on learning the linguistic clues of relations.This will result in bad generalization for sentences with unseen entities.Regardless of whether the entity mask is used, by incorporating knowledge-attention mechanism, our model improves the performance of selfattention by a statistically significant margin, especially the softmax interpolation integrated model.The results on SemEval2010-Task8 are consistent with that of TACRED, demonstrating the effectiveness and robustness of our proposed method.

Ablation study
To study the contributions of specific components of knowledge-attention encoder, we perform ablation experiments on the dev set of TACRED.without certain components are shown in Table 4.
It is observed that: (1) The proposed multihead knowledge-attention structure outperforms single-head significantly.This demonstrates the effectiveness of jointly attending texts to different relational semantic subspaces in the multi-head structure.(2) The synonyms improve the performance of knowledge-attention since they are able to broaden the coverage of relation indicators and form a robust relational semantic space.(3) The subtraction of relation indicators mean vector from attention hidden representations helps to suppress the activation of irrelevant words and results in a better representation for each word to capture the linguistic clues of relations.(4-5) The two masking strategies are helpful for our model: the output masking eliminates the effects of the padding tokens and the entity masking avoids entity overfitting while providing entity type information.(6) The relative position embedding term in position-aware attention contributes a significant amount of F 1 score.This shows that positional information is particularly important for relation extraction task.

Attention visualization
To verify the complementary effects of knowledge-attention encoder and self-attention encoder, we compare the attention weights assigned to words from the two encoders.Table 5 presents the attention visualization results on sample sentences.For each sample sentence, attention weights from knowledge-attention encoder are visualized first, followed by self-attention encoder.It is observed that knowledge-attention encoder focuses more on the specific keywords or cue phrases of certain relations, such as "graduated", "executive director" and "founded"; while self-attention encoder attends to a wide range of words in the sentence and pays more attention to the surrounding words of target entities especially the words indicating the syntactic structure, such as "is", "in" and "of".Therefore, knowledgeattention encoder and self-attention encoder have complementary strengths that focus on different perspectives for relation extraction.

Error analysis
To investigate the limitations of our proposed model and provide insights for future research, we analyze the errors produced by the system on the test set of TACRED.For knowledge-attention encoder, 58% errors are false negative (FN) due to the limited ability in capturing long-distance dependencies and some unseen linguistic clues during training.For our integrated model4 that takes the benefits of both self-attention and knowledgeattention, FN is reduced by 10%.However, false positive (FP) is not improved due to overfitting that leads to wrong predictions.Many errors are caused by multiple entities with different relations co-occurred in one sentence.Our model may mistake irrelevant entities as a relation pair.
We also observed that many FP errors are due to the confusions between related relations such as "city of death"and "city of residence".More data or knowledge is needed to distinguish "death" and "residence".Besides, some errors are caused by imperfect annotations.

Conclusion and Future Work
We introduce knowledge-attention encoder which effectively incorporates prior knowledge from external lexical resources for relation extraction.The proposed knowledge-attention mechanism transforms texts from word space into relational semantic space and captures the informative linguistic clues of relations effectively.Furthermore, we show the complementary strengths of knowledgeattention and self-attention, and propose three different ways of integrating them to maximize the utilization of both knowledge and data.The proposed models are fully attention-based end-toend systems and achieve state-of-the-art results on TACRED dataset, outperforming existing CNN, RNN, and self-attention based models.
In future work, besides lexical knowledge, we will incorporate conceptual knowledge from encyclopedic knowledge bases into knowledgeattention encoder to capture the high-level semantics of texts.We will also apply knowledgeattention in other tasks such as text classification, sentiment analysis and question answering.

Figure 1 :
Figure 1: Knowledge-attention process (left) and multi-head structure (right) of knowledge-attention encoder.
t i ∈ R dt by looking up the word embedding matrix W wrd ∈ R dw×V wrd and POS embedding matrix W pos ∈ R dt×V pos respectively, where d w and d t are the dimensions of word and POS embeddings, V wrd is vocabulary size 3 and V pos is total number of POS tags.The corresponding relation indicator is formed by concatenating word embedding and POS embedding, k i = [w i , t i ].If a lexical unit contains multiple words (i.e.phrase), the corresponding relation indicator is formed by averaging the embeddings of all words.Eventually, around 3000 relation indicators (including 2000 synonyms) are generated: K = {k 1 , k 2 , ..., k m }.

Figure 2 :
Figure 2: Three ways of integrating knowledge-attention with self-attention: multi-channel attention and softmax interpolation (top), as well as knowledge-informed self-attention (bottom).

Figure 3 :
Figure 3: Knowledge-informed self-attention structure.Q, K represent input matrix and knowledge indicators respectively, h is the number of attention heads.

Figure 4 :
Figure 4: Change of precision, recall and F 1 score on dev set as the priority weight β in softmax interpolation changes.

Table 4 :
Ablation study on knowledge-attention encoder.Results are the median F 1 scores of 5 independent runs on dev set of TACRED.

Table 5 :
SUBJ-PERSON graduated in 1992 from the OBJ-ORGANIZATION OBJ-ORGANIZATION OBJ-ORGANIZATION with a degree in computer science and had worked as a systems analyst at a Pittsburgh law firm since 1999 .PERSON graduated in 1992 from the OBJ-ORGANIZATION OBJ-ORGANIZATION OBJ-ORGANIZATION with a degree in computer science and had worked as a systems analyst at a Pittsburgh law firm since 1999 .Attention visualization for knowledge-attention encoder (first) and self-attention encoder (second).Words are highlighted based on the attention weights assigned to them.Best viewed in color. correct