Revisiting Document Representations for Large-Scale Zero-Shot Learning

Zero-shot learning aims to recognize unseen objects using their semantic representations. Most existing works use visual attributes labeled by humans, not suitable for large-scale applications. In this paper, we revisit the use of documents as semantic representations. We argue that documents like Wikipedia pages contain rich visual information, which however can easily be buried by the vast amount of non-visual sentences. To address this issue, we propose a semi-automatic mechanism for visual sentence extraction that leverages the document section headers and the clustering structure of visual sentences. The extracted visual sentences, after a novel weighting scheme to distinguish similar classes, essentially form semantic representations like visual attributes but need much less human effort. On the ImageNet dataset with over 10,000 unseen classes, our representations lead to a 64% relative improvement against the commonly used ones.


Introduction
Algorithms for visual recognition usually require hundreds of labeled images to learn how to classify an object (He et al., 2016). In reality, however, the frequency of observing an object follows a longtailed distribution (Zhu et al., 2014): many objects do not appear frequently enough for us to collect sufficient images. Zero-shot learning (ZSL) (Lampert et al., 2009), which aims to build classifiers for unseen object classes using their semantic representations, has thus emerged as a promising paradigm for recognizing a large number of classes.
Being the only information of unseen objects, how well the semantic representations describe the visual appearances plays a crucial role in ZSL. One popular choice is visual attributes (Lampert et al., 2009;Patterson and Hays, 2012;Wah et al., 2011) carefully annotated by humans. For example, the bird "Red bellied Woodpecker" has the "capped head pattern" and "pointed wing shape". While Figure 1: An illustration of our ZSL approach, which recognizes the input image by comparing it to the visual sentences of documents. Here we show two documents, one for "Tiger" and one for "Lion". The gray area highlights the extracted visual sentences (red: by section headers; blue: by clustering). strictly tied to visual appearances, visual attributes are laborious to collect, limiting their applicability to small-scale problems with hundreds of classes.
For large-scale problems like ImageNet (Deng et al., 2009) that has more than 20, 000 classes, existing ZSL algorithms (Frome et al., 2013;Norouzi et al., 2013) mostly resort to word vectors of classes names (Mikolov et al., 2013;Pennington et al., 2014) that are automatically extracted from large corpora like Common Crawl. While almost labor free, word vectors are purely text-driven and barely aligned with visual information. As a result, the state-of-the-art ZSL accuracy on ImageNet falls far behind being practical (Changpinyo et al., 2020).
Is it possible to develop semantic representations that are as powerful as visual attributes without significant human effort? A feasibility study by representing a class with its Wikipedia page shows some positive signs -Wikipedia pages do capture rich attribute information. For example, the page "Red-bellied Woodpecker" contains phrases "red cap going from the bill to the nape" and "black and white barred patterns on their back, wings and tail" that exactly match the visual attributes mentioned above. In other words, if we can identify visual sentences from a document to represent a class, we are likely to attain much higher ZSL accuracy 1 .
To this end, we present a simple yet effective semi-automatic approach for visual sentence extraction, which leverages two informative semantic cues. First, we leverage the section structures of Wikipedia pages: the section header indicates what kind of sentences (visual or not) appear in the section. Concretely, we search Wikipedia pages of common objects following the sysnsets in Im-ageNet (e.g., fish, room), and manually identify sections that contain visual information (e.g., characteristics, appearance). We then apply these visual headers to the Wikipedia pages of the remaining ImageNet classes. Second, we observe that visual sentences share some common contextual patterns: for example, they contain commonly used words or phrases of visual attributes (e.g., red color, furry surface). To leverage these patterns, we perform K-means sentence clustering using the BERT features (Devlin et al., 2018) and manually select clusters that contain visual information. We keep sentences in these clusters and combine them with those selected by section headers to represent a document. See Figure 1 for an illustration.
To further increase the discriminative ability of the visual sentences between similar object classes (e.g., breeds of dogs), we introduce a novel scheme to assign weights to sentences, emphasizing those that are more representative for each class.
We validate our approach on three datasets: Ima-geNet Fall 2011 dataset (Deng et al., 2009), which contains 14, 840 unseen classes with Wikipedia pages; Animals with Attributes 2 (AwA2) (Xian et al., 2018a), which has 50 animal classes; Attribute Pascal and Yahoo (aPY) (Farhadi et al., 2009), which has 32 classes. Our results are promising: compared to word vectors on ImageNet, we improve by 64% using visual sentences. On AwA2 and aPY, compared to visual attributes annotated by humans, we improve by 8% and 5%, respectively. Moreover, our new semantic representations can be easily incorporated into any ZSL algorithms. Our code and data will be available at https: //github.com/heendung/vs-zsl.

Related Work
Semantic representations. Visual attributes are the most popular semantic representations (Lampert et al., 2009;Patterson and Hays, 2012;Wah et al., 2011;Zhao et al., 2019). However, due to the need of human annotation, the largest dataset has only 717 classes. Reed et al. (2016b,a) (2017) extract single word attributes, which are not discriminative enough (e.g., "red cap" becomes "red", "cap"). None of them works on ZSL with over 1,000 classes. Hessel et al. (2018); Le Cacheux et al. (2020) collect images and tags of a class and derives its semantic representation from tags, which is not feasible for unseen classes on ZSL. Zero-shot learning algorithms. The most popular way is to learn an embedding space in which visual features and semantic representations are aligned and nearest neighbor classifiers can be applied (Changpinyo et al., 2017;Romera-Paredes and Torr, 2015;Akata et al., 2015a;Kodirov et al., 2017;Schonfeld et al., 2019;Zhu et al., 2019;Xie et al., 2019;Socher et al., 2013). These algorithms consistently improve accuracy on datasets with attributes. Their accuracy on ImageNet, however, is saturated, mainly due to the poor quality of semantic representations (Changpinyo et al., 2020).
3 Visual Sentence Extraction 3.1 Background and notation ZSL algorithms learn to align visual features and semantic representations using a set of seen classes S. The alignment is then applied to the test images of unseen classes U. We denote by D = {(x n , y n ∈ S)} N n=1 the training data (i.e., image feature and label pairs) with the labels coming from S.
Suppose that we have access to a semantic representation a c (e.g., word vectors) for each class c ∈ S ∪ U, one popular algorithm DeViSE (Frome et al., 2013) proposes the learning objective where ∆ ≥ 0 is a margin. That is, DeViSE tries to learn transformations f θ and g φ and a matrix M to maximize the visual and semantic alignment of the same classes while minimizing that between classes. We can then classify a test image x by arg max c∈U f θ (x)M g φ (a c ). (2) Here, we consider that every class c ∈ S ∪ U is provided with a document H c = {h |Hc| } rather than a c , where |H c | is the amount of sentences in document H c and h (c) j is the jth sentence, encoded by BERT (Devlin et al., 2018). We mainly study DeViSE, but our approach can easily be applied to other ZSL algorithms.

Visual section selection
We aim to filter out sentences in H c that are not describing visual information. We first leverage the section headers in Wikipedia pages, which indicate what types of sentences (visual or not) are in the sections. For example, the page "Lion" has sections "Description" and "Colour variation" that are likely for visual information, and "Health" and "Cultural significance" that are for non-visual information.
To efficiently identify these section headers, we use ImageNet synsets (Deng et al., 2009), which group objects into 16 broad categories. We randomly sample 30 ∼ 35 classes per group, resulting in a set of 500 classes. We then retrieve the corresponding Wikipedia pages by their names and manually identify section headers related to visual sentences. By sub-sampling classes in this way, we can quickly find section headers that are applicable to other classes within the same groups. Table 1 shows some visual/non-visual sections gathered from the 500 classes. For example, "Characteristics" frequently appears in pages of animals to describe their appearances. In contrast, sections like "History" or "Mythology" do not contain visual information. Investigating all the 500 Wikipedia pages carefully, we find 40 distinct visual sections. We also include the first paragraph of a Wikipedia page, which often contains visual information.

Visual cluster selection
Our second approach uses K-means for sentence clustering: visual sentences often share common

Sentence clusters
It has large ears that help the fox lower its body temperature. It usually has a gray coat, with rusty tones, and a black tip to its tail. It has distinct dark patches around the nose. It is most recognisable for its dark vertical stripes on orangish-brown fur. · · · muscular body with powerful forelimbs, a large head and a tail. They have a mane-like heavy growth of fur around the neck and jaws · · · The kit fox is a socially monogamous species. Male and female kit foxes usually establish monogamous mating · · · The average lifespan of a wild kit fox is 5.5 years. Tiger mates all year round, but most cubs are born between March · · · The father generally takes no part in rearing. The mortality rate of tiger cubs is about 50% in the first two years. Table 2: Sentence clusters. The top cluster is visual and the bottom one is non-visual. The sentences from a class kit-fox are in red and those from a class tiger are in blue.
words and phrases of visual attributes, naturally forming clusters. We represent each sentence using the BERT features (Devlin et al., 2018), and perform K-means (with K = 100) over all the sentences from Wikipedia pages of ImageNet classes. We then manually check the 100 clusters and identify 40 visual clusters. Table 2 shows a visual (top) and a non-visual (bottom) cluster. We highlight sentences related to two classes: "kit-fox" (red) and "tiger" (blue). The visual cluster describes the animals' general appearances, especially about visual attributes "dark", "black", "tail", "large", etc. In contrast, the non-visual cluster describes mating and lifespan that are not related to visual aspects.

Semantic representations of documents
After we obtain a filtered documentĤ c , which contains sentences of the visual sections and clusters, the next step is to representĤ c by a vector a c so that nearly all the ZSL algorithms can leverage it.
A simple way is average,ā c = 1 |Ĥc| h∈Ĥc h, where h is the BERT feature. This, however, may not be discriminative enough to differentiate similar classes that share many common descriptions (e.g., dog classes share common phrase like "a breed of dogs" and "having a coat or a tail").
We therefore propose to identify informative sentences that can enlarge the difference of a c between classes. Concretely, we learn to assign each sentence a weight λ, such that the resulting weighted average a c = 1 |Ĥc| h∈Ĥc λ(h) × h can be more distinctive. We model λ(·) ∈ R by a multi-layer We learn b ψ to meet two criteria. On the one hand, for very similar classes c and c whose similarity cos(a c , a c ) is larger than a threshold τ , we want cos(a c , a c ) to be smaller than τ so they can be discriminable. On the other hand, for other pair of less similar classes, we want their similarity to follow the average semantic representationā c 2 . To this end, we initialize b ψ such that the initial a c is close toā c . We do so by first learning b ψ to minimize the following objective c∈S∪U max{0, − cos(a c ,ā c )}. (4) We set = 0.9, forcing a c andā c of the same class to have cos(a c ,ā c ) > 0.9. We then fine-tune b ψ by minimizing the following objective We assign τ a high value (e.g., 0.95) to only penalize overly similar semantic representations. Please see the appendix for details. Comparison. Our approach is different from DAN (Iyyer et al., 2015). First, we learn an MLP to assign weights to sentences so that their embeddings can be combined appropriately to differentiate classes. In contrast, DAN computes the averaged embedding and learns an MLP to map it to another (more discriminative) embedding space. Second, DAN leans the MLP with a classification loss. In contrast, we learn the MLP to reduce the embedding similarity between similar classes while maintaining the similarity for other pairs of classes.  , 2013) has f θ and g φ as identity functions. Here, we consider a stronger version, DeViSE , in which we model f θ and g φ each by a two-hidden layers multi-layer perceptron (MLP). We also experiment with two state-of-the-art ZSL algorithms, EXEM (Changpinyo et al., 2020) and HVE (Liu et al., 2020).
We use the average per-class Top-1 classification accuracy as the metric (Xian et al., 2018a).  GZSL is the generalized ZSL setting (Xian et al., 2018a). In GZSL, U, S, H denote unseen class accuracy, seen class accuracy, and their harmonic mean, respectively. We use per-class Top-1 accuracy (%). and visual clusters for sentence extraction outperforms w2v-v2. More discussions are as follows. BERT vs. w2v-v2. For both DeViSE and De-ViSE, BERT p by averaging all the sentences in a Wikipedia page outperforms w2v-v2, suggesting that representing a class by its document is more powerful than its word vector. DeViSE vs. DeViSE. Adding MLPs to DeViSE largely improves its accuracy: from 0.78% (De-ViSE + w2v-v2) to 1.48% (DeViSE + w2v-v2) at ALL. In the following, we then focus on DeViSE . Visual sentence extraction. Comparing different strategies for BERT p , we see both Vis clu and Vis sec largely improves NO, demonstrating the effectiveness of sentence selection. Combining the two sets of sentences (Vis sec-clu ) leads to a further boost. Fine-tuning BERT. BERT can be fine-tuned together with DeViSE . The resulting BERT f has a notable gain over BERT p (e.g., 2.39% vs. 2.05%). Weighted average. With the weighted average (BERT p-w , BERT f-w ), we obtain the best accuracy. ZSL algorithms. EXEM + w2v-v2 outperforms DeViSE + w2v-v2, but falls behind DeViSE + BERT p-w (or BERT f , BERT f-w ). This suggests that algorithm design and semantic representations are both crucial. Importantly, EXEM and HVE can be improved using our proposed semantic representations, demonstrating the applicability and generalizability of our approach.  BERT p-w-direct directly learns visual sentences without our sentence selection. Par 1st and Clsname use the first paragraph and sentences containing the class name, respectively. geNet, AwA2, and aPY demonstrate our proposed method's applicability to multiple datasets.

Analysis on ImageNet
To further justify the effectiveness of our approach, we compare to additional baselines in Table 5.
• BERT p-w-direct : it directly learns b ψ (Equation 3) as part of the DeViSE objective. Namely, we directly learn b ψ to identify visual sentences, without our proposed selection mechanisms, such that the resulting a c optimizes Equation 1. • Par 1st : it uses the first paragraph of a document. • Cls name : it uses the sentences of a Wikipedia page that contain the class name. As shown in Table 5, our proposed sentence selection mechanisms (i.e., Vis sec , Vis clu , and Vis sec-clu ) outperform all the three baselines.

Conclusion
ZSL relies heavily on the quality of semantic representations. Most recent work, however, focuses solely on algorithm design, trying to squeeze out the last bit of information from the pre-define, likely poor semantic representations. Changpinyo et al. (2020) has shown that existing algorithms are trapped in the plateau of inferior semantic representations. Improving the representations is thus more crucial for ZSL. We investigate this direction and show promising results by extracting distinctive visual sentences from documents for representations, which can be easily used by any ZSL algorithms.

Appendix
In this appendix, we provide details omitted in the main text.
• Appendix A : contribution • Appendix B : more related work (cf. Section 2 in the main text) • Appendix C: detailed statistics of Wikipedia pages (cf. Section 4.1 in the main text) • Appendix D: weighted average representations (cf. Section 3.4 in the main text) • Appendix E: dataset, metrics, and ZSL algorithms (cf. Section 4.2 in the main text) • Appendix F: implementation details (cf. Section 4.3 in the main text) • Appendix G: ablation study (cf. Section 4.3 in the main text) • Appendix H qualitative results (cf. Section 3 in the main text)

A Contribution
Our contribution is not merely in the method we developed, but also in the direction we explored. As discussed in Section 5 of the main paper, most of the efforts in ZSL have focused on algorithm design to associate visual features and pre-defined semantic representations. Yet, it is also important to improve semantic representations. Indeed, one reason that ZSL performs poorly on large-scale datasets is the poor semantic representations (Changpinyo et al., 2020). We therefore chose to investigate this direction by revisiting document representations, with the goal to make our contributions widely applicable. To this end, we deliberately kept our method simple and intuitive, but also provided insights for future work to build upon. Our manual inspection identified important properties of visual sentences like the clustering structure, enabling us to efficiently extract them. We chose to not design new ZSL algorithms but make our semantic representations compatible with existing ones to clearly demonstrate the effectiveness of improving semantic representations.

B More Related Work
Zero-shot learning (ZSL) algorithms construct visual classifiers based on semantic representations. Some recent work applies generative models to generate images or visual features of unseen classes (Xian et al., 2019(Xian et al., , 2018bZhu et al., 2018), so that conventional supervised learning algorithms can be applied.
Knowledge bases usually contain triplets of entities and relationships. The entities are usually objects, locations, etc. For ZSL, we need entities to be fine-grained (e.g., "beaks") and capture more visual appearances. YAGO (Suchanek et al., 2008) and DBpedia (Zaveri et al., 2013) leverage Wikipedia infoboxes to construct triplets, which is elegant but not suitable for ZSL since Wikipedia infoboxes contain insufficient visual information. Thus, these datasets and construction methods may not be directly applicable to ZSL. Nevertheless, the underlying methodologies are inspiring and could serve as the basis for future work. The datasets also offer inter-class relationships that are complementary to visual descriptions, and may be useful to establish class relationships in ZSL algorithms like SynC (Changpinyo et al., 2016).

C Statistics of Wikipedia Pages
We use a Wikipedia API to extract pages from Wikipedia for ImageNet 21,842 classes. Among 21,842 classes, we find that some classes have multiple Wikipedia pages because of their ambiguous class names. For example, a class "black widow" in ImageNet refers to a spider with dark brown or a shiny black in colour, but it also refers to the name of a "Marvel Comics" character in Wikipedia.
We therefore exclude such classes and also classes that do not have word vectors, resulting in 15,833 classes. The Wikipedia pages of the 15K classes contain 1,260,889 sentences where each class has 80 sentences on average. We also investigate the number of sentences by our filters (i.e. Vis sec , Vis cls , Vis sec-clu ). As a result, we correspondingly find 213,585, 534,852, 542,645 sentences, which are 16%, 42%, 43% of all sentences in 15K classes, respectively (See Figure 2).

D.1 Observation
Two similar classes may have similar averaged visual sentence embeddings since they share many common descriptions. For example, Figure 3 shows that the averaged embedding (i.e., BERT p and BERT f ) between "Kerry Blue Terrier" and "Soft-coated Terrier" are overly similar since they share a number of sentences containing the common dog features such as "a breed of dog" or "having a coat or a tail". Thus, if we represent their semantic representations a c as the averaged embeddings, ZSL models may not differentiate them.

D.2 Algorithm
In Section 3.4 of the main text, we introduce λ(·) to give each sentence h of a document a weight. We note that, while learning λ(·) can enlarge the distance of a c between similar classes, we should not overly maximize the distance to prevent semantically similar classes (e.g., different breed of dogs) end up being less similar than dissimilar classes (e.g., dogs and cats). To this end, we introduce a margin loss with τ in Equation 5, which only penalize overly similar semantic representations. We also note that, the purpose of λ(·) is to improve a c from the simple average embeddingā c . We therefore initialize λ(·) such that the initial a c is similar toā c . We do so by first learning b ψ with the following objective: c∈S∪U max{0, − cos(a c ,ā c )}.
We set = 0.9, forcing a c andā c to have a similarity larger than 0.9. Figure 3 demonstrates the effectiveness of the weighted average embedding BERT f-w . While other semantic representations predict "Kerry Blue Terrier" as other similar dog, "soft-coated Terrier", BERT f-w is able to classify the image correctly. In addition, based on the attention weights, we report the Top 3 sentences and the Bottom 3 sentences. The Top 1st sentence contains the inherent features for "Kerry Blue Terrier" such as long head or softto-curly coat while the Top 2nd and 3rd sentences describe general features of dogs. On the other hand, the Bottom 3 sentences do not have visual appearance of the object. This suggest that our weighted representation BERT f-w is more representative to "Kerry Blue Terrier" than other semantic representations. Compared to the average per-sample accuracy, the per-class accuracy is a more suitable for ImageNet since the dataset is highly imbalanced (Changpinyo et al., 2020). The state-of-theart algorithms in ZSL are EXEM and HVE proposed by (Changpinyo et al., 2020) and(Liu et al., 2020), respectively. To make fair comparison with our models, we evaluate their algorithms on the same number of our test classes using their official codes.

E.1 ImageNet
We follow (Xian et al., 2018a;Changpinyo et al., 2016) to consider three tasks, 2-Hop, 3-Hop, and ALL, corresponding to 1, 509, 7, 678 and 20, 345 unseen classes that have word vectors and are within two, three, and arbitrary tree hop distances to the 1, 000 seen classes. We search Wikipedia and successfully retrieve pages for 15,833 classes, of which 1,290, 5,984, and 14,840 are for 2-Hop, 3-Hop, and ALL.

E.2 AwA2
Animals with Attributes2 (AwA2) provides 37,322 images of 50 animal classes. On average, each class Blue Terrier and Soft-coated Terrier since two classes share the common features of dogs such as "a breed of dog" or "having a coat or a tail". On the other hand, our weighted average BERTf-w is able to differentiate them by weighting on the sentences. We report the Top 3 sentences and the Bottom 3 sentences based on the attention weights.
includes 746 images. It also provides 85 visual attributes that are manually annotated by humans. In AwA2, classes are split into 40 seen classes and 10 unseen classes. For GZSL, a total of 50 classes is used for testing.

E.3 aPY
Attribute Pascal and Yahoo (aPY) contains 15,339 images of 32 classes with 64 attributes. The classes are split into 20 seen classes and 12 unseen classes. A total of 32 classes is used for testing on GZSL. All algorithms learn feature transformations to associate visual features x and semantic representations a c . The key differences are what and how to learn. DeViSE learns two MLPs f θ and g φ to embed x and a c into a common space, while HVE embeds them into a hyperbolic space. EXEM learns kernel regressors to embed a c into the visual space. On how to learn, DeViSE and HVE force each image x to be similar to the true class a c by a margin loss and a ranking loss respectively, while EXEM learns to regress the averaged visual features of a class from a c .
F Implementation Details

F.2 Hyperparameters
DeViSE (Frome et al., 2013) has a tunable margin ∆ ≥ 0 (cf. Section 3.1 in the main text) which its default value is 0.1. We try multiple values 0.1, 0.2, 0.5, and 0.7 to find the best setting. De-ViSE uses Adam optimizer which its learning rate is 1e −3 by default. We try different possible values, 1e −3 , 5e −4 , 2e −4 , and 1e −4 . Among all 16 possible combination of the margin and learning rate, we find that margin of 0.2 and learning rate of 2e −4 achieve the best results on all our cases.

F.3 Fine-tuned models
For fine-tuning, DeViSE is first attached to a BERT model.  G Ablation Study Table 6 shows the results on 2-Hop with different thresholds τ introduced in Equation 5. We obtain the weighted average BERT p-w by taking an input h from BERT p and learning MLP b ψ with different τ (similar for BERT f-w ). Then, we measure 2-Hop accuracy based on BERT p-w (or BERT f-w ). Note that BERT p and BERT f have different ranges of τ , since BERT f already has lower similarity between classes. This is because BERT f is trained with images (from seen classes) during fine-tuning, which makes BERT f more aligned with visual features and thus is more representative. We choose τ based on the ImageNet validation set of the seen classes. Table 7 shows that the weighted average embedding BERT p-w makes similar classes less similar. Originally, a class "Sea boat" has overly similar semantic representations with other type of boats (i.e. BERT p ). After applying our weighting approach, the classes become less similar (e.g. 0.94 to 0.91 between "Sea boat" and "Scow").

H.1 Visual sections and clusters
We provide additional illustrations of visual sections and clusters of Section 3 in the main text. Figure 4 shows visual and non-visual sections in a Wikipedia page Siberian Husky. We note that the summary paragraph and sections such as Description contain visual sentences while sections such as Health or History do not. Similarly, Table 8 shows two clusters: the top cluster is visual, consisting of information about hunting and preys of animals while the bottom cluster includes mythology sentences not visually related. Figure 5 shows the qualitative results of our BERT f-w and w2v-v2 on ImageNet. For each image, we provide its label and the Top 5 prediction by BERT f-w and w2v-v2. While w2v-v2 is not able Clusters · · · hunt shortly after sunset, eating small animals · · · · · · if food is scarce, it has been known to eat tomatoes · · · Tigers are capable of taking down larger prey like adult gaur · · · Tigers will also prey on such domestic livestock as cattle, horses, · · · Panda is a Roman goddess of peace and travellers · · · The Ibex is also a national emblem of the great ancient Axum empire. In Aztec mythology, the jaguar was considered to be the totem animal of · · · It is the national animal of Guyana, and is featured in its coat of arms · · · Table 8: K-means sentence clusters. The top cluster has visual information about hunting and preys while the bottom one contains non-visual description such as mythology.

H.2 On ImageNet
to differentiate the similar classes (e.g. Predicting "Scooter" as "Tandem bicycle"), our BERT f-w can distinguish them. We also note that the Top 5 classes predicted by BERT f-w are similar (e.g. "Grey whale" and "Killer whale"). This suggests that our approach maintains the order of similarity among classes but make their semantic representations more distinctive.  Qualitative results between BERT f-w and w2v-v2 on ImageNet. For each image, we report Top 5 prediction. While w2v-v2 is not able to distinguish similar classes (e.g. Predicting "Scooter" as "Tandem bicycle"), our BERT f-w differentiates them.