An Empirical Study on Large-Scale Multi-Label Text Classification Including Few and Zero-Shot Labels

Large-scale Multi-label Text Classification (LMTC) has a wide range of Natural Language Processing (NLP) applications and presents interesting challenges. First, not all labels are well represented in the training set, due to the very large label set and the skewed label distributions of LMTC datasets. Also, label hierarchies and differences in human labelling guidelines may affect graph-aware annotation proximity. Finally, the label hierarchies are periodically updated, requiring LMTC models capable of zero-shot generalization. Current state-of-the-art LMTC models employ Label-Wise Attention Networks (LWANs), which (1) typically treat LMTC as flat multi-label classification; (2) may use the label hierarchy to improve zero-shot learning, although this practice is vastly understudied; and (3) have not been combined with pre-trained Transformers (e.g. BERT), which have led to state-of-the-art results in several NLP benchmarks. Here, for the first time, we empirically evaluate a battery of LMTC methods from vanilla LWANs to hierarchical classification approaches and transfer learning, on frequent, few, and zero-shot learning on three datasets from different domains. We show that hierarchical methods based on Probabilistic Label Trees (PLTs) outperform LWANs. Furthermore, we show that Transformer-based approaches outperform the state-of-the-art in two of the datasets, and we propose a new state-of-the-art method which combines BERT with LWANs. Finally, we propose new models that leverage the label hierarchy to improve few and zero-shot learning, considering on each dataset a graph-aware annotation proximity measure that we introduce.


Introduction
Large-scale Multi-label Text Classification (LMTC) is the task of assigning a subset of labels from a large predefined set (typically thousands) to a given document. LMTC has a wide range of applications in Natural Language Processing (NLP), Figure 1: Examples from LMTC label hierarchies. ∅ is the root label. L l is the number of labels per level. Yellow nodes denote gold label assignments. In EURLEX57K, documents have been tagged with both leaves and inner nodes (GAP: 0.45). In MIMIC-III, only leaf nodes can be used, causing the label assignments to be much sparser (GAP: 0.27). In AMAZON13K, documents are tagged with leaf nodes, but it is assumed that all the parent nodes are also assigned, leading to dense label assignments (GAP: 0.86).
Apart from the large label space, LMTC datasets often have skewed label distributions (e.g., some labels have few or no training examples) and a label hierarchy with different labelling guidelines (e.g., they may require documents to be tagged only with leaf nodes, or they may allow both leaf and other nodes to be used). The latter affects graph-aware annotation proximity (GAP), i.e., the proximity of the gold labels in the label hierarchy (see Section 4.1). Moreover, the label set and the hierarchies are periodically updated, thus requiring zero-and few-shot learning to cope with newly introduced labels. Figure 1 shows a sample of label hierarchies, with different label assignment guidelines, from three standard LMTC benchmark datasets: EUR-LEX (Chalkidis et al., 2019b), MIMIC-III (Johnson et al., 2017), and AMAZON (McAuley and Leskovec, 2013)).
Current state-of-the-art LMTC models are based on Label-Wise Attention Networks (LWANs) (Mullenbach et al., 2018), which use a different attention head for each label. LWANs (1) typically do not leverage structural information from the label hierarchy, treating LMTC as flat multi-label classification; (2) may use the label hierarchy to improve performance in few/zero-shot scenario, but this practice is vastly understudied; and (3) have not been combined with pre-trained Transformers.
We empirically evaluate, for the first time, a battery of LMTC methods, from vanilla LWANs to hierarchical classification approaches and transfer learning, in frequent, few, and zero-shot learning scenarios. We experiment with three standard LMTC datasets (EURLEX57K; MIMIC-III; AMA-ZON13K). Our contributions are the following: • We show that hierarchical LMTC approaches based on Probabilistic Label Trees (PLTs) (Prabhu et al., 2018;Khandagale et al., 2019;You et al., 2019) outperform flat neural state-of-the-art methods, i.e., LWAN (Mullenbach et al., 2018) in two out of three datasets (EURLEX57K, AMAZON13K).
• We demonstrate that pre-trained Transformerbased approaches (e.g., BERT) further improve the results in two of the three datasets (EURLEX57K, AMAZON13K), and we propose a new method that combines BERT with LWAN achieving the best results overall.
• Finally, following the work of Rios and Kavuluru (2018) for few and zero-shot learning on MIMIC-III, we investigate the use of structural information from the label hierarchy in LWAN. We propose new LWAN-based models with improved performance in these settings, taking into account the labelling guidelines of each dataset and a graph-aware annotation proximity (GAP) measure that we introduce.
2 Related Work

Advances and limitations in LMTC
In LMTC, deep learning achieves state-of-the-art results with LWANs (You et al., 2018;Mullenbach et al., 2018;Chalkidis et al., 2019b), in most cases comparing to naive baselines (e.g., vanilla CNNs or vanilla LSTMs). The computational complexity of LWANs, however, makes it difficult to scale them up to extremely large label sets. Thus, Probabilistic Label Trees (PLTs) (Jasinska et al., 2016;Prabhu et al., 2018;Khandagale et al., 2019) are preferred in Extreme Multi-label Text Classification (XMTC), mainly because the linear classifiers they use at each node of the partition trees can be trained independently considering few labels at each node. This allows PLT-based methods to efficiently handle extremely large label sets (often millions), while also achieving top results in XMTC. Nonetheless, previous work has not thoroughly compared PLT-based methods to neural models in LMTC. In particular, only You et al. (2018) have compared PLT methods to neural models in LMTC, but without adequately tuning their parameters, nor considering few and zero-shot labels. More recently, You et al. (2019) introduced ATTENTION-XML, a new method primarily intended for XMTC, which combines PLTs with LWAN classifiers. Similarly to the rest of PLTbased methods, it has not been evaluated in LMTC.

The new paradigm of transfer learning
Transfer learning (Ruder et al., 2019;Rogers et al., 2020), which has recently achieved state-of-the-art results in several NLP tasks, has only been considered in legal LMTC by Chalkidis et al. (2019b), who experimented with BERT (Devlin et al., 2019) and ELMO (Peters et al., 2018). Other BERT variants, e.g. ROBERTA , or BERT-based models have not been explored in LMTC so far.

Few and zero-shot learning in LMTC
Finally, few and zero-shot learning in LMTC is mostly understudied. Rios and Kavuluru (2018) investigated the effect of encoding the hierarchy in these settings, with promising results. However, they did not consider other confounding factors, such as using deeper neural networks at the same time, or alternative encodings of the hierarchy. Chalkidis et al. (2019b) also considered few and zero-shot learning, but ignoring the label hierarchy.
Our work is the first attempt to systematically compare flat, PLT-based, and hierarchy-aware LMTC methods in frequent, few-, and zero-shot learning, and the first exploration of the effect of transfer learning in LMTC on multiple datasets.

Notation for neural methods
We experiment with neural methods consisting of: (i) a token encoder (E w ), which makes token embeddings (w t ) context-aware (h t ); (ii) a document encoder (E d ), which turns a document into a single embedding; (iii) an optional label encoder (E l ), which turns each label into a label embedding; (iv) a document decoder (D d ), which maps the document to label probabilities. Unless otherwise stated, tokens are words, and E w is a stacked BIGRU.

Flat neural methods
T is the document length in tokens, h t the contextaware representation of the t-th token, and u l a trainable vector used to compute the attention scores of the l-th attention head; u l can also be viewed as a label representation. Intuitively, each head focuses on possibly different tokens of the document to decide if the corresponding label should be assigned. In this model, D d employs L linear layers with sigmoid activations, each operating on a different label-wise document representation d l , to produce the probability of the corresponding label.

Hierarchical PLT-based methods
In PLT-based methods, each label is represented as the average of the feature vectors of the training documents that are annotated with this label. The root of the PLT corresponds to the full label set. The label set is partitioned into k subsets using kmeans clustering, and each subset is represented by a child node of the root in the PLT. The labels of each new node are then recursively partitioned into k subsets, which become children of that node in the PLT. If the label set of a node has fewer than m labels, the node becomes a leaf and the recursion terminates. During inference, the PLT is traversed top down. At each non-leaf node, a multi-label classifier decides which children nodes 1 The original model was proposed by Mullenbach et al. (2018), with a CNN token encoder (Ew). Chalkidis et al. (2019b) show that BIGRU is a better encoder than CNNs. See also the supplementary material for a detailed comparison.
(if any) should be visited by considering the feature vector of the document. When a leaf node is visited, the multi-label classifier of that node decides which labels of the node will be assigned to the document. PARABEL, BONSAI: We experiment with PARA-BEL (Prabhu et al., 2018) and BONSAI (Khandagale et al., 2019), two state-of-the-art PLT-based methods. PARABEL employs binary PLTs (k = 2), while BONSAI uses non-binary PLTs (k > 2), which are shallower and wider. In both methods, a linear classifier is used at each node, and documents are represented by TF-IDF feature vectors.
ATTENTION-XML: Recently, You et al. (2019) proposed a hybrid method that aims to leverage the advantages of both PLTs and neural models. Similarly to BONSAI, ATTENTION-XML uses nonbinary trees. However, the classifier at each node of the PLT is now an LWAN with a BILSTM token encoder (E w ), instead of a linear classifier operating on TF-IDF document representations.

Transfer learning based LMTC
BIGRU-LWAN-ELMO: In this model, we use ELMO (Peters et al., 2018) to obtain contextsensitive token embeddings, which we concatenate with the pre-trained word embeddings to obtain the initial token embeddings (w t ) of BIGRU-LWAN. Otherwise, the model is the same as BIGRU-LWAN.
BERT, ROBERTA: Following Devlin et al. (2019), we feed each document to BERT and obtain the toplevel representation h CLS of BERT's special CLS token as the (single) document representation. D d is now a linear layer with L outputs and sigmoid activations which operates directly on h CLS , producing a probability for each label. The same arrangement applies to ROBERTA . 2 BERT-LWAN: Given the large size of the label set in LMTC datasets, we propose a combination of BERT and LWAN. Instead of using h CLS as the document representation and pass it through a linear layer with L outputs (as with BERT and ROBERTA), we pass all the top-level output representations of BERT into a label-wise attention mechanism. The entire model (BERT-LWAN) is jointly trained, also fine-tuning the underlying BERT encoder.

Zero-shot LMTC
C-BIGRU-LWAN is a zero-shot capable extension of BIGRU-LWAN. It was proposed by Rios and Kavuluru (2018), but with a CNN encoder; instead, we use a BIGRU. In this method, E l creates u l as the centroid of the token embeddings of the corresponding label descriptor. The label representations u l are then used by the attention heads.
Here h t are the context-aware embeddings of E w , a lt is the attention score of the l-th attention head for the t-th document token, viewed as v t (Eq. 2), d l is the label-wise document representation for the lth label. D d also relies on the label representations u l to produce each label probability p l .
The centroid label representations u l of both encountered (during training) and unseen (zero-shot) labels remain unchanged, because the token embeddings in the centroids are not updated. This keeps the representations of unseen labels close to those of similar labels encountered during training. In turn, this helps the attention mechanism (Eq. 3) and the decoder (Eq. 4) cope with unseen labels that have similar descriptors with encountered labels.
GC-BIGRU-LWAN: This model, originally proposed by Rios and Kavuluru (2018), applies graph convolutions (GCNs) to the label hierarchy. 3 The intuition is that the GCNs will help the representations of rare labels benefit from the (better) representations of more frequent labels that are nearby in the label hierarchy. E l now creates graph-aware label representations u 3 l from the corresponding label descriptors (we omit the bias terms for brevity) as follows: where u l is again the centroid of the token embeddings of the descriptor of the l-th label; W i s , W i p , W i c are matrices for self, parent, and children nodes of each label; N p,l , N c,l are the sets of parents and children of the the l-th label; and f is the tanh activation. The label-wise document representations d l are again produced by E d , as in C-BIGRU-LWAN (Eq. 2-3), but they go through an additional dense layer with tanh activation (Eq. 8). The resulting document representations d l,o and the graph-aware label representations u 3 l are then used by D d to produce a probability p l for each label (Eq. 9).
DC-BIGRU-LWAN: The stack of GCN layers in GC-BIGRU-LWAN (Eq. 5-6) can be turned into a plain two-layer Multi-Layer Perceptron (MLP), unaware of the label hierarchy, by setting N p,l = N c,l = ∅. We call DC-BIGRU-LWAN the resulting (deeper than C-BIGRU-LWAN) variant of GC-BIGRU-LWAN. We use it as an ablation method to evaluate the impact of the GCN layers on performance.
DN-BIGRU-LWAN: As an alternative approach to exploit the label hierarchy, we used a recent improvement of NODE2VEC (Grover and Leskovec, 2016) by Kotitsas et al. (2019) to obtain alternative hierarchy-aware label representations. NODE2VEC is similar to WORD2VEC (Mikolov et al., 2013), but pre-trains node embeddings instead of word embeddings, replacing WORD2VEC's text windows by random walks on a graph (here the label hierarchy). 4 In a variant of DC-BIGRU-LWAN, dubbed DN-BIGRU-LWAN, we simply replace the initial centroid u l label representations of DC-BIGRU-LWAN in Eq. 5 and 7 by the label representations g l generated by the NODE2VEC extension. GNC-BIGRU-LWAN: Similarly, we expand GC-BIGRU-LWAN with the hierarchy-aware label representations of the NODE2VEC extension. Again, we replace the centroid u l label representations of GC-BIGRU-LWAN in Eq. 5 and 7 by the label representations g l of the NODE2VEC extension. The resulting model, GNC-BIGRU-LWAN, uses both NODE2VEC and the GCN layers to encode the label hierarchy, thus obtaining knowledge from the label hierarchy both in a self-supervised and a supervised fashion.

Graph-aware Annotation Proximity
In this work, we introduce graph-aware label proximity (GAP), a measure of the topological proximity (on the label hierarchy) of the gold labels assigned to documents. GAP turns out to be a key factor in the performance of hierarchy-aware zero-shot capable extensions of BIGRU-LWAN. Let G(L, E) be the graph of the label hierarchy, where L is the set of nodes (label set) and E the set of edges. Let L d ⊆ L be the set of gold labels a particular document d is annotated with. Finally, let such that for any two nodes (gold labels) l 1 , l 2 ∈ L d , the shortest path between l 1 , l 2 in the full graph G(L, E) is also a path in . Intuitively, we extend L d to L + d by including additional labels that lie between any two assigned labels l 1 , l 2 on the shortest path that connects l 1 , l 2 in the full graph. We then define GAP By averaging GAP d over all the documents d of a dataset, we obtain a single GAP score per dataset (Fig. 1). When the assigned (gold) labels of the documents are frequently neighbours in the full graph (label hierarchy), we need to add fewer labels when expanding the L d of each document to L + d ; hence, GAP → 1. When the assigned (gold) labels are frequently remote to each other, we need to add more labels (|L + d | |L d |) and GAP → 0. GAP should not be confused with label density (Tsoumakas and Katakis, 2009) where N is the total number of documents. Although label density is often used in the multi-label classification literature, it is graphunaware, i.e., it does not consider the positions (and distances) of the assigned labels in the graph.

Data
EURLEX57K (Chalkidis et al., 2019b) contains 57k English legislative documents from  Each document is annotated with one or more concepts (labels) from the 4,271 concepts of EUROVOC. 6 The average document length is 5 http://eur-lex.europa.eu/ 6 http://eurovoc.europa.eu/ approx. 727 words. The labels are divided in frequent (746 labels), few-shot (3,362), and zero-shot (163), depending on whether they were assigned to n > 50, 1 < n ≤ 50, or no training documents. They are organized in a 6-level hierarchy, which was not considered in the experiments of Chalkidis et al. (2019b). The documents are labeled with concepts from all levels ( Fig. 1), but in practice if a label is assigned, none of its ancestor or descendent labels are assigned. The resulting GAP is 0.45. MIMIC-III (Johnson et al., 2017) contains approx. 52k English discharge summaries from US hospitals. The average document length is approx. 1.6k words. Each summary has one or more codes (labels) from 8,771 leaves of the ICD-9 hierarchy, which has 8 levels ( Fig. 1). 7 Labels are divided in frequent (4,112 labels), few-shot (4,216 labels), and zero-shot (443 labels), depending on whether they were assigned to n > 5, 1 < n ≤ 5, or no training documents. All discharge summaries are annotated with leaf nodes (5-digit codes) only, i.e., the most fine-grained categories (Fig. 1), causing the label assignments to be much sparser compared to EURLEX57K (GAP 0.27).
AMAZON13K (Lewis et al., 2004) contains approx. 1.5M English product descriptions from Amazon. Each product is represented by a title and a description, which are on average 250 words when concatenated. Products are classified into one or more categories (labels) from a set of approx. 14k. Labels are divided in frequent (3,108 labels), fewshot (10,581 labels), zero-shot (579 labels), depending on whether they were assigned to n > 100, 1 < n ≤ 100, or no training documents. The labels are organized in a hierarchy of 8 levels. If a product is annotated with a label, all of its ancestor labels are also assigned to the product (Fig. 1), leading to dense label assignments (GAP 0.86).

Evaluation Measures
The most common evaluation measures in LMTC are label precision and recall at the top K predicted labels (P@K, R@K) of each document, and nDCG@K (Manning et al., 2009), both averaged over test documents. However, P@K and R@K unfairly penalize methods when the gold labels of a document are fewer or more than K, respectively. R-Precision@K (RP@K) (Chalkidis et al., 2019b) 2009), is better; it is the same as P@K if there are at least K gold labels, otherwise K is reduced to the number of gold labels. When the order of the top-K labels is unimportant (e.g., for small K), RP@K is more appropriate than nDCG@K.

Implementation Details
We implemented neural methods in TENSORFLOW 2, also relying on the HuggingFace Transformers library for BERT-based models. 8 We use the BASE versions of all models, and the Adam optimizer (Kingma and Ba, 2015). All hyper-parameters were tuned selecting values with the best loss on the 8 Consult https://tersorflow.org/ and http: //github.com/huggingface/transformers/. development data. 9 For all PLT-based methods, we used the code provided by their authors. 10

Overall predictive performance
PLTs vs. LWANs: Interestingly, the TF-IDF-based PARABEL and BONSAI outperform the best previously published neural LWAN-based models on EURLEX57K and AMAZON13K, while being comparable to ATTENTION-XML, when all or frequent labels are considered (Table 1). This is not the case with MIMIC-III, where BIGRU-LWAN and ATTENTION-XML have far better results for all and frequent labels. The poor performance of the two TF-IDF-based PLT-based methods on MIMIC-III seems to be due to the fact that their TF-IDF features ignore word order and are not contextualized, which is particularly important in this dataset. To confirm this, we repeated the experiments of BIGRU-LWAN on MIMIC-III after shuffling the words of the documents, and performance dropped by approx. 7.7% across all measures, matching the performance of PLT-based methods. 11 The dominance of ATTENTION-XML in MIMIC-III further supports our intuition that word order is particularly important in this dataset, as the core difference of ATTENTION-XML with the rest of the PLT-based methods is the use of RNN-based classifiers that use word embeddings and are sensitive to word order, instead of linear classifiers with TF-IDF features, which do not capture word order. Meanwhile, in both EURLEX57K and AMAZON13K, the performance of ATTENTION-XML is competitive with both TF-IDF-based PLT-based methods and BIGRU-LWAN, suggesting that the bag-of-words assumption holds in these cases. Thus, we can fairly assume that word order and global context (longterm dependencies) do not play a drastic role when predicting labels (concepts) on these datasets.
Effects of transfer learning: Adding contextaware ELMO embeddings to BIGRU-LWAN (BIGRU-LWAN-ELMO) improves performance across all datasets by a small margin, when considering all or frequent labels. For EURLEX57K and AMA-ZON13K, larger performance gains are obtained by fine-tuning BERT-BASE and ROBERTA-BASE. Our proposed new method (BERT-BASE-LWAN) that employs LWAN on top of BERT-BASE has the best results among all methods on EURLEX57K and AMAZON13K, when all and frequent labels are considered. However, in both datasets, the results are comparable to BERT-BASE, indicating that the multi-head attention mechanism of BERT can effectively handle the large number of labels.
Poor performance of BERT on MIMIC-III: Quite surprisingly, all three BERT-based models perform poorly on MIMIC-III (Table 1), so we examined two possible reasons. First, we hypothesized that this poor performance is due to the distinctive MethodTF nDCG@15 ATTENTION-XML (You et al., 2019) full-text -73.4 BERT-BASE (Devlin et al.,   writing style and terminology of biomedical documents, which are not well represented in the generic corpora these models are pre-trained on. To check this hypothesis, we employed CLINICAL-BERT (Alsentzer et al., 2019), a version of BERT-BASE that has been further fine-tuned on biomedical documents, including discharge summaries. Table 2 shows that CLINICAL-BERT performs slightly better than BERT-BASE on the biomedical dataset, partly confirming our hypothesis. The improvement, however, is small and CLINICAL-BERT still performs worse than ROBERTA-BASE, which is pre-trained on larger generic corpora with a larger vocabulary. Examining the token vocabularies (Gage, 1994) of the BERT-based models reveals that biomedical terms are frequently overfragmented; e.g., 'pneumonothorax' becomes ['p', '##ne', '##um', '##ono', '##th', '##orax'], and 'schizophreniform becomes ['s', '##chi', '##zo', '##ph', '##ren', '##iform']. This is also the case with CLINICAL-BERT, where the original vocabulary of BERT-BASE was retained. We suspect that such long sequences of meaningless sub-words are difficult to re-assemble into meaningful units, even when using deep pre-trained Transformer-based models. Thus we also report the performance of SCI-BERT (Beltagy et al., 2019), which was pretrained from scratch (including building the vocabulary) on scientific articles, mostly from the biomedical domain. Indeed SCI-BERT performs better, but still much worse than ATTENTION-XML. A second possible reason for the poor performance of BERT-based models on MIMIC-III is that they can process texts only up to 512 tokens long, truncating longer documents. This is not a problem in EURLEX57K, because the first 512 tokens contain enough information to classify EURLEX57K documents (727 words on average), as shown by Chalkidis et al. (2019b). It is also not a problem in AMAZON13K, where texts are short (250 words on average). In MIMIC-III, however, the average document length is approx. 1.6k words and documents EURLEX57K (K = 5) MIMIC-III (K = 15) AMAZON13K (K = 5) FEW (n < 50) ZERO FEW (n < 5) ZERO FEW (n < 100) ZERO BIGRU-LWAN (Chalkidis et al., 2019b) 61.7 -14.3 -73.6 -C-BIGRU-LWAN (Rios and Kavuluru, 2018) 51  This model encodes consecutive segments of text (each up to 512 tokens) using a shared SCI-BERT encoder, then applies max-pooling over the segment encodings to produce a final document representation. HIER-SCI-BERT outperforms SCI-BERT, confirming that truncation is an important issue, but it still performs worse than ATTENTION-XML. We believe that a hierarchical BERT model pre-trained from scratch on biomedical corpora, especially discharge summaries, with a new BPE vocabulary, may perform even better in future experiments.

Zero-shot Learning
In Table 1 we intentionally omitted zero-shot labels, as the methods discussed so far, except GC-BIGRU-LWAN, are incapable of zero-shot learning.
In general, any model that relies solely on trainable vectors to represent labels cannot cope with unseen labels, as it eventually learns to ignore unseen labels, i.e., it assigns them near-zero probabilities. In this section, we discuss the results of the zero-shot capable extensions of BIGRU-LWAN (Section 3.5).
In line with the experiments of Rios and Kavuluru (2018), Table 3 shows that GC-BIGRU-LWAN (with GCNs) performs better than C-BIGRU-LWAN in zero-shot labels on all three datasets. These two zero-shot capable extensions of BIGRU-LWAN also obtain better few-shot results on MIMIC-III comparing to BIGRU-LWAN; GC-BIGRU-LWAN is also comparable to BIGRU-LWAN in few-shot learning on EURLEX57K, but BIGRU-LWAN is much better than its two zero-shot extensions on AMAZON13K. The superior performance of BIGRU-LWAN on EU-RLEX57K and AMAZON13K, compared to MIMIC-III, is due to the fact that in the first two datasets few-shot labels are more frequent (n ≤ 50, and n ≤ 100, respectively) than in MIMIC-III (n ≤ 5).
Are graph convolutions a key factor? It is unclear if the gains of GC-BIGRU-LWAN are due to the GCN encoder of the label hierarchy, or the increased depth of GC-BIGRU-LWAN compared to C-BIGRU-LWAN. Table 3 shows that DC-BIGRU-LWAN is competitive to GC-BIGRU-LWAN, indicating that the latter benefits mostly from its increased depth, and to a smaller extent from its awareness of the label hierarchy. This motivated us to search for alternative ways to exploit the label hierarchy.
Alternatives in exploiting label hierarchy: Table 3 shows that DN-BIGRU-LWAN, which replaces the centroids of token embeddings of the label descriptors of DC-BIGRU-LWAN with label embeddings produced by the NODE2VEC extension, is actually inferior to DC-BIGRU-LWAN. In turn, this suggests that although the NODE2VEC extension we employed aims to encode both topological information from the hierarchy and information from the label descriptors, the centroids of word embeddings still capture information from the label descriptors that the NODE2VEC extension misses. This also indicates that exploiting the information from the label descriptors is probably more important than the topological information of the label hierarchy for few and zero-shot learning generalization.
DNC-BIGRU-LWAN, which combines the centroids with the label embeddings of the NODE2VEC extension, is comparable to DC-BIGRU-LWAN, while being better overall in few-shot labels. Combining the GCN encoder and the NODE2VEC extension (GNC-BIGRU-LWAN) leads to a large im-provement in zero-shot labels (46.1% to 51.9% nDCG@K) on AMAZON13K. On EURLEX57K, however, the original GC-BIGRU-LWAN still has the best zero-shot results; and on MIMIC-III, the best zero-shot results are obtained by the hierarchyunaware DC-BIGRU-LWAN. These mixed findings seem related to the GAP of each dataset (Fig. 1).
The role of graph-aware annotation proximity: When gold label assignments are dense, neighbouring labels co-occur more frequently, thus models can leverage topological information and learn how to better cope with neighbouring labels, which is what both GCNs and NODE2VEC do. The denser the gold label assignments, the more we can rely on more distant neighbours, and the better it becomes to include graph embedding methods that conflate larger neighbourhoods, like NODE2VEC (included in GNC-BIGRU-LWAN) on AMAZON13K (GAP 0.86), when predicting unseen labels.
For medium proximity gold label assignments, as in EURLEX57K (GAP 0.45), it seems preferable to rely on closer neighbours only; hence, it is better to use only graph encoders that conflate smaller neighbourhoods, like the GCNs which apply convolution filters to neighbours up to two hops away, as in GC-BIGRU-LWAN (excl. NODE2VEC extension).
When label assignments are sparse, as in MIMIC-III (GAP 0.27), where only non-neighbouring leaf labels are assigned in the same document, leveraging the topological information (e.g., knowing that a rare label shares an ancestor with a frequent one) is not always helpful, which is why encoding the label hierarchy shows no advantage in zero-shot learning in MIMIC-III; however, it can still be useful when we at least have few training instances, as the few-shot results of MIMIC-III indicate.
Overall, we conclude that the GCN label hierarchy encoder does not always improve LWANs in zeroshot learning, compared to equally deep LWANs, and that depending on the proximity of label assignments (based on the label annotation guidelines) it may be preferable to use additional or no hierarchyaware encodings for zero-shot learning.

Conclusions
We presented an extensive study of LMTC methods in three domains, to answer three understudied questions on (1) the competitiveness of PLT-based methods against neural models, (2) the use of the label hierarchy, (3) the benefits from transfer learning. A condensed summary of our findings is that (1) TF-IDF PLT-based methods are definitely worth considering, but are not always competitive, while ATTENTION-XML, a neural PLT-based method that captures word order, is robust across datasets; (2) transfer learning leads to state-of-the-art results in general, but BERT-based models can fail spectacularly when documents are long and technical terms get over-fragmented; (3) the best way to use the label hierarchy in neural methods depends on the proximity of the label assignments in each dataset. An even shorter summary is that no single method is best across all domains and label groups (all, few, zero) as the language, the size of documents, and the label assignment strongly vary with direct implications in the performance of each method.
In future work, we would like to further investigate few and zero-shot learning in LMTC, especially in BERT models that are currently unable to cope with zero-shot labels. It is also important to shed more light on the poor performance of BERT models in MIMIC-III and propose alternatives that can cope both with long documents (Kitaev et al., 2020;Beltagy et al., 2020) and domain-specific terminology, reducing word over-fragmentation. Pretraining BERT from scratch on discharge summaries with a new BPE vocabulary is a possible solution. Finally, we would like to combine PLTs with BERT, similarly to ATTENTION-XML, but the computational cost of fine-tuning multiple BERT encoders, one for each PLT node, would be massive, surpassing the training cost of very large Transformerbased models, like T5-3B (Raffel et al., 2019) and MEGATRON-LM (Shoeybi et al., 2019) with billions of parameters (30-100x the size of BERT-BASE). Table 7 shows RP@K results of the zero-shot capable methods. As with nDCG@K, we conclude that the GCN label hierarchy encoder of Rios and Kavuluru (2018) does not always improve LWANs in zero-shot learning, compared to equally deep LWANs, and that depending on the proximity of label assignments, it may be preferable to use additional or no encodings of the hierarchy for zeroshot learning. Also, the zero-shot capable methods outperform BIGRU-LWAN in all, frequent, and few labels, but no method is consistently the best.    (Rios and Kavuluru, 2018) 80.2 0.2 9.3h GC-BIGRU-LWAN (Rios and Kavuluru, 2018) 80  (Devlin et al., 2019) 110 110 9.5h ROBERTA-BASE  110 110 9.5h BERT-LWAN (new) 119 119 11h