META: Metadata-Empowered Weak Supervision for Text Classification

Recent advances in weakly supervised learning enable training high-quality text classiﬁers by only providing a few user-provided seed words. Existing methods mainly use text data alone to generate pseudo-labels despite the fact that metadata information (e.g., author and timestamp) is widely available across various domains. Strong label indicators exist in the metadata and it has been long overlooked mainly due to the following challenges: (1) metadata is multi-typed, requiring systematic modeling of different types and their combinations, (2) metadata is noisy, some metadata entities (e.g., authors, venues) are more compelling label indicators than others. In this paper, we propose a novel framework, M ETA , which goes beyond the existing paradigm and leverages metadata as an additional source of weak supervision. Speciﬁcally, we organize the text data and metadata together into a text-rich network and adopt network motifs to capture appropriate combinations of metadata. Based on seed words, we rank and ﬁlter motif instances to distill highly label-indicative ones as “seed motifs”, which provide additional weak supervision. Following a boot-strapping manner, we train the classiﬁer and expand the seed words and seed motifs iteratively. Extensive experiments and case studies on real-world datasets demonstrate superior performance and signiﬁcant advantages of leveraging metadata as weak supervision.


Introduction
Weakly supervised text classification has recently gained much attention from the researchers because it reduces the burden of annotating the data. So far, the major source of weak supervision lies in text data itself (Agichtein and Gravano, 2000;Kuipers et al., 2006;Riloff et al., 2003;Tao et al., 2015;Mekala and Shang, 2020). These methods typically require a few user-provided seed   words for each class as weak supervision. They expand seed words with generated pseudo labels and improve their text classifier in an iterative fashion.
Metadata information (e.g., author, published year) in addition to textual information, is widely available across various domains (e.g., news articles, social media posts, and scientific papers) and it could serve as a strong, complementary weak supervision source. Take a look at the research papers in Figure 1(a) as an example. It shall be learned in a data-driven manner that G. Hinton is a highly-reputed machine learning researcher, thus his presence is a strong indicator of a paper belonging to the Machine Learning category.
Distilling effective metadata for weak supervision faces several major challenges. Metadata is often multi-typed, each type and the type combinations could have very different semantics and may not be equally important. Moreover, even entities within a single metadata type could be noisy. Continuing our example in Figure 1(a), we shall notice that year is less helpful than an author to do classification. Among the authors, J. Dean might be an important figure but has research interests < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 f W s j T k A r O 8 k A Z C 4 d + 4 U D m + l o k 4 = " > A A A G n n i c n V R b b 9 M w F P Y G K 6 P c N n j k x a K a x M N U N W g S P O 6 + A R v r L t 0 K p J p s x 2 l N n T j Y T m k X 5 Y k f w C v 8 N P 4 N d p J 2 7 c a l w 5 L P + c 7 x 5 3 O + O I l x x J n S t d r P m d l b t + d K d + b v l u / d f / D w 0 c L i 4 1 M l Y k l o g w g u Z B M j R T k L a U M z z W k z k h Q F m N M z 3 N 2 w 6 2 c 9 K h U T 4 Y k e R L Q V o H b I f E a Q N q m G S / b P n f O F S q 1 a y w a 8 D p w C V E A x 6 u e L c 1 9 d T 5 A 4 o K E m H C n 1 0 a l F u p U g q R n h N C 2 7 s a I R I l 3 U p h 8 N D F F A V S v J 1 K Z w y W Q 8 6 A t p Z q h h l h 3 f k a B A q U G A D T N A u q O u r t n k b 9 c 0 w j F H s r 8 M s R B d E 6 k U w i V o A K e T k r T / q p W w M I o 1 D U m u y I 8 5 1 A L a M 4 I e k 5 R o P j A A E c n M Q 0 H S Q R I R b U 5 y o i U R Q S 7 H 9 F F Y L k O C Z b l c d j 3 q u 5 4 x X I i o 4 i Q u 8 / t j s U u 5 o l m c V J z U p f 0 I h R 7 y T f U R y f V Z O l b H 1 p i g m T x R 9 l w h x r Z g 6 O W h o Y U q l t S q S l x r M b Z N z C i P i s O 1 9 Y 3 N r e 2 d 3 d d v 3 u 7 t v z u o H x 4 d n z R O z 5 r v P 4 w 4 0 / X 2 / 9 H b v 0 l v h I k h t T v s U 5 c H o Y g + S 6 X j 3 p f + 4 O J S 1 R J 0 y d q y M e v L s F q t T q W S / F 0 k Q f z / T s h o 6 V k t v R t o 6 f 1 Z C x b c s 9 + 9 u J G c K Q + t h 3 j U Q Z l O q 3 h t F E 3 3 k U 0 n e 8 g e J 2 c P M v Y k S S Y k T T D V x r a R + X / S x K P c R j R S j I s w T X p I j v B F R s y M 7 m T O r B a I C W u 7 K I q M 4 + b W 8 4 w P 4 j Q J z e y z N I l Y R r d O d k S G M 6 9 Y O 8 g r F U g j s y M e 9 o w 6 x T 7 r i T V m J U 1 E Q N u G u 5 O L 3 s x F n + R S 9 o r u T c O r m 3 k 8 6 l C g x r B 4 3 d a r 2 3 o H e T 3 K e X r 5 s v L 3 c T T 6 e Y / S 4 U W w N c p t 2 Z y 5 t p 2 r l / R 1 c P q i 6 t S q z u F K Z X W l u M D n w V P w D D w H D n g J V s E u q I M G I I C B b + A 7 + F G C p e 3 S f u k g p 8 7 O F H u e g I l R a v 4 C Y j V U A Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 f W s j T k A r O 8 k A Z C 4 d + 4 U D m + l o k 4 = " > A A A G n n i c n V R b b 9 M w F P Y G K 6 P c N n j k x a K a x M N U N W g S P O 6 + A R v r L t 0 K p J p s x 2 l N n T j Y T m k X 5 Y k f w C v 8 N P 4 N d p J 2 7 c a l w 5 L P + c 7 x 5 3 O + O I l x x J n S t d r P m d l b t + d K d + b v l u / d f / D w 0 c L i 4 1 M l Y k l o g w g u Z B M j R T k L a U M z z W k z k h Q F m N M z 3 N 2 w 6 2 c 9 K h U T 4 Y k e R L Q V o H b I f E a Q N q m G S / b P n f O F S q 1 a y w a 8 D p w C V E A x 6 u e L c 1 9 d T 5 A 4 o K E m H C n 1 0 a l F u p U g q R n h N C 2 7 s a I R I l 3 U p h 8 N D F F A V S v J 1 K Z w y W Q 8 6 A t p Z q h h l h 3 f k a B A q U G A D T N A u q O u r t n k b 9 c 0 w j F H s r 8 M s R B d E 6 k U w i V o A K e T k r T / q p W w M I o 1 D U m u y I 8 5 1 A L a M 4 I e k 5 R o P j A A E c n M Q 0 H S Q R I R b U 5 y o i U R Q S 7 H 9 F F Y L k O C Z b l c d j 3 q u 5 4 x X I i o 4 i Q u 8 / t j s U u 5 o l m c V J z U p f 0 I h R 7 y T f U R y f V Z O l b H 1 p i g m T x R 9 l w h x r Z g 6 O W h o Y U q l t S q S l x r M b Z N z C i P i s O 1 9 Y 3 N r e 2 d 3 d d v 3 u 7 t v z u o H x 4 d n z R O z 5 r v P 4 w 4 0 / X 2 / 9 H b v 0 l v h I k h t T v s U 5 c H o Y g + S 6 X j 3 p f + 4 O J S 1 R J 0 y d q y M e v L s F q t T q W S / F 0 k Q f z / T s h o 6 V k t v R t o 6 f 1 Z C x b c s 9 + 9 u J G c K Q + t h 3 j U Q Z l O q 3 h t F E 3 3 k U 0 n e 8 g e J 2 c P M v Y k S S Y k T T D V x r a R + X / S x K P c R j R S j I s w T X p I j v B F R s y M 7 m T O r B a I C W u 7 K I q M 4 + b W 8 4 w P 4 j Q J z e y z N I l Y R r d O d k S G M 6 9 Y O 8 g r F U g j s y M e 9 o w 6 x T 7 r i T V m J U 1 E Q N u G u 5 O L 3 s x F n + R S 9 o r u T c O r m 3 k 8 6 l C g x r B 4 3 d a r 2 3 o H e T 3 K e X r 5 s v L 3 c T T 6 e Y / S 4 U W w N c p t 2 Z y 5 t p 2 r l / R 1 c P q i 6 t S q z u F K Z X W l u M D n w V P w D D w H D n g J V s E u q I M G I I C B b + A 7 + F G C p e 3 S f u k g p 8 7 O F H u e g I l R a v 4 C Y j V U A Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 f W s j T k A r O 8 k A Z C 4 d + 4 U D m + l o k 4 = " > A A A G n n i c n V R b b 9 M w F P Y G K 6 P c N n j k x a K a x M N U N W g S P O 6 + A R v r L t 0 K p J p s x 2 l N n T j Y T m k X 5 Y k f w C v 8 N P 4 N d p J 2 7 c a l w 5 L P + c 7 x 5 3 O + O I l x x J n S t d r P m d l b t + d K d + b v l u / d f / D w 0 c L i 4 1 M l Y k l o g w g u Z B M j R T k L a U M z z W k z k h Q F m N M z 3 N 2 w 6 2 c 9 K h U T 4 Y k e R L Q V o H b I f E a Q N q m G S / b P n f O F S q 1 a y w a 8 D p w C V E A x 6 u e L c 1 9 d T 5 A 4 o K E m H C n 1 0 a l F u p U g q R n h N C 2 7 s a I R I l 3 U p h 8 N D F F A V S v J 1 K Z w y W Q 8 6 A t p Z q h h l h 3 f k a B A q U G A D T N A u q O u r t n k b 9 c 0 w j F H s r 8 M s R B d E 6 k U w i V o A K e T k r T / q p W w M I o 1 D U m u y I 8 5 1 A L a M 4 I e k 5 R o P j A A E c n M Q 0 H S Q R I R b U 5 y o i U R Q S 7 H 9 F F Y L k O C Z b l c d j 3 q u 5 4 x X I i o 4 i Q u 8 / t j s U u 5 o l m c V J z U p f 0 I h R 7 y T f U R y f V Z O l b H 1 p i g m T x R 9 l w h x r Z g 6 O W h o Y U q l t S q S l x r M b Z N z C i P i s O 1 9 Y 3 N r e 2 d 3 d d v 3 u 7 t v z u o H x 4 d n z R O z 5 r v P 4 w 4 0 / X 2 / 9 H b v 0 l v h I k h t T v s U 5 c H o Y g + S 6 X j 3 p f + 4 O J S 1 R J 0 y d q y M e v L s F q t T q W S / F 0 k Q f z / T s h o 6 V k t v R t o 6 f 1 Z C x b c s 9 + 9 u J G c K Q + t h 3 j U Q Z l O q 3 h t F E 3 3 k U 0 n e 8 g e J 2 c P M v Y k S S Y k T T D V x r a R + X / S x K P c R j R S j I s w T X p I j v B F R s y M 7 m T O r B a I C W u 7 K I q M 4 + b W 8 4 w P 4 j Q J z e y z N I l Y R r d O d k S G M 6 9 Y O 8 g r F U g j s y M e 9 o w 6 x T 7 r i T V m J U 1 E Q N u G u 5 O L 3 s x F n + R S 9 o r u T c O r m 3 k 8 6 l C g x r B 4 3 d a r 2 3 o H e T 3 K e X r 5 s v L 3 c T T 6 e Y / S 4 U W w N c p t 2 Z y 5 t p 2 r l / R 1 c P q i 6 t S q z u F K Z X W l u M D n w V P w D D w H D n g J V s E u q I M G I I C B b + A 7 + F G C p e 3 S f u k g p 8 7 O F H u e g I l R a v 4 C Y j V U A Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 2 f W s j T k A r O 8 k A Z C 4 d + 4 U D m + l o k 4 = " > A A A G n n i c n V R b b 9 M w F P Y G K 6 P c N n j k x a K a x M N U N W g S P O 6 + A R v r L t 0 K p J p s x 2 l N n T j Y T m k X 5 Y k f w C v 8 N P 4 N d p J 2 7 c a l w 5 L P + c 7 x 5 3 O + O I l x x J n S t d r P m d l b t + d K d + b v l u / d f / D w 0 c L i 4 1 M l Y k l o g w g u Z B M j R T k L a U M z z W k z k h Q F m N M z 3 N 2 w 6 2 c 9 K h U T 4 Y k e R L Q V o H b I f E a Q N q m G S / b P n f O F S q 1 a y w a 8 D p w C V E A x 6 u e L c 1 9 d T 5 A 4 o K E m H C n 1 0 a l F u p U g q R n h N C 2 7 s a I R I l 3 U p h 8 N D F F A V S v J 1 K Z w y W Q 8 6 A t p Z q h h l h 3 f k a B A q U G A D T N A u q O u r t n k b 9 c 0 w j F H s r 8 M s R B d E 6 k U w i V o A K e T k r T / q p W w M I o 1 D U m u y I 8 5 1 A L a M 4 I e k 5 R o P j A A E c n M Q 0 H S Q R I R b U 5 y o i U R Q S 7 H 9 F F Y L k O C Z b l c d j 3 q u 5 4 x X I i o 4 i Q u 8 / t j s U u 5 o l m c V J z U p f 0 I h R 7 y T f U R y f V Z O l b H 1 p i g m T x R 9 l w h x r Z g 6 O W h o Y U q l t S q S l x r M b Z N z C i P i s O 1 9 Y 3 N r e 2 d 3 d d v 3 u 7 t v z u o H x 4 d n z R O z 5 r v P 4 w 4 0 / X 2 / 9 H b v 0 l v h I k h t T v s U 5 c H o Y g + S 6 X j 3 p f + 4 O J S 1 R J 0 y d q y M e v L s F q t T q W S / F 0 k Q f z / T s h o 6 V k t v R t o 6 f 1 Z C x b c s 9 + 9 u J G c K Q + t h 3 j U Q Z l O q 3 h t F E 3 3 k U 0 n e 8 g e J 2 c P M v Y k S S Y k T T D V x r a R + X / S x K P c R j R S j I s w T X p I j v B F R s y M 7 m T O r B a I C W u 7 K I q M 4 + b W 8 4 w P 4 j Q J z e y z N I l Y R r d O d k S G M 6 9 Y O 8 g r F U g j s y M e 9 o w 6 x T 7 r i T V m J U 1 E Q N u G u 5 O L 3 s x F n + R S 9 o r u T c O r m 3 k 8 6 l C g x r B 4 3 d a r 2 3 o H e T 3 K e X r 5 s v L 3 c T T 6 e Y / S 4 U W w N c p t 2 Z y 5 t p 2 r l / R 1 c P q i 6 t S q z u F K Z X W l u M D n w V P w D D w H D n g J V s E u q I M G I I C B b + A 7 + F G C p e 3 S f u k g p 8 7 O F H u e g I l R a v 4 C Y j V U A Q = = < / l a t e x i t > M 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " L T 0 u P 3 J 2 4 e k 1 x Y W + w h p N P 5 L 4 H I A = " > A A A G n n i c n V R b b 9 M w F M 4 G K 6 P c N n j k x a K a x E N V N d M k e N x 9 A z b W X b o V l m q y H a c 1 d e J g O 6 V d l C d + A K / w 0 / g 3 2 E m a t R u X D k s + 5 z v H n 8 / 5 4 i R G I a N S 1 e s / Z 2 b v 3 J 0 r 3 Z u / X 3 7 w 8 N H j J w u L T 0 8 l j w Q m T c w Z F y 0 E J W E 0 I E 1 F F S O t U B D o I 0 b O U G / D r J / 1 i Z C U B y d q G J K 2 D z s B 9 S i G S q e a D t 6 / W L 5 Y q N R r 9 X S A m 8 D O Q c X K R + N i c e 6 r 4 3 I c + S R Q m E E p z + 1 6 q N o x F I p i R p K y E 0 k S Q t y D H X K u Y Q B 9 I t t x q j Y B S z r j A o 8 L P Q M F 0 u z 4 j h j 6 U g 5 9 p J k + V F 1 5 f c 0 k f 7 u m I I o Y F I M q Q J z 3 d C Q T A J a A B o x M S l L e 6 3 Z M g z B S J M C Z I i 9 i Q H F g z g i 4 V B C s 2 F A D i A X V D w V w F w q I l T 7 J i Z a Y + 5 k c 3 U c i U Q U Y i X K 5 7 L j E c 1 x t G O d h x Y 4 d 6 g 3 G Y o c w S d I 4 r t i J Q w Y h D F z o 6 e o F y f F o M l b H 1 J i g 6 T y W 5 l w B Q q Z g 4 G a h p g U y E s S o i h 1 j E T J N 9 C g X x c H a + s b m 1 v b O 7 p u 3 7 / b 2 3 x 8 0 D o + O T 5 q n Z 6 0 P H w v O d L 2 9 f / T 2 b t M b I q x J n S 7 9 1 G N + w M P P Q q q o / 2 U w v L x S t Q Q c v F b V Z r 0 K a r X a V C r x 3 0 V i y P 7 v h L S W v t H S v 4 W W / p + 1 I M 5 c 8 9 3 z W 8 m Z 8 t D 6 k I V d m O o 0 i t e K a L q P b D r Z I / Y 4 O X 2 Q s S e J U y F J j I j S t g P 1 / 5 P E L m E m I q G k j A d J 3 I e i w J c p M T W q m z q 9 m i P K j e 3 B M N S O 6 V v P 1 d 6 P k j j Q c 0 C T O K Q p 3 T j R 5 S l O v a Q d P 6 u U I w X 1 j m j U M + z m + 4 z H x u i V J O Y + 6 W j u T i Z 6 M x N 9 k k n Z y 7 u 3 N K + h 5 3 H R I U f N U f G G q d c w 9 Q 6 y e o S x 5 O p l Z e / j q P h 5 j 5 L R R b B V 5 L Z M T l / b 9 v V L + i Y 4 X a 7 Z 9 Z p 9 u F J Z X c k v 8 H n r u f X C e m n Z 1 i t r 1 d q 1 G l b T w h a 1 v l n f r R 8 l U N o u 7 Z c O M u r s T L 7 n m T U x S q 1 f a G t U A g = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " L T 0 u P 3 J 2 4 e k 1 x Y W + w h p N P 5 L 4 H I A = " > A A A G n n i c n V R b b 9 M w F M 4 G K 6 P c N n j k x a K a x E N V N d M k e N x 9 A z b W X b o V l m q y H a c 1 d e J g O 6 V d l C d + A K / w 0 / g 3 2 E m a t R u X D k s + 5 z v H n 8 / 5 4 i R G I a N S 1 e s / Z 2 b v 3 J 0 r 3 Z u / X 3 7 w 8 N H j J w u L T 0

. Expansion
Expand from high score to low New seed set cutoff Figure 2: Our META framework. In each iteration, we generate pseudo labels for documents, train the text classifier, and rank all words and motif instances in a unified ranking framework. We then expand seed sets until an automatic cutoff is reached. The quality of the classifier and the seed sets are improved through iterations.
spanning across different domains. However, if we join the author with year, it carries more accurate semantics, and we may discover J. Dean has more interest in machine learning in recent years, thus becoming highly label-indicative.
Bearing the challenges in mind, we propose META, a principled framework for metadataempowered weakly-supervised text classification. As illustrated in Figure 1 and Figure 2, we first organize the text data and metadata together into a text-rich network. The network structure gives us a holistic view of the corpus and enables us to rank and select useful metadata entities. We leverage motif patterns (Benson et al., 2016;Milo et al., 2002; to model typed metadata as well as their combinations. A motif pattern is a subgraph pattern at the meta-level that captures higher-order connections and the semantics represented by these connections. It serves as a useful tool to model typed edges, typed paths (a.k.a. metapaths) (Sun et al., 2011), and higher-order structures in the network. With little effort, users can specify a few possibly useful motif patterns as input to our model. We develop a unified, principled ranking mechanism to select label-indicative motif instances and words, forming expanded weak supervision. Note that, such instance-level selection process also implicitly refines the motif patterns, ensuring the robust performance of META even when irrelevant motif patterns exist in input. It is worth a mention that META is compatible with any text classifiers.
Our contributions are summarized as follows: • We explore to incorporate metadata information as an additional source of weak supervision for text classification along with seed words. • We propose a novel framework META, which in-troduces motif patterns to capture the high-order combinations among different types of metadata and conducts a unified ranking and selection of label-indicative motif instances and words.
• We conduct experiments on two real-world datasets. The results and case studies demonstrate the superiority of incorporating metadata as parts of weak supervision and verify the effectiveness of META. Reproducibility. Our code is made publicly available at GitHub 1 .

Documents as Text-rich Network
Given a collection of n text documents D = {D 1 , D 2 , . . . , D n }, and their corresponding metadata, we propose to organize them into a text-rich network, as illustrated in Figure 1(b). A text-rich network is a heterogeneous network with documents, words, different types of metadata as nodes, and their associations as edges. For example, our text-rich network for research papers has papers, words, authors, and publication years as nodes. Each paper is connected to its associated words and metadata nodes. Such a network provides a holistic and structured representation of the input.

Seed Words and Motif Patterns
Users are asked to provide a few seed words S = {S w 1 , S w 2 , . . . , S w l } for each of l classes (i.e., C 1 , C 2 , . . . , C l ) in our classification problem, as well as k motif patterns {M 1 , M 2 , . . . , M k }. Motif patterns are sub-graph patterns at the meta-level (i.e., every node is abstracted by its type). They are able to capture semantics and higher-order interconnections among nodes. A motif instance is a sub-graph instance in the graph that follows a motif pattern. Figure 1 presents an example of a motif pattern that captures co-authorship and a motif instance following this motif pattern. In this paper, we discover seed motif instances for each class label, denoted as {S m 1 , S m 2 , . . . , S m l }.

Problem Formulation
Given the text-rich network and user-provided seed words and motif patterns as input, we aim to build a high-quality document classifier, assigning one class label C j to each document D i .

Our META Framework
As shown in Figure 2, META is an iterative framework, generating pseudo labels and training the text classifier alternatively, similar to many other weakly supervised text classification methods (Kuipers et al., 2006;Tao et al., 2015;. One iteration in META consists of the following steps: • Generate pseudo labels based on the seeds; • Train a text classifier based on pseudo labels; • Rank and select words and motif instances to expand the seeds. We repeat these steps iteratively. We denote the number of iterations as T , which is the only hyperparameter in our framework.
The novelty of META mainly lies in integrating two sources of weak supervisions, seed motif instances, and seed words. Given each motif instance m or each word w, for each label l, we estimate a ranking score R m,l or R w,l ranging between 0 and 1, measuring how label-indicative it is to the particular label l. Such ranking scores are utilized to select new seed motif instances and seed words. Note that, while this selection is conducted at the instance level, it also selects motif patterns implicitly and therefore ensures robust performance when users provide some irrelevant motif patterns.

Pseudo Labels and Text Classifier
Based on seed words, seed motif instances, and their respective ranking scores for each class, we generate pseudo labels for unlabeled text documents and train a classifier based on these pseudo labels. In the first iteration, we have no seed motif instances and the ranking score is 1 for all seed words. Pseudo-Label Generation. Suppose we have seed word sets S w 1..l and seed motif instance sets S m  for all l labels, we generate pseudo labels using a simple yet effective count-based technique. Specifically, given a document D i , the probability that it belongs to the class l is proportional to the aggregated ranking scores of its respective seed words and seed motif instances.
where f D i ,w is the term frequency of the word w in document D i . The pseudo label of document D i is then assigned as follows: Our framework is compatible with any text classification model as a classifier. We use Hierarchical Attention Networks (HAN) (Yang et al., 2016) as the classifier. HAN is designed to capture the hierarchical document structure i.e. words -sentences -documents. As illustrated in Figure 3, HAN performs attention first on the sentences in the document to find the important sentence in a document and on the words in the sentence to identify important words in a sentence. We train a HAN model on unlabeled documents with the generated pseudo-labels. For the document D i , it estimates the probabilityŶ i,l for each class l. Such predicted distributions are used in the expansion of seed words and motifs.

Unified Seed Ranking and Expansion
Once the text classifier is trained, we rank words and motif instances together for each class. Then, we expand the seed sets by adding top-ranked words and motif instances. This improves the quality of the weak supervision over iterations, thereby improving the text classifier. We present our design of the unified ranking and expansion as follows.
Ranking Score Design. An ideal seed word or motif instance for a particular class should be highly relevant and highly exclusive to this class. So an  Figure 4: Using motif patterns, we construct bipartite graphs from the text-rich network linking documents to their respective motif instances.
effective ranking score must quantify relevance and exclusiveness. Such a ranking score for words alone has been explored by previous studies (Tao et al., 2015;Mekala and Shang, 2020), typically based on similarity and frequency-based metrics.
In this paper, we have motif instances in addition to words, therefore, we build upon the text-rich network to unify the ranking process. Given k user-provided motif patterns M 1 , . . ., M k and the text-rich network G, we construct k bipartite graphs G B 1 , . . ., G B k , one for each motif pattern (see Figure 4). In the i-th bipartite graph G B i , the node set contains two parts: (1) all documents and (2) all motif instances following the motif pattern M i in the text-rich network G; The edges in the graph G B i connect the documents to the motif instances which are subsets of the metadata associated with the documents.
For the sake of simplicity, we introduce one more motif pattern, document-word. It makes words a special case of motif instances, and one can easily construct a similar bipartite graph for words. Therefore, in the rest of this section, we use motif instances to explain our ranking score design.
For each motif pattern M, we conduct one personalized random walk on its corresponding bipartite graph G B for each label l. Specifically, we normalize each column of the adjacency matrix of the bipartite graph G B by the degree of its respective node, resulting in the transition matrix W. Suppose p l,u represents the personalized PageRank (PPR) score of each node u for each label l, we initialize the PPR score of each document node tô Y i,l and PPR score of each motif instance node to 0. This initialization ensures that a random walk starts from a document node and since G B is bipartite, it ends at a motif instance node. We iteratively update the PPR scores as follows: Since each document node is initialized with probabilities corresponding to l and the random walk starts from a document node and ends at a motif instance node, this can be viewed as a label propagation problem. Based on the previous work in label propagation (Hensley et al., 2015), similar nodes are more likely to form edges and the PPR score is used to measure the similarity. Therefore, we believe that p l,m reflects the relevance of a motif instance m to the particular class label l.
Though the absolute values of PPR scores are quite small, their relative magnitude conveys their affinity towards a label. Therefore, we normalize these PPR scores into a distribution, resulting in the ranking scores. Mathematically, for a label l, the ranking score of a motif instance m is: If a motif instance has similar relevance to multiple labels, the ranking score distribution becomes flat irrespective of the magnitude of its respective PPR scores. From this, we realize that our ranking score also quantifies exclusiveness, which is an essential characteristic of a highly label-indicative term.
Based on this ranking score, we rank words and motif instances in a unified manner and expand the seed word set and seed motifs set.
Expansion. Given the ranking scores of all words and motif instances for every label, we expand the seed words and seed motifs simultaneously for all labels. Intuitively, a highly label-indicative motif instance would not belong to the seed sets of multiple labels. Therefore, when any motif instance is expanded to seed sets of multiple classes, we stop the expansion of motif instances of the corresponding motif pattern. Also, we set a hard threshold of 1 |C| , where |C| is the number of classes, on ranking scores for those added motif instances. In this way, the number of new seed words and seed motif instances is decided by the method automatically. It is worth mentioning that our expansion here is adaptive and every label may have a different number of seeds. Note that, in the first iteration, pseudo labels are generated using only seed words but ranking scores are obtained for all words and motif instances. The highly ranked motif instances and words are used as seeds in further iterations.
After expanding the seed sets for every label, we generate pseudo labels and train the classifier. This process is repeated iteratively for T iterations.

Experiments
In this section, we evaluate META and compare it with existing techniques on two real-world datasets in a weakly supervised classification setting.

Experimental Settings
Datasets. We conduct experiments on the DBLP dataset (Tang et al., 2008) and the Book Graph dataset (Wan and McAuley, 2018;Wan et al., 2019). The dataset statistics are shown in Table 1. The details of the datasets are mentioned below.
• DBLP dataset: The DBLP dataset contains a comprehensive set of research papers in computer science. We select 38, 128 papers published in flagship venues. In addition to text data, it has information about authors, published year, and venue for each paper. There are 9, 300 distinct authors and 42 distinct years. For each paper, we annotate its research area largely based on its venue as the classification objective 2 . Therefore, in our experiments, we drop the venue information to ensure a fair comparison. • Book Graph dataset: The Book Graph dataset is a collection of the description of books, userbook interactions, and users' book reviews collected from a popular online book review website named Goodreads 3 . We select books belonging to eight popular genres 4 . The genre of a book is viewed as the label to be predicted. The total number of books selected is 33, 594. We use the title and description of a book as text data and author, publisher, and year as metadata. In total, there are 22, 145 distinct authors, 5, 186 distinct publishers, and 136 distinct years. Motif Patterns. The motif patterns we used as metadata information for DBLP and Book Graph datasets are shown in Figure 5. Seed Words. The seed words are obtained as follows: we asked 5 human experts to recommend 2 Classes in DBLP: (1) computer vision, (2) computational linguistics, (3) biomedical engineering, (4) software engineering, (5) graphics, (6) data mining, (7) security and cryptography, (8) signal processing, (9) robotics, and (10)  Evaluation Metrics. Both datasets are imbalanced with respect to the label distribution. Being aware of this fact, we adopt micro-and macro-F 1 scores as evaluation metrics. Implementation Details. To make the model robust to multi-word phrases as supervision, we extract phrases using Autophrase Shang et al., 2018). We set the word vector dimension to be 100 for all the methods that use word embeddings. We set the number of iterations parameter for META to 9.

Compared Methods
We compare our proposed method with a wide range of methods described below: • IR-TF-IDF treats seed words as a query. It computes the relevance of a document to a class by aggregating the TF-IDF values of its seed words. Each document is assigned the label which is the most relevant to this document. • Word2Vec learns word vector representations (Mikolov et al., 2013) for all words in the corpus. It computes label representations by aggregating the word vectors of all its seed words. Each document is assigned the label whose cosine similarity with this document is maximum. • Doc2Cube (Tao et al., 2015) considers label surface names as seed set and performs multidimensional document classification by learning dimension-aware embedding. • WeSTClass  leverages seed words to generate bag-of-words pseudo documents for neural model pre-training and then bootstraps the model on unlabeled data. Specifically, we compare with WeSTClass-CNN which is the best configuration under our setting. We use the public implementations of WeSTClass 5 with the hyperparameters mentioned in the paper. • Metapath2Vec (Dong et al., 2017) learns node representations in the text-rich network using meta-path-guided random walks by capturing the structural and semantic correlations of differently typed nodes. We use the first two motif patterns in Figure 5(a) and the first three motif patterns in Figure 5(b) as meta-paths because the rest cannot be represented as meta-paths. We generate pseudo-labels using the seed words and train a logistic regression classifier with document nodes representations as input to predict the labels. We denote our framework with HAN classifier as META, with CNN classifier as META-CNN, and with BERT(bert-base-uncased) classifier as META-BERT. We also compare with their respective ablated versions META-NoMeta, META-CNN-NoMeta, META-BERT-NoMeta where metadata information is not expanded and not considered while generating pseudo labels.
For a fair comparison, we also present results of all the baselines on the metadata-augmented datasets, where a token for every relevant motif instance is appended to the text data of a document. This is denoted by ++ in Table 2, e.g., WeSTClass++ represents the performance of WeSTClass on metadata-augmented datasets.
We also present the performance of HAN in a supervised setting which is denoted as HAN-Sup. The results of HAN-Sup reported are on the test set which follows an 80-10-10 train-dev-test split.

Performance Comparison
The evaluation results of all methods are summarized in Table 2. We can observe that our proposed framework outperforms all the compared weakly supervised methods. We discuss the effectiveness of our proposed META as follows: • META achieves the best performance among all the compared weakly supervised methods with significant margins. By extracting the highly label-indicative motif instances along with words and using them together in pseudo label generation, META successfully leverages metadata information and achieves superior performance. • We observe that the performance of META is better than all the compared weakly supervised models on metadata-augmented datasets. By comparing those ++ methods with their text-only counterparts, one can easily observe that adding metadata in text classification is indeed helpful. However, META does not restrict to single metadata types and goes beyond by employing motif patterns to capture the metadata information. It is successful in identifying the appropriate labelindicative metadata combinations and therefore achieves even better performance. • The comparison between META and Metap-ath2Vec demonstrates the advantages of motif patterns over the meta-paths. For example, on the Book Graph dataset, the last three motif patterns in Figure 5(b) cannot be represented through meta-paths and this significantly affects the performance. It's also worth mentioning that Metapath2Vec cannot handle new documents directly without re-training the embedding whereas our framework can directly predict without any additional effort. • The comparison between META and the ablation method META-NoMeta demonstrates the effectiveness of our motif instance expansion. For example, on the Book Graph dataset, the motif instance expansion improves the micro-F1 score from 0.58 to 0.62 and macro-F1 score from 0.58 to 0.63, which are quite significant. • The comparison between META and HAN-Sup demonstrates that META is effective in decreasing the gap between the performance of the weakly supervised and supervised settings.

Parameter Study
The only hyper-parameter in our framework META is T , the number of iterations. We experiment on both datasets to study the effect of the number of iterations on the performance. The plots of micro-F1 score and macro-F1 score with respect to the number of iterations are shown in Figure 6. We observe that the performance increases initially and gets gradually converged by 6 or 7 iterations. We also observe that the expanded seed words and seed motifs have become almost unchanged. While there is some fluctuation, a reasonably large T , such as T = 9 or T = 10, is recommended.

Number of Seed Words
We vary the number of seed words per class and plot the performance in Figure 7. We observe that the performance increases as the number of seed words increase, which is generally intuitive. For reasonable performance, we observe that three seed words are sufficient.

Case Study
We present case studies to showcase the effectiveness of our framework in addressing the challenges of leveraging metadata. Leveraging Metadata Combinations. Table 3 shows a few samples of expanded motif instances. First, let's take a look at motif instances related to authors and publishers. We can observe that strong label-indicative authors and publishers are mined accurately. For example, Marvel, a widely known comics publisher, is present in the expanded publishers for comics genre; A classic American poet E. Dickinson is successfully identified as labelindicative for poetry genre. Note that, the author N. Gaiman (in blue) who has written books in multiple genres including comic books, graphic novels, etc., is not a labelindicative author for any of these categories, because he is not exclusive to any one category, which is accurately captured by our framework. However, his works in various genres together with their respective publisher information form a unique labelindicative pattern which is reflected by the "Author-Publisher" motif pattern. Now, adding year metadata into the loop, although "Year-Document" is a user-provided motif pattern, META identifies that year information alone is not much helpful in classification. This demonstrates the robustness of our framework when users provide some irrelevant motif patterns. However, if we combine author information with year, it then carries more accurate semantics, and we may discover that N.Gaiman had authored more children's books in early 2000, thus becoming highly label-indicative. Eliminating Noise in Metadata. Table 4 presents the percentage of motif instances expanded out of the total motif instances following a motif pattern, for every label. One can observe that META actually prunes out many motif instances, as the final selection ratio is far less than 100%.  For the "Year-Document" motif pattern, we observe that its motif instances are only expanded for a few genres, which is generally intuitive. For example, one can see that a significant percentage of "Year-Document" motif instances expanded for history and poetry. After a closer inspection, we find that the expanded years were concentrated between the late 1800 and early 1900, thus developing an affinity for this time period.
One can also observe that the percentage of motif instances following the "Publisher-Document" motif pattern expanded varies for different labels, ranging from 1 to 13.5. This illustrates that our expansion is adaptive. Seed words Expansion. Figure 7 shows the number of seed words expanded after each iteration for comics, hystory, and mystery classes in Books dataset. We observe that the number varies for each label because of our data-driven, adaptive thresholds, which is different for each label.
One can also observe that the the number increases over iterations and gets almost stagnated at the end, indicating that the seed sets are getting refined and converged. A few examples of expanded seed words are shown in Table 5.

Related Work
We review the literature about (1) weakly supervised text classification methods, (2) text classification with metadata, and (3) document classifiers.

Weakly Supervised Text Classification
Due to the training data bottleneck in supervised classification, weakly supervised classification has recently attracted much attention from researchers. The majority of weakly supervised classification techniques require seeds in various forms, including label surface names (Li et al., 2018;Song and Roth, 2014;Tao et al., 2015), label-indicative words (Chang et al., 2008;Tao et al., 2015;Mekala and Shang, 2020), and labeleddocuments (Tang et al., 2015b;Xu et al., 2017;Miyato et al., 2016;. Dataless (Song and Roth, 2014) considers label surface names as seeds and classifies documents by embedding both labels and documents in a semantic space and computing semantic similarity between a document and a potential label; Along similar lines, Doc2Cube (Tao et al., 2015) expands label-indicative words using label surface names and performs multi-dimensional document classification by learning dimension-aware embedding; WeSTClass  considers both word-level and document level supervision sources. It first generates bag-of-words pseudo documents for neural model pre-training, then bootstraps the model on unlabeled data. This method is later extended to a hierarchical setting with a pre-defined hierarchy (Meng et al., 2019); ConWea (Mekala and Shang, 2020) leverages contextualized representation techniques to provide contextualized weak supervision for text classification.
However, all these techniques consider only the text data and don't leverage metadata information for classification. In this paper, we focus on user-provided seed words and mine label-indicative words and metadata in an iterative manner.

Text Classification with Metadata
Previous studies try to incorporate metadata information to improve the performance of the classifier. Tang et al. (2015a) and  consider the user and product information as metadata for document-level sentiment classification; Rosen-Zvi et al. (2012) use author information for paper classification;  employ user biography data for tweet localization. However, all these frameworks are in a supervised setting and use fixed metadata types for each task whereas our method is generalized for different metadata types and multiple metadata combinations.
Another way to leverage metadata for text understanding is to organize the corpus into a heterogeneous information network. A straightforward approach is to obtain document representations using their respective meta-path guided node embeddings (Dong et al., 2017;Shang et al., 2016) and train a classifier. However, higher-order connectivity cannot be captured by meta-paths and this approach can't handle new documents directly without re-training the embeddings. Recently,  proposed a minimally supervised framework to categorize text with metadata. However, they require labeled documents as supervision and they only consider typed edges in the model. Network motifs (Milo et al., 2002) can capture higher-order connectivity and have been proved fundamental in complex real-word networks across various domains (Benson et al., 2016).  leveraged motifs for topic taxonomy construction in an unsupervised setting. Our proposed method mines highly label-indicative metadata information with a unified motif and word ranking framework, and effectively expands weak supervision to improve document classification.

Document classifier
Document classification has been a long-studied problem in Natural Language Processing. CNNbased classifiers (Kim, 2014;Johnson and Zhang, 2014;Lai et al., 2015), RNN-based classifiers (Socher et al., 2013) achieve competitive performance. Yang et al. (2016) proposed Hierar-chical Attention Network (HAN) for document classification that performs attention first on the sentences in the document, and on the words in the sentence to find the most important sentences and words in a document. Though our framework uses HAN as the document classifier, it is also compatible with all the above-mentioned text classifiers. We choose HAN for the demonstration purpose.

Conclusion and Future Work
In this paper, we propose META, a novel framework that leverages metadata information as an additional source of weak supervision and incorporates it into the classification framework. Our method organizes the text data and metadata together into a text-rich network and employs motif patterns to capture appropriate metadata combinations. Using the initial user-provided seed words and motif patterns, our method generates pseudo labels, trains classifier, and ranks and filters highly label-indicative words, motifs in a unified manner and adds them to their respective seed set. Experimental results and case studies demonstrate that our model outperforms previous methods significantly, thereby signifying the advantages of leveraging metadata as weak supervision.
In the future, we are interested in effectively integrating different forms of supervision including annotated documents. Also, we only consider positively label-indicative metadata combinations currently. There should be negatively label-indicative combinations as well which can eliminate some classes from potential labels. This is another potential direction for the extension of our method.