A Deep Relevance Model for Zero-Shot Document Filtering

In the era of big data, focused analysis for diverse topics with a short response time becomes an urgent demand. As a fundamental task, information filtering therefore becomes a critical necessity. In this paper, we propose a novel deep relevance model for zero-shot document filtering, named DAZER. DAZER estimates the relevance between a document and a category by taking a small set of seed words relevant to the category. With pre-trained word embeddings from a large external corpus, DAZER is devised to extract the relevance signals by modeling the hidden feature interactions in the word embedding space. The relevance signals are extracted through a gated convolutional process. The gate mechanism controls which convolution filters output the relevance signals in a category dependent manner. Experiments on two document collections of two different tasks (i.e., topic categorization and sentiment analysis) demonstrate that DAZER significantly outperforms the existing alternative solutions, including the state-of-the-art deep relevance ranking models.


Introduction
Filtering irrelevant information and organizing relevant information into meaningful topical categories is indispensable and ubiquitous. For example, a data analyst tracking an emerging event would like to retrieve the documents relevant to a specific topic (category) from a large document collection in a short response time. In the era of big data, the potentially possible categories covered by documents would be limitless. It is unrealistic to manually identify a lot of positive examples for each possible category. However, new information needs indeed emerge everywhere in many real-world scenarios. Recent studies on dataless text classification show promising results on reducing labeling effort (Liu et al., 2004;Druck et al., 2008;Chang et al., 2008;Song and Roth, 2014;Hingmire et al., 2013;Hingmire and Chakraborti, 2014;Chen et al., 2015;Li et al., 2016). Without any labeled document, a dataless classifier performs text classification by using a small set of relevant words for each category (called "seed words"). However, existing dataless classifiers do not consider document filtering. We need to provide the seed words for each category covered by the document collection, which is often infeasible in the real world.
To this end, we are particularly interested in the task of zero-shot document filtering. Here, zeroshot means that the instances of the targeted categories are unseen during the training phase. To facilitate zero-shot filtering, we take a small set of seed words to represent a category of interest. This is extremely useful when the information need (i.e., the categories of interest) is dynamic and the text collection is large and temporally updated (e.g., the possible categories are hard to know). Specifically, we propose a novel deep relevance model for zero-shot document filtering, named DAZER. In DAZER, we use the word embeddings learnt from an external large text corpus to represent each word. A category can then be well represented also in the embedding space (called category embedding) through some composition with the word embeddings of the provided seed words. Given a small number of seed words provided for a category as input, DAZER is devised to produce a score indicating the relevance between a document and the category. It is intuitive to connect zero-shot document filtering with the task of ad-hoc retrieval. Indeed, by treating the seed words of each category as a query, the zero-shot document filtering is equivalent to ranking documents based on their relevance to the query. The relevance ranking is a core task in information retrieval, and has been studied for many years. Although they share the same formulation, these two tasks diverge fundamentally. For ad-hoc retrieval, a user constructs a query with a specific information need. The relevant documents are assumed to contain these query words. This is confirmed by the existing works that exact keyword match is still the most important signal of relevance in ad-hoc retrieval (Fang and Zhai, 2006;Wu et al., 2007;Eickhoff et al., 2015;Guo et al., 2016a,b).
For document filtering, the seed words for a category are expected to convey the conceptual meaning of the latter. It is impossible to list all the words to fully cover the relevant documents of a category. Therefore, it is essential to capture the conceptual relevance for zero-shot document filtering. The classical retrieval models simply estimate the relevance based on the query keyword matching, which is far from capturing the conceptual relevance. The existing deep relevance models for ad-hoc retrieval utilize the statistics of the hard/soft-match signals in terms of cosine similarity between two word embeddings (Guo et al., 2016a;Xiong et al., 2017). However, the scalar information like cosine similarity between two embedding vectors is too coarse or limited to reflect the conceptual relevance. On the contrary, we believe that the embedding features could provide rich knowledge towards the conceptual relevance.
A key challenge is to endow DAZER a strong generalization ability to also successfully extract the relevance signals for unseen categories. To achieve this purpose, we extract the relevance signals based on the hidden feature interactions between the category and each word in the embedding space. Specifically, two element-wise operations are utilized in DAZER: element-wise subtraction and element-wise product. Since these two kinds of interactions represent the relative information encoded in hidden embedding space, we expect that the relevance signal extraction process could generalize well to unseen categories. Firstly, DAZER utilizes a gated convolutional operation with k-max pooling to extract the relevance signals. Then, DAZER abstracts higher-level relevance features through a multi-layer perceptron, which can be considered as a relevance aggregation procedure. At last, DAZER calculates an overall score indicating the relevance between a document and the category. Without further constraints, it is possible for DAZER to encode the bias towards the category-specific features seen during the training (i.e., model overfitting). Therefore, we further introduce an adversarial learning over the output of the relevance aggregation procedure. The purpose is to ensure that the higher-level relevance features contain no category-dependent information, leading to a better zero-shot filtering performance.
To the best of our knowledge, DAZER is the first deep model to conduct zero-shot document filtering. We conduct extensive experiments on two real-world document collections from two different domains (i.e., 20-Newsgroup for topic categorization, and Movie Review for sentiment analysis). Our experimetnal results suggest that DAZER achieves promising filtering performance and performs significantly better than the existing alternative solutions, including state-of-the-art deep relevance ranking methods.
2 Deep Zero-Shot Document Filtering Figure 1 illustrates the network structure of the proposed DAZER model. It consists of two main components: relevance signal extraction and relevance aggregation. In the following, we present each component in detail.

Relevance Signal Extraction
Given a document d = (w 1 , w 2 , ..., w |d| ) and a set of seed words S c = {s c,i } for category c, we first map each word w into its dense word embedding representation e w ∈ R le where l e denotes the dimension number. The embedding representation is pre-trained by using a representation learning method from an external large text corpus. Since our aim is to capture the conceptual relevance, we simply take the averaged embedding of the seed words to represent a category in the embedding space: c c = 1/|S c | s∈Sc e s .
Interaction-based Representation. It is widely recognized that word embeddings are useful because both syntactic and semantic information of words are well encoded (Mikolov et al., 2013;Pennington et al., 2014). The element-wise hidden feature difference is a kind of relative infor-  mation that captures the offset bettwen a word and a category in the embedding space. These embedding offsets contain more intricate relationships for a word pair. A well known example is: e king − e queen ≈ e man − e woman (Mikolov et al., 2013). Similar observations are made when we calculate the embedding offset between words and categories. Table 1 lists several interesting patterns observed for the embedding offsets between a category and a word in 20-Newsgroup dataset (ref. Section 3.2 for more details). We can see that the embedding offsets are somehow consistent with a particular relation between the two category-word pairs.
An effective way to measure the relatedness for two words is the inner product or cosine similarity between two corresponding word embeddings. This can be considered as a particular linear combination of corresponding feature products for the two embeddings: rel(e 1 , e 2 ) = i g(e 1 , e 2 , i)e 1,i · e 2,i = g(e 1 , e 2 ) T (e 1 e 2 ) where g(e 1 , e 2 , i) refers to the weight calculated for i-th dimension, and g(e 1 , e 2 ) = [g(e 1 , e 2 , 1); ...; g(e 1 , e 2 , l e )], is the elementwise product operation. The element-wise product between two embeddings is also a kind of relative information. The sign of a product of two embeddings in a specific dimension indicates whether the two embeddings share the same polarity in this dimension. And the resultant value manifests to what extent that this agreement/disagreement reaches. It is intuitive that the element-wise Examples sign(c mideast e muslim ) ≈ sign(c med e doctor ) sign(cspace e orbit ) ≈ sign(c hockey eespn) sign(c electronics ecircuit) ≈ sign(cpc e controller ) sign(ccrypt e algorithm ) ≈ sign(cspace e burning ) Table 2: Examples by using element-wise product.
product offers some kinds of semantic relations. We conduct the element-wise product for each category-word pair in 20-Newsgroup dataset. Table 2 lists some interesting patterns we observe. The sign(x) function returns 1 when x ≥ 0, otherwise return −1. Shown in the table, the sign pattern of the element-wise product encodes the relevance information between a category and its related words.
Inspired by these observations, we use these two kinds of element-wise interactions to complement the representation of a word in a document. Specifically, for each word w in document d, we derive its interaction-based representation e c w towards category c as follows: where ⊕ is the vector concatenation operation. Note that these two kinds of feature interactions are mainly overlooked by the existing literature. The embedding offsets are used in deriving word semantic hierarchies in (Fu et al., 2014). However, there is no existing work incorporating these two kinds of feature interactions for relevance estimation. Here, we expect that these two kinds of feature interactions can magnify the relevance information regarding the category.
Convolution with k-max Pooling. We utilize m convolution filters to extract the relevance signals for each word based on its local window of size l in the document. Specifically, after calculating the interaction-based representation d = (e c 1 , e c 2 , ..., e c |d| ) for document d and category c, we apply the convolution operation as follows: where r i ∈ R m is the hidden features regarding the relevance signal extracted for i-th word, W 1 ∈ R m×3le(2l+1) and b 1 ∈ R m are the weight matrix and the corresponding bias vector respectively, e c i−l:i+l refers to the concatenation from e c i−l to e c i+l . Both l zero vectors are padded to the begining and the end of the document. With a local window of size l, the convolution operation can extract more accurate relevance information by taking the consecutive words (e.g., phrases) into account. We then apply k-max pooling strategy to obtain the k most active features for each filter. Let r j k−max denote the k largest values for filter j, we form the overall relevance signals r d extracted by all m filters through the concatenation: Category-specific Gating Mechanism. Given a specific word w, the interaction-based representation e c w for each category c could be very different. Therefore, for a specific local context, the extracted relevance signal from a particular convolution filter could be also distinct for different categories. It is then reasonable to assume that the relevance signals for a specific category are captured by a subset of filters. We propose to identify which filters are relevant to a category through a category-specific gating mechanism. Given category c, category-specific gates a c ∈ R m are calculated as follows: where W 2 ∈ R m×3le and b 2 ∈ R m are the weight matrix and bias vectors respectively, σ(·) is the sigmoid function. With category-specific gating mechanism, Equation 4 can be rewritten as follows: r i = a c (W 1 e c i−l:i+l + b 1 ) Here, a c works as on-off switches for m filters. While a c,j → 1 indicates that j-th filter should be turned on to capture the relevance singals under category c to its fullness, a c,j → 0 indicates that the filter is turned off due to its irrelevance. This collaboration of the convolution operation and gating mechanism is similar to the Gated Linear Units (GLU) recently proposed in (Dauphin et al., 2017). Given an input X, GLU calculates the output as follow: h(X) = (XW + b) σ(XV + c) where the first term in the right side refers to the convolution operation and the second term in the right side refers to the gating mechanism. In GLU, both the convolution operation and the gates share the same input X. In contrast, in this work, we aim to identify which filters capture the relevance signals in a category-dependent manner. The experimental results validate that this category-dependent setting brings significant benefit for zero-shot filtering performance (ref. Section 3).

Relevance Aggregation
The raw relevance signals r c,d are somehow category-dependent, since the relevant filters are category-dependent. The hidden features regarding the relevance are distilled through a fullyconnected hidden layer with nonlinearity: where W 3 ∈ R la×3km and b 3 ∈ R la are the weight matrix and bias vector respectively, g a (·) is the tanh function. This procedure can be considered as a relevance aggregation process. Then, the overall relevance score is then estimated as follow: where w ∈ R la and b are the parameters and bias respective.

Model Training
Adversarial Learning The hidden features h c,d are expected to be category-independent. However, there is no guarantee that the categoryspecific information is not mixed with the relevance information extracted in h c,d . Here, we introduce an adversarial learning mechanism to ensure that no category-specific information can be memorized during the training. Otherwise, the proposed DAZER may not generalize well to unseen categories. Specifically, we introduce an category classifier over h c,d to calculate the probability that h c,d belongs to each category seen during the training: p cat (·|h c,d ) = sof tmax(W 4 h c,d + b 4 ) where W 4 ∈ R C×la and b 4 ∈ R C are the weight matrix and bias vector for the classifier, C is the number of categories covered by the training set. We aim to optimize parameters φ = {W 4 , b 4 } to successfully classify h c,d to its true category. Let θ denote the parameters regarding the calculation of h c,d , i.e., to minimize the negative log-likelihood: where T denotes the training set {(d, y)} such that document d is relevant to category y. On the other hand, we expect that h c,d carries no category specific information, such that the classifier can not perform the category classification precisely. Hence, we add the Gradient Reversal Layer (GRL) (Ganin and Lempitsky, 2015;Ganin et al., 2016) between h c,d and the category classifier. We can consider GRL as a pseudo-function R λ (x): It means that θ is optimized to make h c,d indistinguishable by the classifier. In Equation 9, parameter λ controls the importance of the adversarial learning. DAZER is devised to return a relevance score, we utilize the pairwise margin loss for model training: where document d − y is the negative sample for category y, ∆ is the margin and set to be 1 in this work, and δ = {w, b}. Overall, the proposed DAZER is an end-to-end neural network model. The parameters Θ = {θ, φ, δ} are optimized via back propagation and stochastic gradient descent. Specifically, we utilize Adam (Kingma and Ba, 2014) algorithm for parameter update over minibatches. The final objective loss used in the training is as follow: where λ Θ controls the importance of the regularizaton term.

Experiment
In this section, we conduct experiments on two real-world document collections to evaluate the effectiveness of the proposed DAZER 1 .

Existing Alternative Methods
Here, we compare the proposed DAZER against the following alternative solutions.
BM25 Model: BM25 is a widely known retrieval model based on keyword matching (Robertson and Walker, 1994). The default parameter setting is used in the experiments.
DSSM: DSSM utilizes a multi-layer perceptron to extract hidden representations for both the document and the query (Huang et al., 2013). Then, cosine similarity is calculated as the relevance score based on the representation vectors. Since we use pre-trained word embeddings from a large text corpus, we choose to replace the letter-tri-grams representation with the word embedding representation instead. We use the recommended network setting by its authors.
DRMM: DRMM calculates the relevance based on the histogram information of the semantic relatedness scores between each word in the document and each query word (Guo et al., 2016a). The recommended network setting (i.e., LCH×IDF) and parameter setting are used.
K-NRM: K-NRM is a kernel based neural model for relevance ranking based on word-level hard/soft matching signals (Xiong et al., 2017). We use the recommended setting as in their paper.
DeepRank: DeepRank is a neural relevance ranking model based on the query-centric context (Pang et al., 2017). The recommended setting is used for evaluation.

Seed-based Support Vector Machines (SSVM):
We build a seed-driven training set by labeling a training document with a category if the document 1 The implementation is available at https://github.com/WHUIR/ DAZER contains any seed word of that category. Then, we adopt a one-class SVM implemented by sklearn 2 for document filtering 3 . The optimal performance is reported by tuning the hyper-parameter.

Datasets and Experimental Setup
20-Newsgroup (20NG) 4 is a widely used benchmark for document classification research (Li et al., 2016). It consists of approximately 20K newsgroup articles from 20 different categories. The bydate version with 18, 846 documents is used here. As provided, the training set and test set contain 60% and 40% documents respectively.
Movie Review 5 is a collection of movie reviews in English (Pang and Lee, 2005). The scale dataset v1.0 is used in the experiments. Based on the numerical ratings, we split these reviews into five sentiment labels: very negative, negative, neutral, positive and very positive, which contains 167, 1030, 1786, 1682, 341 reviews respectively. For each sentiment label, we randomly split the reviews into a training set (80%) and a test set (20%). Since our work targets at zero-shot document filtering for unseen categories, the word embeddings pre-trained by Glove over a large text corpus with total 840 billion tokens 6 are used across all the methods and the two datasets. The dimension of the word embeddings is l e = 300. No further word embedding fine-tuning is applied. For both datasets, the stop words are removed firstly. Then, all the words are converted into their lowercased forms. We further remove the words whose word embeddings are not supported by Glove.
Evaluation Protocol. With the specified unseen categories, we take all the training documents of the other categories to train a model. Then, all documents in the test set are used for evaluation. For each unseen category, the task is to rank the documents of that category higher than the others. Here, we choose to report mean average precision (MAP) for performance evaluation. MAP is a widely used metric to evaluate the ranking quality. The higher the relevant documents are ranked, the larger the MAP value is, which means a better filtering performance. For all neural networks based models, the training documents from one randomly sampled training category work as the validation set for early stop. We report the averaged results over 5 runs for all the methods (excluding SSVM and BM25). The statistical significance is conducted by applying the student t-test.
Seed Word Selection. For 20NG dataset, we directly use the seed words 7 manually compiled in (Song and Roth, 2014). These seed words are selected from the category descriptions and widely used in the works of dataless text classification (Song and Roth, 2014;Chen et al., 2015;Li et al., 2016). For Movie Review, following the seed word selection process (i.e., assisted by standard LDA) proposed in (Chen et al., 2015), we manually select the seed words for each sentiment label. Table 3 lists the seed words selected for each sentiment label for Movie Review dataset. There are on average 5.2 and 4.6 seed words for each category over 20NG and Movie Review respectively. It is worthwhile to highlight that no category information is exploited within the seed word selection process.
Parameter Setting. For DAZER, the number of convolution filters is m = 50 and k = 3 is used for k-max pooling. The dimension size for relevance aggregation is l a = 75. The local window size l is set to be 2. The learning rate is 0.00001. The models are trained with a batch size of 16 and λ Θ = 0.0001, λ = 0.1.

Performance Comparison
For 20NG dataset, we randomly create 9 document filtering tasks which cover 10 out of 20 categories. For Movie Review, we take each sentiment label as an unseen category for evaluation. Table 4 lists the performance of 7 methods in terms of MAP for these filtering tasks. Here, we make the following observations. First, the proposed DAZER significantly achieves much better filtering performance on all 14 tasks across the two datasets. The averaged MAP of DAZER over these 14 filtering tasks is 0.671. Note that only 5.2 and 4.6 seed words are used on average for each task. The second best performer is K-NRM, which achieves the second 7 The seed words are available at https://github.com/WHUIR/  Table 4: Performance of the 7 methods for zero-shot document filtering in terms of MAP. The best and second best results are highlighted in boldface and underlined respectively, on each task. † indicates that the difference to the best result is statistically significant at 0.05 level. Avg: averaged MAP over all tasks.

STM
best on 7 tasks. Overall, the averaged performance gain for DAZER over K-NRM is about 30.8%. Second, We observe that DSSM performs signficantly better for sentiment analysis than for topic categorization. As discussed in Section 4, DSSM is designed to perform semantic matching. Compared with topic categorization, sentiment analysis is more like a semantic matching task. SSVM delivers the worst performance on both datasets. This illustrates that the quality of the labeled documents is essential for supervised learning techniques. Apparently, recruiting training documents with the provided seed words in a simple fashion is error-prone. We also note that BM25 achieves inconsistent performance over the two kinds of tasks. It performs especially worse for sentiment analysis. This is reasonable because there are more diverse ways to express a specific sentiment. It is hard to cover a reasonable proportion of documents with limited number of sentimental seed words. In comparison, the proposed DAZER obtains a consistent performance for both topic categorization and sentiment analysis.

Analysis of DAZER
Component Setting. Here, we further discuss the impact of different component settings of DAZER on both 20NG and Movie Review datasets. Table 5 and 6 report the impacts of each component setting via an ablation test on the two datasets respectively. We can see that each component brings significantly positive benefit for document filtering. First, we can see that either element-wise subtraction or product contributes signifcantly to the performance improvement. Specifically, from Table 6, we can see that both the element-wise subtraction and element-wise product play equally on Movie Review dataset. On the other hand, it is observed that DAZER experiences significantly a much larger performance degradation on 20NG dataset. For example, a MAP of only 0.154 is achieved when e prod c,w is excluded from DAZER for the filtering task space. A much severer case is for the filtering task baseball-hockey. By excluding e prod c,w , the MAP performance of DAZER is reduced from 0.782 to 0.045. That is, the element-wise product is more critical for extracting relevance signals for topical categorization. We also observe that these two hidden feature interactions together play a more important role for DAZER. For example, without both e dif f c,w and e prod c,w , DAZER only achieves a MAP of 0.126 for filtering task space. The large performance deterioration is also observed for other filtering tasks on 20NG dataset.
Either adversarial learning or category-specific gate mechanism enhances the filtering performance of DAZER, which validates the effectiveness of the two components for enhancing con-  e dif f c,w : no element-wise subtraction; -e prod c,w : no element-wise product; -Gate: no category-specific gate mechanism; -Adv: no adversarial learning.  e dif f c,w : no element-wise subtraction; -e prod c,w : no element-wise product; -Gate: no categoryspecific gate mechanism; -Adv: no adversarial learning. ceptual relevance extraction. Also, without using adversarial learning, DAZER still achieves much better filtering performance than the existing baseline methods compared in Section 3.3. This observation is also held on 20NG dataset. This further validates that the two kinds of hidden feature interactions indeed encode rich knowledge towards the conceptual relevance.
Impact of Seed Words. It has been recognized that the less seed words incur worse document classification performance in the existing dataless document classification techniques (Song and Roth, 2014;Chen et al., 2015;Li et al., 2016). Following these works, we also use the words appearing in the category name of 20NG dataset as the corresponding seed words 8 . There are on average 2.75 seed words for a category of 20NG. Table 7 reports the MAP performace of each method on 20NG dataset. The experimental results show that all methods investigated in Section 3.3 experience signficant performance degradation for most filtering tasks. We plan to incorporate the pseudo-relevance feedback into DAZER to tackle the scarcity of the seed words. One possible solution is to enrich the architecture of DAZER to allow few-shot document filtering. That is, the filtering decisions of high-confidence are utilized to derive more seed words for better filtering performance. 8 The seed words based on the category name are available at https://github.com/WHUIR/STM 4 Related Work Document filtering is the task to separate relevant documents from the irrelevant ones for a specific topic (Robertson and Soboroff, 2002;Nanas et al., 2010;Gao et al., , 2015Proskurnia et al., 2017). Both ranking and classification based solutions have been developed (Harman, 1994;Robertson and Soboroff, 2002;Soboroff and Robertson, 2003). In earlier days, a filtering system is mainly devised to facilitate the document retrieval for the long-term information needs (Mostafa et al., 1997). The term-based pattern mining techniques are widely developed to perform document filtering. A network-based topic profile is built to exploit the term correlation patterns for document filtering (Nanas et al., 2010). Frequent term patterns in terms of finegrained hidden topics are proposed in (Gao et al., , 2015 for doucment filtering. Very recently, frequent term patterns are also utilized to perform event-based microblog filtering (Proskurnia et al., 2017). However, these approaches are all based on supervised-learning, which requires a significant amount of positive documents for each topic. In the era of big data, the information space and new information needs are continuously growing. Retrieval of the relevance information in a short response time becomes a fundamental need. Recently, many works have been proposed to conduct document filtering in an entity-centric manner (Frank et al., 2012;Balog and Ramampiaro, 2013;Zhou and Chang, 2013;Reinanda et al., 2016). The task is to identify the documents relevant to a specific entity that is well defined in an  Table 7: Performance of the 7 methods for zero-shot document filtering in terms of MAP. The words appearing in the category name are used as the seed words. The best and second best results are highlighted in boldface and underlined respectively, on each task. external knowledge base. Specifically, Balog and Ramampiaro (2013) examine the choice of classification against ranking approaches. They found that ranking approach is more suitable for the filtering task. Following this conclusion, we formulate the zero-shot document filtering as a relevance ranking task. Many information needs may not be well represented by a specific entity. Hence, these entity-centric solutions are restricted to knowledge base related tasks.
Many ad-hoc retrieval models can be used to perform zero-shot document filtering. Indeed, traditional term-based document filtering approaches utilize many term-weighting schemes developed for ad-hoc retrieval. Traditional adhoc retrieval models mainly estimate the relevance based on keyword matching. BM25 (Robertson and Walker, 1994) can be considered as the optimal practice in this line of literature. The recent advances in word embedding offer effective learning of word semantic relations from a large external corpus. Several neural relevance ranking models are proposed to preform ad-hoc retrieval based on word embeddings. Both K-NRM (Xiong et al., 2017) and DRMM (Guo et al., 2016a) estimate the relevance based on the macro-statistics of the hard/soft-match signals in terms of cosine similarity between two word embeddings. Deep-Rank (Pang et al., 2017) first measures the relevance signals from the query-centric context of each query keyword matching point through convolutional operations. Then, RNN based networks are adopted to aggregate these relevance signals. These works achieve significantly better retrieval performance than the keyword matching based so-lutions and represent the new state-of-the-art. The relevance between a query and a document can also be considered as a matching task between two pieces of text. There are many deep matching models, e.g., DSSM (Huang et al., 2013), ARC-II (Hu et al., 2014), MatchPyramid , Match-SRNN (Wan et al., 2016). These models are mainly developed for some specific semantic matching tasks, e.g., paraphrase identification. Therefore, information like grammatical structure or sequence of words are often taken into consideration, which is not applicable to seed word based zero-shot document filtering.

Conclusion
In this paper, we propose a novel deep relevance model for zero-shot document filtering, named DAZER. To enable DAZER to capture conceptual relevance and generalize well to unseen categories, two kinds of feature interactions, a gated convolutional network and an categoryindependent adversarial learning are devised. The experimental results over two different tasks validate the superiority of the proposed model. In the future, we plan to enrich the architecture of DAZER to allow few-shot document filtering by incorporating several labeled examples.