Multimodal Topic-Enriched Auxiliary Learning for Depression Detection

From the perspective of health psychology, human beings with long-term and sustained negativity are highly possible to be diagnosed with depression. Inspired by this, we argue that the global topic information derived from user-generated contents (e.g., texts and images) is crucial to boost the performance of the depression detection task, though this information has been neglected by almost all previous studies on depression detection. To this end, we propose a new Multimodal Topic-enriched Auxiliary Learning (MTAL) approach, aiming at capturing the topic information inside different modalities (i.e., texts and images) for depression detection. Especially, in our approach, a modality-agnostic topic model is proposed to be capable of mining the topical clues from either the discrete textual signals or the continuous visual signals. On this basis, the topic modeling w.r.t. the two modalities are cast as two auxiliary tasks for improving the performance of the primary task (i.e., depression detection). Finally, the detailed evaluation demonstrates the great advantage of our MTAL approach to depression detection over the state-of-the-art baselines. This justifies the importance of the multimodal topic information to depression detection and the effectiveness of our approach in capturing such information.


Introduction
Depression detection is a task of determining a human being is depressed or non-depressed by automatically analyzing user-generated contents (UGC). Due to its crucial role in assessing mental health, it has recently received considerable attention from several research communities, such as NLP (Shen et al., 2017) and CV (Valstar et al., 2016). These studies mainly utilize UGC (e.g., texts and images) on social media to perform depression detection and achieve promising results since UGC instantly reflects not only the daily lives but also the mental states of users. Despite the progress of prior studies, they always focus on leveraging RNN variants (e.g., GRU) to model texts or images along the timeline (see Figure  1) posted by users, suffering from the problem of ignoring the global topic information inside these texts and images, though obviously this global topic information can mitigate the notorious difficulty of RNN variants w.r.t. modeling long-range dependencies (Dieng et al., 2017).
More importantly, from the point of health psychology, humans plagued by the negative emotion for a prolonged period of time (generally longer than two weeks), leading to their inability of carrying out daily activities, are highly probable to have the depression symptom (American Psychiatric Association and others, 2013). For instance, Figure 1 illustrates a month-long timeline of a depressed user. From this figure, we can see that this person is highly possible to be depressed since he/she discloses negative emotions lasting for almost a month. This conforms to his/her depression symptom and indicates the importance of considering the global semantics for depression detection.
Inspired by the above observations, this paper hypothesizes that it is desirable to consider the global topic information inside multiple modalities (i.e., texts and images) for depression detection. Still take Figure 1 as an example, leveraging a proper topic model to mine global clues inside the texts, e.g., words I'm so scared . I need a hug please. I scream, but no one helps me.

Timeline Texts Images
If I died, my parents would probably yell at me. How ridiculous! Figure 1: An example containing the timeline, texts and corresponding images, posted by a depressed user on Twitter.
"so scared" and "parents ... yell at me" indicating a topic w.r.t. domestic violence, can potentially assist the depression prediction. Besides, as reported in Reece and Danforth (2017), the images posted by depressed persons can be easily distinguished from those posted by healthy persons. Obviously, also leveraging a proper topic model to mine global clues inside images may more powerfully capture overall differences between depressed and non-depressed persons, thereby contributing to depression detection. However, conventional latent topic models (Blei et al., 2003;Dieng et al., 2017) usually focus on processing the text modality composed of discrete textual signals (i.e., words) under the assumption that each topic is a multinomial distribution over the vocabulary. Apparently, these topic models cannot be directly adopted to mine the topic information inside the image modality since the image is composed of continuous visual signals, making the above assumption not applicable. Therefore, an appropriate topic model should not only be capable of capturing the textual topic information inside the texts but also be capable of capturing the visual topic information inside the images for depression detection.
To tackle the above challenges, we propose a multimodal topic-enriched auxiliary learning (MTAL) approach, which can mine both the textual topic information and the visual topic information for depression detection. In particular, a modality-agnostic topic model is first proposed to mine the topical clues from either discrete textual signals or continuous visual signals. Furthermore, the topic modeling w.r.t. the two modalities are cast as two auxiliary tasks for boosting the performance of the primary depression detection task. Third and finally, the primary task is trained alongside the two auxiliary tasks under the architecture of multi-task learning. Experimentation demonstrates that the proposed MTAL approach can significantly outperform several state-of-the-art baselines, including the representative textual depression detection approaches and the state-of-the-art multimodal-based approaches.

Related Work
Depression detection is an interdisciplinary research task and has been drawing ever-more attention in NLP with a focus on extracting various types of features from text modality (Choudhury et al., 2013;Nambisan et al., 2015). Compared with the studies on the text modality, the studies on multimodality (e.g., both the text and image modalities) like Gui et al. (2019b) are much less and limited to neglect the topic information inside multiple modalities. In the following, we will first review the depression detection task and then introduce the related studies on neural topic models.
Depression Detection. The ubiquity of social media poses a great opportunity to perform depression detection. Prior studies mainly focus on identifying depressed persons by analyzing the generated textual information in social media. Specifically, Choudhury et al. (2013) focus on the differences in word usage for depression detection. Gkotsis et al. (2016) focus on the depth of syntax-parsing trees for depression detection. In recent years, researchers begin to use multimodal information (e.g., the text, speech and image) for depression detection. Specifically, Yin et al. (2019) propose a hierarchical RNN network to extract the features from the vision, speech and text for depression detection. Gui et al. (2019b) propose a reinforced GRU network to capture both the textual and visual information for depression detection. In addition, it is worthwhile to mention that, Resnik et al. (2015) and Shen et al. (2017)   the topic information for depression detection, but is limited to capture this information inside single text modality. Different from them, this paper aims to integrate the topic information inside multiple modalities (i.e., both texts and images).
Neural Topic Models. Traditional topic models, e.g., probabilistic latent semantic analysis (pLSA) (Hofmann, 1999) and latent dirichlet allocation (LDA) (Blei et al., 2003), have been widely leveraged for inferring a low-dimensional latent representation that captures the global semantic information of a text. Recently, based on the variational auto-encoder (VAE) architecture (Kingma and Welling, 2014), Miao et al. (2017) and Srivastava and Sutton (2017) propose the neural topic models (NTM) to mine the topic information inside texts. Gui et al. (2019a) propose a reinforcement learning based neural topic model to alleviate the limitations of traditional topic coherence measures. Wang et al. (2020) also propose a topicaware multi-task learning model to learn topic-enriched utterance representations in customer service, which is inspirational to our topic-enriched auxiliary learning framework. Unlike the above studies modeling topics under the assumption that the topic-word distribution is a multinomial distribution, Das et al. (2015) model topics with multivariate gaussian distribution over the word embedding space to deal with the new word issue, which is inspirational to our proposed modality-agnostic topic model. However, all the prior topic models rely on word vocabulary and thus are specially-designed for text modality, which cannot be directly adopted to capture the topic information inside images. Different from all the above studies, this paper proposes a new modality-agnostic topic model to mine the global topics from either the discrete textual signals or the continuous visual signals. On this basis, a MTAL approach is proposed to integrate the multimodal topic information for depression detection. To our best knowledge, this is the first attempt to consider the topic information inside multiple modalities (i.e., both texts and images) for depression detection. Figure 2 shows the framework of our proposed Multimodal Topic-enriched Auxiliary Learning (MTAL) approach for depression detection, which consists of one primary task and two auxiliary tasks. The primary task is exactly the depression detection task (introduced in Section 3.1). Two auxiliary tasks are the textual and visual topic modeling respectively (together with the proposed modality-agnostic topic model are introduced in Section 3.2). Finally, a topic-enriched auxiliary learning strategy is proposed to combine the primary task with auxiliary tasks (introduced in Section 3.3).

Primary Task: Depression Detection
Given all pairs of text and image along the timeline (see Figure 1) posted by a user, the primary task aims at modeling both the text sequence and the corresponding image sequence to perform depression prediction for this user. Figure 2 shows the illustration of the primary task. First of all, given n pairs Text Encoder. As a pre-trained text encoding mechanism, BERT can be fine-tuned to create state-ofthe-art models for a range of NLP tasks, e.g., text classification and natural language inference. In our approach, we use BERT-Base (uncased) model as the shared text encoder. Given the t-th text ., x text tn } of each user, we adopt BERT to encode this text and use the mark " After all texts in a text sequence are encoded by this BERT, we can obtain a user-generated contents (UGC) matrix, i.e., text matrix U text = {h text t } n t=1 . Image Encoder. As a pre-trained image encoding model, VGG has shown the state-of-the-art performance on various computer vision tasks, e.g., image caption and image classification (Simonyan and Zisserman, 2015). In this paper, we use VGG as the shared image encoder. Given the t-th image x image t of each user, following Gui et al. (2019b), we use the output vectorĥ t ∈ R 4096 of the first fully connected layer in VGG to compute the image vector h image Here, W h ∈ R d×4096 andb h ∈ R d are trainable parameters. After all images in a image sequence are encoded by this VGG, we can obtain another UGC matrix, i.e., image matrix U image = {h image t } n t=1 . Modality Fusion. To incorporate both the text sequence and image sequence information, we concatenate the text vector and the image vector at the t-th time-step to obtain the new representation v t ∈ R 2d of the t-th text-image pair.
Finally, this representation v t is fed to an LSTM network to obtain the final hidden state h t ∈ R 2d of the text-image pair as h t = LSTM(v t , h t−1 , m t−1 ), where m t−1 denotes the memory cell state at the time-step t − 1. Finally, we regard vector h n at the final time-step n as the output representation of the primary task.

Auxiliary Tasks: Multimodal Topic Modeling
In this section, we first introduce the proposed modality-agnostic topic model, and then present two types of auxiliary tasks, i.e., the textual topic modeling and the visual topic modeling.
Modality-Agnostic Topic Model. Unlike traditional neural topic models (Miao et al., 2017) focusing on generating an input text represented by a discrete bag-of-words vector, our modality-agnostic topic model aims to generate the intermediate UGC matrix of each modality (e.g., U text or U image ). For clarity, we will omit superscripts text and image of the UGC matrix and take one modality as an example next. Since both the text or image sequence are encoded into the same type of UGC matrix, the proposed topic model could be seen as the modality-agnostic. Similar to Miao et al. (2017), our topic model also adopts the variational auto-encoder architecture, aiming at generating the UGC matrix in an unsupervised setting. Figure 3 shows the workflow of our topic model, consisting of two main components, i.e., the inference network and the generative network.
• Inference Network is leveraged to infer the topic distribution θ from the UGC matrix U = [u 1 , u 2 , ..., u n ], where u t is exactly the text vector h text t or the image vector h image t . Specifically, we first aims at estimating μ(U) and σ(U) for parameterizing a diagonal Gaussian distribution q(z|U) = N (μ(U), σ 2 (U)). Wherein, z ∈ R K (where K is the number of topics) is a latent variable in the topic model. μ(U) and σ(U) are functions of U which are implemented by neural networks.
More specifically, a convolutional layer is first employed to extract features from the UGC matrix. Formally, we suppose that the width of the kernel is j and the dimension of each row in the UGC matrix is d. A convolutional filter W c ∈ R d×j then maps j rows of matrix U in the receptive field to a single feature map c. A sequence of new features c = [c 1 , c 2 , ..., c n ] are computed as c i = tanh(u i;i+j * W c + b c ). Here, b c ∈ R is the bias. tanh is a non-linear activation function. * denotes convolution operation. If there are m j filters of the same width j, the output features form a feature-map matrix C ∈ R m j ×n j . We then apply a max-pooling operation over the matrix C, resulting in a fixed-size vector h c ∈ R m j . The output h c can be fed into two different fully connected layers to estimate μ(U) and log(σ(U)): where f μ and f σ are two different MLP fully connected layers. After obtaining μ(U) and log(σ(U)), the diagonal Gaussian distribution q(z|U) can be parameterized. We then sampleẑ from q(z|U) using a reparameterization trick as described in Kingma and Welling (2014), i.e.,ẑ = μ(U) + σ(U). Here, is sampled from N (0, I 2 ). Finally, inspired by the Gaussian softmax proposed by Miao et al. (2017), we compute the topic distribution θ ∈ R K as follows: where W θ ∈ R K×K and b θ ∈ R K are trainable parameters.
• Generative Network is leveraged to parameterize p(U|φ 1:K , ζ 1:K ), which is a conditional probability distribution of the UGC matrix U given the trainable parameters (φ k and ζ k ) for the k-th topic. Different from the neural topic model proposed by Miao et al. (2017) which defines a multinomial distribution for each topic over the word vocabulary, our model defines an independent diagonal Gaussian distribution N (φ k , ζ 2 k ) for each topic k over the embedding space of different modalities. In this way, given a UGC matrix U with topic distribution θ, each embedding u t (i.e., h text t or h image t ) of our topic model is generated in two steps: -Choose a topic γ t ∼ Multinomial Distribution(θ) -Choose the embedding u t ∼ N (φ γt , ζ 2 γt ) Then, the probability distribution p(U|φ 1:K , ζ 1:K ) for the UGC matrix U can be computed as follows: Finally, the loss function for the proposed modality-agnostic topic model is computed as follows: where p(z) is a standard Normal prior N (0, I). In the first part of Eq.(4), we use KL divergence to measure the similarity between the learned distribution q(z|U)) and true prior distribution p(z). The second part of Eq.(4) represents the likelihood of reconstructing original input via the generative network.
On the basis of the proposed modality-agnostic topic model, we further construct two auxiliary tasks 1) textual topic modeling and 2) visual topic modeling, aiming at capturing the topic information inside texts and images respectively. Concretely, we first take advantage of the mean vector φ k ∈ R M of each Gaussian distribution to construct the topic embedding matrix Φ = [φ 1 , φ 2 , ..., φ K ] of all topics, since each topic has an independent Gaussian distribution as mentioned above. Then, we leverage the topic embedding matrix to compute the topic representation of each modality (texts or images) according to their corresponding topic distribution θ. More concretely, two auxiliary tasks are formulated as follows.
Auxiliary Task 1: Textual Topic Modeling. The text sequence posted by a user is first encoded into a text matrix U text , which is then fed into a modality-agnostic topic model to obtain the textual topic distribution θ text ∈ R K . Finally, the textual topic representation q text ∈ R M of all texts is computed as q text = Φ text θ text , where Φ text ∈ R K×M denotes the textual topic embedding matrix. Auxiliary Task 2: Visual Topic Modeling. The image sequence posted by the user is first encoded into a image matrix U image , then fed into another modality-agnostic topic model to obtain the visual topic distribution θ image ∈ R K . Finally, the visual topic representation q image ∈ R M of all images is computed as q image = Φ image θ image , where Φ image ∈ R K×M denotes visual topic embedding matrix.
After obtaining both the textual and visual topic representation, we compute the output representation h topic ∈ R 2M of all auxiliary tasks as h topic = q text ⊕ q image .

Topic-Enriched Auxiliary Learning
Different from multi-task learning whose goal is to achieve better performance across all tasks, auxiliary learning differs in that better performance is only required for a single primary task, and the role of auxiliary tasks is to assist the performance improvement of this primary task. To this end, we take advantage of two strategies (i.e., adaptive learning and auxiliary training) to combine the primary task with the auxiliary tasks, which are illustrated as follows.
Adaptive Learning. To distinguish the output representation of primary task from that of auxiliary tasks for topic modeling, we utilize an adaptive gate e ∈ R 2d to combine representations from both the primary and auxiliary tasks. The final user representation r ∈ R 2d is computed as follows: where h n ∈ R 2d is the primary task representation. is the element-wise multiplication. W e ∈ R 2d×(2d+2M ) , b e ∈ R 2d and W r ∈ R 2d×2M are trainable parameters. Further, vector r is fed to a softmax layer for depression prediction, i.e., p Θ = softmax(W p r + b p ). Here, W p ∈ R n×2d , b p ∈ R n are trainable parameters. n is category number. p Θ is the probability distribution over two categories.
Auxiliary Training. We employ the joint loss function to optimize all the primary and auxiliary tasks simultaneously. Here, the joint loss consists of two parts. One is the supervised loss for depression detection, and the other contains the unsupervised losses for the two auxiliary tasks of modality-agnostic topic modeling. Specifically, the loss L primary for the primary depression detection task is computed as: where N is the number of all twitter users. y i is the ground-true label for the i-th user x i . δ is an L 2 regularization weight. Θ denotes all training parameters in the model. In addition, the loss function for our topic model has been shown in Eq.(4). For clarity, the losses for two auxiliary tasks, i.e., the textual topic modeling and visual topic modeling, are denoted as L text and L image respectively. Finally, the joint loss L is defined as follows: where λ is a weight and fine-tuned to be 0.25 for balancing the losses for primary and auxiliary tasks.  (Zadeh et al., 2018) Text + Image 79.9 79.9 79.9 79.9 CoMemory (Xu et al., 2018) 80.6 80.4 80.5 80.4 CoATT  79.6 80.3 79.9 80.4 Hybrid Attention (Gu et al., 2018) 80.6 80.6 80.6 80.6 CoMMA (Gui et al., 2019b) 78

Experimentation
To validate the effectiveness of our approach, we evaluate the performance of the proposed and baseline approaches and show the details in Table 2.

Experimental Settings
Data Setting. We conduct experiments based on the multimodal depression dataset 1 released by Gui et al. (2019b). Different from Gui et al. (2019b), we adopt a new data setting and split the original dataset into the standard train/development/test sets with the ratio of 7:1:2. The reason why we adopt this different setting is that Gui et al. (2019b) follows the same experimental setting proposed by Shen et al. (2017) for a fair comparison, while this old setting contains no real development set. Instead, Gui et al. (2019b) regard the test set as the development set for training and use the five-fold cross validation results on this development set as final results. We believe this is not well-suited for the training of the iterative neural network based approach because involving the label information of the test set in the training phase may not convincingly evaluate the generalization ability of a iterative neural network based approach. Despite this, for a fair comparison, we re-implement their approach based on our new data setting. Statistics of the new split dataset are shown in Table 1. This dataset retains balanced categories (1,402 depressed users and 1,402 non-depressed users) like the original dataset (Note that we also evaluate our approach in the imbalance scenario presented in the section of analysis and discussion).
To facilitate this corresponding research, the dataset with the new data setting is released as the new benchmark dataset for multimodal depression detection via github 2 . Implementation Details. In our experiments, all hyper-parameters are tuned according to the development set. Specifically, BERT is optimized by the Adam optimizer (Devlin et al., 2019), where β 1 = 0.9 and the initial learning rate is 10 −4 . Other parameters of BERT are following (Devlin et al., 2019). For our MTAL approach, we set the dimensions of LSTM hidden states to be 256 and adopt another Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 10 −2 and β 1 = 0.9 for cross-entropy training. The regularization weight is 10 −5 . The dropout rate is 0.5. For CNN, we set the widths of filters to 3, 4, 5 with 100 features each. The dimension M of topic embeddings in our topic model is 128 and the number of topics (K) is 20. Besides, if a tweet includes no image, the image vector will be initialized as a zero vector.
Evaluation Metrics. The performance is evaluated using standard Accuracy (Acc.) and Macro-F1 (F1) by following . Moreover, t-test is used to evaluate the significance of the performance difference between two approaches by following Yang and Liu (1999).
Baselines. For comparison, we re-implement several approaches as baselines for depression detec-  Table 3: Performances of the primary task (depression detection) with different combinations of auxiliary tasks, i.e., the textual topic modeling and the visual topic modeling. tion. 1) H-LSTM. This is a hierarchical LSTM approach to aspect sentiment classification. In our implementation, we use it to model word sequence and text sequence for depression classification. 2) BERT+LSTM. This is a BERT model for encoding each text, followed by an LSTM to encode the text sequence for depression classification. 3) BERT+LSTM+Textual Topic Modeling. This is an extension of BERT+LSTM with textual topic information. 4) VGG+LSTM. This is a VGG model for encoding each image, followed by an LSTM to encode the image sequence. 5) VGG+LSTM+Visual Topic Modeling. This is an extension of VGG + LSTM with visual topic information. 6) EF-LSTM (Zadeh et al., 2018). This is a state-of-the-art multimodal approach to human communication comprehension task. 7) CoMemory (Xu et al., 2018). This is a state-of-the-art multimodal approach to the multimodal sentiment analysis task. 8) CoATT . This is a state-of-the-art multimodal approach to the named entity recognition task. 9) Hybrid Attention (Gu et al., 2018). This is a state-of-the-art multimodal approach using modality attention to learn modality-specific and modality-fusion features for the spoken language classification. Note that the above four multimodal approaches can only encode single text+image pair. In our implementation, we also use an LSTM to encode the final vector sequence of all text+image pairs for depression classification. 10) CoMMA (Gui et al., 2019b). This is exactly a state-of-the-art multimodal approach to depression detection. In this study, we re-implement it based on our new data setting. 11) Primary Task. Our approach w/o integrating two auxiliary tasks. Table 2 compares different approaches to the depression detection task. From this table, we can see that:

Experimental Results
Single-Modality Performance. When only using the text modality, 1) The BERT based approach BERT+LSTM performs better than H-LSTM. This encourages us to use BERT as the text encoder for depression detection. 2) Our approach BERT+LSTM+Textual Topic Modeling performs consistently better than BERT+LSTM. This encourages us to incorporate the textual topic information for depression detection. When only using the image modality, 1) The image classification approach VGG performs much better than a random performance. This encourages us to consider the image information for depression detection and also indicates the appropriateness of using VGG as the image-region encoder.
Multimodality Performance. When using both the text and image modalities, CoMemory, CoATT and Hybrid Attention perform better than the BERT based single-modal BERT+LSTM. This confirms the helpfulness of considering the image information in depression detection. In comparison, our approach Primary Task performs consistently better than all the above multimodal approaches in terms of all metrics. This is mainly due to the helpfulness of using BERT as the text encoder. Among all these approaches, our approach MTAL performs best and even significantly outperforms (p-value < 0.01) the strong baseline BERT+LSTM+Textual Topic Modeling in terms of all metrics. These results encourage us to incorporate both the textual and visual topic information for depression detection.

Multimodal Examples: Tweets Posted by a Depressed Patient in a Month
Figure 5: A depressed user example from the test data with output categories and probabilities of the true label depressed, predicted by different approaches. (or ) denotes that the predicted category is correct (or wrong).

Analysis and Discussion
Contribution of Multimodal Topic Information. Table 3 summarizes the results of the primary task integrated with different auxiliary tasks. From this table, we can see that: 1) + Auxiliary Task 1 is superior to Primary Task with improving the F1 score by 1.2% (p-value < 0.05). This demonstrates that incorporating the textual topic information is helpful to detect depression. 2) + Auxiliary Task 2 performs slightly better than Primary task with the improvement of 0.4% in terms of F1. This demonstrates that incorporating the visual topic information is useful to detect depression. 3) + Auxiliary Task 1,2 performs best and significantly outperforms Primary Task (p-value < 0.05) by 3.3% in Acc. and by 3.1% in F1. This indicates that jointly training primary task with two auxiliary tasks can significantly improve the performance and demonstrates that incorporating both the textual and visual topic information can help detect depression. Analysis of Imbalanced Scenario. In realistic scenarios, only a small proportion of users are depressed. Inspired by this, we further evaluate our MTAL approach on different percentages of depressed users for verifying its robustness. Specifically, we construct 9 different imbalanced training sets where the percentages of depressed users are ranging from 10% to 90%, and the total number of users is set to be 1,500 with the train/dev/test setting of 7:1:2. The detail experimental results are shown in Figure 4. From the figure, we can see that our MTAL approach can still achieve stable performance even in the case of a very low percentage (10% depressed users), and consistently perform better than the state-of-the-art baselines. This can further justify the robustness and effectiveness of our approach.
Qualitative Analysis. Figure 5 shows the multimodal tweets posted by a depressed user in a month, together with the predicted categories and probabilities of the ground-true label via different approaches. From this figure, we can see that: 1) Though this user expresses positive emotions (e.g., "grateful" and "beautiful") in the time points [2015.3.27] and [2015.4.11], he/she still expresses the negative emotions (e.g., words "worthless", "hurt" and "hopeless" which indicate a world-weary topic) in most of the time in a month. Moreover, all images in Figure 5 have the characteristics of low brightness and dark tones, which also indicates the negative emotion of this user. These highlight the importance of capturing the global topic information for depression detection. 2) Despite that our approach Primary Task gives a wrong prediction, it can still obtain higher probability for the true label depressed than Hybrid Attention. This indicates the appropriateness of using Primary Task approach to fuse the texts and images information. Furthermore, when incorporating either the textual or visual topic information, all our approaches including MTAL with topic modeling can give correct prediction, i.e., depressed, for this user. This again encourages us to incorporate multimodal topic information for depression detection.
User Visualization with Multimodal Topics. We randomly pick 500 users to perform the t-SNE projection according to the textual topic representation q text or the visual topic representation q image of each user. Specifically, we randomly pick six topics in each modality (If a topic is not picked, its topic embedding inside topic embedding matrix Φ will be set to be zero vector) to compute the textual (or visual) topic representation of each user. Figure 6 (a) and 6 (b) show the user visualization with the My family will always be my harbor.
Topic: Family, Friends I like chatting with friends very much.
Medicine can heal my broken heart.

Topic: Medicine, Cancer
I'm so scared. Cancer is terrible. I'm always alone, and My life is full of loneliness.

Topic: Loneliness, Tiredness
I feel depressed every day because of tiredness.
Sadness is always distracting me from my focus.

Topic: Distracting, Enduring
I am enduring the heaviest and deepest sorrows. I love travelling, and it makes me happy.

Topic: Travelling, Enjoying
I am enjoying a perfectly nice trip.
I love going to parties and it's so funny.

Topic: Party, Park
The fresh air in the park makes me relaxed.  Figure 6: t-SNE visualization of depressed users and non-depressed users according to six randomly picked topics from each modality respectively. The squares and triangles represent non-depressed and depressed users respectively. The six topics in Left (a) are generated by the textual topic modeling, while those in Right (b) are generated by the visual topic modeling. Besides, the sample texts and images are randomly selected from tweets posted by the linked users. textual topics and visual topics respectively. From the two figures, we can see that: 1) depressed and nondepressed users are clearly divided by both the textual or visual topics, indicating the helpfulness of both the textual and visual topic information for depression detection; 2) textual or visual topics themselves are also clearly divided, indicating the effectiveness of our modality-agnostic topic model in capturing the textual or visual topic information. In summary, these observations suggest us to consider the topic information inside both the texts and images for depression detection.

Conclusion
In this paper, we propose a novel multimodal topic-enriched auxiliary learning approach to depression detection. The main idea of the proposed approach is to incorporate not only the textual topic information but also the visual topic information for depression detection. Experimental results demonstrate that the proposed approach significantly outperforms a number of competitive baselines, including the representative textual depression detection approaches and the state-of-the-art multimodal-based approaches.
In our future work, we would like to explore more information, such as the user personal attribute information (such as profession, location and age) and the social attribute information (e.g., timeline, social behavior and social relationships), to assist depression detection. In addition, we would like to apply our approach to other psychological analysis tasks, such as multimodality-based emotion analysis, anxiety detection and personality inference.