Divide, Conquer and Combine: Hierarchical Feature Fusion Network with Local and Global Perspectives for Multimodal Affective Computing

We propose a general strategy named ‘divide, conquer and combine’ for multimodal fusion. Instead of directly fusing features at holistic level, we conduct fusion hierarchically so that both local and global interactions are considered for a comprehensive interpretation of multimodal embeddings. In the ‘divide’ and ‘conquer’ stages, we conduct local fusion by exploring the interaction of a portion of the aligned feature vectors across various modalities lying within a sliding window, which ensures that each part of multimodal embeddings are explored sufficiently. On its basis, global fusion is conducted in the ‘combine’ stage to explore the interconnection across local interactions, via an Attentive Bi-directional Skip-connected LSTM that directly connects distant local interactions and integrates two levels of attention mechanism. In this way, local interactions can exchange information sufficiently and thus obtain an overall view of multimodal information. Our method achieves state-of-the-art performance on multimodal affective computing with higher efficiency.


Introduction
Multimodal machine learning, as prior research shows (Baltrušaitis et al., 2019), always yields higher performance in multimodal tasks compared to the situation where only one modality is involved. In this paper, we aim at the multimodal machine learning problem, with an emphasis on multimodal affective computing where the task is to infer human's opinion from given language, visual and acoustic modalities (Poria et al., 2017a).
Finding a feasible and effective solution to learning inter-modality dynamics has been an intriguing and important problem in multimodal learning (Baltrušaitis et al., 2019), where intermodality dynamics represent complementary information contained in more than one involved modality to be detected and analyzed for a more accurate comprehension. For this purpose, a large body of prior work mostly treats the feature vectors of the modalities as the smallest units and fuse them at holistic level (Barezi et al., 2018;Poria et al., 2016aPoria et al., , 2017b. Typically,  propose a tensor-based fusion method which fuses feature vectors of three modalities using Cartesian product. Despite the effectiveness this type of methods have achieved, they give little consideration to acknowledging the variations across different portions of a feature vector which may contain disparate aspects of information and thus fail to render the fusion procedure more specialized. Additionally, they conduct fusion within one step, which can be intractable in some scenarios where the fusion method is susceptible to high computational complexity. Recently, Convolution Neural Networks (CN-N) have achieved compelling successes in computer vision (Krizhevsky et al., 2012;Mehta et al., 2019). One of core spirits in CNN lies in the use of convolutional operation to process feature maps, which is a series of local operations with kernels sliding through the object. Inspired by it, we propose local fusion to explore local interactions in multimodal embeddings, which is in spirit similar to convolution but basically is a general strate-gy towards multimodal fusion with multiple concrete fusion methods to choose from. Specifically, as shown in Fig. 1, we align feature vectors of three modalities to obtain multimodal embeddings and apply a sliding window to slide through them. The parallel portions of feature vectors within each window are then fused by a specific fusion method. By considering local interactions we achieve three advantages: 1) render fusion procedure more specialized since each portion of modality embeddings contains specific aspect of information intuitively; 2) assign proper weights to different portions; 3) reduce computational complexity and parameters substantially by dividing holistic fusion into multiple local ones. Many approaches can be adapted into our strategy for local fusion, and we empirically apply outer product, following . While using outer product (bilinear pooling) always brings heavy time and space complexity (Lin et al., 2015;Wu et al., 2017), we show that our method can achieve much higher efficiency.
Nonetheless, local fusion alone is not adequate for a comprehensive analysis of opinion. In fact, local interactions may contain complementary information to each other, which should be drawn upon for overall comprehension. Moreover, a small-sized sliding window may not be able to cover a complete interaction. Thus, we propose global fusion to explore interconnections of local interactions to mitigate these problems. In practice, RNN variants (Goudreau et al., 1994), especially LSTM (Hochreiter and Schmidhuber, 1997), are suitable for global fusion for their impressive power in modeling interrelations. However, in vanilla RNN architecture, only consecutive time steps are linked through hidden states, which may not be adequate for conveying information to local interactions that are far apart. Recently, some works have focused upon introducing residual learning into RNNs (Tao and Liu, 2018;Wang and Wang, 2018;Wang and Tian, 2016;He et al., 2016). Motivated by these efforts, we propose an Attentive Bi-directional Skip-connected LSTM (ABS-LSTM) that introduces bidirectional skip connection of memory cells and hidden states into LSTM, which is effective in ensuring sufficient flow of information in multi-way and handling long-term dependency problem (Bengio et al., 1994). In the transmission process of ABS-LSTM, the previous interactions are not equally correlated to the current local interaction, i.e., they vary in the amount of complementary information to be delivered. In addition, given that the local interactions, which do not contain equally valuable information, are used as input into ABS-LSTM across time steps, it is understandable that the produced states do not contribute equally to recognizing emotion. Thus, we incorporate two levels of attention mechanism into ABS-LSTM, i.e., Regional Interdependence Attention and Global Interaction Attention. The former takes effect in the process of delivering complementary information between local interactions, identifying the various correlation of previous t local interactions to the current one. The latter serves the purpose of allocating more attention to states that are more informative so as to aid a more accurate prediction.
To sum up, we propose a Hierarchical Feature Fusion Network (HFFN) for multimodal affective analysis. The main contributions are as follows: • We propose a generic hierarchical fusion strategy, termed 'divide, conquer and combine', to explore both local and global interactions in multiple stages each focusing on different dynamics.
• Instead of conducting fusion on a holistic level, we innovate to leverage a sliding window to explore inter-modality dynamics locally. In this way, our model can take into account the variations across portions in a feature vector. Such setting also brings about an impressive bonus, i.e., significant drop in computational complexity compared to other tensorbased methods, which is proven empirically.
• We propose global fusion to obtain an overall view of multimodal embeddings via a specifically designed ABS-LSTM, in which we integrate two levels of attention mechanism: Regional Interdependence Attention and Global Interaction Attention.

Related Work
Previous research on affective analysis focuses on text modality (Liu and Zhang, 2012;Cambria and Hussain, 2015), which is a hot research topic in the NLP community. However, recent research suggests that information from text is not sufficient for mining opinion of humans (Poria et al., 2017a;D'Mello and Kory, 2015;Cambria, 2016), espe-cially under the situation where sarcasm or ambiguity occurs. Nevertheless, if the accompanying information such as speaker's facial expressions and tones are presented, it would be much easier to figure out the real sentiment (Pham et al., 2019(Pham et al., , 2018. Therefore, multimodal affective analysis has attracted increasing attention, whose major challenge is how to fuse features from various modalities. Earlier feature fusion strategies can be roughly categorized into feature-level and decision-level fusion. The former seeks to extract features of various modalities and conduct fusion at input level, by mapping them into the same embedding space simply using concatenation (Wollmer et al., 2013;Rozgic et al., 2012;Morency et al., 2011;Poria et al., 2016aPoria et al., , 2017bGu et al., 2017). The latter, by contrast, draws tentative decisions based on involved modalities separately and weighted-average the decisions, realizing cross-modal fusion (Wu and Liang, 2010;Nojavanasghari et al., 2016;Zadeh et al., 2016a;. These two lines of work do not effectively model cross-modal or modalityspecific dynamics ).
Recently, word-level fusion methods have received substantial research attention and been widely acknowledged for effective exploration of time-dependent interactions (Wang et al., 2019;Zadeh et al., 2018a,b,c;Gu et al., 2018a;Rajagopalan et al., 2016). For example,  and Gu et al. (2018b) leverage word-level alignment between modalities and explore timerestricted cross-modal dynamics. Liang et al. (2018a) propose Recurrent Multistage Fusion Network (RMFN) which decomposes multimodal fusion into three stages and uses LSTM to perform local fusion. RMFN adopts the strategy of 'divide and conquer', while our method extends it by adding 'combine' part to learn the relations between local interactions. Liang et al. (2018b) conducts emotion recognition using local-global emotion intensity rankings and Bayesian ranking algorithms. However, the 'local' and 'global' here is totally different from ours, with its 'local' referring to an utterance of a video while our 'local' represents a feature chunk of an utterance.
Tensor fusion has also become increasingly popular. Tensor Fusion Network (TFN)  adopts outer product to conduct fusion at holistic level, which is later extended by  and Barezi et al. (2018) that try to improve efficiency and reduce redundant information by decomposing weights of high-dimensional fused tensors. HFFN mainly applies outer product as local fusion methods, and it improves efficiency by dividing modality embeddings into multiple local chunks before fusion which prevents highdimensional fused tensor from being created. Actually, HFFN can adopt any fusion strategy in local fusion stage other than only outer product, showing high flexibility and applicability.

Algorithm
As shown in Fig. 2, HFFN consists of: 1) Local Fusion Module (LFM) for fusing features of different modalities at every local chunk; 2) Global Fusion Module (GFM) for exploring global intermodality dynamics; 3) Emotion Inference Module (EIM) for obtaining the predicted emotion.

Divide and Conquer: Local Fusion
At the local fusion stage, we apply a sliding window that slides through the aligned feature vectors synchronously. At each step of operation, local fusion is conducted for the portions of feature vectors within the window. In this way, features across all modalities at the same window are able to fully interact with one another to obtain locally confined interactions in a more specialized way.
Assume that we have three modalities' feature vectors as input, namely language l ∈ R k , visual v ∈ R k and acoustic a ∈ R k (we only consider the situation where all modalities share the same feature length k since they can be easily mapped into the same embedding space via some transformations). In 'divide' stage, we align these feature vectors to form the multimodal embedding M ∈ R 3×k and leverage a sliding window of size 3 × d to explore inter-modality dynamics. Through the sliding window, each feature vector can be seen as segmented into multiple portions, each termed as a local portion. The segmentation procedure for feature vector of one modality is equivalent to: where m ∈ {l, v, a} is the modality m, d is the window size, s is the stride and m i denotes the i th local portion of modality m (i ∈ [1, n], n is the number of local portions for each modality). Obviously, for each modality, we have n = k−d s + 1 local portions in total, provided that k −d is divisible by s. Otherwise the feature vectors are padded with 0s to guarantee divisibility and in this case we For descriptive convenience we also term the parallel local portions corresponding to all modalities within the sliding window as a local chunk.
Many fusion methods can be chosen for fusing features within each local chunk to explore intermodality dynamics in 'conquer' stage. In practice, we apply outer product for it provides the best results in our experiments. Firstly, each local portion is padded with 1s to retain interactions of any subset of modalities as in : Then we perform outer product from feature vectors padded with 1s, defined as ): where denotes tensor outer product of a set of vectors. The final local fused tensor for i th local chunk is X f i ∈ R (d+1) 3 which represents the i th local interaction. We group all n local fused tensors to obtain the overall fused tensor sequence: Fig. 2. Compared with other models adopting outer product , our model achieves a marked improvement in efficiency by dividing holistic tensor fusion into multiple local ones, which is shown in Section 4.3.3. Actually, we can apply other fusion methods that are suitable for local information extraction, which demonstrates the broad applicability of our strategy and is left for future work.

Combine: Global Fusion
In the 'combine' stage, we model global interactions by exploring interconnections (complementary information) and context-dependency across local fused tensors to obtain an overall interpretation of interactions comprehensively. In addition, the limited and fixed size of sliding window may lead to division of the complete process of expressing emotion into different local portions, in which case sufficient flow of information between local chunks is warranted to compensate for this problem. Therefore, we design ABS-LSTM, an RNN variant, to make sense of the cross-modality dynamics from an integral perspective. In ABS-LSTM, we introduce bidirectional residual connection of memory cells and hidden states as well as integrate attention mechanisms to transmit information and learn overall representations more effectively, as shown in Fig. 2. Now that we obtain the local fused tensor sequence X f in LFM, global interaction learning can be expressed as: where ABS-LSTM is activated by tanh nonlinear function, X g = [X g 1 ; X g 2 ; ...; X g n ] ∈ R n×2o is the global fused tensor sequence, and 2o is the dimensionality of ABS-LSTM's output. A detailed illustration of ABS-LSTM is shown below.

ABS-LSTM
ABS-LSTM is specifically designed for modeling the interconnections of local fused tensors to distill complementary information. Since local in-teractions within a certain distance range are mutually correlated, it is necessary for ABS-LSTM to operate in a bidirectional way. As opposed to conventional bidirectional RNNs, ABS-LSTM has a set of identical parameters for both forward/backward passes which ensures a smaller number of parameters. Further, ABS-LSTM directly connects the current interaction with its several neighbors so that information can be sufficiently exchanged. Given its ability to bidirectionally transmit information in multiple connections, it is powerful in modeling long-term dependency, which is crucial for long sequences.
Firstly we illustrate the pipeline of ABS-LSTM in forward pass stage. Assume that t previous local interactions are directly connected to the current one (t is set to 3 in our experiment), it is beneficial to identify the various correlation between previous t interactions and the current interaction. To this end, we integrate Regional Interdependence Attention (RIA) into ABS-LSTM, so that previous local interactions containing more complementary information to the current one are given more importance in the information transmission process. The equations for previous information fusion of cells and states for l th interaction in forward pass are as follows: denotes vector concatenation and W h , W c ∈ R o×(o+(d+1) 3 ) are parameter matrices that determine the importance of previous cells − → c l−i and states − → c l−i , respectively. Eq. 5 maps the cell and state at the (l − i) th time step into two o-dimensional vectors respectively. Instead of merely using − → c l−i or − → h l−i to obtain their importance towards local interaction at current time step, we also utilize current time step's input X f l to reflect the correlation between the cell and states of (l − i) th interaction and current l th time step's input, which provides a better measurement of attention score by learning inter-dependency correlation between interactions. We take the 2norm of each vector in Eq. 6 as the importance score of each previous cell and state and then form a t-dimensional importance score vector for all states and cells, respectively. In Eq. 7 we use sof tmax layer to normalize both vectors and obtain the final attention scores, which, according to Eq. 8, are used as weights for the combination of previous t local interactions. The function a in Eq. 8 is a nonlinear activation function that helps to improve expressive power of ABS-LSTM, which we empirically choose ReLU . Overall, Eq. 5 to Eq. 8 realize transmission of information from previous multiple local interactions to the current one, using the first level of attention mechanism, i.e., RIA, which is able to properly distribute attention across the previous t local interactions to focus on the ones that contain information most relevant to the current local interaction.
After the combination of previous information, we further define: where σ denotes sigmoid function. Eq. 9 -12 denote the routine procedure of LSTM except that − → h l−1 and − → c l−1 are replaced with h l and c l , respectively. The output of l th time step in forward pass stage is − → h l (1 ≤ l ≤ n). To make ABS-LSTM bidirectional, in backward pass stage, we reverse input X f so that the last interaction arrives in first place and again feed it into Eq. 5 -12, whose output becomes ← − h l . The output of ABS-LSTM at l th time step is: Global Interaction Attention (GIA): Inherently, LSTM has the capability to 'memorize', and uses the memory to sequentially model long-term dependency. Thus, the hidden states output by ABS-LSTM synthesize the information from current time step's input interaction and that from previous input, respectively. In this sense, at each time step new information is processed and previous information still exists but is 'diluted' in the hidden state (due to the forget gate). Therefore, as some local interactions that are more informative, e.g. revealing a sharp tone or sheer alteration of facial expressions, are input to ABS-LSTM, the produced states should be given more importance over others since they have just synthesized an informative interaction and not yet been 'diluted'. Hence, it is justifiable to employ a specifically designed attention mechanism, termed Global Interaction Attention (GIA), to properly assign importance across states. GIA is formulated as follows: where W h ∈ R o×2o and W x ∈ R o×(d+1) 3 are two parameter matrices and b h , b x ∈ R o are two bias vectors to be learned. W h 2 , W x 2 ∈ R 1×o are two parameter matrices that determine final importance scores. Through affine transforms and nonlinearities in Eq. 13 and Eq. 14, the l th state h l and the corresponding input X f l are embedded into two o-dimensional vectors ω h and ω x that contain information regarding importance of l th state and local interaction, respectively. In Eq. 15, W h 2 and ω h first form a scalar via matrix multiplication, which reflects the importance of the l th hidden state to be used as its weight. Meanwhile, we pre-multiply ω x by W x 2 and obtain a scalar to be added to each entry of weighted state, which functions as a bias containing input information. By this means, the attended state at current time step is able to focus more on the information from current interaction instead of the previous ones.
Considering that X f l and h l are two intrinsically disparate sources of information, we only formulate the impact of X f l as a scalar that biases the state, rather than as a vector which has much more complex influence to the state and empirically degrades performance. In this way, if X f l is more important, the l th attended state h a l will receive a more significant shift towards a higher position with respect to all high-dimensional coordinates, and thus h a l is more attended. In a sense, every element of the original state undergoes a transformation, with a specifically determined weight and a fixed bias across all entries. GIA enables ABS-LSTM to enhance the states of greater importance, aiding a more accurate classification. The final output of ABS-LSTM is the concatenation of attended states: X g = n l=1 h a l ∈ R n×2o .

Emotion Inference Module
After obtaining the global interactions, the final emotion is obtained by: (17) where f contains a tanh activation function and a dropout layer of dropout rate 0.5, W e1 ∈ R 50×n·2o , b e1 ∈ R 50 and W e2 ∈ R N ×50 are the learnable parameters, and I ∈ R N is the final emotion inference (N is the number of categories).

Experiments
4.1 datasets CMU-MOSI (Zadeh et al., 2016b) includes 93 videos with each video padded to 62 utterances. We consider positive and negative sentiments in our paper. We use 49 videos for training, 13 for validation and 31 for testing. CMU-MOSEI (Zadeh et al., 2018c) has 2928 videos, and each video is padded to 98 utterances. Each utterance has been scored on two perspectives: sentiment intensity (ranges between [-3, 3]) and emotion (six classes). We consider positive, negative and neutral sentiments in the paper. We utilize 1800, 450 and 678 videos respectively for training, validation and testing. IEMOCAP (Busso et al., 2008) contains 151 videos and each video has at most 110 utterances. IEMOCAP contains following labels: anger, happiness, sadness, neutral, excitement, frustration, fear, surprise and other. We take the first four emotions so as to compare with previous models. The training, validation and testing sets contain 96, 24 and 31 videos respectively.

Experimental details
HFFN is implemented using the framework of Keras, with tensorflow as backend. The input dimensionality k for CMU-MOSI and CMU-MOSEI datasets is 50, while for IEMOCAP, k is set to 100. We use RMSprop for optimizing the network, with cosine proximity as objective function. The output dimension 2o of ABS-LSTM is set to 6 for CMU-MOSI and CMU-MOSEI but 2 for IEMOCAP. Note that ABS-LSTM is activated by tanh and followed by a dropout layer.
For feature pre-extraction, our setting on CMU-MOSI and IEMOCAP datasets are identical to that in (Poria et al., 2017b) 1 . The features are extracted from each utterances separately. For language feature, a text-CNN is applied. Each word is first embedded into a vector using word2vec tool (Mikolov et al., 2013). Then the vectorized representations for all words in an utterance are concatenated, which afterwards is processed by CNNs (Karpathy et al., 2014). For acoustic feature, an open-source tool openS-MILE (Eyben, 2010) is utilized to generate high dimensional vectors comprised of low-level descriptors (LLD). 3D-CNN (Ji et al., 2013) is applied for visual feature pre-extraction. It learns relevant features from each frame and the alterations across consecutive frames. By contrast, on CMU-MOSEI dataset we follow the setting as in  2 . GloVe (Pennington et al., 2014), Facet (iMotions, 2017 and COVAREP (Degottex et al., 2014) are applied for extracting language, visual and acoustic features respectively. Word-level alignment is performed using P2FA (Yuan and Liberman, 2008) across modalities. Eventually the unimodal features are generated as the average of their feature values over word time interval .
Subsequent to pre-extraction, similar to BC-LSTM (Poria et al., 2017b), we devise a Unimodal Feature Extraction Network (UFEN): R u×d j → R u×k , which consists of a bidirectional LSTM layer followed by a fully connected (FC) layer, for each separate modality. Here, u denotes the number of utterances that constitute a video and d j is the dimensionality of raw feature vector for j th modality. Through UFEN, feature vectors of all modalities are mapped into the same embedding space (have the same dimensionality k). UFEN for each modality, is individually trained followed by a FC layer: R k → R N using Adadelta (Zeiler, 2012) as optimizer and with categorical crossentropy as loss function. The precessed feature vectors of each utterance will be sent into HFFN.

Comparison with Baselines
We compare HFFN with following multimodal algorithms: RMFN (Liang et al., 2018a)    CMN , C-MKL (Poria et al., 2016b) and CAT-LSTM (Poria et al., 2017c). As presented in Table 1, HFFN shows improvement over typical approaches, setting new stateof-the-art record. Compared with the tensor fusion approaches TFN , M-RRF (Barezi et al., 2018) and LMF , HFFN achieves improvement by about 4%, which demonstrates its superiority. It is reasonable because these methods conduct tensor fusion at holistic level and ignore modeling local interactions, while ours has a well-designed LFM module. Compared to the word-level fusion approaches RAVEN (Wang et al., 2019), RMFN (Liang et al., 2018a) and FAF (Gu et al., 2018b), etc., HFFN achieves improvement by about 2%. We argue that it is because they ignore explicitly connecting locally-constrained interactions to obtain a general view of multimodal signals, while we explore global interactions by applying ABS-LSTM.
The results on IEMOCAP and CMU-MOSEI datasets are shown in Table 2 and Table 3, respectively. We can conclude from Table 2 that HFFN achieves consistent improvements on accuracy and F1 score in IEMOCAP 4-way and individual emotion recognition tasks compared with other methods. Specifically, HFFN outperforms other methods by a significant margin on the recognition of Angry and Neutral emotions. For CMU-MOSEI dataset, as shown in Table 3, the accuracy of HFFN is lower than that of BC-LSTM and CAT-LSTM, but it achieves the highest F1 s-   core with slight margin. HFFN still achieves stateof-the-art performance on these two datasets.

Discussion on Modality Importance
To explore the underlying information of each modality, we carry out an experiment to compare the performance among unimodal, bimodal and trimodal models. For unimodal models, we can infer from Table 4 that language modality is the most predictive for emotion prediction, outperforming acoustic and visual modalities with significant margin. When coupled with acoustic and visual modalities, the trimodal HFFN performs best, whose result is 1% ∼ 2% better than the language-HFFN, indicting that acoustic and visual modalities actually play auxiliary roles while language is dominant. However, in our model, when conducting outer product, all three modalities are treated equally, which is probably not the optimal choice. In the future, we aim to develop a fusion technique paying more attention to the language modality, while the other two modalities only serve as accessory sources of information.
Interestingly, the bimodal HFFNs do not necessarily outperform the language-HFFN. Contrarily, sometimes it even lowers the performance when language is combining with acoustic or visual modality. Nevertheless, when three modalities are available, the performance is undoubtedly the best. It indicates that a great deal of information hidden in a single modality can be interpreted  only by combining all the three modalities.

Comparative Analysis on Efficiency
Contrast experiments are conducted to analyze the efficiency of TFN , BC-LSTM (Poria et al., 2017b) 3 and HFFN. We compare the number of parameters and FLOPs after fusion (the FLOPs index is used to measure time complexity), and the inputs for all methods are the same to make a fair comparison. The trainable layers in TFN include two FC layers of 32 ReLU activation units and a decision layer: R 32 → R 2 . We adopt this setting to match the code released by the authors 4 . BC-LSTM's trainable layers contain a bidirectional LSTM with input and output dimension being 3 · 50 and 600 respectively, and two FC layers of 500 and 2 units respectively. Table 5 shows that in terms of the number of parameters, TFN is around 511 times larger than our HFFN, even under the situation where we adopt a more complex module after tensor fusion, demonstrating the high efficiency of HFFN. Note that if TFN adopts the original setting as stated in  where the FC layers have 128 units, it would even have more parameters than our version of TFN. Compared to BC-LSTM, HFFN has about 166 times fewer parameters and the FLOPs of HFFN is over 79 times fewer than that of BC-LSTM. Moreover, BC-LSTM is over 6 times faster than TFN in time complexity measured by FLOPs and the number of parameters is over 3 times smaller. These results demonstrate that outer product applied in TFN results in heavy computational complexity and a substantial number of param-  eters compared with other methods such as BC-LSTM, while HFFN can avoid these two problems and is even more efficient than other approaches adopting low-complexity fusion methods.

Discussion on Global Fusion
To demonstrate the superiority of ABS-LSTM on learning global interactions and the impact of the proposed attention mechanism, we conduct an experiment to compare the performance of model under different settings of global fusion. We can infer from Table 6 that ABS-LSTM reaches best results among all tested LSTM variants. Besides, vanilla LSTM achieves lowest performance, showing the necessity of delivering information bidirectionally. Bidirectional LSTM slightly outperforms no-attention variant of ABS-LSTM, possibly due to the use of two sets of independent learnable parameters for forward and backward passes, respectively, which allows more flexibility. However, as ABS-LSTM with attention outperforms bidirectional LSTM, it demonstrates the efficacy of ABS-LSTM. In terms of the effectiveness of attention mechanisms, interestingly, both RIA and GIA, when used alone, only bring about slight improvement (0.2%∼0.3%) compared to the no-attention version of ABS-LSTM. However, it further boosts the performance when RIA and GIA are concurrently used, achieving more improvement than that caused by RIA and GIA alone added together. This shows some potential positive link between the two levels of attention mechanism. Specifically, RIA can provide more refined information during transmission between local interactions, so that the output states to be processed by GIA are more focused on useful information and freer of noise, maximizing the effect of GIA.

Discussion on Sliding Window
To investigate the influence of the size d and the stride s of sliding window on learning local interactions, we conduct experiments on IEMOCAP where s changes incrementally from 1 to 10 and d takes on four values, namely 1, 2, 5 and 10. The results are shown in Fig. 3. It can be observed that for all values of d, the accuracy fluctuates within a limited range as the stride s changes incrementally, showing robustness with respect to the stride. Overall, the model fares best when d is set to 2, demonstrating that a moderate size of sliding window is important for ensuring high performance. We conjecture that the reason behind the decline in performance when d is assigned an overly large value (greater than 2), is that the effect of local fusion is lessened, leading to less specialized exploration of feature portions. This in turn verifies the central importance of local fusion in our strategy. In addition, an unreasonably small d may lead to disintegration of the feature correlation that could be capitalized on and scatter complete information, thus hurting overall performance. Furthermore, it is surprising that when the stride s is greater than d (some dimensions of feature vectors are left out in local fusion), the accuracy does not significantly suffer. This shows that there may be a deal of redundant information in the feature vectors, implying that more advanced extraction techniques are needed for more refined representations, which we will explore as part of future work.

Conclusion
We propose an efficient and effective framework HFFN that adopts a novel fusion strategy called 'divide, conquer and combine'. HFFN learns local interactions at each local chunk and explores global interactions by conveying information across local interactions using ABS-LSTM that integrates two levels of attention mechanism. Our fusion strategy is generic for other concrete fusion methods. In future work, we intend to explore multiple local fusion methods within our framework.