Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer

In this paper, we study Multimodal Named Entity Recognition (MNER) for social media posts. Existing approaches for MNER mainly suffer from two drawbacks: (1) despite generating word-aware visual representations, their word representations are insensitive to the visual context; (2) most of them ignore the bias brought by the visual context. To tackle the first issue, we propose a multimodal interaction module to obtain both image-aware word representations and word-aware visual representations. To alleviate the visual bias, we further propose to leverage purely text-based entity span detection as an auxiliary module, and design a Unified Multimodal Transformer to guide the final predictions with the entity span predictions. Experiments show that our unified approach achieves the new state-of-the-art performance on two benchmark datasets.


Introduction
Recent years have witnessed the explosive growth of user-generated contents on social media platforms such as Twitter. While empowering users with rich information, the flourish of social media also solicits the emerging need of automatically extracting important information from these massive unstructured contents. As a crucial component of many information extraction tasks, named entity recognition (NER) aims to discover named entities in free text and classify them into pre-defined types, such as person (PER), location (LOC) and organization (ORG). Given its importance, NER has attracted much attention in the research community (Yadav and Bethard, 2018).

2016) have shown success in identifying entities in
formal newswire text, most of them perform poorly on informal social media text (e.g., tweets) due to its short length and noisiness. To adapt existing NER models to social media, various methods have been proposed to incorporate many tweet-specific features (Ritter et al., 2011;Li et al., 2012Li et al., , 2014Limsopatham and Collier, 2016). More recently, as social media posts become increasingly multimodal, several studies proposed to exploit useful visual information to improve the performance of NER (Moon et al., 2018;Lu et al., 2018).
In this work, following the recent trend, we focus on multimodal named entity recognition (MNER) for social media posts, where the goal is to detect named entities and identify their entity types given a {sentence, image} pair. For example, in Fig. 1.a, it is expected to recognize that Kevin Durant, Oracle Arena, and Jordan belong to the category of person names (i.e., PER), place names (i.e., LOC), and other names (i.e., MISC), respectively.
While previous work has shown success of fusing visual information into NER (Moon et al., 2018;Lu et al., 2018), they still suffer from several limitations: (1) The first obstacle lies in the non-contextualized word representations, where each word is represented by the same vector, regardless of the context it occurs in. However, the meanings of many polysemous entities in social media posts often rely on its context words. Take Fig. 1.a as an example, without the context words wearing off, it is hard to figure out whether Jordan refers to a shoe brand or a person. (2) Although most existing methods focus on modeling inter-modal interactions to obtain word-aware visual representations, the word representations in their final hidden layer are still based on the textual context, which are insensitive to the visual context. Intuitively, the associated image often provides more context to resolve polysemous entities, and should contribute to the final word representations (e.g., in Fig. 1.b, the image can supervise the final word representations of Kian and David to be closer to persons than animals). (3) Most previous approaches largely ignore the bias of incorporating visual information. Actually, in most social media posts, the associated image tends to highlight only one or two entities in the sentence, without mentioning the other entities. In these cases, directly integrating visual information will inevitably lead the model to better recognize entities highlighted by images, but fail to identify the other entities (e.g., Oracle Arena and King of the Jungle in Fig. 1).
To address these limitations, we resort to existing pre-trained contextualized word representations, and propose a unified multimodal architecture based on Transformer (Vaswani et al., 2017), which can effectively capture inter-modality interactions and alleviate the visual bias. Specifically, we first adopt a recently pre-trained contextualized representation model (Devlin et al., 2018) as our sentence encoder, whose multi-head self-attention mechanism can guide each word to capture the semantic and syntactic dependency upon its context. Second, to better capture the implicit alignments between words and images, we propose a multimodal interaction (MMI) module, which essentially couples the standard Transformer layer with cross-modal attention mechanism to produce an image-aware word representation and a wordaware visual representation for each input word, respectively. Finally, to largely eliminate the bias of the visual context, we propose to leverage textbased entity span detection as an auxiliary task, and design a unified neural architecture based on Transformer. In particular, a conversion matrix is designed to construct the correspondence between the auxiliary and the main tasks, so that the entity span information can be fully utilized to guide the final MNER predictions.
Experimental results show that our Unified Multimodal Transformer (UMT) brings consistent performance gains over several highly competitive unimodal and multimodal methods, and outperforms the state-of-the-art by a relative improvement of 3.7% and 3.8% on two benchmarks, respectively.
The main contributions of this paper can be summarized as follows: • We propose a Multimodal Transformer model for the task of MNER, which empowers Transformer with a multimodal interaction module to capture the inter-modality dynamics between words and images. To the best of our knowledge, this is the first work to apply Transformer to MNER.
• Based on the above Multimodal Transformer, we further design a unified architecture to incorporate a text-based entity span detection module, aiming to alleviate the bias of the visual context in MNER with the guidance of entity span predictions from this auxiliary module.

Methodology
In this section, we first formulate the MNER task, and give an overview of our method. We then delve into the details of each component in our model. Task Formulation: Given a sentence S and its associated image V as input, the goal of MNER is to extract a set of entities from S, and classify each extracted entity into one of the pre-defined types.
As with most existing work on MNER, we formulate the task as a sequence labeling problem. Let S = (s 1 , s 2 , . . . , s n ) denote a sequence of input words, and y = (y 1 , y 2 , . . . , y n ) be the corresponding label sequence, where y i ∈ Y and Y is the pre-defined label set with the BIO2 tagging schema (Sang and Veenstra, 1999).   As shown at the bottom of Fig. 2.a, we first extract contextualized word representations and visual block representations from the input sentence and the input image, respectively.

Overall Architecture
The right part of Fig. 2.a illustrates our Multimodal Transformer model for MNER. Specifically, a Transformer layer is first employed to derive each word's textual hidden representation. Next, a multimodal interaction (MMI) module is devised to fully capture the inter-modality dynamics between the textual hidden representations and the visual block representations. The hidden representations from MMI are then fed to a conditional random field (CRF) layer to produce the label for each word.
To alleviate the visual bias in MNER, we further stack a purely text-based ESD module in the left part of Fig. 2.a, where we feed its hidden representations to another CRF layer to predict each word's entity span label. More importantly, to utilize this for our main MNER task, we design a conversion matrix to encode the dependency relations between corresponding labels from ESD to MNER, so that the entity span predictions from ESD can be integrated to get the final MNER label for each word.

Unimodal Input Representations
Word Representations: Due to the capability of giving different representations for the same word in different contexts, we employ the recent contextualized representations from BERT (Devlin et al., 2018) as our sentence encoder. Following Devlin et al. (2018), each input sentence is preprocessed by inserting two special tokens, i.e., appending [CLS] to the beginning and [SEP] to the end, respectively. Formally, let S = (s 0 , s 1 , . . . , s n+1 ) be the modified input sentence, where s 0 and s n+1 denote the two inserted tokens. Let X = (x 0 , x 1 , . . . , x n+1 ) be the word representations of S , where x i is the sum of word, segment, and position embeddings for each token s i . As shown in the bottom left of Fig. 2.a, X is then fed to the BERT encoder to obtain C = (c 0 , c 1 , . . . , c n+1 ), where c i ∈ R d is the generated contextualized representation for x i .
Visual Representations: As one of the state-ofthe-art CNN models for image recognition, Residual Network (ResNet) (He et al., 2016) has shown its capability of extracting meaningful feature representations of the input image in its deep layers. We therefore keep the output from the last convolutional layer in a pretrained 152-layer ResNet to represent each image, which essentially splits each input image into 7×7=49 visual blocks with the same size and represents each block with a 2048dimensional vector. Specifically, given an input image V , we first resize it to 224×224 pixels, and obtain its visual representations from ResNet, denoted as U = (u 1 , u 2 , . . . , u 49 ), where u i is the 2048-dimensional vector representation for the i-th visual block. To project the visual representations into the same space of the word representations, we further convert U with a linear transformation: V = W u U, where W u ∈ R 2048×d is the weight matrix 1 . As shown in the bottom right of Fig. 2.a, V = (v 1 , v 2 , . . . , v 49 ) is the visual representations generated from ResNet.

Multimodal Transformer for MNER
In this subsection, we present our proposed Multimodal Transformer for MNER.
As illustrated on the right of Fig. 2.a, we first add a standard Transformer layer over C to obtain each word's textual hidden representation: R = (r 0 , r 1 , . . . , r n+1 ), where r i ∈ R d denotes the generated hidden representation for x i .
Motivation: While the above Transformer layer can capture which context words are more relevant to the prediction of an input word x i , they fail to consider the associated visual context. On the one hand, due to the short length of textual contents on social media, the additional visual context may guide each word to learn better word representations. On the other hand, since each visual block is often closely related to several input words, incorporating the visual block representation can potentially make the prediction of its related words more accurately. Inspired by these observations, we propose a multimodal interaction (MMI) module to learn an image-aware word representation and a word-aware visual representation for each word.

Image-Aware Word Representation
Cross-Modal Transformer (CMT) Layer: As shown on the left of Fig. 2.b, to learn better word representations with the guidance of associated images, we first employ an m-head cross-modal attention mechanism (Tsai et al., 2019), by treating V ∈ R d×49 as queries, and R ∈ R d×(n+1) as keys and values: where CA i refers to the i-th head of cross-modal attention, , and W ∈ R d×d denote the weight matrices for the query, key, value, and multi-head attention, respectively. Next, we stack another three sub-layers on top: 1 Bias terms are omitted to avoid confusion in this paper.
where FFN is the feed-forward network (Vaswani et al., 2017), LN is the layer normalization (Ba et al., 2016), and P = (p 1 , p 2 , . . . , p 49 ) is the output representations of the CMT layer. Coupled CMT Layer: However, since the visual representations are treated as queries in the above CMT layer, each generated vector p i is corresponding to the i-th visual block instead of the i-th input word. Ideally, the image-aware word representation should be corresponding to each word.
To address this, we propose to couple P with another CMT layer, which treats the textual representations R as queries, and P as keys and values. As shown in the top left of Fig. 2.a, this coupled CMT layer generates the final image-aware word representations, denoted by A = (a 0 , a 1 , . . . , a n+1 ).

Word-Aware Visual Representation
To obtain a visual representation for each word, it is necessary to align each word with its closely related visual blocks, i.e., assigning high/low attention weights to its related/unrelated visual blocks. Hence, as shown in the right part of Fig. 2.b, we use a CMT layer by treating R as queries and V as keys and values, which can be considered as a symmetric version of the left CMT layer. Finally, it generates the word-aware visual representations, denoted by Q = (q 0 , q 1 , . . . , q n+1 ).
Visual Gate: As pointed out in some previous studies Lu et al., 2018), it is unreasonable to align many function words such as the, of, and well with any visual block. Therefore, it is important to incorporate a visual gate to dynamically control the contribution of visual features. Following the practice in previous work, we design a visual gate by combining the information from the above word representations A and visual representations Q as follows: where {W a , W q } ∈ R d×d are weight matrices, and σ is the element-wise sigmoid function. Based on the gate output, we can obtain the final wordaware visual representations as B = g · Q.

CRF Layer
To integrate the word and the visual representations, we concatenate A and B to obtain the final hidden representations H = (h 0 , h 1 , . . . , h n+1 ), where h i ∈ R 2d . Following Lample et al. (2016), we then feed H to a standard CRF layer, which defines the probability of the label sequence y given the input sentence S and its associated image V : P (y|S, V ) = exp(score(H, y)) y exp(score(H, y )) ; (4) where T y i ,y i+1 is the transition score from the label y i to the label y i+1 , E h i ,y i is the emission score of the label y i for the i-th word, and w y i MNER ∈ R 2d is the weight parameter specific to y i .

Unified Multimodal Transformer
Motivation: Since the Multimodal Transformer presented above mainly focuses on modeling the interactions between text and images, it may lead the learnt model to overemphasize the entities highlighted by the image but ignore the remaining entities. To alleviate the bias, we propose to leverage text-based entity span detection (ESD) as an auxiliary task based on the following observation. As ResNet is pre-trained on ImageNet (Deng et al., 2009) for the image recognition task, its highlevel representations are closely relevant to the final predictions, i.e., the types of contained objects. This indicates that the visual representations from ResNet should be quite useful for identifying types of the detected entities, but are not necessarily relevant to detecting entity spans in the sentence. Therefore, we use purely text-based ESD to guide the final predictions for our main MNER task.
Auxiliary Entity Span Detection Module: Formally, we model ESD as another sequence labeling task, and use z = (z 1 , . . . , z n ) to denote the sequence of labels, where z i ∈ Z and Z = {B, I, O}.
As shown in the left part of Fig. 2.a, we employ another Transformer layer to obtain its specific hidden representations as T = (t 0 , t 1 , . . . , t n+1 ), followed by feeding it to a CRF layer to predict the probability of the label sequence z given S: where w z i ESD ∈ R d is the parameter specific to z i . Conversion Matrix: Although ESD is modeled as an auxiliary task separated from MNER, the two tasks are highly correlated since each ESD label should be only corresponding to a subset of labels in MNER. For example, given the sentence in Fig. 2.a, if the first token is predicted to be the TWITTER-2015TWITTER-2017 Entity Type  Train  Dev  Test  Train  Dev  Test   Person  2217  552  1816  2943  626  621  Location  2091  522  1697  731  173  178  Organization  928  247  839  1674  375  395  Miscellaneous  940  225  726  701  150  157   Total  6176  1546  5078  6049  1324  1351 Num of Tweets 4000 1000 3257 3373 723 723 beginning of an entity in ESD (i.e., have the label B), it should be also the beginning of a typed entity in MNER (e.g., have the label B-PER).
To encode such inter-task correspondence, we propose to use a conversion matrix W c ∈ R |Z|×|Y| , where each element W c j,k defines the conversion probability from Z j to Y k . Since we have some prior knowledge (e.g., the label B can only convert to a label subset {B-PER, B-LOC, B-ORG, B-MISC}), we initialize W c as follows: if Z j is not corresponding to Y k , W c j,k is set to 0; otherwise, W c j,k is set to 1 |C j | , where C j denotes a subset of Y that is corresponding to Z j .
Modified CRF Layer for MNER: After obtaining the conversion matrix, we further propose to fully leverage the text-based entity span predictions to guide the final predictions of MNER. Specifically, we modify the CRF layer for MNER by incorporating the entity span information from ESD into the emission score defined in Eqn. (6):

Model Training
Given a set of manually labeled training samples D = {S j , V j , y j , z j } N j=1 , our overall training objective function is a weighted sum of the sentencelevel negative log-likelihood losses for our main MNER task and the auxiliary ESD task 2 : where λ is a hyperparameter to control the contribution of the auxiliary ESD module.

Experiments
We conduct experiments on two multimodal NER datasets, comparing our Unified Multimodal Transformer (UMT) with a number of unimodal and multimodal approaches.

Experiment Settings
Datasets: We take two publicly available Twitter datasets respectively constructed by  and Lu et al. (2018) for MNER. Since the two datasets mainly include multimodal user posts published on Twitter during 2014-2015 and 2016-2017, we denote them as TWITTER-2015 and TWITTER-2017 respectively. Table 1 shows the number of entities for each type and the counts of multimodal tweets in the training, development, and test sets of the two datasets 3 . We have released the two datasets preprocessed by us for research purpose via this link: https://github.com/jefferyYu/UMT. Hyperparameters: For each unimodal and multimodal approach compared in the experiments, the maximum length of the sentence input and the batch size are respectively set to 128 and 16. For our UMT approach, most hyperparameter settings follow Devlin et al. (2018) with the following exceptions: (1) the word representations C are initialized with the cased BERT base model pre-trained by Devlin et al. (2018), and fine-tuned during training.
(2) we employ a pre-trained 152-layer ResNet 4 to initialize the visual representations U and keep them fixed during training. (3) For the number of cross-modal attention heads, we set it as m=12. (4) The learning rate, the dropout rate, and the tradeoff parameter λ are respectively set to 5e-5, 0.1, and 0.5, which can achieve the best performance on the development set of both datasets via a small grid search over the combinations of [1e-5, 1e-4], [0.1, 0.5], and [0.1, 0.9].

Compared Systems
To demonstrate the effect of our Unified Multimodal Transformer (UMT) model, we first consider a number of representative text-based approaches for NER: (1) BiLSTM-CRF , a pioneering study which eliminates the heavy reliance on hand-crafted features, and simply employs a bidirectional LSTM model followed by a CRF layer for each word's final prediction; (2) CNN-BiLSTM-CRF (Ma and Hovy, 2016), a widely adopted neural network model for NER, which is an improvement of BiLSTM-CRF by replacing each word's word embedding with the concatenation of its word embedding and CNNbased character-level word representations; (3) HBiLSTM-CRF (Lample et al., 2016), an end-toend hierarchical LSTM architectures, which replaces the bottom CNN layer in CNN-BiLSTM-CRF with an LSTM layer to obtain the characterlevel word representations; (4) BERT (Devlin et al., 2018), a multi-layer bidirectional Transformer encoder, which gives contextualized representations for each word, followed by stacking a softmax layer for final predictions; (5) BERT-CRF, a variant of BERT by replacing the softmax layer with a CRF layer.
Besides, we also consider several competitive multimodal approaches for MNER: (1) GVATT-HBiLSTM-CRF (Lu et al., 2018), a state-of-the-art approach for MNER, which integrates HBiLSTM-CRF with the visual context by proposing a visual attention mechanism followed by a visual gate to obtain word-aware visual representations; (2) AdaCAN-CNN-BiLSTM-CRF , another state-of-the-art approach based on CNN-BiLSTM-CRF, which designs an adaptive coattention network to induce word-aware visual representations for each word; (3) GVATT-BERT-CRF  and AdaCAN-BERT-CRF, our two variants of the above two multimodal approaches, which replace the sentence encoder with BERT; (4) MT-BERT-CRF, our Multimodal Transformer model introduced in Section 2.3; (5) UMT-BERT-CRF, our unified architecture by incorporating the auxiliary entity span detection module into Multimodal Transformer, as introduced in Section 2.4. All the neural models are implemented with Py-Torch, and all the experiments are conducted on NVIDIA RTX 2080 Ti GPUs.

Main Results
In Table 2, we report the precision (P), recall (R), and F1 score (F1) achieved by each compared method on our two Twitter datasets.
First, comparing all the text-based approaches, we can clearly observe that BERT outperforms the other compared methods with a significant margin on both datasets. Moreover, it is easy to see that empowering BERT with a CRF layer can further boost the performance. All these observations indicate that the contextualized word representations are indeed quite helpful for the NER task on social media texts, due to the context-aware characteristics. This agrees with our first motivation.
Second, comparing the state-of-the-art multimodal approaches with their corresponding unimodal baselines, we can find that the multimodal approaches can generally achieve better performance, which demonstrates that incorporating the visual context is generally useful for NER. Besides, we can see that although GVATT-HBiLSTM-CRF and AdaCAN-CNN-BiLSTM-CRF can significantly outperform their unimodal baselines, the performance gains become relatively limited when replacing their sentence encoder with BERT. This suggests the challenge and the necessity of proposing a more effective multimodal approach.
Third, in comparison with the two existing multimodal methods, our Multimodal Transformer MT-BERT-CRF outperforms the state-of-the-art by 2.5% and 2.8% respectively, and also achieves bet- Figure 3: The number of entities (shown in yaxis) that are incorrectly predicted by BERT-CRF, but get corrected by each multimodal method Figure 4: The number of entities (shown in yaxis) that are correctly predicted by BERT-CRF, but wrongly predicted by each multimodal method ter performance than their BERT variants. We conjecture that the performance gains mainly come from the following reason: the two multimodal methods only focus on obtaining word-aware visual representations, whereas our MT-BERT-CRF approach targets at generating both image-aware word representations and word-aware visual representations for each word. These observations are in line with our second motivation.
Finally, comparing all the unimodal and multimodal approaches, it is clear to observe that our Unified Multimodal Transformer (i.e., UMT-BERT-CRF) can achieve the best performance on both datasets, outperforming the second best methods by 1.14% and 1.05%, respectively. This demonstrates the usefulness of the auxiliary entity span detection module, and indicates that the auxiliary module can help our Multimodal Transformer alleviate the bias brought by the associated images, which agrees with our third motivation.

Ablation Study
To investigate the effectiveness of each component in our Unified Multimodal Transformer (UMT) architecture, we perform comparison between the full UMT model and its ablations with respect to the auxiliary entity span detection (ESD) module and the multimodal interaction (MMI) module.
As shown in Table 3, we can see that all the components in UMT make important contributions to the final results. On the one hand, removing the whole ESD module will significantly drop the performance, which shows the importance of alleviating the visual bias. In particular, discarding the conversion matrix in the ESD module also leads to the performance drop, which indicates the usefulness of capturing the label correspondence between the auxiliary module and our main MNER task.
On the other hand, as the main contribution of

Further Analysis
Importance of MMI and ESD Modules: To better appreciate the importance of two main contributions (i.e., MMI and ESD modules) in our proposed approaches, we conduct additional analysis on our two test sets. In Fig. 3 and Fig. 4, we show the number of entities that are wrongly/correctly predicted by BERT-CRF, but correctly/wrongly predicted by each multimodal method 5 . First, we can see from Fig. 3 that with the MMI module, our MT-BERT-CRF and UMT-BERT-CRF approaches correctly identify more entities, compared with the two multimodal baselines. Table 4.A shows a specific example. We can see that our two methods correctly classify the type of Wolf Hall as MISC whereas the compared systems wrongly predict its type as LOC, probably because our MMI module enforces the image-aware word representations of Wolf Hall to be closer to drama names.
Second, in Fig. 4, it is clear to observe that compared with the other three methods, UMT-BERT-CRF can significantly decrease the bias brought by the visual context due to incorporating our auxiliary ESD module. In Table 4.B, we show a concrete example: since Game of Thrones is ignored by the image, the two multimodal baselines fail to identify them; in contrast, with the help of the auxiliary ESD module, UMT-BERT-CRF successfully eliminates the bias.
Effect of Incorporating Images: To obtain a better understanding of the general effect of incorporating associated images into our MNER task, we carefully examine our test sets and choose two representative test samples to compare the prediction results of different approaches.
First, we observe that most improvements gained by multimodal methods come from those samples where the textual contents are informal or incomplete but the visual context provides useful clues. For example, in Table 4.C, we can see that without the visual context, BERT-CRF fails to identify that the two entities refer to two singers in the concert, but all the multimodal approaches can correctly classify their types after incorporating the image.
Second, by manually checking the test set of our two datasets, we find that in around 5% of the social media posts, the associated images might be irrelevant to the textual contents due to two kinds of reasons: (1) these posts contain image memes, cartoons, or photos with metaphor; (2) their images and textual contents reflect different aspects of the same event. In such cases, we observe that multimodal approaches generally perform worse than BERT-CRF. A specific example is given in Table 4.D, where all the multimodal methods wrongly classify Siri as PER because of the unrelated face in the image.

Related Work
As a crucial component of many information extraction tasks including entity linking (Derczynski et al., 2015), opinion mining (Maynard et al., 2012), and event detection (Ritter et al., 2012), named entity recognition (NER) has attracted much attention in the research community in the past two decades (Li et al., 2018).
Methods for NER: In the literature, various supervised learning approaches have been proposed for NER. Traditional approaches typically focus on designing various effective NER features, followed by feeding them to different linear classifiers such as maximum entropy, conditional random fields (CRFs), and support vector machines (Chieu and Ng, 2002;Florian et al., 2003;Finkel et al., 2005;Ratinov and Roth, 2009;Lin and Wu, 2009;Passos et al., 2014;Luo et al., 2015). To reduce the feature engineering efforts, a number of recent studies proposed to couple different neural network architectures with a CRF layer (Lafferty et al., 2001) for word-level predictions, including convolutional neural networks (Collobert et al., 2011), recurrent neural networks (Chiu and Nichols, 2016;Lample et al., 2016), and their hierarchical combinations (Ma and Hovy, 2016). These neural approaches have been shown to achieve the state-of-the-art performance on different benchmark datasets based on formal text (Yang et al., 2018).
However, when applying these approaches to social media text, most of them fail to achieve satisfactory results. To address this issue, many studies proposed to exploit external resources (e.g., shallow parser, Freebase dictionary, and orthographic characteristics) to incorporate a set of tweet-specific features into both traditional approaches (Ritter et al., 2011;Li et al., 2014;Baldwin et al., 2015) and recent neural approaches (Limsopatham and Collier, 2016;Lin et al., 2017), which can obtain much better performance on social media text.
Methods for Multimodal NER (MNER): As multimodal data become increasingly popular on social media platforms, several recent studies focus on the MNER task, where the goal is to leverage the associate images to better identify the named entities contained in the text. Specifically, Moon et al. (2018) proposed a multimodal NER network with modality attention to fuse the textual and visual information. To model the inter-modal interactions and filter out the noise in the visual context,  and Lu et al. (2018) respectively proposed an adaptive co-attention network and a gated visual attention mechanism for MNER. In this work, we follow this line of work. But different from them, we aim to propose an effective multimodal method based on the recent Transformer architecture (Vaswani et al., 2017). To the best of our knowledge, this is the first work to apply Transformer to the task of MNER.

Conclusion
In this paper, we first presented a Multimodal Transformer architecture for the task of MNER, which captures the inter-modal interactions with a multimodal interaction module. Moreover, to alleviate the bias of the visual context, we further proposed a Unified Multimodal Transformer (UMT), which incorporates an entity span detection module to guide the final predictions for MNER. Experimental results show that our UMT approach can consistently achieve the best performance on two benchmark datasets.
There are several future directions for this work. On the one hand, despite bringing performance improvements over existing MNER methods, our UMT approach still fails to perform well on social media posts with unmatched text and images, as analyzed in Section 3.5. Therefore, our next step is to enhance UMT so as to dynamically filter out the potential noise from images. On the other hand, since the size of existing MNER datasets is relatively small, we plan to leverage the large amount of unlabeled social media posts in different platforms, and propose an effective framework to combine them with the small amount of annotated data to obtain a more robust MNER model.