Multimodal Named Entity Recognition for Short Social Media Posts

We introduce a new task called Multimodal Named Entity Recognition (MNER) for noisy user-generated data such as tweets or Snapchat captions, which comprise short text with accompanying images. These social media posts often come in inconsistent or incomplete syntax and lexical notations with very limited surrounding textual contexts, bringing significant challenges for NER. To this end, we create a new dataset for MNER called SnapCaptions (Snapchat image-caption pairs submitted to public and crowd-sourced stories with fully annotated named entities). We then build upon the state-of-the-art Bi-LSTM word/character based NER models with 1) a deep image network which incorporates relevant visual context to augment textual information, and 2) a generic modality-attention module which learns to attenuate irrelevant modalities while amplifying the most informative ones to extract contexts from, adaptive to each sample and token. The proposed MNER model with modality attention significantly outperforms the state-of-the-art text-only NER models by successfully leveraging provided visual contexts, opening up potential applications of MNER on myriads of social media platforms.


Introduction
Social media with abundant user-generated posts provide a rich platform for understanding events, opinions and preferences of groups and individuals. These insights are primarily hidden in unstructured forms of social media posts, such as in free-form text or images without tags. Named entity recognition (NER), the task of recognizing named entities from free-form text, is thus a critical step for building structural information, allowing for its use in personalized assistance, recommendations, advertisement, etc.
(a) Visual contexts help recognizing polysemous entity names ('Monopoly' as in a board game versus an economics term). (b) Modality attention successfully suppresses word embeddings of a unknown token ('Marshmelloooo' with erroneously trailing 'o's), and focuses on character-based context (e.g. capitalized first letter, and lexical similarity to a known named entity ('Marshmello', a music producer)) for correct prediction. Nichols, 2015;Lafferty et al., 2001) on NER have shown success for wellformed text in recognizing named entities via word context resolution (e.g. LSTM with word embeddings) combined with character-level features (e.g. CharLSTM/CNN), several additional challenges remain for recognizing named entities from extremely short and coarse text found in social media posts. For instance, short social media posts often do not provide enough textual contexts to resolve polysemous entities (e.g. "monopoly is da best ", where 'monopoly' may refer to a board game (named entity) or a term in economics). In addition, noisy text includes a huge number of unknown tokens due to inconsistent lexical notations and frequent mentions of various newly trending entities (e.g. "xoxo Marshmelloooo ", where 'Marshmelloooo' is a mis-spelling of a known entity 'Marshmello', a music producer), making word embeddings based neural networks NER models vulnerable.
To address the challenges above for social media posts, we build upon the state-of-the-art neural architecture for NER with the following two novel approaches (Figure 1). First, we propose to leverage auxiliary modalities for additional context resolution of entities. For example, many popular social media platforms now provide ways to compose a post in multiple modalities -specifically image and text (e.g. Snapchat captions, Twitter posts with image URLs), from which we can obtain additional context for understanding posts. While "monopoly" in the previous example is ambiguous in its textual form, an accompanying snap image of a board game can help disambiguate among polysemous entities, thereby correctly recognizing it as a named entity.
Second, we also propose a general modality attention module which chooses per decoding step the most informative modality among available ones (in our case, word embeddings, character embeddings, or visual features) to extract context from. For example, the modality attention module lets the decoder attenuate the word-level signals for unknown word tokens (e.g. "Marshmellooooo" with trailing 'o's) and amplifies character-level features intsead (e.g. capitalized first letter, lexical similarity to other known named entity token 'Marshmello', etc.), thereby suppressing noise information ("UNK" token embedding) in decoding steps. Note that most of the previous literature in NER or other NLP tasks combine word and character-level information with naive concatenation, which is vulnerable to noisy social media posts. When an auxiliary image is available, the modality attention module determines to amplify this visual context e.g. in disambiguating polysemous entities, or to attenuate visual contexts when they are irrelevant to target named entities, e.g. selfies, etc. Note that the proposed modality attention module is distinct from how attention is used in other sequence-tosequence literature (e.g. attending to a specific token within an input sequence). Section 2 provides the detailed literature review.
Our contributions are three-fold: we propose (1) an LSTM-CNN hybrid multimodal NER network that takes as input both image and text for recognition of a named entity in text input. To the best of our knowledge, our approach is the first work to incorporate visual contexts for named entity recognition tasks. (2) We propose a general modality attention module that selectively chooses modalities to extract primary context from, maximizing information gain and suppressing irrelevant contexts from each modality (we treat words, characters, and images as separate modalities). (3) We show that the proposed approaches outperform the state-of-the-art NER models (both with and without using additional visual contexts) on our new MNER dataset SnapCaptions, a large collection of informal and extremely short social media posts paired with unique images.

Related Work
Neural models for NER have been recently proposed, producing state-of-the-art performance on standard NER tasks. For example, some of the end-to-end NER systems (Passos et al., 2014;Chiu and Nichols, 2015;Lample et al., 2016;Ma and Hovy, 2016) use a recurrent neural network usually with a CRF (Lafferty et al., 2001;McCallum and Li, 2003) for sequence labeling, accompanied with feature extractors for words and characters (CNN, LSTMs, etc.), and achieve the state-of-the-art performance mostly without any use of gazetteers information. Note that most of these work aggregate textual contexts via concatenation of word embeddings and character embeddings. Recently, several work have addressed the NER task specifically on noisy short text segments such as Tweets, etc. (Baldwin et al., 2015;Aguilar et al., 2017). They report performance gains from leveraging external sources of information such as lexical information (e.g. POS tags, etc.) and/or from several preprocessing steps (e.g. token substitution, etc.). Our model builds upon these state-of-the-art neural models for NER tasks, and improves the model in two critical ways: (1) incorporation of visual contexts to provide auxiliary information for short media posts, and (2) addition of the modality attention module, which better incorporates word embeddings and character embeddings, especially when there are many missing tokens in the given word embedding matrix. Note that we do not explore the use of gazetteers information or other auxiliary information (POS tags, etc.) (Ratinov and Roth, 2009) as it is not the focus of our study.
Attention modules are widely applied in several deep learning tasks Chan et al., 2015;Sukhbaatar et al., 2015;Yao et al., 2015). For example, they use an attention module to attend to a subset within a single input (a part/region of an image, a specific token in an input sequence of tokens, etc.) at each decoding step in an encoder-decoder framework for image captioning tasks, etc. (Rei et al., 2016) explore various attention mechanisms in NLP tasks, but do not incorporate visual components or investigate the impact of such models on noisy social media data. (Moon and Carbonell, 2017) propose to use attention for a subset of discrete source samples in transfer learning settings. Our modality attention differs from the previous approaches in that we attenuate or amplifies each modality input as a whole among multiple available modalities, and that we use the attention mechanism essentially to map heterogeneous modalities in a single joint embedding space. Our approach also allows for reuse of the same model for predicting labels even when some of the modalities are missing in input, as other modalities would still preserve the same semantics in the embeddings space.
Multimodal learning is studied in various domains and applications, aimed at building a joint model that extracts contextual information from multiple modalities (views) of parallel datasets.
The most relevant task to our multimodal NER system is the task of multimodal machine translation (Elliott et al., 2015;Specia et al., 2016), which aims at building a better machine translation system by taking as input a sentence in a source language as well as a corresponding image. Several standard sequence-to-sequence architectures are explored (e.g. a target-language LSTM decoder that takes as input an image first).
Other previous literature include study of Canonical Correlation Analysis (CCA) (Dhillon et al., 2011) to learn feature correlations among multiple modalities, which is widely used in many applications. Other applications include image captioning , audio-visual recognition (Moon et al., 2015), visual question answering systems (Antol et al., 2015), etc.
To the best of our knowledge, our approach is the first work to incorporate visual contexts for named entity recognition tasks. dings, character embeddings, and visual features (Section 3.1). A Bi-LSTM-CRF model then takes as input a sequence of tokens, each of which comprises a word token, a character sequence, and an image, in their respective representation (Section 3.2). At each decoding step, representations from each modality are combined via the modality attention module to produce an entity label for each token (3.3). We formulate each component of the model in the following subsections.

Proposed Methods
Notations: Let x = {x t } T t=1 a sequence of input tokens with length T , with a corresponding label sequence y = {y t } T t=1 indicating named entities (e.g. in standard BIO formats). Each input token is composed of three modalities: t } for word embeddings, character embeddings, and visual embeddings representations, respectively.

Features
Similar to the state-of-the-art NER approaches (Lample et al., 2016;Ma and Hovy, 2016;Aguilar et al., 2017;Passos et al., 2014;Chiu and Nichols, 2015;, we use both word embeddings and character embeddings. Word embeddings are obtained from an unsupervised learning model that learns co-occurrence statistics of words from a large external corpus, yielding word embeddings as distributional semantics (Mikolov et al., 2013). Specifically, we use pre-trained embeddings from GloVE (Pennington et al., 2014).
Character embeddings are obtained from a Bi-LSTM which takes as input a sequence of characters of each token, similarly to (Lample et al., 2016). An alternative approach for obtaining character embeddings is using a convolutional neural network as in (Ma and Hovy, 2016), but we find that Bi-LSTM representation of characters yields empirically better results in our experiments.
Visual embeddings: To extract features from an image, we take the final hidden layer representation of a modified version of the convolutional network model called Inception (GoogLeNet) (Szegedy et al., 2014(Szegedy et al., , 2015 trained on the Ima-geNet dataset (Russakovsky et al., 2015) to classify multiple objects in the scene. Our implementation of the Inception model has deep 22 layers, training of which is made possible via "network in network" principles and several dimension reduction techniques to improve computing resource utilization. The final layer representation encodes discriminative information describing what objects are shown in an image, which provide auxiliary contexts for understanding textual tokens and entities in accompanying captions.
Incorporating this visual information onto the traditional NER system is an open challenge, and multiple approaches can be considered. For instance, one may provide visual contexts only as an initial input to decoder as in some encoderdecoder image captioning systems . However, we empirically observe that an NER decoder which takes as input the visual embeddings at every decoding step (Section 3.2), combined with the modality attention module (Section 3.3), yields better results.
Lastly, we add a transform layer for each feature e.g. x

Bi-LSTM + CRF for Multimodal NER
Our MNER model is built on a Bi-LSTM and CRF hybrid model. We use the following implementation for the entity Bi-LSTM.
where x t is a weighted average of three modalities t } via the modality attention module, which will be defined in Section 3.3. Bias terms for gates are omitted here for simplicity of notation.
We then obtain bi-directional entity token representations by concatenating its left and right context representations. To enforce structural correlations between labels in sequence decoding, ← → h t is then passed to a conditional random field (CRF) to produce a label for each token maximizing the following objective.
where ψ t (y , y ; ← → h ) is a potential function, W CRF is a set of parameters that defines the potential functions and weight vectors for label pairs (y , y ). Bias terms are omitted for brevity of formulation.
The model can be trained via log-likelihood maximization for the training set {(x i , y i )}:

Modality Attention
The modality attention module learns a unified representation space for multiple available modalities (e.g. words, characters, images, etc.), and produces a single vector representation with aggregated knowledge among multiple modalities, based on their weighted importance. We motivate this module from the following observations. A majority of the previous literature combine the word and character-level contexts by simply concatenating the word and character embeddings at each decoding step, e.g.
in Eq.1. However, this naive concatenation of two modalities (word and characters) results in inaccurate decoding, specifically for unknown word token embeddings (e.g. an all-zero vector x t ])). While this concatenation approach does not cause significant errors for wellformatted text, we observe that it induces performance degradation for our social media post datasets which contain a significant number of missing tokens.
Similarly, naive merging of textual and visual information (e.g. t ])) yields suboptimal results as each modality is treated equally informative, whereas in our datasets some of the images may contain irrelevant contexts to textual modalities. Hence, ideally there needs a mechanism in which the model can effectively turn the switch on and off the modalities adaptive to each sample.
To this end, we propose a general modality attention module, which adaptively attenuates or emphasizes each modality as a whole at each decoding step t, and produces a soft-attended context vector x t as an input token for the entity LSTM. [a t ] ∈ R 3 is an attention vector at each decoding step t, and x t is a final context vector at t that maximizes information gain for x t . Note that the optimization of the objective function (Eq.1) with modality attention (Eq.4) requires each modality to have the same dimension (e.g. x , and that the transformation via W m essentially enforces each modality to be mapped into the same unified subspace, where the weighted average of which encodes discrimitive features for recognition of named entities. When visual context is not provided with each token (as in the traditional NER task), we can define the modality attention for word and character embeddings only in a similar way: Note that while we apply this modality attention module to the Bi-LSTM+CRF architecture (Section 3.2) for its empirical superiority, the module itself is flexible and thus can work with other NER architectures or for other multimodal applications.

SnapCaptions Dataset
The SnapCaptions dataset is composed of 10K user-generated image (snap) and textual caption pairs where named entities in captions are manually labeled by expert human annotators (entity types: PER, LOC, ORG, MISC). These captions are collected exclusively from snaps submitted to public and crowd-sourced stories (aka Snapchat Live Stories or Our Stories). Examples of such public crowd-sourced stories are "New York Story" or "Thanksgiving Story", which comprise snaps that are aggregated for various public events, venues, etc. All snaps were posted between year 2016 and 2017, and do not contain raw images or other associated information (only textual captions and obfuscated visual descriptor features extracted from the pre-trained Inception-Net are available). We split the dataset into train (70%), validation (15%), and test sets (15%). The captions data have average length of 30.7 characters (5.81 words) with vocabulary size 15,733, where 6,612 are considered unknown tokens from Stanford GloVE embeddings (Pennington et al., 2014). Named entities annotated in the Snap-Captions dataset include many of new and emerging entities, and they are found in various surface forms (various nicknames, typos, etc.) To the best of our knowledge, SnapCaptions is the only dataset that contains natural image-caption pairs with expert-annotated named entities.

Baselines
Task: given a caption and a paired image (if used), the goal is to label every token in a caption in BIO scheme (B: beginning, I: inside, O: outside) (Sang and Veenstra, 1999). We report the performance of the following state-of-the-art NER models as baselines, as well as several configurations of our proposed approach to examine contributions of each component (W: word, C: char, V: visual).
The rest of the architecture is kept the same.
• We report precision, recall, and F1 score for both entity types recognition (PER, LOC, ORG, MISC) and entity segmentation (untyped recognition -named entity or not) tasks.
• (proposed) Bi-LSTM/CRF + Bi-CharLSTM + Inception with modality attention (W+C+V): uses the modality attention to merge word, character, and visual embeddings as input to entity LSTM. Table 1 shows the NER performance on the Snap Captions dataset. We report both entity types recognition (PER, LOC, ORG, MISC) and named entity segmentation (named entity or not) results. Main Results: When visual context is available (W+C+V), we see that the model performance greatly improves over the textual models (W+C), showing that visual contexts are complimentary to textual information in named entity recognition tasks. In addition, it can be seen that the modality attention module further improves the entity type recognition performance for (W+C+V). This result indicates that the modality attention is able to focus on the most effective modality (visual, words, or characters) adaptive to each sample to maximize information gain. Note that our textonly model (W+C) with the modality attention module also significantly outperform the state-ofthe-art baselines (Aguilar et al., 2017;Ma and Hovy, 2016;Lample et al., 2016) that use the same textual modalities (W+C), showing the effectiveness of the modality attention module for textual models as well.

Results: SnapCaptions Dataset
Error Analysis: Table 2 shows example cases where incorporation of visual contexts affects prediction of named entities. For example, the token 'curry' in the caption "The curry's " is polysemous and may refer to either a type of food or a famous basketball player 'Stephen Curry', and the surrounding textual contexts do not provide enough information to disambiguate it. On the other hand, visual contexts (visual tags: 'parade', 'urban area', ...) provide similarities to the token's distributional semantics from other training examples (e.g. snaps from "NBA Championship Parade Story"), and thus the model successfully predicts the token as a named entity. Similarly, while the text-only model erroneously predicts 'Apple' in the caption "Grandma w dat lit Apple Crisp" as an organization (e.g. Apple Inc.), the visual contexts (describing objects related to food) help disambiguate the token, making the model predict it correctly as a non-named entity (a fruit). Trending entities (musicians or DJs such as 'CID', 'Duke Dumont', 'Marshmello', etc.) are also recognized correctly with strengthened contexts from visual information (describing concert scenes) despite lack of surrounding textual contexts. A few cases where visual contexts harmed the performance mostly include visual tags that are unrelated to a token or its surrounding textual contexts.
Visualization of Modality Attention: Figure 3 visualizes the modality attention module at each decoding step (each column), where amplified modality is represented with darker color, and attenuated modality is represented with lighter color.
For the image-aided model (W+C+V; upper row in Figure 3), we confirm that the modality attention successfully attenuates irrelevant signals (e.g. selfies, etc.) and amplifies relevant modalitybased contexts in prediction of a given token. In the example of "disney word essential = coffee" with visual tags selfie, phone, person, the modality attention successfully attenuates distract-ing visual signals and focuses on textual modalities, consequently making correct predictions. The named entities in the examples of "Beautiful night atop The Space Needle" and "Splash Mountain" are challenging to predict because they are composed of common nouns (space, needle, splash, mountain), and thus they often need additional contexts to correctly predict. In the training data, visual contexts make stronger indicators for these named entities (space needle, splash mountain), and the modality attention module successfully attends more to stronger signals.
For text-only model (W+C), we observe that performance gains mostly come from the modality attention module better handling tokens unseen during training or unknown tokens from the pretrained word embeddings matrix. For example, while WaRriOoOrs and Kooler Matic are missing tokens in the word embeddings matrix, it successfully amplifies character-based contexts (e.g. capitalized first letters, similarity to known entities 'Golden State Warriors') and suppresses wordbased contexts (word embeddings for unknown tokens e.g. 'WaRriOoOrs'), leading to correct predictions. This result is significant because it shows performance of the model, with an almost identical architecture, can still improve without having to scale the word embeddings matrix indefinitely. Figure 3 (b) shows the cases where the modality attention led to incorrect predictions. For example, the model predicts missing tokens HUUUGE and Shampooer incorrectly as named entities by amplifying misleading character-based contexts (e.g. capitalized first letters) or visual contexts (e.g.   concert scenes, associated contexts of which often include named entities in the training dataset). Sensitivity to Word Embeddings Vocabulary Size: In order to isolate the effectiveness of the modality attention module on textual models in handling missing tokens, we report the performance with varying word embeddings vocabulary sizes in Table 3. By increasing the number of missing tokens artificially by randomly removing words from the word embeddings matrix (original vocab size: 400K), we observe that while the overall performance degrades, the modality attention module is able to suppress the peformance degradation. Note also that the performance gap generally gets bigger as we decrease the vocabulary size of the word embeddings matrix. This result is significant in that the modality attention is able to improve the model more robust to missing tokens without having to train an indefinitely large word embeddings matrix for arbitrarily noisy social media text datasets.

Conclusions
We proposed a new multimodal NER (MNER: image + text) task on short social media posts. We demonstrated for the first time an effective MNER system, where visual information is combined with textual information to outperform traditional text-based NER baselines. Our work can be applied to myriads of social media posts or other articles across multiple platforms which often include both text and accompanying images. In addition, we proposed the modality attention module, a new neural mechanism which learns optimal integration of different modes of correlated information. In essence, the modality attention learns to attenuate irrelevant or uninformative modal information while amplifying the primary modality to extract better overall representations. We showed that the modality attention based model outperforms other state-of-the-art baselines when text was the only modality available, by better combining word and character level information.