An Element-wise Visual-enhanced BiLSTM-CRF Model for Location Name Recognition

In recent years, previous studies have used visual information in named entity recognition (NER) for social media posts with attached images. However, these methods can only be applied to documents with attached images. In this paper, we propose a NER method that can use element-wise visual information for any documents by using image data corresponding to each word in the document. The proposed method obtains element-wise image data using an image retrieval engine, to be used as extra features in the neural NER model. Experimental results on the standard Japanese NER dataset show that the proposed method achieves a higher F1 value (89.67%) than a baseline method, demonstrating the effectiveness of using element-wise visual information.


Introduction
Since the 1990s, information extraction, in which computers are used to extract structured data from unstructured documents, has been extensively studied (Cowie and Lehnert, 1996;Grishman and Sundheim, 1996). Among the entities to be extracted, location information (where) is one of the essential components (5W1H) of event information to be extracted, and the process has evolved to include various tasks, such as location name disambiguation and mapping of location names to real-world geographic locations (Weissenbacher et al., 2019).
Location name recognition has been typically conducted as a named entity recognition (NER) task (Li et al., 2018). In this field, deep learning models using visual information have been actively studied in recent years, especially in the extraction of named entities (NEs) from posts in social networking services (SNSs) such as Twitter and SnapChat (Lu et al., 2018;Moon et al., 2019;    . These methods use images attached to a post as multimodal features to disambiguate word meanings in the post. For example, the word Washington can be used to refer to Washington, D.C. (LOCATION) or the presidency of George Washington (PERSON). Looking at the attached image, Washington could be further disambiguated.
As mentioned above, visual information is considered capable of explaining word meanings and provide useful information for location name recognition. For example, Figure 1 shows images of two different modernized cities, Shenzhen in China and Dubai in the UAE, and Figure 2 shows images of rural villages, Manali in India and Hakone in Japan. One can easily recognize common objects from these images: skyscrapers in Figure 1, and townscapes surrounded by mountains and rivers in Figure 2. Similarities like these would provide sufficient information to consider that words like Shenzhen and Dubai in documents have the same NE aspect.
In this paper, we propose a method for location name recognition that utilizes images more effectively. Specifically, image data are obtained for each word in a document through an image retrieval engine, using the words in the document as a search query, and used as an extra multimodal feature in a neural NER model. The proposed model has two advantages. First, it is robust to unseen words that do not appear in the training data; standard NER models tend to be vulnerable to unseen words. Image data corresponding to each word in the document would provide additional information to clarify word meanings, as shown in the examples of Figure 1 and Figure 2. Second, our method can be applied to any documents to obtain element-wise image data corresponding to each word in the document; those of previous studies can only be applied to documents with images attached to them.
In addition, in the proposed method, we introduce a Gate mechanism to control the extent to which the visual features from images are input to the neural NER model. Polysemous words, abbreviations, and misspellings in a document could result in inappropriate instances in the image data obtained by the image retrieval engine. The gate's function is to remove the harmful effects derived from these instances by increasing or decreasing the degree of effect of a visual feature in the model when an image is appropriate or inappropriate for the document's context, respectively. We evaluate the model's performance for location name recognition using a standard BiLSTM-CRF model as our baseline and then show the effectiveness of element-wise visual information and Gate mechanism, through our experimental results.

Neural NER Model
In NER, machine learning models using conditional random fields (CRFs) have been widely used (Marcińczuk, 2015). Since the emergence of deep learning in recent years, it has become common to use various neural network-based NER models. Among them, bidirectional long shortterm memory (LSTM) models that include a CRF layer, BiLSTM-CRF, are one of the most common models (Huang et al., 2015;Lample et al., 2016). Furthermore, a variation of BiLSTM-CRF with pre-trained language models for large unsu-pervised corpora such as Flair (Akbik et al., 2018) have been successful in achieving high performance.

Use of Visual Information
Visual information obtained from images (or pictures) has been used in neural NER models, especially when applied to SNS posts that include images related to them. Moon et al. (2019) proposed a neural NER model using images attached to a post as multimodal features. In the model, the image is transformed into a vector representation through a pre-trained CNN-based image recognition model and then combined with the input to the LSTM network for NER. Asgari-Chenaghlu et al. (2020) proposed a similar model to Moon et al. (2019) that could directly use object name class labels obtained by the image recognition model. Lu et al. (2018) and  proposed models that obtain one-to-one correspondences between a word in a document and an object in a picture attached to the document to obtain fine-grained visual features. These studies only use image data attached to the document, not element-wise image data corresponding to words in the document.
In Chinese NER, each part of a Chinese character in a document can be regarded as a visual feature and mixed into the NER model (Jia and Ma, 2019). Although this model handles element-wise visual information in the same manner as ours, that is, image data corresponding to each element (character or word) in the document are used in the NER model, it only focuses on the characters' patterns. Our model, described in detail in Section 4, focuses on images that express word meanings.

BiLSTM-CRF Model
This section describes the details of the BiLSTM-CRF model as a basis for our baseline model. As mentioned in the previous section, the BiLSTM-CRF model is one of the most common models for NER. The input is a word or character sequence in a document and the output is a sequence of labels representing NE information. In this study, we use a character-based model because the dataset used in the experiments is Japanese and errors caused by word segmentation can be ignored. Characterbased models have been confirmed to outperform word-based models when Japanese documents are used (Misawa et al., 2017).
t=1 be the input vector sequence corresponding to C, and y = {y t } M t=1 be the output label sequence. Here, x is created by concatenating three types of vector (embedding) sequences x c , x w , and x F . The t-th element x t of x is given by Equation (1).
The sequence x c = {x c,t } M t=1 is a sequence of character embeddings corresponding to C. Each element of x c , x c,t corresponds to a GloVe embedding (Pennington et al., 2014) for the corresponding character in C.
In addition to x c , we also use x w and x F , which are sequences of word embeddings to integrate the word meanings into the input. x w is the characterbased word sequence, which is a word sequence whose length is the same as that of the character sequence C. Let W = {w t } M t=1 be a characterbased word sequence, where M denotes the number of characters of a word sequence. Here, let be a word sequence in the input document. W is a variation of S, which is created by repeating each word |s i | times where |s i | denotes the number of characters in the word s i . Note that w t and w t+1 in W are the same value if they come from the same word s i . The sequence x w = {x w,t } M t=1 is a sequence of word embeddings corresponding to W . Each element of {x w,t } is also trained by GloVe (Pennington et al., 2014). The sequence x F is the alternative version of x w , using the Flair training scheme (Akbik et al., 2018) instead of GloVe.
The input x is given to the LSTM network layer. In this layer, each unit of the LSTM updates the state of the t-th element x t on the basis of the previous LSTM (c t−1 ) and the hidden state (h t−1 ), and outputs the updated state as h t and c t .
In a BiLSTM network, the output − → h of the forward LSTM and the output ← − h of the backward LSTM are combined to compute the total output ← → h .
Next, the output of the LSTM network layer, ← → h , is sent to the next CRF layer. In this layer, the labeling scheme that takes into account the transition probability between labels is carried out, and the output sequence y is calculated against x. The output is selected for the optimal sequence on the basis of the Equation (4) where ϕ is the feature function and W CRF is a weight coefficient learned in this layer.
The element y t of the output label sequence y represents the entity label for each character c t . In general, named entity may be composed of multiple characters. So, we use the BIO scheme to represent the chunks of the named entity.

Proposed Method
The proposed method is a variation of the visualenhanced BiLSTM-CRF models that enable to integrate visual features into the basic BiLSTM-CRF model described in the previous section. The proposed method utilizes element-wise visual features by obtaining image data for each word in the input document through an image retrieval engine where each word in the document is used as a search query. By retrieving images associated to words, the proposed method can be applied to any documents, while the previous visualenhanced models mentioned in Section 2 can only be applied to documents with images attached to them. Figure 3 shows an overview of the proposed method. The left-hand side shows the basic BiLSTM-CRF model. The right-hand side shows the proposed module to create element-wise visual features. In this section, we describe the proposed method step by step. First, we explain the procedure for constructing queries from the input document (Section 4.1). Next, we explain how to obtain visual embeddings (Section 4.2) and integrate the visual features to the original text features (Section 4.3). We then update the input vector sequence shown in Equation (1) to carry the visual features to the BiLSTM-CRF (Section 4.4).

Retrieving Image data
The given input document is transformed into a character-based word sequence W = {w t } M t=1 by using the same procedure described in the previous section. Then, we construct a query sequence otherwise q t is empty. As other word types would be irrelevant for image retrieval, we focus only on nouns. Nouns include not only proper nouns but also common nouns. The part-of-speech information is provided by the Japanese POS tagger MeCab, which is described below.
Each q t in Q is used as the query for the image retrieval independently of each other; namely, we run the image retrieval M times. The top K retrieved images, referred to as p t , are saved for each run. If a query q t is empty, no retrievals are performed and p t is also set to empty. The P = {p t } M t=1 is sent to the next step as elementwise visual information.

Obtaining Visual Embeddings
DenseNets (Huang et al., 2016) are one of the most powerful CNN-based deep neural network architectures, especially for image recognition. A pretrained DenseNet model is applied to the retrieved images p t to obtain visual embeddings. First, each image in p t is sent to the DenseNet, and then the hidden representation of the final hidden layer of the DenseNet is saved. After K times running, the average of the K hidden representations is obtained as the visual embedding v t .
If p t is empty, we define v t as a zero vector where every element is 0.

Combining Visual Embeddings
The obtained visual embedding v t are modified to adjust the balance of combinations between the original text features and our visual features.
Here, we introduce the Gate mechanism to control how much of the visual features are input to the BiLSTM-CRF model. It works to decrease the degree of effect of the visual features when retrieved images from polysemous words, abbreviations, and misspellings are inappropriate. We also present another simple procedure, which we compared against the Gate mechanism..
Gate mechanism This procedure is formulated as follows.
The modified visual embedding x v,t is obtained by v t multiplied by g t . The modification weight g t is calculated on the basis of the visual feature v t and the text feature x F,t . We use x F,t because the feature relevance v t and context information around w t needs to be verified. Here, σ() denotes the sigmoid function and W g and b the weight coefficients to be trained. If a visual feature provides useful context, the g t is close to 1, otherwise close to 0. Note that no visual features are considered when g t = 0.
Simple This procedure is used as a comparison with the Gate mechanism where x v,t is defined as follow. x Note that this procedure is equivalent to the gate function in which g t is fixed at 1.

Use of Visual Features
Finally, the input vector sequence shown in Equation (1) is updated to Equation (7) to input the visual features to the input layer of the BiLSTM-CRF.

Dataset
We used the Extended Named Entity corpus (ENE corpus) (Hashimoto et al., 2008), which uses the definition of Sekine's Extended Named Entity Hierarchy (Sekine et al., 2002) 7.1.0 including more than 200 types of named entities including a number of location name types. This corpus is one of the commonly-used datasets for evaluating Japanese NER methods. Each document in the corpus has no attached images. The statistics of ENE corpus are shown in Table 1. We focused on six classes: Country, Province, County, City, GPE Other, and MIX in the experiments. The first five classes are the original ones enclosed in Sekine's definition. We included MIX to indicate cases that have multiple NE classes. Hereafter, we ignore MIX for convenience because of rare cases. The statistics of each class are shown in Table 2 Table 3 5

.2 Settings
We constructed three models for location name recognition. The first is the baseline model and the others are the proposed models described in Section 4.
• Baseline is the BiLSTM-CRF model described in Section 3. No use of visual features.
• Visual (Gate) is the proposed visualenhanced BiLSTM-CRF model that utilizes element-wise visual features with the Gate mechanism.
• Visual (Simple) is another proposed model. This model uses the Simple text/visual combination instead of the Gate mechanism.
For word embeddings x w and character embeddings x c , we conducted the GloVe training with 300 dimensions with the BCCWJ corpus (Maekawa et al., 2014). We use MeCab (Kudo et al., 2004) with the unidic (Den et al., 2007) dictionary for word segmentation. The Flair embeddings (Akbik et al., 2018) were trained using BCCWJ and ten years of Mainichi newspaper data from 1991 to 2000 with 1024 dimensions.
We used Google Images 2 with photo options for the image retrieval used in the proposed models. The top 15 retrieved images for each query are saved. In the dataset, about 43% of words were nouns, enabling non-empty queries to be constructed. The visual embeddings were created from the final hidden layer representation of DenseNet, whose dimensions were 1024. We used the pre-trained DenseNet from PyTorch. We performed an approximate randomization test (Chinchor, 1992) on the F1 values. The mark "*" and "**" in the table show significant differences compared with the baseline at the 0.05 and 0.01 levels, respectively. In the training of the models, we used Adam (Kingma and Ba, 2014) for optimization. The batch size was 20. We applied the dropout regularization (Srivastava et al., 2014) at p = 0.5 for each node of the input layer and each output node of the LSTM layer. We also used a gradient clipping (Pascanu et al., 2013) of 1.0 to reduce the effects of the gradient exploding.
We used the standard BIO schema (Tjong Kim Sang and Veenstra, 1999) for the chunk representation. The performance was measured by the Precision (Prec.), Recall, and F1 values. Only the exact matches were counted as the correct samples, while lenient matches were counted as incorrect.

Results and Discussion
Experimental results are shown in Table 4. Both models using element-wise visual features outperformed the baseline model. This result suggests that element-wise visual features are powerful features for location name recognition. Furthermore, the Visual (Gate) model achieved the best F1 value of 89.67%. From the results, the Gate mechanism is an essential part of integrating element-wise visual features into the baseline model.
The following example sentences are samples in the cases where the Visual (Gate) has a correct output while the Baseline has an incorrect output.  in December 1982.

• (ex.2-J)
City • (ex.2-E) Finally, tomorrow is the last day of our stay in France, except for the day we leave. We're going to Avignon City .
In the first example, (Jamaica) is a country name to be recognized, and in the second example, (Avignon) is a city name in France. Moreover, the examples of retrieved image data corresponding to these location names are shown in Figure 4 and Figure 5, respectively. One can see that a typical scene or object is in each image; a beach in Figure 4 and a palace in Figure  5. It suggests that image data showing scenes or objects strongly relevant to locations provide helpful visual features. Table 5 shows the fine-grained performances of the experimental results. It shows that the City class had the most significant improvement. In fact, we confirmed that the retrieved image data corresponding to city names showed many typical characteristics of the locations, such as buildings, landscapes, and skies. In contrast, for an example of other classes, image data corresponding to country names showed various weakly related objects to the countries. For example, we found some image data showing the president of its country.  These seem to suggest not location names but person names.
Here we call words that appear in test data but not in training data as unseen words. In general, it is arduous to achieve accurate NER performance on unseen words because they do not appear in training data and thus have poor textual information. Here, we investigated whether our visual features provide supplemental information to unseen words. To realize the investigation, we conducted an analysis focusing on the City class. As shown in Table 2, the City class differs from other classes in that it has many types of mentions. It implies that there are many unseen mentions to be recognized to the City class. Therefore we compared the extraction performance between seen words and unseen words in the City class. Table 6 shows the details of the results. One can see that the unseen words achieved better performance improvements than the seen words. Furthermore, precision values improved most significantly (Seen(+5.08) → Unseen(+7.07)). This means that visual features improve the performance of not only true-positive samples but also true-negative samples. The example sentences are shown below. Each underline indicates the unseen true-negative word. And, the corresponding retrieved images are shown in Figure 6 and Figure 7. These samples were correctly classified by the proposed method while wrongly  Although, as discussed above, the element-wise visual features contributed to improve the performance of location name recognition, some types of errors remained. It observed that the proposed method tends to cause false-positive errors in compound words including location names. For example, -(the Kyoto Protocol) in (5-J) and (5-E) was wrongly recognized as the location name. This type of error is caused by inadequate query construction. Because every single noun in the document is regarded as the query word independently in the proposed method, both (Kyoto) and (Protocol) were used to the image retrievals. Then they led to the mistaken recognition of Kyoto.
• (ex.5-J)   It also observed that the proposed method tends to cause false-negative errors when inappropriate images are mixed to the retrieved images. For example, the proposed method missed recognizing (Angola) in (6-J) and (6-E) because the image data retrieved by the query (Angola) includes some inappropriate images of "Angora rabbit" like in Figure 8.
(Angola) is not a polysemous word, but it found that the word means "Angora rabbit" in the specific domain 3 .

Conclusion
In this study, we proposed a NER model that uses images corresponding to all nouns in a document as features and a Gate mechanism that controls the extent to which visual features are provided as input to the neural NER model. We conducted experiments to confirm its performance in location name recognition. Experimental results show that the proposed method achieved a higher F1value performance than the baseline model in the ENE corpus dataset, with a significant difference of p < 0.01.
In future research, we will investigate whether the proposed model is effective for cases other than location names. We also aim to improve our model to be more effective by conducting elaborate query investigations that are motivated by the error analysis. The hyper-parameter K, which means the number of images per word, would be critical for obtaining valuable visual embeddings. Therefore, we will also investigate whether the larger the K, the better the location name recognition performance. The experimental results showed that the proposed method has little contributions when query words are polysemous. We would like to attempt word sequence queries with nouns and adjectives/verbs instead of single noun queries.