Semantic Relatedness Based Re-ranker for Text Spotting

Applications such as textual entailment, plagiarism detection or document clustering rely on the notion of semantic similarity, and are usually approached with dimension reduction techniques like LDA or with embedding-based neural approaches. We present a scenario where semantic similarity is not enough, and we devise a neural approach to learn semantic relatedness. The scenario is text spotting in the wild, where a text in an image (e.g. street sign, advertisement or bus destination) must be identified and recognized. Our goal is to improve the performance of vision systems by leveraging semantic information. Our rationale is that the text to be spotted is often related to the image context in which it appears (word pairs such as Delta-airplane, or quarters-parking are not similar, but are clearly related). We show how learning a word-to-word or word-to-sentence relatedness score can improve the performance of text spotting systems up to 2.9 points, outperforming other measures in a benchmark dataset.


Introduction
Deep learning has been successful in tasks related to deciding whether two short pieces of text refer to the same topic, e.g. semantic textual similarity (Cer et al., 2018), textual entailment (Parikh et al., 2016) or answer ranking for Q&A (Severyn and Moschitti, 2015).
However, other tasks require a broader perspective to decide whether two text fragments are related more than whether they are similar. In this work, we describe one of such tasks, and we retrain some of the existing sentence similarity approaches to learn this semantic relatedness. We also present a new Deep Neural Network (DNN) that outperforms existing approaches when applied to this particular scenario.
The task we tackle is Text Spotting, which is the problem of recognizing text that appears in unrestricted images (a.k.a. text in the wild) such as traffic signs, commercial ads, or shop names. Current state-of-the-art results on this task are far from those of OCR systems with simple backgrounds.
Existing approaches to Text Spotting usually divide the problem in two fundamental tasks: 1) text detection, consisting of selecting the image regions likely to contain texts, and 2) text recognition, that converts the images within these bounding boxes into a readable string. In this work, we focus on the recognition stage, aiming to prove that semantic relatedness between the image context and the recognized text can be useful to boost the system performance. We use existing pretrained architectures for Text Recognition, and add a shallow deep-network that performs a postprocessing operation to re-rank the proposed candidate texts. In particular, we re-rank the candidates using their semantic relatedness score with other visual information extracted from the image (e.g. objects, scenario, image caption). Extensive evaluation shows that our approach consistently improves other semantic similarity methods.

Text Hypothesis Extraction
We use two pre-trained Text Spotting baselines to extract k text hypotheses. The first baseline is a CNN (Jaderberg et al., 2016) with fixed lexicon based recognition, able to recognize words in a predefined 90K-word dictionary. Second, we use an LSTM architecture with visual attention model (Ghosh et al., 2017) that generates the final output words as probable character sequences, without relying on any lexicon. Both models are trained on a synthetic dataset (Jaderberg et al., 2014). The output of both models is a Softmax score for each of the k candidate words.
p Z a G t P w Q + / v 8 c c s 4 J E 8 6 0 c d 1 v J 7 e y u r a + k d 8 s b G 3 v 7 O 4 V 9 w / q W q a K U J 9 I L l U z x J p y J q h v m O G 0 m S i K 4 5 D T R j i 4 n u S N e 6 o 0 k + L O D B M a x L g n W M Q I N t b y + 5 3 M G 3 W K J b f s T o W W w Z t D q Q r H 4 6 / H j 3 G t U / x s d y V J Y y o M 4 V j r l u c m J s i w M o x w O i q 0 U 0 0 T T A a 4 R 1 s W B Y 6 p D r L p s C N 0 a p 0 u i q S y T x g 0 d X 9 3 Z D j W e h i H t j L G p q 8 X s 4 n 5 X 9 Z K T X Q V Z E w k q a G C z D 6 K U o 6 M R J P N U Z c p S g w f W s B E M T s r I n 2 s M D H 2 P g V 7 B G 9 x 5 W W o V 8 r e R b l y 6 5 W q 5 z B T H o 7 g B M 7 A g 0 u o w g 3 U w A c C D B 7 g G V 4 c 4 T w 5 r 8 7 b r D T n z H s O 4 Y + c 9 x / g A Z J i < / l a t e x i t > h 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " 3 J z R b k 9 J p r H p Z a G t P w Q + / v 8 c c s 4 J E 8 6 0 c d 1 v J 7 e y u r a + k d 8 s b G 3 v 7 O 4 V 9 w / q W q a K U J 9 I L l U z x J p y J q h v m O G 0 m S i K 4 5 D T R j i 4 n u S N e 6 o 0 k + L O D B M a x L g n W M Q I N t b y + 5 3 M G 3 W K J b f s T o W W w Z t D q Q r H 4 6 / H j 3 G t U / x s d y V J Y y o M 4 V j r l u c m J s i w M o x w O i q 0 U 0 0 T T A a 4 R 1 s W B Y 6 p D r L p s C N 0 a p 0 u i q S y T x g 0 d X 9 3 Z D j W e h i H t j L G p q 8 X s 4 n 5 X 9 Z K T X Q V Z E w k q a G C z D 6 K U o 6 M R J P N U Z c p S g w f W s B E M T s r I n 2 s M D H 2 P g V 7 B G 9 x 5 W W o V 8 r e R b l y 6 5 W q 5 z B T H o 7 g B M 7 A g 0 u o w g 3 U w A c C D B 7 g G V 4 c 4 T w 5 r 8 7 b r D T n z H s O 4 Y + c 9 x / g A Z J i < / l a t e x i t > h 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " h + / M k e O I p r p Z a G t P w Q + / v 8 c c s 4 J E 8 6 0 c d 1 v J 7 e y u r a + k d 8 s b G 3 v 7 O 4 V 9 w / q O k 4 V o T 6 J e a y a I d a U M 0 l 9 w w y n z U R R L E J O G + H g e p I 3 7 q n S L J Z 3 Z p j Q Q O C e Z B E j 2 F j L 7 3 e y y q h T L L l l d y q 0 D N 4 c S l U 4 H n 8 9 f o x r n e J n u x u T V F B p C M d a t z w 3 M U G G l W G E 0 1 G h n W q a Y D L A P d q y K L G g O s i m w 4 7 Q q X W 6 K I q V f d K g q f u 7 I 8 N C 6 6 E I b a X A p q 8 X s 4 n 5 X 9 Z K T X Q V Z E w m q a G S z D 6 K U o 5 M j C a b o y 5 T l B g + t I C J Y n Z W R P p Y Y W L s f Q r 2 C N 7 i y s t Q r 5 S 9 i 3 L l 1 i t V z 2 G m P B z B C Z y B B 5 d Q h R u o g Q 8 E G D z A M 7 w 4 0 n l y X p 2 3 W W n O m f c c w h 8 5 7 z / h h p J j < / l a t e x i t > h 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " h + / M k e O I p r W O t L V X T g r o E d A S J H U = " > A A A B 7 H i c b Z D L S g M x F I b P 1 F u t t 6 p L Q Y J F E B d l p i 5 0 W X D j s o L T F t q h Z N J M G 5 p k h i Q j l K H P o A s X i r h 1 1 3 f w G Q Q X v o 3 p Z a G t P w Q + / v 8 c c s 4 J E 8 6 0 c d 1 v J 7 e y u r a + k d 8 s b G 3 v 7 O 4 V 9 w / q O k 4 V o T 6 J e a y a I d a U M 0 l 9 w w y n z U R R L E J O G + H g e p I 3 7 q n S L J Z 3 Z p j Q Q O C e Z B E j 2 F j L 7 3 e y y q h T L L l l d y q 0 D N 4 c S l U 4 H n 8 9 f o x r n e J n u x u T V F B p C M d a t z w 3 M U G G l W G E 0 1 G h n W q a Y D L A P d q y K L G g O s i m w 4 7 Q q X W 6 K I q V f d K g q f u 7 I 8 N C 6 6 E I b a X A p q 8 X s 4 n 5 X 9 Z K T X Q V Z E w m q a G S z D 6 K U o 5 M j C a b o y 5 T l B g + t I C J Y n Z W R P p Y Y W L s f Q r 2 C N 7 i y s t Q r 5 S 9 i 3 L l 1 i t V z 2 G m P B z B C Z y B B 5 d Q h R u o g Q 8 E G D z A M 7 w 4 0 n l y X p 2 3 W W n O m f c c w h 8 5 7 z / h h p J j < / l a t e x i t > h 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " J / l 5 v E 2 B a Y C a I 5 X / v 4 / u 7 m b z 3 m c = " > A A A B 7 H i c b Z D L S g M x F I b P e K 3 1 V n U p S L A I 4 q L M t A t d F t y 4 r O C 0 h X Y o m T T T h i a Z I c k I Z e g z 6 M K F I m 7 d 9 R 1 8 B s G F b 2 N 6 W W j r D 4 G P / z + H n H P C h D N t X P f b W V l d W 9 / Y z G 3 l t 3 d 2 9 / Y L B 4 d 1 H a e K U J / E P F b N E G v K m a S + Y Y b T Z q I o F i G n j X B w P c k b 9 1 R p F s s 7 M 0 x o I H B P s o g R b K z l 9 z t Z Z d Q p F N 2 S O x V a B m 8 O x S q c j L 8 e P 8 a 1 T u G z 3 Y 1 J K q g 0 h G O t W 5 6 b m C D D y j D C 6 S j f T j V N M B n g H m 1 Z l F h Q H W T T Y U f o z D p d F M X K P m n Q 1 P 3 d k W G h 9 V C E t l J g 0 9 e L 2 c T 8 L 2 u l J r o K M i a T 1 F B J Z h 9 F K U c m R p P N U Z c p S g w f W s B E M T s r I n 2 s M D H 2 P n l 7 B G 9 x 5 W W o l 0 t e p V S + 9 Y r V C 5 g p B 8 d w C u f g w S V U 4 Q Z q 4 A M B B g / w D C + O d J 6 c V + d t V r r i z H u O 4 I + c 9 x / j C 5 J k < / l a t e x i t > h 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " J / l 5 v E 2 B a Y C a I 5 X / v 4 / u 7 m b z 3 m c = " > A A A B 7 H i c b Z D L S g M x F I b P e K 3 1 V n U p S L A I 4 q L M t A t d F t y 4 r O C 0 h X Y o m T T T h i a Z I c k I Z e g z 6 M K F I m 7 d 9 R 1 8 B s G F b 2 N 6 W W j r D 4 G P / z + H n H P C h D N t X P f b W V l d W 9 / Y z G 3 l t 3 d 2 9 / Y L B 4 d 1 H a e K U J / E P F b N E G v K m a S + Y Y b T Z q I o F i G n j X B w P c k b 9 1 R p F s s 7 M 0 x o I H B P s o g R b K z l 9 z t Z Z d Q p F N 2 S O x V a B m 8 O x S q c j L 8 e P 8 a 1 T u G z 3 Y 1 J K q g 0 h G O t W 5 6 b m C D D y j D C 6 S j f T j V N M B n g H m 1 Z l F h Q H W T T Y U f o z D p d F M X K P m n Q 1 P 3 d k W G h 9 V C E t l J g 0 9 e L 2 c T 8 L 2 u l J r o K M i a T 1 F B J Z h 9 F K U c m R p P N U Z c p S g w f W s B E M T s r I n 2 s M D H 2 P n l 7 B G 9 x 5 W W o l 0 t e p V S + 9 Y r V C 5 g p B 8 d w C u f g w S V U 4 Q Z q 4 A M B B g / w D C + O d J 6 c V + d t V r r i z H u O 4 I + c 9 x / j C 5 J k < / l a t e x i t > h t < l a t e x i t s h a 1 _ b a s e 6 4 = " Q Z 2 7 E m b 5 p O m j Z P 4 m X P X n 3 G j Y U p 4 = " > A A A B 7 H i c b Z D L S s N A F I Z P 6 q 3 W W 9 W l I M E i i I u S 1 I U u C 2 5 c V j B t o Q 1 l M p 2 0 Q y e T M H M i l N B n 0 I U L R d y 6 6 z v 4 D I I L 3 8 b p Z a G t P w x 8 / P 8 5 z D k n S A T X 6 D j f V m 5 l d W 1 9 I 7 9 Z 2 N r e 2 d 0 r 7 h / U d Z w q y j w a i 1 g 1 A 6 K Z 4 J J 5 y F G w Z q I Y i Q L B G s H g e p I 3 7 p n S P J Z 3 O E y Y H 5 G e 5 C G n B I 3 l 9 T s Z j j r F k l N 2 p r K X w Z 1 D q Q r H 4 6 / H j 3 G t U / x s d 2 O a R k w i F U T r l u s k 6 G d E I a e C j Q r t V L O E 0 A H p s Z Z B S S K m / W w 6 7 M g + N U 7 X D m N l n k R 7 6 v 7 u y E i k 9 T A K T G V E s K 8 X s 4 n 5 X 9 Z K M b z y M y 6 T F J m k s 4 / C V N g Y 2 5 P N 7 S 5 X s a T 1 Z r 9 b b r D R n z X s O 4 Y + s 9 x 9 F 3 5 K l < / l a t e x i t > h t < l a t e x i t s h a 1 _ b a s e 6 4 = " Q Z 2 7 E m b 5 p O m j Z P 4 m X P X n 3 G j Y U p 4 = " > A A A B 7 H i c b Z D L S s N A F I Z P 6 q 3 W W 9 W l I M E i i I u S 1 I U u C 2 5 c V j B t o Q 1 l M p 2 0 Q y e T M H M i l N B n 0 I U L R d y 6 6 z v 4 D I I L 3 8 b p Z a G t P w x 8 / P 8 5 z D k n S A T X 6 D j f V m 5 l d W 1 9 I 7 9 Z 2 N r e 2 d 0 r 7 h / U d Z w q y j w a i 1 g 1 A 6 K Z 4 J J 5 y F G w Z q I Y i Q L B G s H g e p I 3 7 p n S P J Z 3 O E y Y H 5 G e 5 C G n B I 3 l 9 T s Z j j r F k l N 2 p r K X w Z 1 D q Q r H 4 6 / H j 3 G t U / x s d 2 O a R k w i F U T r l u s k 6 G d E I a e C j Q r t V L O E 0 A H p s Z Z B S S K m / W w 6 7 M g + N U 7 X D m N l n k R 7 6 v 7 u y E i k 9 T A K T G V E s K 8 X s 4 n 5 X 9 Z K M b z y M y 6 T F J m k s 4 / C V N g Y 2 5 P N 7 S 5 X c n x 0 p C Z U a h r 6 p D I n u q / l s b P 6 X t R I d n H k p E 3 G i U d D p R 0 H C b R 3 Z 4 6 3 t L p N I N R 8 a I F Q y M 6 t N + 0 Q S q s 1 t c u Y I 7 v z K f 6 F e K r o n x V L V L Z S P Y a o s 7 M E B H I E L p 1 C G S 6 h A D S g g 3 M E D P F r X 1 r 3 1 Z D 1 P S z P W r G c X f s l 6 + Q a 8 U 5 F F < / l a t e x i t > c < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 o 0 4 0 z f i l E 0 + e C 2 p G 2 g 9 r a E C Y m E = " > A A A B 6 H i c b Z C 7 S g N B F I b P x l u M t 6 i l I I t B s J C w G w v t D N h Y J m A u k C x h d n I 2 G T M 7 u 8 z M C m F J a W V j o Y i t b 2 D n c 9 j 5 D P o Q T i 6 F R n 8 Y + P j / c 5 h z j h 9 z p r T j f F i Z h c W l 5 Z X s a m 5 t f W N z K 7 + 9 U 1 d R I i n W a M Q j 2 f S J Q s 4 E 1 j T T H J u x R B L 6 H B v + 4 G K c N 2 5 Q K h a J K z 2 M 0 Q t J T 7 C A U a K N V a W d f M E p O h P Z f 8 G d Q e H 8 9 f N 2 / 6 3 6 V e n k 3 9 v d i C Y h C k 0 5 U a r l O r H 2 U i I 1 o x x H u X a i M C Z 0 Q H r Y M i h I i M p L J 4 O O 7 E P j d O 0 g k u Y J b U / c n x 0 p C Z U a h r 6 p D I n u q / l s b P 6 X t R I d n H k p E 3 G i U d D p R 0 H C b R 3 Z 4 6 3 t L p N I N R 8 a I F Q y M 6 t N + 0 Q S q s 1 t c u Y I 7 v z K f 6 F e K r o n x V L V L Z S P Y a o s 7 M E B H I E L p 1 C G S 6 h A D S g g 3 M E D P F r X 1 r 3 1 Z D 1 P S z P W r G c X f s l 6 + Q a 8 U 5 F F < / l a t e x i t > Figure 1: Overview of the system pipeline, an end-to-end post-process scores the semantic relatedness between a candidate word and the context in the image (objects, scenarios, natural language descriptions, ...)

Learning Semantic Relatedness for Text Spotting
To learn the semantic relatedness between the visual context information and the candidate word we introduce a multi-channel convolutional LSTM with an attention mechanism. The network is fed with the candidate word plus several words describing the image visual context (object and places labels, and descriptive captions) 1 , and is trained to produce a relatedness score between the candidate word and the context. Our architecture is inspired by (Severyn and Moschitti, 2015), that proposed CNN-based rerankers for Q&A. Our network consists of two subnetworks, each with 4-channels with kernel sizes k = (3, 3, 5, 8), and overlap layer, as shown in Figure 1. We next describe the main components: Multi-Channel Convolution: The first subnetwork consists of only convolution kernels, and aims to extract n-gram or keyword features from the caption sequence.
The convolution is applied over a sequence to extract n-gram features from different positions. Let x ∈ R s×d be the sentence matrix, where s is the sentence length, and d the dimension of the i-th word in the sentence. Let also denote by c ∈ R k×d the kernel for the convolution operation. For each i-th position in the sentence, w i is the concatenation of k consecutive words, i.e., 1 All this visual context information is automatically generated using off-the-shelf existing modules (see section 4).
Our architecture uses multiple such kernels to generate feature maps m. The feature map for each window vector w i can be written as: where • is element-wise multiplication, f is nonlinear function, in our case we apply Relu function (Nair and Hinton, 2010), and b is a bias. For j kernels, the generated j feature maps can be arranged as feature representation for each window W i as: W = [m 1 ⊕ m 2 ⊕ . . . ⊕ m j ]. Each row W i of W ∈ R (s−k+1)×j is the new generated feature from the j-th kernel for the window vector at position i. The new generated feature (window representations) are then fed into the joint layer and LSTM as shown in Figure 1.
Multi-Channel Convolution-LSTM: Following C-LSTM (Zhou et al., 2015) we forward the output of the CNN layers into an LSTM, which captures the long term dependencies over the features. We further introduce an attention mechanism to capture the most important features from that sequence. The advantage of this attention is that the model learns the sequence without relying on the temporal order. We describe in more detail the attention mechanism below. Also, following (Zhou et al., 2015), we do not use a pooling operation after the convolution feature map. Pooling layer is usually applied after the convolution layer to extract the most important features in the sequence. However, the output of our Convolutional-LSTM model is fed into an LSTM (Hochreiter and Schmidhuber, 1997) to learn the extracted sequence, and pooling layer would break that sequence via downsampling to a selected feature. In short, LSTM is specialized in learning sequence data, and pooling operation would break such a sequence order. On the other hand, for the Multi-Channel Convolution model we also lean the extracted word sequence n-gram directly and without feature selection, pooling operation. Attention Mechanism: Attention-based models have shown promising results on various NLP tasks (Bahdanau et al., 2014). Such mechanism learns to focus on a specific part of the input (e.g. a relevant word in a sentence). We apply an attention mechanism (Raffel and Ellis, 2015) via an LSTM that captures the temporal dependencies in the sequence. This attention uses a Feed Forward Neural Network (FFN) attention function: where W a is the attention of the hidden weight matrix and v a is the output vector. As shown in Fig.  1 the vector c is computed as a weighted average of h t , given by α (defined below). The attention mechanism is used to produce a single vector c for the complete sequence as follows: where T is the total number of steps and α t is the computed weight of each time step t for each state h t , a is a learnable function that depends only on h t . Since this attention computes the average over time, it discards the temporal order, which is ideal for learning semantic relations between words. By doing this, the attention gives higher weights to more important words in the sentence without relying on sequence order. Overlap Layer: The overlap layer is just a frequency count dictionary to compute overlap information of the inputs. The idea is to give more weight to the most frequent visual element, specially when it is observed by more than one visual classifier. The dictionary output is a fully connected layer.
Finally, we merge all subnetworks into a joint layer that is fed to a loss function which calculates the semantic relatedness between both inputs. We call the combined model Fusion Dual Convolution-LSTM-Attention (FDCLSTM AT ).
Since we have only one candidate word at a time, we apply a convolution with masking in the candidate word side (first channel). In this case, simply zero-padding the sequence has a negative impact on the learning stability of the network. We concatenate the CNN outputs with the additional feature into MLP layers, and finally a sigmoid layer performing binary classification. We trained the model with a binary cross-entropy loss (l) where the target value (in [0, 1]) is the semantic relatedness between the word and the visual. Instead of restricting ourselves to a simple similarity function, we let the network learn the margin between the two classes -i.e. the degree of similarity. For this, we increase the depth of network after the MLPs merge layer with more fully connected layers. The network is trained using Nesterov-accelerated Adam (Nadam) (Dozat, 2016) as it yields better results (specially in cases such as word vectors/neural language modelling) than other optimizers using only classical momentum (ADAM). We apply batch normalization (BN) (Ioffe and Szegedy, 2015) after each convolution, and between each MLPs layer. We omitted the BN after the convolution for the model without attention (FDCLSTM), as BN deteriorated the performance. Additionally, we consider 70% dropout (Srivastava et al., 2014) between each MLPs for regularization purposes.

Dataset and Visual Context Extraction
We evaluate the performance of the proposed approach on the COCO-text (Veit et al., 2016). This dataset is based on Microsoft COCO (Lin et al., 2014) (Common Objects in Context), which consists of 63,686 images, and 173,589 text instances (annotations of the images). This dataset does not include any visual context information, thus we used out-of-the-box object (He et al., 2016) and place (Zhou et al., 2014) classifiers and tuned a caption generator (Vinyals et al., 2015) on the same dataset to extract contextual information from each image, as seen in Figure 2.

Related Work and Contribution
Understanding the visual environment around the text is very important for scene understanding. This has been recently explored by a relatively reduced number of works. Zhu et al. (2016) shows that the visual context could be beneficial for text detection. This work uses a 14 classes pixel clas- Table 1: Best results after re-ranking using different re-ranker, and different values for k-best hypotheses extracted from the baseline output (%). In addition, to evaluate our re-ranker with MRR we fixed k CNN k=8 LSTM k=4 sifier to extract context features from the image, such as tree, river, wall, to then assist scene text detection. Kang et al. (2017) employs topic modeling to learn the correlation between visual context and the spotted text in social media. The metadata associated with each image (e.g tags, comments and titles) is then used as context to enhance recognition accuracy. Karaoglu et al. (2017) takes advantage of text and visual context for logo retrieval problem. Most recently, Prasad and Wai Kin Kong (2018) use object information (limited to 42 predefined object classes) surrounding the spotted text to guide text detection. They propose two sub-networks to learn the relation between text and object class (e.g. relations such as car-plate or sign board-digit). Unlike these methods, our approach uses direct visual context from the image where the text appears, and does not rely on any extra resource such as human labeled meta-data (Kang et al., 2017) nor limits the context object classes (Prasad and Wai Kin Kong, 2018). In addition, our approach is easy to train and can be used as a drop-in complement for any text-spotting algorithm that outputs a ranking of word hypotheses.

Experiments and Results
In the following we use different similarity or relatedness scorers to reorder the k-best hypothesis produced by an off-the-shelf state-of-the-art text spotting system. We experimented extracting kbest hypotheses for k = 1 . . . 10.
We use two pre-trained deep models: a CNN (Jaderberg et al., 2016) and an LSTM (Ghosh et al., 2017) as baselines (BL) to extract the ini-tial list of word hypotheses.
The CNN baseline uses a closed lexicon; therefore, it cannot recognize any word outside its 90Kword dictionary. Table 1 presents four different accuracy metrics for this case: 1) full columns correspond to the accuracy on the whole dataset. 2) dict columns correspond to the accuracy over the cases where the target word is among the 90Kwords of the CNN dictionary (which correspond to 43.3% of the whole dataset. 3) list columns report the accuracy over the cases where the right word was among the k-best produced by the baseline. Comparing with sentence level model: We compare the results of our encoder with several stateof-the-art sentence encoders, tuned or trained on the same dataset. We use cosine to compute the similarity between the caption and the candidate word. Word-to-sentence representations are computed with: Universal Sentence Encoder with the Transformer USE-T (Cer et al., 2018), and Infersent (Conneau et al., 2017) with glove (Pennington et al., 2014). The rest of the systems in Table 1 are trained in the same conditions that our model with glove initialization with dual-channel overlapping non-static pre-trained embedding on the same dataset. Our model FDCLSTM without attention achieves a better result in the case of the second baseline LSTM that full of false-positives and short words. The advantage of the attention mechanism is the ability to integrate information over time, and it allows the model to refer to spe-  figure(figsize=(3,2)) ax = sns.heatmap( data, linewidths=0.7,fmt = '', x t i c k l a b e l s = ['football', 'stadium'] yticklabels= ['12', 'k', ' Figure 2: Examples of candidate re-ranking using object (c1), place (c2), and caption (c3) information. The three left examples are re-ranked based on the semantic relatedness score. The delta-airliner relation which frequently co-occurs in training data is captured by the overlap layers. The 12-football pair shows a relatedness between sports and numbers.
cific points in the sequence when computing its output. However, in this case, the attention attends the wrong context, as there are many words have no correlation or do not correspond to actual words. On the other hand, USE-T seems to require a shorter hypothesis list to get top performance when the right word is in the hypothesis list.
Comparing with word level model: We also compare our result with current state-of-the-art word embeddings trained on a large general text using glove and fasttext. The word model used only object and place information, and ignored the caption. Our proposed models achieve better performance than our TWE previous model (Sabir et al., 2018), that trained a word embedding (Mikolov et al., 2013) from scratch on the same task. Similarity to probabilities: After computing the cosine similarity we need to convert that score to probabilities. As we proposed in previous work (Sabir et al., 2018) we obtain the final probability combining (Blok et al., 2003) the similarity score, the probability of the detected context (provided by the object/place classifier), and the probability of the candidate word (estimated from a 5M token corpus) (Lison and Tiedemann, 2016). Effect of Unigram Probabilities: Ghosh et al.
(2017) showed the utility of a language model (LM) when the data is too small for a DNN, obtaining significant improvements. Thus, we introduce a basic model of unigram probabilities with Out-of-vocabulary (OOV) words smoothing. The model is applied at the end, to re-rank out false positive short words, and has the main goal of reranking out less probable word overranked by the deep model. As seen in Table 1, the introduction of this unigram lexicon produces the best results.
Human performance: To estimate an upper bound for the results, we picked 33 random pictures from the test dataset and had 16 human subjects try to select the right word among the top k = 5 candidates produced by the baseline text spotting system. Our proposed model performance on the same images was 57%. Average human performance was 63% (highest 87%, lowest 39%).

Conclusion
In this work, we propose a simple deep learning architecture to learn semantic relatedness between word-to-word and word-to-sentence pairs, and show how it outperforms other semantic similarity scorers when used to re-rank candidate answers in the Text Spotting problem.
In the future, we plan using the same approach to tackle similar problems, including lexical selection in Machine Translation, or word sense disambiguation . We believe our approach could also be useful in multimodal machine translation, where an image caption must be translated using not only the text but also the image content (Barrault et al., 2018). Tasks that lie at the intersection of computer vision and NLP, such as the challenges posed in the new Break-ingNews dataset (popularity prediction, automatic text illustration) could also benefit from our results (Ramisa et al., 2018).