Incorporating Multimodal Information in Open-Domain Web Keyphrase Extraction

Open-domain Keyphrase extraction (KPE) on the Web is a fundamental yet complex NLP task with a wide range of practical applications within the ﬁeld of Information Retrieval. In contrast to other document types, web page designs are intended for easy navigation and information ﬁnding. Effective designs encode within the layout and formatting signals that point to where the important information can be found. In this work, we propose a modeling approach that leverages these multi-modal signals to aid in the KPE task. In particular, we leverage both lexical and visual features (e.g., size, font, position) at the micro-level to enable effective strategy induction, and meta-level features that describe pages at a macro-level to aid in strategy selection. Our evaluation demonstrates that a combination of effective strategy induction and strategy selection within this approach for the KPE task out-performs state-of-the-art models. A qualitative post-hoc analysis illustrates how these features function within the model.


Introduction
We present a novel multi-modal approach to KeyPhrase Extraction (KPE), which is the task of automatically extracting salient phrases from a given document. The KPE task is a foundational task, which plays a facilitating role in many Information Retrieval (IR) tasks, including classification, summarization, and document indexing (Hasan and Ng, 2014). Specifically, the KPE task requires accurate selection of the phrases that best capture the web document's topic. Wellperforming approaches take advantage of document structure and entity co-occurrences.
Over the history of work in this area, there have been a variety of benchmarks (Medelyan and Witten, 2002;Nguyen and Kan, 2007;Wan and Xiao, * Equally contributed. 2008; Meng et al., 2017) and an equally wide variety of both non-neural (Grineva et al., 2009;Liu et al., 2009Liu et al., , 2010 and neural modeling approaches (Meng et al., 2017;Zhang et al., 2017;Chen et al., 2018). The earliest KPE approaches were mainly limited to domain-specific keyphrase extraction. The recent release of OpenKP (Xiong et al., 2019), a large-scale feature-rich dataset specifically developed for open-domain web-page keyphrase extraction, has encouraged further research related to the KPE task. A novel characteristic of this data set is the inclusion of features related to visual properties.
Visual properties of words and web page layout offer a KPE model utility in at least two respects. First, micro-level features operating at the word level, including lexical features as well as features related to size, font, color, and position of words, signal relative importance of words within extended texts. Intuitively, texts that are highlighted with colored, bold or bigger font or placed in more obvious places within a web page are more likely to be important, and should correspondingly be given higher probability as a keyphrase. Next, macro-level features describing the layout and type of page (e.g., News, Storefront, etc.) are related to the distribution of keyphrases. For example, for a news or blog website, the title or the first paragraph of its main content are likely locations for finding keyphrases, while for an Indexing page, the important phrases will more likely be found in lists. Important information may also be listed underneath or beside pictures.
Based on these insights, we propose a multimodal framework, Strategy-based Multimodal ARchiTecture for KeyPhrase Extraction (SMART-KPE), addressing the web KPE task in two steps: Multimodal Strategy Induction to apply specific extraction tactics with a refined use of micro-level features and Strategy Selection to choose results from different tactics using macro-level features. In our evaluation, we compare SMART-KPE with several state-of-the-art baselines, where SMART-KPE shows its better ability to locate and extract keyphrases. We offer post-hoc case studies and ablation studies to illustrate model strengths and weaknesses. In addition to the improvement over SOTA baselines for the KPE task, to the best of our knowledge, Originally, the concept keyphrase was first used by authors of scientific papers when they indicated by hand a few phrases they decided best summarized their paper (Ç ano and Bojar, 2019). The first corpora for automated keyphrase extraction were likewise assembled out of publications from scientific fields including technical reports (Witten et al., 1999), paper abstracts (Hulth, 2003), and scientific papers (Nguyen and Kan, 2007;Medelyan et al., 2009;Kim et al., 2010). To this day, scientific publications still serve as a fundamental fixed-domain benchmark for neural KPE methods (Meng et al., 2017;Alzaidy et al., 2019;Sahrawat et al., 2019) due to the availability of ample data of this kind. However, experiments have revealed that KPE methods trained directly on such corpora do not generalize well to other web-related genres or other types of documents (Chen et al., 2018;Xiong et al., 2019), where there may be far more heterogeneity in topics, content and structure, and there may be more variation in terms of where a key phrase may appear. Past researchers have collected corpora for KPE in Internet and social media environments, including web pages (Yih et al., 2006;Hammouda et al., 2005), blogs (Grineva et al., 2009), email (Dredze et al., 2008), news articles (Wan and Xiao, 2008;Hulth and Megyesi, 2006) and live chats (Kim and Baldwin, 2012), but most of these existing corpora fall prey to similar problems with respect to robust model training for neural models due to data sparsity and lack of representativeness in topic distribution. The recently released OpenKP (Xiong et al., 2019) is the first large-scale KPE dataset with a broad distribution of topic domains. This recent dataset facilitates work on model generalization and the opportunity to develop nuanced models that can adapt their performance based on the type of document they are applied to. This property of the dataset has inspired our proposed method where strategies are selected based on the detected type of document using macro-level features.

Neural Keyphrase Extraction Approaches
The earliest neural KPE models treat the KPE task as a standard encoder-decoder task, which first creates an encoding of the input using an RNN (Meng et al., 2017) or CNN (Zhang et al., 2017), and then decodes the predicted keyphrases. These early approaches were strictly limited to textual data representations. The release of OpenKP (Xiong et al., 2019) has introduced opportunities for research in multimodal KPE. OpenKP is now a recently added branch of MS-MARCO (Nguyen et al., 2016), with a public leaderboard for the KPE task held by Microsoft 1 . Built from Web data, it serves as the first large-scale benchmark for open-domain neural keyphrase extraction. In addition to providing the raw text of each document, OpenKP also includes various visual features associated with each text term, such as position, size, font, etc. Along with OpenKP, Xiong et al. (2019) also proposed BLING-KPE, the first neural model baseline for open-domain keyphrase extraction using visual features along with text. BLING-KPE first generates a hybrid embedding for each term by concatenating: (1) the ELMo (Peters et al., 2018) representation of the term, (2) standard sinusoidal position embedding (Vaswani et al., 2017), and (3) 18 of the 20 visual features of the term available in the OpenKP dataset. It models n-grams using multiple CNNs, and utilizes a Transformer (Vaswani et al., 2017) layer and feed-forward layer for scoring. This approach represents the first attempt at multimodal KPE.
Recently, Sun et al. (2020) achieved greater success on the OpenKP task by modeling keyphrase extraction as multiple traditional text tasks, including sequence labeling, chunking, salience ranking, etc. From this work we adopt the idea of modeling KPE as a sequence labeling task, where one of five tags is assigned to each document term: NOn keyphrase, Begin word of the keyphrase, MIddle word of the keyphrase, End word of the keyphrase, and Uni-word keyphrase. However, despite multiple recent efforts on this newly proposed dataset, published work so far in multimodal KPE has either omitted available features (Sun et al., 2020), or has adopted a brute force approach to feature encoding (direct concatenation of raw features) (Xiong et al., 2019). Therefore, in this work we strive for a more nuanced approach to leveraging available features for multimodal KPE and offer a uniquely comprehensive approach.

Task Definition
Here we formalize the keyphrase extraction task (KPE) under the web page setting: Given a document D = {W, V, M }, where W = {w 1 , w 2 , . . . , w n } are the text terms of the web page with length n, V = {v 1 , v 2 , . . . , v n } are the respective visual features of each term and M is the set of macro-level meta-features describing the document, we aim to find the set of word sub-sequences S = {S 1 , S 2 , ..., S K } where S i = {w j i , . . . , w j i +k i −1 } are the extracted keyphrases from the document text that are the most salient and best represent the keypoints of the document.
We adopt the method used by Sun et al. (2020)(BERT2Tag), where KPE is modeled as a sequence labeling problem on text terms W . For a given document, each term is assigned one of five labels: namely, {O, B, I, E, U }, which repre-sent nOn keyphrase, Begin word of the keyphrase, Middle word of the keyphrase, End word of the keyphrase and Uni-word keyphrase respectively. In the following sections, we elaborate on the details of training and testing for this sequence labeling task.

Model Structure
We divide the web KPE task into two steps: multimodal strategy induction, where specific tactics are applied to micro-level multimodal features, and strategy selection, where macro-level features are used to choose the best available strategy matching the form of the current page. We devise Strategy-based Multimodal ARchiTecture for KeyPhrase Extraction (SMART-KPE), a multimodal framework that extends the sequence labeling foundation. Figure 1 illustrates the architecture of SMART-KPE. Specifically, it incorporates three components: Unimodal Encoder, Multimodal Predictor and Meta-Feature Constructor.

Multimodal Strategy Induction
In this step, strategies are learned from selected subsets of available micro-level features for the KPE task. Specifically, our model first generates contextualized embeddings for both textual and visual features separately in the Unimodal Encoder. These embeddings are fused and subsequently fed into several distinct strategy-specific sequence labeling networks in the Multimodal Predictor, each generating a separate probability distribution of each token's possible tag.
The micro-level features used during this step are provided by the dataset, specified in Section 4.1.

Unimodal Encoder
The Unimodal Encoder is designed to build each term's multimodal representation based on both textual and visual modalities. We use a pretrained uncased BERT model (Devlin et al., 2018) to generate the contextualize term embeddings. Similarly, we apply a separate transformer (Vaswani et al., 2017) to the visual features. This can be formulated as: v 1 , ..., v n = Vis Transformer(V ). (2) where w i and v i are the contextualized text and visual embeddings of word w i respectively. w CLS is the BERT representation of the [CLS] token, which we use later as a representation of the whole document in the Meta-Feature Constructor. We motivate our handling of visual features using two intuitions. First, similar to text, visual features can be expected to achieve completely different visual effects depending on the page context: e.g., a word with font size 20 might be commonly used in one web page while it could mark the largest word in another. Second, the behavior and characteristics of visual and text modality features are different from one another. Therefore the first-step self-attention should be modeled in separate networks for text and visual features.
At the end of the Unimodal Encoder, textual and visual embeddings are concatenated as the final representation of a term and fused globally with another transformer: t 1 , ..., t n = Transformer( e 1 , ..., e n ). (4)

Multimodal Predictor
The Multimodal Predictor consists of N label predictors representing N different extraction strategies. Each predictor takes the textual and visual embeddings given by Eq. 4 and generates a tag score distribution P i,k for the term i independently using a 2-layer feedforward network: Note that all the strategies here are defined implicitly, which means we do not introduce any human assigned prior related to identification of page type. Rather, we allow latent information on page types to emerge and differentiate between strategies as the network's learning process finds through optimization. This avoids a timeconsuming labeling process and also mitigates the risk that types that seem intuitive from a cognitive standpoint might nevertheless not offer utility within the optimization.

Strategy Selection
In order to obtain the overall sequence labeling result, the Meta-Feature Constructor encodes the macro-level meta-features to perform a weighted selection upon predictors in the Multimodal Predictor. In this work, we use the following kinds of macro-level meta-features: Whole-Text Representation Here we use w CLS , the representation of the [CLS] token of our BERT model in the Unimodal Encoder as an estimated overall textual representation of both the website document and the title. Snapshot The visual embedding extracted from the snapshot of the original web page using Resnet-152 (He et al., 2016). This is a novel aspect of our work that offers the model the opportunity to identify characteristics of the overall layout and appearance of the web page.
This architecture is designed to flexibly include more meta-features given extra data or resources. All the meta-features are concatenated within a meta-feature embedding m, after which a feedforward network is applied to generate a normalized N -dimensional selector vector: S = (s 1 , ..., s N ) = sof tmax(FFN( m)), (6) s k = sof tmax k (s 1 , ..., s N ). (7) We use the selector vector s k as the weights for the N sequence label predictors in Multimodal Predictor to generate the overall probability that each tag is assigned to a term.
where p i,T is the probability of term i being labeled as tag T ∈ {O, B, I, E, U }, as described in Section 3.1.

Training and Keyphrase Prediction
The SMART-KPE model is trained in an end-toend way using term-wise cross-entropy loss: where y i is the correct tagging label for word w i according to its relation to the golden keyphrases. For keyphrase prediction, after SMART-KPE predicts the tag probability distribution for each term, the score of each phrase is calculated as follows: where T j is the corresponding tag for term w j given S i:i+k is a keyphrase, and p i,T i is the predicted probability of w j being labeled as T j . Min-pooling is used here to enable keyphrases to be treated in a comparable fashion regardless of length.

Dataset
We set OpenKP as the main dataset for our task. OpenKP consists of ∼150K documents sampled from the Bing search engine, within which neither the domain nor type of original web pages are restricted.
For each document, the following information is given: • URL: The link to the respective web page.
• Text: Cleaned body text of a document.
• Visual DOM features: A set of vectors representing the visual characteristics of text terms, listed in Table 1. For each term, features describing itself and its parent block in the DOM Tree are included.
The keyphrases for each document in the given dataset were labeled by expert annotators, with each document assigned 1-3 keyphrases. As a requirement, all the keyphrases were ones that appeared in the original document. The detailed statistics of OpenKP are displayed in Table 2. In the original dataset, no meta-features are provided except the website URL. We downloaded the title of each website and concatenated the titles with the cleaned body text as the text input of

Baselines and Evaluation Metrics
We compare SMART-KPE with the following baselines on the OpenKP dataset:   with the basic version of SMART-KPE which also uses BERT. For further comparison, we also report the strongest baseline result (RoBERTa2Joint) presented by Sun et al. (2020), and compare it to a RoBERTa-based variant of SMART-KPE in Section 5.1.
For evaluation of our generated keyphrases, we follow the official MS-MARCO guide and evaluation code 3 . Retrieval metrics include Precision, Recall and F1 at positions 1, 3 and 5, of which F1@3 is considered the main metric.

Implementation and Training Details
The model was implemented in Pytorch (Paszke et al., 2019) using the huggingface reimplementation of BERT (Wolf et al., 2019). Table 4 lists the parameters of our model. The maximum document length is 512 due to BERT limitations, and documents are zero-padded or truncated to this length. The model contains approximately 120M trainable parameters and was trained on a single GeForce GTX 1080 Ti GPU for around 12 hours to achieve best performance.

Evaluation Results
Experimental results on the OpenKP dataset are listed in Table 3. We perform experiments on the full SMART-KPE model and its 3 variants, where only micro-level visual features (SMART-KPE-Micro), only macro-level meta-features (SMART-KPE-Macro), and neither set of features (SMART-KPE-Skeleton) are applied respectively. We see that all variants of BERT-based SMART-KPE outperform BERT2Tag and BERT2Joint on all metrics, suggesting the effectiveness of feature construction and strategy selection.   Figure 2. Note that in the original data, punctuation is absent and keyphrases are case-insensitive. The text snippets we show here are restored from the original websites.
We further explore how different kinds of features function when extracting keyphrases. SMART-KPE-Skeleton outperforms the baselines even without the use of micro-level visual features and macro-level meta-features. This is due to the addition of title information as well as a more effective model structure. We observe further improvements as each multimodal feature component is added upon SMART-KPE-Skeleton, with the full model being the best combination. This analysis reveals the separate and joint effects of using meta-features and performing strategy selection, which formulates a complete approach to extract keyphrases.
We also change the BERT base in Unimodal Encoder to RoBERTa and replace the predictors in Multimodal Predictor from the tagging method to the joint extraction method (Sun et al., 2020) to fairly compare with the baseline with the best performance, RoBERTa2Joint. Experiment results are listed in the last two rows of Table 3. RoBERTa-based SMART-KPE outperforms both the baseline model and SMART-KPE-Full in all metrics, further demonstrating our model's flexibility to benefit from a more advanced text-based extraction backbone.

Case Study of Visual Feature Usage
We demonstrate the effects of introducing microlevel visual features by showing 3 cases from the validation set of OpenKP in Table 5. We present prediction results from SMART-KPE-Skeleton (SMART-KPE using only text features) and the complete SMART-KPE model. Snapshots of the original web pages are presented in Figure  2.
Cases #1 and #2 show how micro-level visual features help find the correct keyphrases. In Case #1, all the keyphrases are in the middle part of the webpage and more obvious than other words with different colors. The full model, successfully utilizing micro-level visual features, focuses on all of the keyphrases, while the skeleton model chooses a topic word in the first paragraph "Lyrics", resulting in a mistake. Case #2 shows a similar situation where the model with visual features finds the proper keyphrases that are much larger in font size, while the text-only model selects nouns elsewhere. On the other hand, Case #3 demonstrates a typical kind of web page where visual features can be misleading: an indexing page. In this kind of page, words larger in size and in bold are mostly  Table 5. Texts with green bounding boxes are keyphrases predicted by SMART-KPE-Full, and texts with red bounding boxes are keyphrases predicted by SMART-KPE-Skeleton. We can see the preference of the former model to focus on larger, bold and colorful texts.
entries to detailed contents, while the summarizing words before them, which are sometimes hidden in the small contents and easy to be omitted, are more likely to become keyphrases.

Analysis of Prediction Strategy Selection
In order to confirm that our predictors apply different tactics rather than simply duplicating each other, we calculate the overlap rate among the extracted keyphrases of different predictors.
We define the overlap rate for n sets as: where n k is the number of elements shared by k different sets. An overlap rate of 0 indicates that no elements exist in more than one set while 1 indicates totally identical sets. As shown in Table 6, the overlap rates are low at all positions, which means the predictors are generating diverse keyphrases, leaving space for the selector to pick up final results with macro-level features.  We also looked back into the data to see the keyphrases in context. Figure 3 presents the locations of extracted keyphrases by different predictors. We can generally conclude the difference as the result of: (1) Focus bias of modalities. Some multimodal predictors behave similarly to a purely textual predictor while others focus more on distinctly visual features. (2) Chunking bias of words. Different predictors will chunk words in different ways. (3) Ranking bias of keyphrases. Even if the same keyphrases are selected in different instances, the influence of their scores depends on other scores being compared. The more keyphrases being compared, the less utility an individual score has. We see an increase in overlap rate when more keyphrases are considered.

Paragraph Length
We investigate whether paragraph length or segmented length affects performance. In the experiments section, we truncate the document and only use the first 512 words. Apart from statistical results indicating that the majority of keyphrases appear rather near the front for most web pages, we calculated P@3 and F1@3 for different maximum document lengths to see how the extent of truncation affects the extent of the model's ability to generate reasonable keyphrases. The results are presented in Figure 4. From the figure, we can conclude that in our experiment, P@3 and F1@3 values are not heavily influenced by the length of the documents or amount of truncation. This is consistent with the intuition that keyphrases often appear at the top of the document(title, heading, first paragraph). Although there are advantages of increasing maximum text length, it is very resource-intensive and increases the difficulty of training because the amount of noise relative to the amount of signal increases as the maximum length increases.

Conclusions and Future Work
In this work we propose a Strategy-based Multimodal Architecture for Keyphrase Extraction (SMART-KPE) as a new state-of-the-art method for multimodal web-page keyphrase extraction. Different from traditional keyphrase extraction models mainly focusing on text, SMART-KPE illustrates the advantage of incorporating other modalities to help keyphrases location and salience prediction. Our proposed model outperforms several state-of-the-art baselines with the introduction of multimodal information. Through several case studies, we further illustrate how micro and macro-level features lead to the model's correct or incorrect selections.
As a first attempt to introduce macro-level meta-features for strategy selection, we believe there is much potential to refine and improve our approach. One high-level idea is to add further supervision to the current selector model based on empirical web page clustering, to better train the model to develop a set of more distinct keyphrase prediction strategies, and more effectively adjust the respective selector weights. We also plan to add more types of meta-features to generate richer multimodal representations. Furthermore, the SMART-KPE framework can be easily adapted to other NLP tasks, and we believe there is much potential in combining SMART-KPE with different models to further boost performance on opendomain KPE and other web-related tasks.