PubSE: A Hierarchical Model for Publication Extraction from Academic Homepages

Publication information in a researcher’s academic homepage provides insights about the researcher’s expertise, research interests, and collaboration networks. We aim to extract all the publication strings from a given academic homepage. This is a challenging task because the publication strings in different academic homepages may be located at different positions with different structures. To capture the positional and structural diversity, we propose an end-to-end hierarchical model named PubSE based on Bi-LSTM-CRF. We further propose an alternating training method for training the model. Experiments on real data show that PubSE outperforms the state-of-the-art models by up to 11.8% in F1-score.


Introduction
Researchers often list their publications in their academic homepages. These publications provide insights about the researchers' expertise, research interests, and collaboration networks. Extracting publications from a researcher's homepage is an essential step in extracting the researcher's profile (Tang et al., 2010). In this study, we aim to extract every publication from a researcher's homepage. For ease of discussion, we call such a publication item a publication string. Figure 1 illustrates the studied problem. There are two publications on the homepage shown in the figure, a journal article and a conference paper. We aim to extract them as two separate publication strings.
Extracting publication strings from academic homepages helps bypass the problem of name ambiguity (i.e., different authors with the same name) (Zhu et al., 2018) in extracting publication strings from indexing sites such as DBLP and PubMed. However, extracting publication strings from academic homepages directly has its own challenges: (i) The list of publications may be located anywhere in a homepage with varying contexts. The structure of the list and the formatting styles of a publication string can vary vastly across different homepages. For example, some researchers like to list some of their publications with more details, such as full venue names, volume and page information, while listing the other publication in a concise way. Also, some researchers like to group their publication by year or by topic. (ii) A publication string may contain multiple lines of text (cf. Figure 1), and there may not be a clear boundary between two publications strings. (iii) There may be strings in an academic homepage that share very similar structures and styles with publication strings, such as records of conference presentations (cf. Figure 1). Previous work (Hong et al., 2009;Chung et al., 2012) focuses on feature and rule engineering and cannot accommodate the above challenges.
To address these challenges, we propose a model named PubSE to extract every publication string from an academic homepage. PubSE has two characteristics: (i) The model structure reflects the structure of a list of publications, by its loss-functions at both text line-level and webpage-level. (ii) The training process of the model utilizes both text line-level and webpagelevel information via an alternating training procedure to reduce overfitting. Our PubSE model can extract publication strings in non-trivial cases, such as multi-line publication strings from a noncontinuous publication list. We make the following contributions: (i) We create a dataset of 2,500 homepages 1 , in which each publication string is labeled. (ii) We address the problem of publication string extraction by end-to-end learning, without feature engineering. (iii) We propose a model that can learn the structures of publication lists and the styles of publication strings. We also propose an alternating training method that can reduce overfitting and further improve the prediction accuracy.

Related Work
Earlier studies extract publication strings from research papers (Peng and McCallum, 2006;Councill et al., 2008;Tkaczyk et al., 2015). Such a problem is simpler for two reasons: (i) The reference list of a research paper usually appears at a fixed position (e.g., end of paper) and is continuous. (ii) The references are usually well-formatted and have few format variations since they may be generated by software such as L A T E X.
For extracting publication strings from academic homepages, previous studies use either rule-based (Hong et al., 2009;Yang and Ho, 2010) or a hybrid of machine learning and rule-based methods (Chung et al., 2012).
For example, Chung et al. (2012) develop a system named PRM that first segments an HTML DOM tree and its contents based on HTML tags. Then, PRM uses a linear chain CRF model to label the different parts of the tree. Based on the labels, it refines the publication string boundaries by heuristic rules to produce final predictions.
Relying on the HTML DOM tree structure makes it difficult to train a machine learning based model for publication extraction because: (i) Text in a publication string may be separated in many different DOM tree nodes. (ii) The DOM tree structure, which previous web data record extraction systems (Liu et al., 2003;Furche et al., 2014;Omari et al., 2016) rely on, may vary given the same webpage content.
As a result, we do not use HTML tags in our model. Instead, we work on the visible text con-tents. To the best of our knowledge, no existing studies for publication string extraction can effectively model the structure of publication lists.

Dataset
To the best of our knowledge, there is no public dataset of academic homepages that has labeled publication strings.
We downloaded 2,500 academic homepages from 100 universities around the world. We use Selenium, an open-source automated rendering software, to render the webpages. We collect visible texts from the webpages and then manually tag all the publication strings in them. During tagging, we mark the beginning and ending byte offsets of each publication string. Among the 2,500 academic homepages, 723 homepages (28.9%) contain publication lists, which consist of a total of 13,237 publication strings. Among the 723 homepages that contain publication strings, there are 117 homepages (16.2%) that contain multi-line publication strings. On average, there are 732.1 (std=1583.3) tokens, 89.9 (std=141.6) lines, and 18.3 (std=35.4) publication strings per homepage. We call this dataset HomePub.
Each publication string in HomePub dataset is annotated by two annotators. Disagreement is resolved by a third annotator. On publication string level, annotators agree on 83.76% publications, and the Cohen's kappa is 0.2084.
We have developed a program PageTagger 2 to assist the annotation. On average, it takes about 2.5 minutes to annotate one academic homepage when using the PageTagger tool.
Note that our annotation does not consider the following as publication strings: (i) Master or PhD theses; (ii) working papers; (iii) seminars, invited talks, or presentations; and (iv) patents. Our annotation also excludes the numbers (e.g., [1] or [i]) if the publication strings are in a numbered list.

Methods
We summarize the baseline models in Section 4.1 and present our PubSE model in Section 4.2.  CNN-Sentence: In the HomePub dataset, over 80% of the publication strings are in a single line. The problem of extracting publication strings can be viewed as a single-line text classification problem (Kim, 2014;Yang et al., 2016;Joulin et al., 2017). Following Kim (2014), we implement a line-level classification model. We use the GloVe (300 dimensions) pre-trained embedding on this model (the same embedding is used across all the models that require word embeddings as the input). This model predicts whether each line in the webpage is a publication string.
Bi-LSTM-CRF: The problem of extracting publication strings can also be viewed as a sequence labeling task (Lample et al., 2016;Gui et al., 2017;Liu et al., 2017), where there are two possible labels for each token, publication (I) or non-publication (O). A consecutive sequence of I tokens forms a publications string. Sequence labeling approaches can capture correlations of labels and words, as well as words themselves. Following Ma and Hovy (2016), we implement a Bi-LSTM-CRF model. We concatenate token-level with character-level embeddings as the model input, to better deal with out-of-vocabulary tokens.

PubSE Model
To capture the hierarchical structure of a homepage and address the issue of multi-line publication strings, we propose a hierarchical model named PubSE with an alternating training method, which alternates between training a line-level and a webpage-level model to reduce overfitting. Figure 2 illustrates the model structure.
We incorporate both line-level and webpagelevel information in the model since the predic-tions depend on both local (line-level) and global (webpage-level) context. On one hand, local information such as word embeddings and morphology information is crucial for the predictions over individual lines. On the other hand, global information is also necessary, e.g., to help identify the starting and ending positions of a publication list and the boundaries between multi-line publications strings. Line-level model: As shown in Figure 2, the left Bi-LSTM network π θ ( |s) specializes in linelevel inputs, where each mini batch of input composes of lines in a webpage. On top of this submodule, we add another layer σ φs (b s |s) to model whether each line contains a publication string: where denotes the predicted labels for tokens t si in a line s, and b s denotes the predicted label for a line, i.e., whether the line contains a publication string or not; y t and y l denote the ground truth label for each token and line. Hyperparameters λ θs and λ φs are the coefficients for token-level and line-level, while θ and φ s are the parameters of the two networks; L is the cross-entropy loss: whereŷ denotes the predicted label, and y denotes the ground truth label.
The inputs to the line-level model are each line in the form of (e 1 , c 1 ), (e 2 , c 2 ), ..., (e n , c n ) , where e k and c k are the word embedding and character embedding of the token t sk . Network π θ ( |s) outputs a label for each token, while network σ φs (b s |s) gives binary output for each line, indicating whether it contains a publication string. Extraction result is based on the output of π θ ( |s).
Webpage-level model: Similarly, the right Bi-LSTM sub-module π θ ( |d) in Figure 2 specializes in webpage-level inputs, where the whole homepage d is supplied to the model as a long sequence of token embeddings. We add another layer σ φ d (b d |d) to reflect whether the homepage contains publication strings: where b d denotes whether the document contains publication or not; λ θ d and λ φ d are token-level and webpage-level coefficients; y w denotes the ground truth label for a webpage; θ and φ d are network parameters. Note that the left and right sub-modules collaborate by sharing network weights θ.  Alternating training method: Inspired by the training procedure of curriculum learning (Bengio et al., 2009) and soft-landing , we adopt an alternating training procedure controlled by the following function: where T controls the period of the function, and k is the number of epochs. The H denotes the Heaviside step function. In the k th epoch, we will train only one of the submodules, given by Our intuition is that the training of line-level and webpage-level networks can reinforce each other and reduce overfitting. If we only train the model on the line-level input, the model will lose all the long-term dependency information. For example, a string that describes a thesis resembles that of a conference paper. To filter such strings, we need to rely on indicators that may reside in a different line such as a heading "Dissertations supervised". On the other hand, if we only train the model on webpage-level input, the model may be dominated by the longest line on the homepage, such as biography information. Our alternating training procedure balances the two factors and can better model the hierarchical structure of a publication list.
The PubSE model can capture and exploit information on the webpage from four different perspectives: (i) character-level information such as word morphology; (ii) token-level information such as word context; (iii) line-level information such as whether a line is a publication string; and (iv) webpage-level information such as whether a webpage contains publication strings.

Experiments
Settings and Evaluation: We divide the Home-Pub dataset by a 60-40 split and train our model on 60% of the total data. We use 20% of the training set as a validation set for early stopping and hyperparameter tuning. The optimal hyperparameters are obtained with a standard grid search procedure on the validation set. We use 40 as the batch size for the line-level model and 1 for the webpagelevel model. We set λ θs , λ φs , λ θ d , λ φ d , T as 1, 0.05, 1, 0.3, 1/π, respectively.
We use precision, recall, and F1-score to measure the performance, and we report both exact and relaxed matching performance. In exact matching, a publication string is considered to be correctly extracted only if it exactly matches a publication string in the ground truth. In relax matching, we allow mismatching 15% of publication strings. (i.e., a publication string is considered correct if it contains at least 85% of the tokens of a publication string in the ground truth.) We also list the model performance on webpages in the test set that contain multi-line publication strings.
Results: We report the experiment results in Table 1. The result shows that the proposed PubSE model consistently outperforms all the baselines with a statistically significant margin, and the advantage is up to 11.8%. In particular, the use of the webpage-level sub-module helps PubSE to handle multi-line publication strings, which yields a significant performance gain.
In comparison, PRM struggles in determining which part of a page contains publications. ParsCit requires well-formatted inputs. For example, if publication strings do not contain page numbers, ParsCit will be reluctant to separate the list of publication strings into individual records. CNN-Sentence and Bi-LSTM-CRF give poor results in pages that contain multi-line publication strings.  L means only the line-level sub-module π θ ( |s); LP means the extra layer σ φs (bs|s); W means only the webpage-level sub-module π θ ( |d); WP means the extra layer σ φ d (b d |d).
Ablation Study: We also test different variations of our proposed model PubSE, and the results are shown in Table 2.
About 50% improvement over the best baseline  is made by training the model with webpage-level input (W) since it is difficult to extract multi-line publication strings without a global view of the whole webpage.
The effect of the alternating training method (L+W) is also significant. The webpage-level model (W) may not handle short lines too well, e.g., a line with text "Conference paper (peerreviewed)" as shown in Figure 1. This problem is solved by combining the line-level model with the webpage-level model (L+W).
Error analysis: Figure 3 shows typical errors made by various models. Examples 1 shows errors occurred in line-level model prediction results. The line-level model does not handle multiline publication strings well since the predictions of different lines are independent, so the model fails to capture dependency relationships in different lines.
Example 2 shows prediction results given by the webpage-level model. We see that the webpagelevel model can make a more accurate prediction for multi-line publications. However, it may make false positive predictions for short lines (e.g., "Chapter (peer-reviewed)"), while the linelevel model seldom makes such mistakes. This is the motivation for us to integrate both the linelevel and the webpage-level models.
PubSE can avoid most of the errors shown in Examples 1 and 2. Nevertheless, PubSE still makes mistakes in some challenging cases. Example 3 shows such a case, where PubSE does not recognize that "Rhetoric" is a publication title. A possible explanation is that such a short publication title is less common.

Conclusions and Future work
We studied publication string extraction and proposed a model named PubSE for the problem. PubSE models the publication list structure with its hierarchical structure and loss functions. We proposed an alternating training scheme that combines both line-level and webpage-level information, which are crucial for predicting multi-line publication strings. Experiments show that the proposed PubSE model outperforms the state-ofthe-art models by up to 11.8% in F1-score.
For future work, we aim to expand our experimental study to a larger scale. We further consider extracting publication strings from academic homepages of the same organization. Such homepages may share similar templates, which may help improve the extraction accuracy. We also plan to investigate adaptive alternating model training schemes as well as external memory mechanism such as memory networks.