Bootstrapping Multilingual Metadata Extraction: A Showcase in Cyrillic

Applications based on scholarly data are of ever increasing importance. This results in disadvantages for areas where high-quality data and compatible systems are not available, such as non-English publications. To advance the mitigation of this imbalance, we use Cyrillic script publications from the CORE collection to create a high-quality data set for metadata extraction. We utilize our data for training and evaluating sequence labeling models to extract title and author information. Retraining GROBID on our data, we observe significant improvements in terms of precision and recall and achieve even better results with a self developed model. We make our data set covering over 15,000 publications as well as our source code freely available.


Introduction
The use of scholarly data becomes more and more important as the rate of academic publications keeps increasing and automated processing gains relevance, such as scientometric analysis and scholarly recommendation (Sigurdsson, 2020;Zhang et al., 2020). Consequentially, limitations of scholarly data and approaches based thereon directly translate into disadvantages for the affected publications, in terms of, for example, discoverability and impact. One particular limitation of scholarly data nowadays is an underrepresentation of non-English content (Vera-Baceta et al., 2019; Moskaleva and Akoev, 2019). While supporting multiple languages poses challenges, such as language-specific preprocessing requirements (Grave et al., 2018;McCann, 2020), disregarding non-English work is problematic (Amano et al., 2016;Lynch et al., 2021). To further the availability of high-quality scholarly data beyond the anglophone publication record, we showcase the creation and application of a data set for training and evaluating sequence labeling tasks on Cyrillic publications. Recent years have seen an increased focus on multilinguality in natural language processing approaches, such as language models (Devlin et al., 2019) and data sets (Caswell et al., 2021). Furthermore, there are efforts to specifically support languages that use non-Latin scripts (Roark et al., 2020;Pfeiffer et al., 2021). With regards to Cyrillic script languages, approaches concerned with named entity linking in Web documents (Piskorski et al., 2021), as well as approaches to extracting keywords from scientific texts (Bolshakova et al., 2019) exist. Model training for these types of information extraction tasks is increasingly done using automatically generated high-quality training data. This has, for example, been done for tasks such as text extraction from scholarly PDF files (Bast and Korzen, 2017), identification of publication components such as figures and tables in scanned documents (Ling and Chen, 2020), and the parsing of bibliographic references (Grennan and Beel, 2020;Thai et al., 2020).
We extend this approach to non-English scholarly data. To this end, we use Cyrillic script documents from the CORE data set (Knoth and Zdrahal, 2012) to train and evaluate sequence labeling mod-els for identifying publications' metadata (title and authors) in unlabeled text, as illustrated in Figure 1.
Overall, the contributions we make with this paper are as follows.
1. We showcase an effective method for creating high-quality data for training and evaluating metadata extraction sequence labeling models on multilingual scholarly data.
2. We provide a data set for Cyrillic, comprising 15,553 publications spanning three languages and 27 years.
3. We create sequence labeling models that outperform available methods on Cyrillic data.

Data Selection
Although many large scholarly data sets exist nowadays, most are restricted in terms of language coverage, language related metadata, or availability of full text documents. The PubMed Central Open Access Subset, 2 for example, only contains Latin script publications, 3 the Semantic Scholar Open Research Corpus (Lo et al., 2020) is restricted to English, and the Microsoft Academic Graph (Sinha et al., 2015;Wang et al., 2019) contains no full texts. Furthermore, none of the aforementioned offers metadata on publications' language. We chose to use the CORE data set 4 (Knoth and Zdrahal, 2012)-a large scholarly data set consisting of PDF documents and metadata aggregated from institutional and subject repositories-for our approach because it is not restricted by language, offers full papers and partly provides language metadata. To obtain Cyrillic script publications, we first filter the whole collection for the language labels of four Cyrillic script languages, namely Russian, Ukrainian, Bulgarian, and Macedonian, resulting in 23,850 documents. Noticing that a lot of the items we identified are clustered in certain ID ranges of CORE, we extend our data to roughly 48,000 papers by applying language detection on the PDF files of documents adjacent in the set of CORE IDs. After removal of duplicates (papers with different CORE ID but identical PDF) we end up with 27,755 documents. Examination of our data at this point reveals that it contains documents other than scientific papers, such as lecture notes, lecture schedules, and untypically long documents such as whole conference proceedings. To remove these, we perform two filtering steps. First, we remove documents whose title contains either of the words студентiв (UKR: "student"), Конспект лекцiй (UKR: "lecture schedule"), Програма (RUS: "program", as in study program) and Диплом (RUS: "diploma"), leaving around 22,000 documents and changing the distribution of document title lengths as shown in Figure 2. Second, we drop documents whose length exceeds the 95% quantile (68 pages). Finally, we remove papers for which CORE does not provide basic metadata, and papers for which the plain text was not extractable from the PDF. This leaves us with 15,553 papers, which form the basis for our work and the provided Cyrillic data set.

Data Preparation
To prevent having to remove large portions of the identified Cyrillic papers due to missing metadata (see previous section), we decide to focus on publications' title and list of authors. In order to create training data for sequence labeling tasks, we obtain the JSON metadata and PDF of each of the selected publications from CORE. From the PDF, we extract the plain text contained in the first page using PDFMiner 5 , identify the title and authors from the JSON metadata and insert labels accordingly (see Section 3.2.1 for details).

Data Set
The resulting data set comprises 15,553 papers spanning 27 years and three languages. For each paper, we provide ground truth sequence labeling output in TEI 6 format and as annotated plain text. 7 A detailed breakdown of languages, obtained using fastText (Joulin et al., 2016(Joulin et al., , 2017 language detection is shown in Table 1. Languages with less than five occurrences throughout the data set are not included. The distribution of papers by publication year is shown in Figure 3. A breakdown of the topics 8 covered by the data set is shown in Table 2. Analysing the origin of papers, we note that 90% originate from either the "A.N.Beketov KNUME Digital Repository" 9 or the "Zhytomyr State University Library." 10 Language #Documents Ukrainian 11,708 Russian 3,786 Bulgarian 54

Application
To assess the utility of our data set, we use it to retrain GROBID (Lopez, 2008(Lopez, -2021, a widely used metadata extraction tool (Nasar et al., 2018), as well as a standalone sequence labeling model, and evaluate their performance against an off-theshelf version of GROBID.

GROBID Training
GROBID utilizes several models for different tasks, each of which can be retrained. Our use casethe extraction of title and author informationconcerns the header model, which is based on conditional random fields (CRF). Retraining the header model from scratch using our data set, we note that for a significant portion of PDFs, GROBID is not able to produce plain text on which the CRF would then be applied. Because of this, we are only able to use 9,620 papers (62% of the data set) for retraining.

Data Preprocessing
For our standalone model we decide to label the textual content of the first page of each paper using four tags, namely Author, B-title (beginning of the title, i.e. the first title token), I-title (tokens inside the title) and Misc (everything else).
To this end, we extract the plain text from the PDF using PDFMiner, tokenize the text according to whitespace, and replace newlines with a NEWLINE token. The publication's title is then identified using the JSON metadata and each token labeled accordingly. NEWLINE tokens within a sequence of title tokens are preserved.
For the matching of authors, we split the author strings from the metadata into surname and given names. We first locate the surnames in the token sequence, and label the occurrence closest to the title as Author. Because given names can appear written-out as well as abbreviated in the form of initials, we heuristically identify the latter as follows. Given an identified surname, we search within a window of eight tokens before and after the surname 11 for uppercase characters followed by a period. Matching initials are then labeled accordingly. Written-out given names are normally 11 Eight being given in the edge case where a surname is followed by a separating comma, two initials and a newline somewhere in-between. E.g.: "<surname>, <initial>. <newline><initial>.". From the tokens we derive vectorized embeddings using fastText. Following Chiu and Nichols (2016) we use representations with 100 dimensions. In addition to the embeddings, we add five additional feature dimensions to the word vectors as done by Huang et al. (2015). These contain information about whether a token is uppercase, capitalized, contains punctuation, contains a line break or is styled like an author initial (uppercase and ending in a period character).

Model Training
For our standalone model we choose to use a BiLSTM network, as is commonly done for sequence labeling tasks (Huang et al., 2015).
We trim input sequences to the first 1,000 tokens, resulting in an input space of 1, 000 × 105 dimensions per document, as each token is represented by a 100-dimensional vector with a set of five added features per token. The output space is of equal length and contains a one-hot-encoded representation of one of the four labels Author, B-title, I-title and Misc.
Because title and authors only make up a small fraction of the words at the beginning of a publication, tokens with the Misc label make up a majority of our data. To prevent the trivial prediction of the Misc label playing too much of a role in training, each input word token is given an individual, heuristically determined weight value of either 1 for Misc. or 5 for Author and *-title labels.
The final network, as shown in Figure 4, consists of a BiLSTM layer followed by a ReLU activated dense layer, a dropout layer and a final dense layer with softmax activation. For training, categorical cross entropy serves as the model's loss function and recall is employed as the target metric. Furthermore, the Adam optimizer (Kingma and Ba, 2017) with a learning rate of 0.0001 is used.

Evaluation
To assess the performance of both the off-theshelf and retrained GROBID as well as the standalone BiLSTM model, we perform five-fold crossvalidations and measure the overall precision, recall, and F1 score. 12 Because GROBID retraining is only possible on roughly two thirds of our data (see Section 3.1) we evaluate the off-the-shelf ("vanilla") GROBID model on the same subset in order to maximize comparability of the evaluation results.
Regarding the comparability to our standalone BiLSTM model, a key difference lies in the fact that we use four labels (Author, B-title, I-title and Misc) instead of GROBID's two (Author and Title). To adjust for this difference, we decide to disregard the Misc label and combine the two types of *-title label by a weighted average.
The overall evaluation scores resulting from this are shown in Table 3. We note that off-the-shelf GROBID is only able to determine a small fraction of title and author tokens correctly. Retraining GROBID using our training data, however, significantly improves the performance from an F1 score of 0.06 to 0.83, on par with GROBID's performance on English documents (Nasar et al., 2018). Our standalone BiLSTM model outperforms the retrained GROBID due to significantly higher recall with a F1 score of 0.90. Looking at the evaluation results per label for the retrained GROBID and standalone BiLSTM model, as shown in  is given in the recall of the author label (measuring 0.74 and 0.95 respectively). For further assessment of the BiLSTM model's performance, we evaluate its predictions per language as shown in Table 5. We can observe that the model achieves higher scores for Russian documents compared to the results for Ukrainian. This is especially notable since the amount of Ukrainian documents in the data set is significantly higher than that of Russian papers. One possible explanation of this performance gap could be a more consistent structure among the Russian documents. Performance on the 50 Bulgarian documents within the data set is comparatively low. While this could likely be due to the vast majority of the respective training data being in a different language, the informativeness of the score itself has to be considered keeping in mind that there are merely 50 documents for testing available.

Conclusion
Inspired by recent approaches creating high-quality data for training and evaluating information extraction tasks involving scholarly publications, we utilize this approach to tackle the problem of underrepresented non-English scholarly (training) data. To this end, we use Cyrillic script documents found in the CORE data set to train sequence labeling models for identifying publications' metadata.
We create a data set of 15,553 papers spanning 27 years and three languages. Using this data set, we retrain GROBID and thereby greatly improve its performance. Furthermore, we train and evaluate a separate sequence labeling model that is less constrained by PDF parsing restrictions (see Section 3.1), showing even better overall performance results than the retrained GROBID model.
By showcasing the use of freely available non-English publications to improve the availability of high-quality data and models covering areas beyond the anglophone publication record, we hope to inspire similar efforts for other languages. For our own approach, we plan to extend it to the extraction of bibliographic references in the future.