Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings

Chemical patents are an important resource for chemical information. However, few chemical Named Entity Recognition (NER) systems have been evaluated on patent documents, due in part to their structural and linguistic complexity. In this paper, we explore the NER performance of a BiLSTM-CRF model utilising pre-trained word embeddings, character-level word representations and contextualized ELMo word representations for chemical patents. We compare word embeddings pre-trained on biomedical and chemical patent corpora. The effect of tokenizers optimized for the chemical domain on NER performance in chemical patents is also explored. The results on two patent corpora show that contextualized word representations generated from ELMo substantially improve chemical NER performance w.r.t. the current state-of-the-art. We also show that domain-specific resources such as word embeddings trained on chemical patents and chemical-specific tokenizers, have a positive impact on NER performance.


Introduction
Chemical patents are an important starting point for understanding of chemical compound purpose, properties, and novelty. New chemical compounds are often initially disclosed in patent documents; however it may take 1-3 years for these chemicals to be mentioned in chemical literature (Senger et al., 2015), suggesting that patents are a valuable but underutilized resource. As the number of new chemical patent applications is drastically increasing every year (Muresan et al., 2011), it is becoming increasingly important to develop automatic natural language processing (NLP) approaches enabling information extraction from these patents (Akhondi et al., 2019). Chemical Named-Entity Recognition (NER) is a fundamental step for information extraction from chemical-related texts, supporting relation extraction (Wei et al., 2016), reaction prediction (Schwaller et al., 2018) and retro-synthesis (Segler et al., 2018).
However, performing NER in chemical patents can be challenging (Akhondi et al., 2014). As legal documents, patents are written in a very different way compared to scientific literature. When writing scientific papers, authors strive to make their words as clear and straight-forward as possible, whereas patent authors often seek to protect their knowledge from being fully disclosed (Valentinuzzi, 2017).
In tension with this is the need to claim broad scope for intellectual property reasons, and hence patents typically contain more details and are more exhaustive than scientific papers (Lupu et al., 2011).
There are a number of characteristics of patent texts that create challenges for NLP in this context. Long sentences listing names of compounds in chemical patents are frequently used. The structure of sentences in patent claims is usually complex, and syntactic parsing in patents can be difficult (Hu et al., 2016). A quantitative analysis by Verberne et al. (2010) showed that the average sentence length in a patent corpus is much longer than in general language use. That work also showed that the lexicon used in patents usually includes domain-specific and novel terms that are difficult to understand. Some patent authorities use Optical Character Recognition (OCR) for digitizing patents, which can be problematic when applying automatic NLP approaches as the OCR errors introduces extra noise to the data (Akhondi et al., 2019).
Most NER systems for the chemical domain were developed, trained and tested on either chemical literature or only the title and abstract of chemical patents (Akhondi et al., 2019). There are substantial linguistic differences between ab-stracts and the corresponding full text publications (Cohen et al., 2010). The performance of NER approaches on full patent documents has still not been fully explored (Krallinger et al., 2015).
Hence, this paper will focus on presenting the best NER performance achieved to date on full chemical patent corpus.
We use a combination of pre-trained word embeddings, a CNN-based character-level word representation and contextualized word representations generated from ELMo, trained on a patent corpus, as input to a BiLSTM-CRF model. The results show that contextualized word representations help improve chemical NER performance substantially. In addition, the impact of the choice of pre-trained word embeddings and tokenizers is assessed.
The results show that word embeddings that are pre-trained on chemical patents outperform embeddings pre-trained on biomedical datasets, and using tokenizers optimized for the chemical domain can improve NER performance in chemical patent corpora.

Related work
In this section, we summarize previous methods and empirical studies on NER in chemical patents.
Two existing Conditional Random Field (CRF)based systems for chemical named entity recognition are tmChem (Leaman et al., 2015) and ChemSpot (Rocktäschel et al., 2012); each makes use of numerous hand-crafted features including word shape, prefix, suffix, part-of-speech and character N-grams in an algorithm based on modelling of tag sequences. A previous detailed empirical study explored the generalization performance of these systems and their ensembles (Habibi et al., 2016). The application of the tm-Chem model trained on chemical literature corpora of the BioCreative IV CHEMDNER task (Krallinger et al., 2015) and the ChemSpot model trained on a subset of the SCAI corpus (Klinger et al., 2008) resulted in a significant performance drop over chemical patent corpora. Zhang et al. (2016) compared the performance of CRF-and Support Vector Machine (SVM)based models on the CHEMDNER-patents corpus (Krallinger et al., 2015). The features constructed in that work included the binarized embedding (Guo et al., 2014), Brown clustering (Brown et al., 1992 and domain-specific features extracted by detecting common prefixes/suffixes in chemical words. The obtained results show that the performance of CRF and SVM models can be significantly improved by incorporating unsupervised features (e.g. word embeddings, word clustering). The study also showed that the SVM model slightly outperformed the CRF model in the chemical NER task.
To perform chemical NER on the CHEMD-NER patents corpus, Akhondi et al. (2016) proposed an ensemble approach combining a gazetteer-based method and a modified version of tmChem. Here, the gazetteer-based method utilized a wide range of chemical dictionaries, while additional features such as stems, prefixes/suffixes, chemical elements were added to the original feature set of tmChem. In the ensemble approach, tokens were predicted as chemical mentions if recognized as positive by either tmChem or the gazetteer-based method. The results showed that both gazetteer-based and ensemble approaches were outperformed by the modified tmChem version in terms of overall F 1 score, although these two approaches can obtain higher recall. Huang et al. (2015) proposed a BiLSTM-CRF based on the use of a bidirectional long-short term memory network -BiLSTM (Schuster and Paliwal, 1997) -to extract (latent) features for a CRF classifier. The BiLSTM encodes the input in both forward and backward directions and passes the concatenation of outputs from both directions as input to a linear-chain CRF sequence tagging layer. In this approach, the BiLSTM selectively encodes information and long-distance dependencies observed while processing input sentences in both directions, while the CRF layer globally optimizes the model by using information from neighbor labels.
The morphological structures within words are also important clues for identifying named entities in biological domain. Such morphological structures are widely used in systematic chemical name formats (e.g. IUPAC names) and hence particularly informative for chemical NER (Klinger et al., 2008). Character-level word representations have been developed to leverage information from these structures by encoding the character sequences within tokens. Ma and Hovy (2016) uses Convolutional Neural Networks (CNNs) to encode character sequences while Lample et al. (2016) developed a LSTM-based approach for encoding character level information. Habibi et al. (2017) presented an empirical study comparing three NER models on a large collection of biomedical corpora including the BioSemantics patent corpus: (1) tmChem-the CRFbased model with hand-crafted features-used as the baseline; (2) a second CRF model based on CRFSuite (Okazaki, 2007) using pre-trained word embeddings; (3) and a BiLSTM-CRF model with additional LSTM-based character-level word embeddings (Lample et al., 2016). The performance of CRFSuite-and BiLSTM-CRF-based models with different sets of pre-trained biomedical word embeddings (Pyysalo et al., 2013) were also explored. The results showed that the BiLSTM-CRF model with the combination of domainspecific pre-trained word embedding and LSTMbased character-level word embeddings outperformed the two CRF-based models on chemical NER tasks in both chemical literature and chemical patent corpora. However, this work used only a general tokenizer (i.e. OpenNLP) and word embeddings pre-trained on biomedical corpora. Corbett and Boyle (2018) presented word-level and character-level BiLSTM networks for chemical NER in literature domain. The word-level model employed word embeddings learned by GloVe (Pennington et al., 2014) on a corpus of patent titles and abstracts. The character-level model used two different transfer learning approaches to pre-train its character-level encoder. The first approach attempts to predict neighbor characters at each time step, while the other tries to predict whether a given character sequence is an entry in the chemical database ChEBI (Degtyarenko et al., 2007). Experimental results show that the character-level model can produce better NER performance than word-level model by leveraging transfer learning. In addition, for the wordlevel model, using pre-trained word embeddings learned from a patent corpus helps produce better performance than using the pre-trained ones learned from a general corpus.

Our empirical methodology
This section presents our empirical study of NER chemical on patent datasets. We first outline the experimental datasets (Section 3.1) and the tokenizers (Section 3.2) used to pre-process these datasets, and then we introduce the BiLSTM-CRF-based models (Section 3.3) with pre-trained word embeddings (Section 3.4), character-level word embeddings (Section 3.5), contextualized word embeddings (Section 3.6) and implementation details (Section 3.7).
The BioSemantics patent corpus (Akhondi et al., 2014) consists of 200 full chemical patent documents with 9 different entity classes. In particular, this corpus has 170K sentences and and 360K entity annotations, which is much larger than previously used datasets, e.g. the CHEMD-NER patent abstract corpus (Krallinger et al., 2015). Therefore, this corpus can be considered as a more suitable resource for evaluating deep learning methods in which a large amount of training data is required (LeCun et al., 2015). A subset of 47 patents were annotated by multiple groups (at least 3) of annotators and evaluated through innerannotator agreement. By harmonizing the annotations from different annotator groups, these 47 patents formed the "harmonized" set in the BioSemantics patent corpus. We use the harmonized set for both hyper-parameter tuning and error analysis as it has known high-quality annotations.
The Reaxys gold set (Akhondi et al., 2019) contains 131 patent snippets (parts of full chemical patent documents) from several different patent offices. The tagging scheme of this corpus includes 2 coarse-grained labels chemical class and chemical compounds, and 7 fine-grained labels of chemical compound (e.g. mixture-part, prophetic) and chemical class (e.g. bio-molecule, Markush, mixture, mixture-part). This corpus is relatively small in size, approximately 20,000 sentences in total, but very richly annotated. The relevancy score of each chemical entity and the relations between them were also annotated, which allows this corpus to be used in other tasks beyond named entity recognition.
In our experiments, each corpus is used separately. We follow Habibi et al. (2017) to use a ratio split of 60%/10%/30% for training/development/test. Note that on the BioSemantic patent corpus, our sampling of datasets may not be exactly the same as in Habibi et al. (2017).

Tokenizers
The morphological information captured by character-level word representations can be highly affected by tokenization quality. General-purpose tokenizers usually split tokens by spaces and punctuation. However, strict adherence to such boundaries may not be suitable for chemical texts as spaces and punctuation are commonly used in the IUPAC format for chemical names (e.g. 3-(4,5-dimethylthiazol-2-yl)-2,5diphenyl tetrazolium bromide) (Jessop et al., 2011). Hence, the impact of different tokenizers on NER also needs to be explored. A pre-processing step is applied to the patent corpora including sentence detection and tokenization. Following Habibi et al. (2017), we use the OpenNLP (Morton et al., 2005) English sentence detection model. To explore the relationship between tokenization quality and final NER performance, we apply different tokenizers and train/test models with each tokenizer individually. To investigate the effect of a general domain tokenizer, following Habibi et al. (2017), we also use the OpenNLP tokenizer. To investigate whether NER performance will be affected by tokenization quality, we employ three tokenizers optimized for chemical texts including ChemTok (Akkasi et al., 2016), OSCAR4 (Jessop et al., 2011) and NBIC UMLSGeneChemTokenizer. 1

Models
We use the BiLSTM-CNN-CRF model (Ma and Hovy, 2016) as our baseline. We extend the baseline by adding the contextualized word representations generated from ELMo (Peters et al., 2018).
For convenience, we call the extended version as EBC-CRF as illustrated in Figure 1. In particular, for EBC-CRF, we use a concatenation of pretrained word embeddings, CNN-based characterlevel word embeddings and ELMo-based contextualized word embeddings as the input of a BiL-STM encoder. The BiLSTM encoder learns a latent feature vector for each word in the input. Then each latent feature vector is linearly transformed before being fed into a linear-chain CRF layer (Lafferty et al., 2001) for NER tag prediction. We assume binary potential between tags and unary potential between tags and words. Dai et al. (2019) showed that NER performance is significantly affected by the overlap between pretrained word embedding vocabulary and the target NER data. Therefore, we explore the effects of different sets of pre-trained word embeddings on the NER performance.

Pre-trained word embeddings
We use 200-dimensional pre-trained PubMed-PMC and Wiki-PubMed-PMC word embeddings (Pyysalo et al., 2013), which are widely used for NLP tasks in biomedical domain. Both the PubMed-PMC and Wiki-PubMed-PMC embeddings word embeddings were generated by training the Word2Vec skip-gram model (Mikolov et al., 2013) on a collection of PubMed abstracts and PubMed Central articles. Here, an additional Wikipedia dump was also used to learn the Wiki-PubMed-PMC word embeddings.
To explore whether word embeddings trained in the same domain can produce better performance in NER tasks, we learn another set of word embeddings, which we called ChemPatent embeddings, by applying the same model and hyperparameters from Pyysalo et al. (2013) on a collection of 84,076 full patent documents (1B tokens) across 7 patent offices (see Table 1 for details).
The pre-trained PubMed-PMC, Wiki-PubMed-PMC and ChemPatent word embeddings are fixed during training of the NER models. For a more concrete comparison, a set of 200-dimensional trainable word embeddings initialized from normal distribution is used as a baseline.
The 200-dimensional baseline word embeddings contain all words in the vocabulary of the dataset and are initialized from a normal distribution, the baseline word embeddings are learned during training process. The vocabulary of models using pre-trained word embeddings is built by taking the union of words in the pre-traied word embedding file and words with frequency more than 3 in training and development sets. We do not update weights for word embeddings if pre-trained word embeddings were used.

Character-level representation
The BiLSTM-CRF model with character-level word representations (Lample et al., 2016;Ma and Hovy, 2016) has been shown to have state-of-theart performance in NER tasks on chemical patent datasets (Habibi et al., 2017). It has been shown that the choice of using LSTM-based or CNNbased character-level word representation has little effect on final NER performance in both general and biomedical domain while the CNN-based approach has the advantage of reduced training time (Reimers and Gurevych, 2017b;Zhai et al., 2018). Hence, we use the CNN-based approach with the same hyper-parameter settings of Reimers and Gurevych (2017b) for capturing characterlevel information (see Table 2 for details).

ELMo
ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) can be used to generate contextualized word representations by combining internal states of different layers in neural language models. Contextualized word representation can help to improve performance in various NLP tasks by incorporating contextual information, essentially allowing for the same word to have distinct context-dependent meanings. This could be particularly powerful for chemical NER since generic chemical names (e.g. salts, acid) may have different meanings in other domains. We therefore explore the impact of using contextualized word representations for chemical patents. We train ELMo on the same corpus of 84K  patents (detailed in Table 1), which we use for training the ChemPatent embeddings (described in Section 3.4). We use the ELMo implementation provided by Peters et al. (2018) with default hyper-parameters. 2 Such neural language models require a large amount of computational resources to train. In ELMo, a maximum character sequence length of tokens is set to make training feasible. However, systematic chemical names in chemical patents are often longer than the typical maximum sequence length of these neural language models. As very long tokens tend to be systematic chemical names, we reduced the max length of word from 50 to 25 and replace tokens longer than 25 characters by a special token "Long Token".

Implementation details
Our NER model implementation is based on the AllenNLP system (Gardner et al., 2017). We learn model parameters using the training set, and we use the overall F 1 score over development set as indicator for performance improvement. All models in this paper are trained with 50 epochs in maximum, and an early stopping is applied if there are no overall F 1 score improvement observed after 10 epochs. In Reimers and Gurevych (2017a) and Zhai et al. (2018), optimal hyper-parameters of BiLSTM-CRF models in NER tasks were explored. Hence, we fix the hyper-parameters shown in Table 2 to the suggested values in our experiments, which means that only models with 2stacked LSTM of size 250 are evaluated.
In this study, we also consider the choice of tokenizer and word embedding source as hyperparameters. To compare the performance of different tokenizers, we tokenize the same split of datasets with different tokenizers and evaluate the overall F 1 score over development set. After the best tokenizer for pre-processing patent corpus is determined, we use datasets tokenized by the best   tokenizer to train models with different pre-trained word embeddings. The best set of pre-trained word embeddings for patent corpus is determined based on the overall F 1 score over development set. We then take the best performing tokenizer and pre-trained word embeddings by comparing the marco-average F 1 score improvement on both experimental datasets.

Main Results
Effects of different tokenizers: Table 3 shows that all 3 tokenizers optimized for the chemical domain outperform the baseline general-purpose tokenizer (i.e. OpenNLP). The best performance on BioSemantics and Reaxys Gold are achieved by using the NBIC tokenizer (+1.86 F 1 score) and the OSCAR4 tokenizer (+0.86 F 1 score), respectively. The best overall tokenizer is OSCAR4 which obtains about 1.0 absolute macro-averaged F 1 improvement in comparison to the baseline.
Effects of different sets of word embeddings: Table 4 shows results obtained by training EBC-CRF with different sets of pre-trained word embeddings. On both BioSemantics and Reaxys Gold, it is not surprising that our ChemPatent word embeddings help produce the best performance on the development set, obtaining (on average) a higher F 1 score of 1.5 as compared to the  Final results: Table 5 compared results reported in Habibi et al. (2017) and our approach on the full BioSemantics test set. It is clear that all neural models outperform conventional CRF-based models tmChem and CRFSuite. Our EBC-CRF model outperforms the BiLSTM-CRF + LSTMchar model with a 3.7 F 1 score improvement.
Compared to the baseline model BiLSTM-CNN-CRF, the ELMo-based contextualized word embeddings help to produce an F 1 improvement of 1.3 points. Table 6 details our F 1 scores for BiLSTM-CNN-CRF and EBC-CRF with respect to each entity label on both the BioSemantics patent corpus and the Reaxys Gold set. The overall results show that ELMo-based contextualized word embeddings help improve the baseline by 1.3 and 4.8 absolute F 1 score on BioSemantics and Reaxys, respectively.
In BioSemantics patent corpus, we obtain 1+ F 1 score improvements on frequent entity labels (i.e. > 3, 000 instances) except for the entity label Formula, which has 0.4 absolute improvement. Higher improvements can be observed on rare entity labels (e.g. 4 points on Mode of Actions, 6 points on Registry numbers and Trademarks). The highest improvement at 9 points is found for the most rare entity label CAS Number. (b) Reaxys Gold Table 6: F 1 score with respect to each entity label. "Count † " denotes gold-entity counts in test sets."+ELMo" denotes scores obtained by EBC-CRF.
In the Reaxys Gold set, with ELMo we obtain 2+ F 1 score improvements on entity labels chem-Compound, chemCompound-mixture part and chemClass-mixture. Higher improvements (> 6 points) can be seen on some rare entity labels such as chemClass, chemClass-biomolecule, chemclass-mixture-part and chemClass-polymer. The improvements on entity label chemClass-Markush and chemCompound-prophetics are irregular compared to others. In particular, an absolute F 1 improvement of 74+ is achieved on entity label chemCompound-prophetics, while we do not find any improvement on chemClass-Markush.

Error Analysis
To perform error analysis on BioSemantics, we use its harmonized subset. Figure 2 (a) shows that most of the errors are confusions between nonchemical words and generic chemical names (e.g. water, salt, acid). For example, as illustrated in Figure 3 (a), the word "salt" which appears at the end of a systematic name should be identified as a part of the systematic name. However, the same word is also widely used to describe a class of chemicals, e.g. "pharmaceutically acceptable salt" in Figure 3 (b). Disambiguation between chemical class and chemical compound is a challenging task even for human annotators, and is thus particularly difficult for a statistical model to learn. The confusion matrix of Reaxys Gold set in Figure 2 (b) also supports this point since most confusions are between non-chemical words, chemical classes and chemical compounds.
The Reaxys Gold set has a more complex tag set than the BioSemantics patent corpus, as it assigns separate fine-grained tags for subcategories of chemical classes (chemClass) and chemical compounds (chemCompound). As illustrated in Table 6, there is not sufficient training data for fine-grained sub-category labels. It is difficult for a high complexity neural model to learn characteristics of these sub-category labels and the key difference between the main categories and their subcategories. Figure 2 (b) shows that 50% the errors for "chemical compound prophetics" and 80% errors for "chemical compound mixture part" are   Aqueous liquid compositions of the invention also are particularly useful . No . 08 / 189,479 , referred to above .
The crude product was purified by column chromatography on silica gel using dichloromethane : methanol ( 98 : 2 ) as eluant to afford the title compound as a colourless oil , 162mg , 67% ; < 1 > H NMR ( CDCl3 , 400MHz ) delta : 1.381.53 ( m , 2H ) , 3.27 ( m , 2H ) , 4.35 ( s , 2H ) , 4.61 ( m , 1H ) , 7.287.36 The invention further provides a process for the preparation of a combination product as hereinbefore defined , which process comprises bringing into association a compound of formula I , as hereinbefore defined ( but , for example , without the proviso ) , or a pharmaceutically acceptable salt thereof with the other therapeutic agent that is useful in the treatment of inflammation , and at least one pharmaceutically acceptable adjuvant , diluent or carrier .
Alkenyl is , for example , vinyl or prop2enyl .
The title indoline ( 28mg ) which was collected by filtration was isolated as a colourless solid , m.p. 140142°C .
The ATP and cydic AMP were separated on a double column chromatography system ( Anal . The method of claim 7 , wherein said antidepressant is a monoamine oxidase inhibitor . The vial was capped and put in the microwave oven ( Smith synthesiser ) .
[ 075 ] The phrase " linear chain of atoms " refers to the longest straight chain of atoms independently selected from carbon , nitrogen , oxygen and sulfur .  , 1 H ) . No . 08 / 189,479 , referred to above .
The crude product was purified by column chromatography on silica gel using dichloromethane : methanol ( 98 : 2 ) as eluant to colourless oil , 162mg , 67% ; < 1 > H NMR ( CDCl3 , 400MHz ) delta : 1.38-1.53 ( m , 2H ) , 3.27 ( m , 2H ) , 4.35 ( s , 2H ) , 4 invention further provides a process for the preparation of a combination product as hereinbefore defined , which process comp compound of formula I , as hereinbefore defined ( but , for example , without the proviso ) , or a pharmaceutically acceptable sa therapeutic agent that is useful in the treatment of inflammation , and at least one pharmaceutically -acceptable adjuvant , dilue Alkenyl is , for example , vinyl or prop-2-enyl .
The title indoline ( 28mg ) which was collected by filtration was isolated as a colourless solid , m.p. 140-142°C .
The ATP and cydic AMP were separated on a double column chromatography system ( Anal . The method of claim 7 , wherein said antidepressant is a monoamine oxidase inhibitor . The vial was capped and put in the microwave oven ( Smith synthesiser ) .
[ 075 ] The phrase " linear chain of atoms " refers to atoms independently selected from carbon , nitrogen , oxygen and sulfur .  Aqueous liquid compositions of the invention also are particularly useful .
The crude product was purified by column chromatography on silica gel using dichloromethane : methanol ( 98 : 2 ) as eluant to afford the title compound as a colourless oil , 162mg , 67% ; < 1 > H NMR ( CDCl3 , 400MHz ) delta : 1.381.53 ( m , 2H ) , 3.27 ( m , 2H ) , 4.35 ( s , 2H ) , 4.61 ( m , 1H ) , 7.287.36 The invention further provides a process for the preparation of a combination product as hereinbefore defined , which process comprises bringing into association a compound of formula I , as hereinbefore defined ( but , for example , without the proviso ) , or a pharmaceutically acceptable salt thereof with the other therapeutic agent that is useful in the treatment of inflammation , and at least one pharmaceutically acceptable adjuvant , diluent or carrier .
Alkenyl is , for example , vinyl or prop2enyl .
The title indoline ( 28mg ) which was collected by filtration was isolated as a colourless solid , m.p. 140142°C .
The ATP and cydic AMP were separated on a double column chromatography system ( Anal . The method of claim 7 , wherein said antidepressant is a monoamine oxidase inhibitor . The vial was capped and put in the microwave oven ( Smith synthesiser ) .
[ 075 ] The phrase " linear chain of atoms " refers to the longest straight chain of atoms independently selected from carbon , nitrogen , oxygen and sulfur .

/harmonized_oscar4_merged/BRATFile0004
: false positives) due to confusion with their parent category "chemical compound". Another typical error observed frequently in BioSemantics and Reaxys is caused by participles. The most common example is word 'substituted'. In "substituted or un-substituted alkyl", the token "substituted" refers to a specific chemical compound "substituted alkyl". Whereas in "2pyridinyl is optionally substituted with 1-3 substituents", the token "substituted" refers to the substitution reaction.
We also observe that in both patent corpora, there are long sequences of systematic chemical names connected by comma only. Since there are no narrative words between the chemical names in such sequences, it is unlikely that the model can capture any contextual information when tagging them. This can potentially cause a "chain reaction" as shown in Figure 4, in which all chemical names fail to be recognized when the first chemical name is not tagged correctly.

Discussion
The results in Table 3 show that all chemical tokenizers outperform the OpenNLP general domain tokenizer. This is not surprising because tokenizers optimized for the chemical domain usually use either rule-based method or gazetteer-based methods to ensure that long systematic chemical names will be treated as a single token instead of being split into several tokens by symbols. This is reasonable as the character-level word representation will not be able to capture the morphological structures in a long chemical name if it is split into several tokens.
In the BioSemantics patent corpus, 80% of all entities are annotated as Generic or IUPAC. When adding ELMo-based word representations, we obtain smaller improvements in F 1 score for Generic and IUPAC than for remaining entity labels/types. This makes sense, as there are already enough training instances for these two labels in the dataset. By contrast, for rare entity labels with frequencies of less than 2 (e.g. CAS Numbers, Trademarks, Mode of Actions, Registry numbers), we obtain improvements of 4+ points when exploiting external information conveyed via ELMo.
The global F 1 score improvements on both experimental datasets confirm further this observation, viz., that score improvements due to ELMo : true positives) decrease in inverse proportion to label frequency and training set size. Since the BioSemantics patent corpus contains 10 times more training instances than the Reaxys Gold set, we obtain an absolute improvement of 4.8 on Reaxys Gold set but of 1.3 points on the BioSemantics patent corpus.
Adding ELMo substantially improves the F 1 score on chemCompound-prophetics. This is because chemCompound-prophetics named entities are all long systematic chemical names which are arranged in lists. Since we replace all tokens longer than 25 characters with "Long Token" when training ELMo, almost all sentences containing chemCompound-prophetics entities appear in the "Long Token" style. This makes the ELMobased representations of such long entities almost identical, and particularly easy to predict, thus resulting in an F 1 score improvement of 74 points for chemCompound-prophetics. We also observe no improvement for the chemClass-Markush label. The Markush structures are figures describing the structure of chemical compounds in which only a few parts/functional groups are labeled. When transforming to text, only the textual labels in the Markush structure are preserved. Thus, it is difficult for ELMo to learn any useful information from the broken Markush structures.

Conclusions
In this paper, we have made the following contributions towards improved chemical named entity recognition in chemical patents: 1. We improve on the current state-of-art for chemical NER in patents by +2.67 F 1 score. 2. We confirm that tokenizers optimized for chemical domain have a positive effect on NER performance by preserving informative morphological structures in systematic chemical names. 3. We demonstrate that word embeddings pretrained on an in-domain chemical patent corpus help produce better performance than the word embeddings pre-trained on biomedical literature corpora. 4. We show that chemical NER performance can be improved by using contextualized word representations. 5. We release our ChemPatent word embeddings and an ELMo model trained from scratch on a newly collected corpus of 84K unannotated chemical patents, which can be utilized for downstream NLP tasks on chemical patents. 4 Inspired by the patterns uncovered by our error analysis, our future work on chemical NER will focus on developing models which can be used to support disambiguation of general chemical words. In addition, it would be interesting to explore contextualized word embeddings learned by other neural models such as BERT (Devlin et al., 2019) or OpenAI GPT models (Radford et al., 2019) in future work.