MaintNet: A Collaborative Open-Source Library for Predictive Maintenance Language Resources

Maintenance record logbooks are an emerging text type in NLP. An important part of them typically consist of free text with many domain specific technical terms, abbreviations, and non-standard spelling and grammar. This poses difficulties for NLP pipelines trained on standard corpora. Analyzing and annotating such documents is of particular importance in the development of predictive maintenance systems, which aim to improve operational efficiency, reduce costs, prevent accidents, and save lives. In order to facilitate and encourage research in this area, we have developed MaintNet, a collaborative open-source library of technical and domain-specific language resources. MaintNet provides novel logbook data from the aviation, automotive, and facility maintenance domains along with tools to aid in their (pre-)processing and clustering. Furthermore, it provides a way to encourage discussion on and sharing of new datasets and tools for logbook data analysis.


Introduction
With the rapid development of information technologies, engineering systems are generating increasing amounts of data that are used by various industries to improve their products. Maintenance records are one such type of data. They typically consist of event logbooks which are collected in many domains such as aviation, transportation, and healthcare (Tanguy et al., 2016;Altuncu et al., 2018). The analysis of maintenance records is particularly important in the development of predictive maintenance systems, which can be used to prevent accidents and reduce maintenance costs (Jarry et al., 2018).
Maintenance record datasets generally contain free text fields describing issues (or problems) written in non-standard language with many abbreviations and domain specific terms, as in the instances presented in Table 1  area, we present MaintNet 1 , a collaborative, open-source library for technical language resources with a special focus on predictive maintenance data.
The main contributions of this paper are the following: 1. The development of MaintNet, a user-friendly web-based platform that serves as a repository hosting a variety of resources and tools developed to process predictive maintenance and technical logbook data.
2. The creation of several important language resources for technical language and predictive maintenance such as abbreviation lists, morphosyntactic information lists, and termbanks for the aviation, automotive, and facility maintenance domains. All these resources as well as raw data from these domains are made freely available to the research community via MaintNet.
3. The development of several novel Python packages for (pre-)processing technical language which we make available to the research community. This includes stop word removal, stemmers, lemmatizers, POS tagging, document clustering, and more. 4. A collaborative environment in which the community can contribute with data and resources and interact with developers and other members of the community via forums.

Language Resources
To the best of our knowledge, there are no freely available tools and libraries developed to process such data, which makes MaintNet a unique resource. MaintNet currently features datasets from the aviation, automotive, and facilities domains (see Table 2), and it will be expanded with the collaboration of the interested members of the NLP community working on similar topics.  Predictive maintenance datasets are hard to obtain due to the sensitive information they contain. Therefore, we work closely with the data providers to ensure that any confidential and sensitive information in the dataset remains anonymous. In addition to the datasets, MaintNet further provides the user with domain specific abbreviation dictionaries, morphosyntactic annotation, and term banks. The abbreviation dictionaries contains abbreviated validated by domain experts. The morphosyntactic annotation contains the part of speech (POS) tag, compound, lemma, and word stems. Finally, the domain term banks contain the collected list of terms that are used in each domain along with a sample of usage in the corpus.

Pre-processing and Tools
Grouping maintenance issues by time is an important step in the analysis of logbook data. Most of the predictive maintenance datasets available, however, do not feature the reason for maintenance or the category of the issues making it impossible to train classification systems on such systems. To address this problem, we implemented several (pre-)processing steps to clean and extract information from logbooks aiming at document clustering and classification. The complete processing pipeline is shown in Figure 1. The pre-processing steps start with text normalization, lowercasing, stop word and punctuation removal. Then we treat special characters with NLTK's (Bird et al., 2009) regular expression library, followed by stemming (Snowball Stemmer), lemmatization (WordNet (Miller, 1992)), and tokenization (NLTK tokenizer). POS annotation is carried out using the NLTK POS tagger. Finally, Term frequencyinverse document frequency (TF-IDF) is obtained using the gensim tfidf model (Rehurek and Sojka, 2010). To address misspellings and abbreviations which are abundant in predictive maintenance datasets, we explored various state-of-the-art spellcheckers including Enchant 2 , Pyspellchecker 3 , Symspellpy 4 , and Autocorrect 5 . We also developed our own spell checker using Levenshtein distance (Aggarwal and Zhai, 2012) where a dictionary of domain specific words is used to map the misspelling candidates to words in the dictionary. The Levenshtein algorithm was chosen over other distance metrics (e.g., Euclidian, Cosine) as it allows us to control the minimum number of string edits. The performance of our method compared to other spellcheckers in a sub set of the aviation dataset is presented in Table 3.  Table 3: Results of the spelling correction and abbreviation expansion methods in terms of success rate.
In MaintNet we also developed document clustering systems customized to logbook data and we make the scripts available to the community. As previously stated, logbook datasets are often not annotated with issue categories requiring a domain expert to group instances into categories. Here we use clustering methods to help grouping documents together.
Finally, we use three different similarity algorithms: Levenshtein, Jaro, and cosine (Fraley and Raftery, 1998) to calculate intra-and inter-cluster similarity. Cosine similarity is commonly used and is independent of the length of document, while Jaro is more flexible by providing a rating of matching strings. We collected human annotated instances by a domain expert to serve as our gold standard, and these are provided on MaintNet to encourage research into improving unsupervised clustering of maintenance logbooks.

Community Participation
MaintNet provides various webpages for users to communicate with each other and the project developers; as well as upload data to share with the community (see Figure 2). We hope this will help further facilitate discussion and research in this important and under explored area.

Conclusions and Future Work
In this paper we presented MaintNet, a collaborative open-source library for predictive maintenance language resources. MaintNet provides raw technical logbook data as well as several language resources such as abbreviation lists, morphosyntactic information lists, and termbanks from the aviation, automotive and facilities domains. Tools developed in Python are also made available for pre-processing, such as spell checking, POS tagging, and document clustering. In addition to these tools, the collaborative aspects of MaintNet should be emphasized. We welcome the community to contribute with new datasets that can be processed using the tools available at MaintNet, or share new and improved tools developed with MaintNet's open source data.
MaintNet is also expanding as current work involves processing data from additional domains such as healthcare and power systems (e.g., wind turbines). These datasets will be made available on MaintNet in upcoming months. We also aim to collect and release datasets and tools for languages other than English in the near future.