MalwareTextDB: A Database for Annotated Malware Articles

Cybersecurity risks and malware threats are becoming increasingly dangerous and common. Despite the severity of the problem, there has been few NLP efforts focused on tackling cybersecurity. In this paper, we discuss the construction of a new database for annotated malware texts. An annotation framework is introduced based on the MAEC vocabulary for defining malware characteristics, along with a database consisting of 39 annotated APT reports with a total of 6,819 sentences. We also use the database to construct models that can potentially help cybersecurity researchers in their data collection and analytics efforts.


Introduction
In 2010, the malware known as Stuxnet physically damaged centrifuges in Iranian nuclear facilities (Langner, 2011). More recently in 2016, a botnet known as Mirai used infected Internet of Things (IoT) devices to conduct large-scale Distributed Denial of Service (DDoS) attacks and disabled Internet access for millions of users in the US West Coast (US-CERT, 2016). These are only two cases in a long list ranging from ransomeware on personal laptops (Andronio et al., 2015) to taking over control of moving cars (Checkoway et al., 2011). Attacks such as these are likely to become increasingly frequent and dangerous as more devices and facilities become connected and digitized.
Recently, cybersecurity defense has also been recognized as one of the "problem areas likely to be important both for advancing AI and for its long-run impact on society" (Sutskever et al., 2016). In particular, we feel that natural language processing (NLP) has the potential for substantial contribution in cybersecurity and that this is a critical research area given the urgency and risks involved.
There exists a large repository of malwarerelated texts online, such as detailed malware reports by various cybersecurity agencies such as Symantec (DiMaggio, 2015) and Cylance (Gross, 2016) and in various blog posts. Cybersecurity researchers often consume such texts in the process of data collection. However, the sheer volume and diversity of these texts make it difficult for researchers to quickly obtain useful information. A potential application of NLP can be to quickly highlight critical information from these texts, such as the specific actions taken by a certain malware. This can help researchers quickly understand the capabilities of a specific malware and search in other texts for malware with similar capabilities.
An immediate problem preventing application of NLP techniques to malware texts is that such texts are mostly unannotated. This severely limits their use in supervised learning techniques.
In light of that, we introduce a database of annotated malware reports for facilitating future NLP work in cybersecurity. To the best of our knowledge, this is the first database consisting of annotated malware reports. It is intended for public release, where we hope to inspire contributions from other research groups and individuals.
The main contributions of this paper are: • We initiate a framework for annotating malware reports and annotate 39 Advanced Persistent Threat (APT) reports (containing 6,819 sentences) with attribute labels from the Malware Attribute Enumeration and Characterization (MAEC) vocabulary (Kirillov et al., 2010).
• We propose the following tasks, construct models for tackling them, and discuss the challenges: • Classify if a sentence is useful for inferring malware actions and capabilities, • Predict token, relation and attribute labels for a given malware-related text, as defined by the earlier framework, and • Predict a malware's signatures based only on text describing the malware.

APTnotes
The 39 APT reports in this database are sourced from APTnotes, a GitHub repository of publiclyreleased reports related to APT groups (Blanda, 2016). The repository is constantly updated, which means it is a constant source of reports for annotations. While the repository consists of 384 reports (as of writing), we have chosen 39 reports from the year 2014 to initialize the database.

MAEC
The MAEC vocabulary was devised by The MITRE Corporation as a standardized language for describing malware (Kirillov et al., 2010). The MAEC vocabulary is used as a source of labels for our annotations. This will facilitate crossapplications in other projects and ensure relevance in the cybersecurity community.

Related Work
There are datasets available, which are used for more general tasks such as content extraction (Walker et al., 2006) or keyword extraction (Kim et al., 2010). These may appear similar to our dataset. However, a big difference is that we are not performing general-purpose annotation and not all tokens are annotated. Instead, we only annotated tokens relevant to malware capabilities and provide more valuable information by annotating the type of malware capability or action implied. These are important differentiating factors, specifically catered to the cybersecurity domain. While we are not aware of any database catering specifically to malware reports, there are various databases in the cybersecurity domain that provide malware hashes, such as the National Software Reference Library (NSRL) (NIST, 2017;Mead, 2006) and the File Hash Repository (FHR) by the Open Web Application Security Project (OWASP, 2015).
Most work on classifying and detecting malware has also been focusing on detecting system calls (Alazab et al., 2010;Briones and Gomez, 2008;Willems et al., 2007;Qiao et al., 2013). More recently, Rieck et al. (2011) has incorporated machine learning techniques for detecting malware, again through system calls. To the best of our knowledge, we are not aware of any work on classifying malware based on analysis of malware reports. By building a model that learns to highlight critical information on malware capabilities, we feel that malware-related texts can become a more accessible source of information and provide a richer form of malware characterization beyond detecting file hashes and system calls.

Data Collection
We worked together with cybersecurity researchers while choosing the preliminary dataset, to ensure that it is relevant for the cybersecurity community. The factors considered when selecting the dataset include the mention of most current malware threats, the range of author sources, with blog posts and technical security reports, and the range of actor attributions, from several suspected state actors to smaller APT groups.

Preprocessing
After the APT reports have been downloaded in PDF format, the PDFMiner tool (Shinyama, 2004) is used to convert the PDF files into plaintext format. The reports often contain non-sentences, such as the cover page or document header and footer. We went through these non-sentences manually and subsequently removed them before the annotation. Hence only complete sentences are considered for subsequent steps.

Annotation
The Brat Rapid Annotation Tool (Stenetorp et al., 2012) is used to annotate the reports. The main aim of the annotation is to map important word phrases that describe malware actions and behaviors to the relevant MAEC vocabulary, such as the ones shown in Figure 1. We first extract and enumerate the labels from the MAEC vocabulary, which we call attribute labels. This gives us a total of 444 attribute labels, consisting of 211 Action-Name labels, 20 Capability labels, 65 StrategicObjectives labels and 148 TacticalObjectives labels. These labels are elaborated in Section 3.5.
There are three main stages to the annotation process. These are cumulative and eventually build up to the annotation of the attribute labels.

Stage 1 -Token Labels
The first stage involves annotating the text with the following token labels, illustrated in Fig Action This refers to an event, such as "registers", "provides" and "is written". Subject This refers to the initiator of the Action such as "The dropper" and "This module". Object This refers to the recipient of the Action such as "itself ", "remote persistent access" and "The ransom note"; it also refers to word phrases that provide elaboration on the Action such as "a service", "the attacker" and "disk". Modifier This refers to tokens that link to other word phrases that provide elaboration on the Action such as "as" and "to".
This stage helps to identify word phrases that are relevant to the MAEC vocabulary. Notice that for the last sentence in Figure 2, "The ransom note" is tagged as an Object instead of a Subject. This is because the Action "is written" is not being initiated by "The ransom note". Instead, the Subject is absent in this sentence.

Stage 2 -Relation Labels
The second stage involves annotating the text with the following relation labels:  This stage helps to make the links between the labelled tokens explicit, which is important in cases where a single Action has multiple Subjects, Objects or Modifiers. Figure 2 demonstrates how the relation labels are used to link the token labels.

Stage 3 -Attribute Labels
The third stage involves annotating the text with the attribute labels extracted from the MAEC vocabulary. Since the Action is the main indicator of a malware's action or capability, the attribute labels are annotated onto the Actions tagged in Stage 1. Each Action should have one or more attribute labels.
There are four classes of attribute labels: Ac-tionName, Capability, StrategicObjectives and TacticalObjectives. These labels describe different actions and capabilities of the malware. Refer to Appendix A for examples and elaboration.

Summary
The above stages complete the annotation process and is done for each document. There are also sentences that are not annotated at all since they do not provide any indication of malware actions or capabilities, such as the sentences in Figure 3. We call these sentences irrelevant sentences.
At the time of writing, the database consists of 39 annotated APT reports with a combined total of 6,819 sentences. Out of the 6,819 sentences, Figure 4: Two different ways for annotating a sentence, where both seem to be equally satisfactory to a human annotator. In this case, both serve to highlight the malware's ability to hide its DLL's functionality.  2,080 sentences are annotated. Table 1 shows the breakdown of the annotation statistics.

Annotators' Challenges
We can calculate the Cohen's Kappa (Cohen, 1960) to quantify the agreement between annotators and to give an estimation of the difficulty of this task for human annotators. Using annotations from pairs of annotators, the Cohen's Kappa was calculated to be 0.36 for annotation of the Token labels. This relatively low agreement between annotators suggests that this is a rather difficult task. In the following subsections, we discuss some possible reasons that make this annotation task difficult.

Complex Sentence Structures
In many cases, there may be no definite way to label the tokens. Figure 4 shows two ways to annotate the same sentence. Both annotations essentially serve to highlight the Gen 2 sub-family's capability of hiding the DLL's functionality. The first annotation highlights the method used by the malware to hide the library, i.e., employing the Driver. The second annotation focuses on the malware hiding the library and does not include the method. Also notice that the Modifiers highlighted are different in the two cases, since this depends on the Action highlighted and are hence mutually exclusive. Such cases occur more commonly when the sentences contain complex noun-and verbphrases that can be decomposed in several ways. Repercussions surface later in the experiments de-scribed in Section 5.2, specifically in the second point under Discussion.

Large Quantity of Labels
Due to the large number (444) of attribute labels, it is challenging for annotators to remember all of the attribute labels. Moreover, some of the attribute labels are subject to interpretation. For instance, should Capability: 005: MalwareCapability-command_and_control be tagged for sentences that mention the location or IP addresses of command and control servers, even though such sentences may not be relevant to the capabilities of the malware?

Specialized Domain Knowledge Required
Finally, this task requires specialized cybersecurity domain knowledge from the annotator and the ability to apply such knowledge in a natural language context. For example, given the phrase "load the DLL into memory", the annotator has to realize that this phrase matches the attribute label ActionName: 119: ProcessMemory-map_library_into_process. The abundance of labels with the many ways that each label can be expressed in natural language makes this task extremely challenging.

Proposed Tasks
The main goal of creating this database is to aid cybersecurity researchers in parsing malwarerelated texts for important information. To this end, we propose several tasks that build up to this main goal.
Task 1 Classify if a sentence is relevant for inferring malware actions and capabilities Task 2 Predict token labels for a given malwarerelated text Task 3 Predict relation labels for a given malware-related text Task 4 Predict attribute labels for a given malware-related text Task 5 Predict a malware's signatures based on the text describing the malware and the text's annotations Task 1 arose from discussions with domain experts where we found that a main challenge for cybersecurity researchers is having to sift out critical sentences from lengthy malware reports and articles. Figure 3 shows sentences describing the political and military background of North Korea in the APT report HPSR SecurityBrief-ing_Episode16_NorthKorea. Such information is essentially useless for cybersecurity researchers focused on malware actions and capabilities. It will be helpful to build a model that can filter relevant sentences that pertain to malware.
Tasks 2 to 4 serve to automate the laborious annotation procedure as described earlier. With sufficient data, we hope that it becomes possible to build an effective model for annotating malwarerelated texts, using the framework and labels we defined earlier. Such a model will help to quickly increase the size of the database, which in turn facilitate other supervised learning tasks.
Task 5 explores the possibility of using malware texts and annotations to predict a malware's signatures. While conventional malware analyzers generate a list of malware signatures based on the malware's activities in a sandbox, such analysis is often difficult due to restricted distribution of malware samples. In contrast, numerous malware reports are freely available and it will be helpful for cybersecurity researchers if such texts can be used to predict malware signatures instead of having to rely on a limited supply of malware samples.
In the following experiments, we construct models for tackling each of these tasks and discuss the performance of our models.

Experiments and Results
Since the focus of this paper is on the introduction of a new framework and database for annotating malware-related texts, we only use simple algorithms for building the models and leave more complex models for future work.
For the following experiments, we use linear support vector machine (SVM) and multinomial Naive Bayes (NB) implementations in the scikitlearn library (Pedregosa et al., 2011). The regularization parameter in SVM and smoothing parame-  ter in NB were tuned (with the values 10 −3 to 10 3 in logarithmic increments) by taking the value that gave the best performance in development set.
For scoring the predictions, unless otherwise stated, we use the metrics module in scikit-learn for SVM and NB, as well as the CoNLL2000 conlleval Perl script for CRF 1 .
Also, unless otherwise mentioned, we make use of all 39 annotated documents in the database. The experiments are conducted with a 60%/20%/20% training/development/test split, resulting in 23, 8 and 8 documents in the respective datasets. Each experiment is conducted 5 times with a different random allocation of the dataset splits and we report averaged scores 2 .
Since we focus on building a database, we weigh recall and precision as equally important in the following experiments and hence focus on the F 1 score metric. The relative importance of recall against precision will ultimately depend on the downstream tasks.

Task 1 -Classify sentences relevant to malware
We make use of the annotations in our database for this supervised learning task and consider a sentence to be relevant as long as it has an annotated token label. For example, the sentences in Figure  2 will be labeled relevant whereas the sentences in Figure 3 will be labeled irrelevant.
A simple bag-of-words model is used to represent each sentence. We then build two models -SVM and NB -for tackling this task.
Results: Table 2 shows that while the NB model outperforms the SVM model in terms of F 1 score, the performance of both models are still rather low with F 1 scores below 70 points. We proceed to discuss possible sources of errors for the models. Figure 5: An example of a token ("a lure document") labelled as both Subject and Object. In the first case, it is the recipient of the Action "used", while in the latter case, it is the initiator of the Action "installed". Discussion: We find that there are two main types of misclassified sentences.

Sentences describing malware without implying specific actions
These sentences often contain malware-specific terms, such as "payload" and "malware" in the following sentence.
This file is the main payload of the malware.
These sentences are often classified as relevant, probably due to the presence of malware-specific terms. However, such sentences are actually irrelevant because they merely describe the malware but do not indicate specific malware actions or capabilities.

Sentences describing attacker actions
Such sentences mostly contain the term "attacker" or names of attackers. For instance, the following sentence is incorrectly classified as irrelevant. This is another remote administration tool often used by the Pitty Tiger crew.
Such sentences involving the attacker are often irrelevant since the annotations focus on the malware and not the attacker. However, the above sentence implies that the malware is a remote administration tool and hence is a relevant sentence that implies malware capability.

Task 2 -Predict token labels
Task 2 concerns automating Stage 1 for the annotation process described in Section 3.3. Within the annotated database, we find several cases where a single word-phrase may be annotated with both Subject and Object labels (see Figure 5). In order to simplify the model for prediction, we redefine Task 2 as predicting Entity, Action and Modifier labels for word-phrases. The single Entity label is used to replace both Subject and Object labels. Since the labels may extend beyond a single word token, we use the BIO format for indicating the span of the labels (Sang and Veenstra, 1999). We use two approaches for tackling this task: a) CRF is used to train a model for directly predicting token labels, b) A pipeline approach where the NB model from Task 1 is used to filter relevant sentences. A CRF model is then trained to predict token labels for relevant sentences.
The CRF model in Approach 1 is trained on the entire training set, whereas the CRF model in Approach 2 is trained only on the gold relevant sentences in the training set.
For features in both approaches, we use unigrams and bigrams, part-of-Speech labels from the Stanford POStagger (Toutanova et al., 2003), and Brown clustering features after optimizing the cluster size (Brown et al., 1992). A C++ implementation of the Brown clustering algorithm is  Table 3: Task 2 scores: predicting token labels.
used (Liang, 2005). The Brown cluster was trained on a larger corpus of APT reports, consisting of 103 APT reports not in the annotated database and the 23 APT reports from the training set. We group together low-frequency words that appear 4 or less times in the set of 126 APT reports into one cluster and during testing we assign new words into this cluster. Results: Table 3 demonstrates that Approach 2 outperforms Approach 1 on most scores. Nevertheless, both approaches still give low performance for tackling Task 2 with F 1 -scores below 50 points.
Discussion: There seem to be three main categories of wrong predictions: 1. Sentences describing attacker actions Such sentences are also a main source of prediction errors in Task 1. Again, most sentences describing attackers are deemed irrelevant and left unannotated because we focus on malware actions rather than human attacker actions. However, these sentences may be annotated in cases where the attacker's actions imply a malware action or capability.
For example, the Figure 6a describes the attackers stealing credentials. This implies that the malware used is capable of stealing and exfiltrating credentials. It may be challenging for the model to distinguish whether such sentences describing attackers should be annotated since a level of inference is required.

Sentences containing noun-phrases made up of participial phrases and/or prepositional phrases
These sentences contain complex noun-phrases with multiple verbs and prepositions, such as in Figures 6b and 6c. In Figure 6b, "the RCS sample sent to Ahmed" is a noun-phrase annotated as a single Subject/Entity. However, the model decomposes the noun-phrase into the subsidiary noun "the RCS sample" and participial phrase "sent to Ahmed" and further decompose the participial phrase into the constituent words, predict-   ing Action, Modifier and Entity labels for "sent", "to" and "Ahmed" respectively. There are cases where such decomposition of noun-phrases is correct, such as in Figure 6c.
As mentioned in Section 3.7, this is also a challenge for human annotators because there may be several ways to decompose the sentence, many of which serve equally well to highlight certain malware aspects (see Figure 4).
Whether such decomposition is correct depends on the information that can be extracted from the decomposition. For instance, the decomposition in Figure 6c implies that the malware can receive remote commands from attackers. In contrast, the decomposition predicted by the model in Figure  6b does not offer any insight into the malware. This is a difficult task that requires recognition of the phrase spans and the ability to decide which level of decomposition is appropriate.

Sentences containing noun-phrases made up of determiners and adjectives
These sentences contain noun-phrases with determiners and adjectives such as "All the requests" in Figure 6d. In such cases, the model may only predict the Entity label for part of the noun-phrase. This is shown in Figure 6d, where the model predicts the Entity label for "the requests" instead of "All the requests".
Thus, we also consider a relaxed scoring scheme where predictions are scored in token level instead of phrase level (see Table 4). The aim of the relaxed score is to give credit to the model when the span for a predicted label intersects with the span for the actual label, as in Figure 6d. Figure 7: An example of an entity with multiple parents. In this case, stage two payloads has two parents by ActionObject relations -downloading and executing.

Task 3 -Predict relation labels
Following the prediction of token labels in Task 2, we move on to Task 3 for building a model for predicting relation labels. Due to the low performance of the earlier models for predicting token labels, for this experiment we decided to use the gold token labels as input into the model for predicting relation labels. Nevertheless, the models can still be chained in a pipeline context.
The task initially appeared to be similar to a dependency parsing task where the model predicts dependencies between the entities demarcated by the token labels. However, on further inspection, we realized that there are several entities which have more than one parent entity (see Figure 7). As such, we treat the task as a binary classification task, by enumerating all possible pairs of entities and predicting whether there is a relation between each pair.
Predicting the relation labels from the token labels seem to be a relatively straightforward task and hence we design a simple rule-based model for the predictions. We tuned the rule-based model on one of the documents (AdversaryIntel-ligenceReport_DeepPanda_0 (1)) and tested it on the remaining 38 documents. The rules are documented in Appendix B.
Results: Table 5 shows the scores from testing the model on the remaining 38 documents.
The results from the rule-based model are better than expected, with the average F 1 -scores exceeding 84 points for all the labels. This shows that the relation labels can be reliably predicted given good predictions of the preceding token labels.
Discussion: The excellent performance from the rule-based model suggests that there is a welldefined structure in the relations between the entities. It may be possible to make use of this inherent structure to help improve the results for predicting the token labels.
Also, notice that by predicting the SubjAction, ActionObj and ActionMod relations, we are simultaneously classifying the ambiguous Entity labels into specific Subject and Object labels. For instance, Rule 1 predicts a ModObj relation be-  tween a Modifier and an Entity, implying that the Entity is an Object, whereas Rule 3 predicts a SubjAction relation between an Entity and an Action, implying that the Entity is a Subject.

Task 4 -Predict attribute labels
A significant obstacle in the prediction of attribute labels is the large number of attribute labels available. More precisely, we discover that many of these attribute labels occur rarely, if not never, in the annotated reports. This results in a severely sparse dataset for training a model. Due to the lack of substantial data, we decide to use token groups instead of entire sentences for predicting attribute labels. Token groups are the set of tokens that are linked to each other via relation labels. We extract the token groups from the gold annotations and then build a model for predicting the attribute labels for each token group. Again, we use a bag-of-words model to represent the token groups while SVM and NB are each used to build a model for predicting attribute labels.
Results: Table 6 shows the average scores over 5 runs for the four separate attribute categories. For this task, SVM appears to perform generally better than NB, although much more data seems to be required in order to train a reliable model for predicting attribute labels. The Capability category shows the best performance, which is to be expected, since the Capability attributes occur the most frequently.
Discussion: The main challenge for this task is the sparse data and the abundant attribute labels available. In fact, out of the 444 attribute labels, 190 labels do not appear in the database. For the remaining 254 attribute labels that do occur in the database, 92 labels occur less than five times and 50 labels occur only once. With the sparse data 1564  available, particularly for rare attribute labels, effective one-shot learning models might have to be designed to tackle this difficult task.

Task 5 -Predict malware signatures using text and annotations
Conventional malware analyzers, such as malwr.com, generate a list of signatures based on the malware's activities in a sandbox. Examples of such signatures include antisandbox_sleep, which indicates anti-sandbox capabilities or persistence_autorun, which indicates persistence capabilities.
If it is possible to build an effective model to predict malware signatures based on natural language texts about the malware, this can help cybersecurity researchers predict signatures of malware samples that are difficult to obtain, using the malware reports freely available online.
By analyzing the hashes listed in each APT report, we obtain a list of signatures for the malware discussed in the report. However, we are unable to obtain the signatures for several hashes due to restricted distribution of malware samples. There are 8 APT reports without any obtained signatures, which are subsequently discarded for the following experiments. This leaves us with 31 out of 39 APT reports.
The current list of malware signatures from Cuckoo Sandbox 3 consists of 378 signature types. However, only 68 signature types have been identified for the malware discussed in the 31 documents. Furthermore, out of these 68 signature types, 57 signature types appear less than 10 times, which we exclude from the experiments. The experiments that follow will focus on predicting the remaining 11 signature types using the 31 documents.
The OneVsRestClassifier implementation in scikit-learn is used in the following experiments, since this is a multilabel classification problem. We also use SVM and NB to build two types of 3 https://cuckoosandbox.org/ models for comparison.
Three separate methods are used to generate features for the task: a) the whole text in each APT report is used as features via a bag-of-words representation, without annotations, b) the gold labels from the annotations are used as features, without the text, and c) both the text and the gold annotations are used, via a concatenation of the two feature vectors.
Results: Comparing the first two rows in Table 7, we can see that using the annotations as features significantly improve the results, especially the precision. SVM model also seems to benefit more from the annotations, even outperforming NB in one case.
Discussion: The significant increase in precision suggests that the annotations provide a condensed source of features for predicting malware signatures, improving the models' confidence. We also observe that some signatures seem to benefit more from the annotations, such as persis-tence_autorun and has_pdb. In particular, per-sistence_autorun has a direct parallel in attribute labels, which is MalwareCapability-persistence, showing that using MAEC vocabulary as attribute labels is appropriate.

Conclusion
In this paper, we presented a framework for annotating malware reports.
We also introduced a database with 39 annotated APT reports and proposed several new tasks and built models for extracting information from the reports. Finally, we discuss several factors that make these tasks extremely challenging given currently available models. We hope that this paper and the accompanying database serve as a first step towards NLP being applied in cybersecurity and that other researchers will be inspired to contribute to the database and to construct their own datasets and implementations. More details about this database can be found at http://statnlp.org/research/re/.