HCCL at SemEval-2018 Task 8: An End-to-End System for Sequence Labeling from Cybersecurity Reports

This paper describes HCCL team systems that participated in SemEval 2018 Task 8: SecureNLP (Semantic Extraction from cybersecurity reports using NLP). To solve the problem, our team applied a neural network architecture that benefits from both word and character level representaions automatically, by using combination of Bi-directional LSTM, CNN and CRF (Ma and Hovy, 2016). Our system is truly end-to-end, requiring no feature engineering or data preprocessing, and we ranked 4th in the subtask 1, 7th in the subtask2 and 3rd in the SubTask2-relaxed.


Introduction
Recently, cybersecurity defense has also been recognized as one of the problem areas likely to be important both for advancing AI and for its longrun impact on society. In particular, natural language processing (NLP) has the potential for substantial contribution in cybersecurity and that this is a critical research area given the urgency and risks involved (Lim et al., 2017).
In SemEval 2018 Task 8 (Phandi et al., 2018), there are four subtask: 1. SubTask1: Classify if a sentence is useful for inferring malware actions and capabilities 2. SubTask2: predict the token labels in the sentences. The output needs to be in BIO format. There are 3 types of token labels: "Action", "Entity", and "Modifier".
3. SubTask3: predict the relations between the token labels 4. SubTask4: predict the attributes for each entity token In this evaluation, our team submitted the results of Subtask 1 and Subtask 2. To tackle this problem, we treat subtask 2 as a sequence labeling problem. Most traditional high performance sequence labeling models are linear statistical models, including Hidden Markov Models (HMM) and Conditional Random Fields (CRF) (Luo et al., 2015), which rely heavily on hand-crafted features and taskspecific resources.
Recently, many neural network based methods have been successfully applied to sequence labeling task: Named Entity Recognition (Lample et al., 2016). In this paper, we present an end-to-end System (combined CNN, LSTM and CRF) for sequence labeling that uses no complicated handcrafted features or domain knowledge. LSTM is capable of learning long-term dependencies, which is beneficial to sequence modeling tasks. And character level CNN can get characterlevel representation. For sequence labeling (or general structured prediction) tasks, it is beneficial to consider the correlations between labels in neighborhoods and jointly decode the best chain of labels for a given input sentence. So we model label sequence jointly using a conditional random field (CRF), instead of decoding each label independently. Therefore, the system we proposed is based on CNN, Bi-directional LSTM and CRF. And in the SubTask2-relaxed our group ranked third. As for SubTask1, we proposed a ruled based method that if any token in the sentence is labled "Action", "Entity", or "Modifier", the sentence would be considered relevant. Our team ranked 4th in the subtask 1.

System Description
In this section, we describe the components (layers) of our end-to-end system. We design our model with CNN-BiLSTM-CRF that combined word level representation, character level representation and POS representation as feature in-put, and outperform than the baseline in subtask2relaxed.

Feature Embedding
Feature representation as the meta input of neural network have received a great deal of attention, and there are many outstanding achievements. In our system, the word level embedding is trained by the Google's Word2Vec (Mikolov et al., 2013) tool. Previous studies (Santos and Guimaraes, 2015;Chiu and Nichols, 2015)have shown that CNN is an effective approach to extract morphological information (like the prefix or suffix of a word) from characters of words and encode it into neural representations. To get more diverse information, our team decided to use Part-Of-Speech(POS) as extra feature input.
Word level Embeddings: Taking into account the particularity of the cybersecurity, we use the evaluation data to train our own word embeddings. Word level embeddings are trained by Word2Vec 1 , and we set embedding dim = 300. Character level Embeddings: Character level embeddings are random initialization(trainable), and we set embedding dim = 30.

Model
We provide a brief description of CNN, LSTM and CRF, and present a hybrid sequence labeling architecture. This architecture is similar to the ones presented by (Ma and Hovy, 2016).  Figure 1 shows the CNN we use to extract character-level representation of a given word. The CNN is similar to the (Chiu and Nichols, 2015), except that we use only character embeddings as the inputs to CNN, without character type features. A dropout layer (Srivastava et al., 2014) is applied before character embeddings are input to CNN.

LSTM
Recurrent neural networks (RNNs) are a family of neural networks that operate on sequential data. Although RNN can, in theory, learn long dependencies, in practice they fail to do so and tend to be biased towards their most recent inputs in the sequence (Bengio et al., 1994). Long Short-term Memory Network (LSTM) have been designed to combat this issue by incorporating a memory-cell and have been shown to capture long-range dependencies. They do so using several gates that control the proportion of the input to give to the memory cell, and the proportion from the previous state to forget (Hochreiter and Schmidhuber, 1997). We use the following implementation: We will refer to the former as the forward LSTM and the latter as the backward LSTM. This forward and backward LSTM pair is referred to as a bidirectional LSTM (Graves and Schmidhuber, 2005;Dyer et al., 2015).The basic idea is to present each sequence forwards and backwards to two separate hidden states to capture past and future information, respectively.

CRF
For sequence labeling (or general structured prediction) tasks, it is beneficial to consider the correlations between labels in neighborhoods and jointly decode the best chain of labels for a given input sentence. Therefore, we model label sequence jointly using a conditional random field (CRF) (Lafferty et al., 2001), instead of decoding each label independently.
For a sequence CRF model (only interactions between two successive labels are considered), training and decoding can be solved efficiently by adopting the Viterbi algorithm.

CNN-BiLSTM-CRF
Finally, we construct our neural network model by feeding the output vectors of BiLSTM into a CRF layer. Figure 2 illustrates the architecture of our network in detail. For each word, the character-level is computed by the CNN in Figure 1 with character embeddings as inputs, and we use NLTK 2 to get POS information. Then the character-level representation vector the POS representation vector are concatenated with the word embedding vector to feed into the BiLSTM network. Finally, the output vectors of BiLSTM are fed to the CRF layer to jointly decode the best label sequence. As shown in Figure 2, dropout layers are applied on both the input and output vectors of BiLSTM. Experimental results show that using dropout significantly improve the performance of our model.

Training
For model presented, we train our networks using the back-propagation algorithm updating our parameters on every training batch, using Adam with a learning rate of 0.001 and a gradient clipping of 5.0. Our CNN-BiLSTM-CRF model uses a single layer for the forward and backward LSTMs whose dimensions are set to 300. Tuning this dimension did not significantly impact model performance. We set the dropout rate to 0.5. Using 2 http://www.nltk.org/ higher rates negatively impacted our results, while smaller rates led to longer training time. The models were implemented in TensorFlow 3 and experiments were run on K80 GPU.

Result
In this work, our team submitted the subtask 1 and subtask 2 results. The results of all the teams are shown in Table 2.
For subtask1, its goal is to classify if a sentence is relevant for inferring malware actions and capabilities. We make use of the result in subtask2 for this subtask and consider a sentence to be relevant as long as it has an annotated token label. Table  2 shows that our system is ranked 4th and behave better than baseline for subtask1.
For subtask2, our CNN-BiLSTM-CRF model is then trained to predict token labels from cybersecurity reports. From Table 2 we can see that in subtask2, our system is slightly worse than the baseline. However, our system has a 22.5% improvement in subtask2-relaxed than baseline.

Error Analysis
For subTask1, a lot of non-malware sentences are regarded as malware sentences. May be due to the fact that we use the subTask2 output to estimate whether the current sentence is non-malware sentence or malware sentence, so the errors of subTask2 will affect subTask1. And both nonmalware sentences and malware sentences contain annotated tokens.
For subTask2, we find that many unannotated tokens are labeled as annotated tokens and annotated tokens are not labeled. By analyzing the data, we found that the same words occurring as both unannotated and annotated tokens in the sentences, which might make our system achieve a low F-score.

Conclusion
In this paper we presented the system we used to compete in the SemEval-2018 Semantic Extraction from cybersecurity reports using NLP competition. Our goal is to implement a deep learning based end-to-end system that can solve cross domain sequence labeling issues without complicated feature engineering.
For future work, it would be interesting to explore systems that can solve the problem of self- adaptation between different domains. And transfer learning might be a way to handle the lack of labeled data.