NHK_STRL at WNUT-2020 Task 2: GATs with Syntactic Dependencies as Edges and CTC-based Loss for Text Classification

The outbreak of COVID-19 has greatly impacted our daily lives. In these circumstances, it is important to grasp the latest information to avoid causing too much fear and panic. To help grasp new information, extracting information from social networking sites is one of the effective ways. In this paper, we describe a method to identify whether a tweet related to COVID-19 is informative or not, which can help to grasp new information. The key features of our method are its use of graph attention networks to encode syntactic dependencies and word positions in the sentence, and a loss function based on connectionist temporal classification (CTC) that can learn a label for each token without reference data for each token. Experimental results show that the proposed method achieved an F1 score of 0.9175, out- performing baseline methods.


Introduction
The outbreak of COVID-19 that has occurred since the end of 2019 has greatly impacted our daily lives. In these circumstances, it is important for everyone to understand the situation and grasp the latest information to avoid causing too much fear and panic. Nowadays, social networking sites (SNSs) such as Twitter and Facebook are important information sources because users post information regarding their personal events-including that related to COVID-19-in real time. For this reason, many monitoring systems for COVID-19 have been developed such as The Johns Hopkins Coronavirus Dashboard 1 and the COVID-19 Health System Response Monitor 2 . Many systems use SNSs as resources, but largely depend on manual work such as using cloud sourcing to extract informative posts from massive numbers of uninformative ones. Generally, SNSs contain too much information on miscellaneous topics, so extracting important information is difficult. Therefore, we attempted to develop a method to extract important information.
Our method first embeds each token in the input sentence using BERT (Devlin et al., 2019). Then, the vectors are fed into graph attention networks (GATs) (Veličković et al., 2018) to encode tokento-token relations. Finally, our method classifies each vector into labels using feed-forward neural networks (FFNNs). In the training process, we use a loss function based on connectionist temporal classification (CTC) (Graves et al., 2006). Experimental results show that our method using GATs and the CTC-based loss function achieved an F1 score of 0.9175, outperforming baseline methods.
Our contributions are as follows: (1) We propose a GAT-based network to embed syntactic dependencies and positional features of tokens in an input sentence.
(2) We also propose a loss function, which enables to train labels for each token. (3) We confirmed the effectiveness of our proposed methods using the identification of informative COVID-19 English Tweets shared task dataset.

Identifying Informative COVID-19
Tweets Shared Task The identification of informative COVID-19 English Tweets 3 is a shared task held at W-NUT (Workshop on Noisy User-generated Text) 2020 (Nguyen et al., 2020b). The purpose of the task is to identify whether English tweets related to COVID-19 are informative or not. The dataset for the task contains 7,000 tweets for training, 1,000 for validating, and 2,000 for testing. Each tweet in the data, excluding those in the testing data are labelled informative or uninformative. The target metric of the task is the F1 score for informative tweets.
Labels for each token Classification *One or more "informative" labels included, then output is "informative." 1 2 3 4 5 6 2-layer FFNN Figure 1: Overview of our method. Our method first embeds each token in an input sentence using BERT. Also, syntactic dependencies are obtained using a dependency parser. Then, our method embeds syntactic features using GATs, by using a graph that has nodes of token-embedding vectors and edges of syntactic dependencies and selfloops. Positional features are also added to the graph. The output vectors of the GATs are concatenated with BERT output vectors, and then fed into 2-layer FFNNs, which classifies each vector into labels. If one or more vectors are labelled as informative, the output class is informative. Note that the arrows in the the dependency parser example connect the head word to the dependent word as to follow a convention. On the other hand, arrows in the GAT example connect the dependent word to the head word, as used in our proposed method.

Methods
The overview of our method is illustrated in Figure  1. The key features of our method are embedding syntactic dependencies and positional features using GATs (Veličković et al., 2018), and calculating loss in the training process using a loss function based on CTC (Graves et al., 2006). We use masked-token estimation as multi-task learning to help improve the generalization capability. We use word-dropout (Sennrich et al., 2016) before BERT, and the "dropped" tokens are used as masked words to be estimated in the training process as a multi-task.

GATs for encoding token-to-token relations
The BERT model, which we use for token embedding, uses position encoding to consider the position of tokens in the model, but its ability to capture global information including syntactic features is limited (Lu et al., 2020). Therefore, we use GATs with syntactic dependencies as edges of the graph, which enables our method to handle syntactic dependency explicitly. This is inspired from the work of Huang and Carley (2019). We use all of the universal dependency (McDonald et al., 2013) as a directional edge regardless of dependency type 4 . The tokenizer used in BERT often separates a single word into many tokens. We connect edges from all tokens of a word to all tokens of the head word. For example, if there is a relation between the two words COVID-19 and tweet, and the former word is divided into two tokens COVID and ##-19, our method connects the two edges, COVID to tweet and ##-19 to tweet.
The GAT is based on multi-head attention (Vaswani et al., 2017) among neighbor nodes, with all the connected nodes used as the keys and values of the attention calculation. In many cases, the number of incoming edges for a node is only zero or one if syntactic dependencies are used as edges. Nodes that have no incoming edges cannot update the vector in the GATs. Also, for nodes that have only one incoming edge, the attention weight in the multi-head attention is 1.0, which leads to poor results. To overcome this problem, a self-loop for each node was proposed (Huang and Carley, 2019;Xu and Yang, 2019). Following that, we use a self-loop for each node in the GATs.

Positional features
Many edges are concentrated on the root word of a sentence, so the GATs treats all nodes equally. On the other hand, nearby and distant words are generally more and less related to the root word, respectively. To simulate this, we use positional encoding to our GATs. We use the relative distance between tokens as a parameter, then embed them along with the attention coefficient between nodes For fixed, we use the following representation as a positional embedding between the i-th and j-th tokens of the sentence: where L is the number of tokens in the sentence. For learned, we use a 1-layer FFNN with an input of P E ij fixed as a positional embedding as follows: where W P E ∈ R |1×1| and b P E ∈ 1 |d| are a learnable weight and bias, respectively. The positional features are then broadcasted into P E ij ∈ |1 × d| where d is the dimension of a GAT layer, and added after calculating the multi-head attention along the edges in the graph.

CTC for Text Classification (CTCTC)
Most tweets that were labelled as informative contain not only informative phrases but also uninformative parts. To consider this, we propose a new loss function-CTC for Text Classification (CTCTC).

The basis of CTC
Let us consider the input sequence of probabilities x ∈ R |T |×|L| where |T | denotes the length of the sequence and |L| denotes the number of labels to classify. Note that L includes blank, which is a special symbol for CTC labelled for the data in which no labels are aligned. The probability p ctc (y|x) for input x and reference data y ∈ 1 ≤|T | is calculated as follows: where B −1 is the inverse of the many-to-one map B of all possible labellings from the input to reference data. In generating B, blanks are inserted between each label in y, i.e., for y = {y 1 , y 2 , · · · , y |y| }, a modified reference y = {blank, y 1 , blank, y 2 , · · · , y |y| , blank} is used to generate B. In Figure 2, B is equal to the set of the paths of black arrows that finally reach one of the two dots in the blue box. Then, p ctc represents the probability of the sum of all probabilities of paths that pass all labels with the given order as reference data, which is shown as the sum of two probabiliuties in the blue box in Figure 2.

CTCTC loss
We use a CTC-based loss function that is utilized for text classification. Our loss function accepts the reference dataȳ, which is a single label for an "informative" or "uninformative" sentence in the task, and assign a label or blank for all tokens in the sentence. It works by handling the uninformative parts in informative tweets as blank automatically. Calculating CTCTC is almost the same as CTC, differing only in the construction of the many-toone map. First, CTCTC arranges a sufficient number of the given reference labelȳ and blank, i.e., y = {blank,ȳ, blank,ȳ, · · · ,ȳ, blank}. Then,B is generated, which is the set of all possible labellings from the input x to modified reference dataȳ regardless of the number of passed labels inȳ . In Figure 2,B is equal to the set of the paths of black arrows that finally reach one of the dots in the red box. To calculate a CTCTC loss,B is used instead of B in Equation (3). As a result, the probability p ctctc represents the probability of at least one token in the input sequence being aligned to the label y, which is illustrated as the sum of all dots in the red box in Figure 2.

Smoothing for CTCTC
CTCTC tends to align most tokens to blank, and only one token to the reference label. This is because the probability for blank is learned for every sentence in the training data regardless of its label, so the probability tends to be high for all data. To avoid the probabilities of all data being learned as blank, we prepare three types of smoothing.
Label smoothing We use label smoothing (Szegedy et al., 2016), which is a regularization technique to avoid overfitting and overconfidence. This replaces the one-hot reference label l with the smoothed label l (k) as follows: where δ k,l is the Dirac delta function, which equals 1 when k = l and 0 otherwise, K is the set of labels to classify, and is the smoothing rate. The label-wise smoothing is illustrated as a green box in Figure 2.
Token smoothing This is almost the same as label smoothing but differs in the direction of the smoothing-token-wise. It works on the basis that words close together often have similar meanings. We set the max width to 5 to consider this smoothing in the experiments. The token-wise smoothing is illustrated as a yellow box in Figure 2.
Leaking To enable learning the probability for labels instead of blank, we use the one-direction smoothing named "leaking." This is calculated as follows: where is the smoothing rate, p i,blank and p i,ȳ are the probabilities for blank and the reference label y of i-th data of the input sequence, respectively. This is calculated only for the probability of blank, and is illustrated as an orange arrow in Figure 2.

Experimental settings
Our experiments were based on the identification of informative COVID-19 English Tweets dataset mentioned in Section 2. We conducted two experiments on the basis of the validation and testing data, respectively. For the validation data-based experiment, we used training data contains 7,000 tweets and validating data contains 1,000 tweets for training and testing, respectively. For the testing data-based experiment, we used 8,000 tweets mixed from the training and validating data for 4fold cross validation. Then, an ensemble of the best model of each fold data were used for testing data. We added the output scores of each model for the model ensemble.
The models were implemented in PyTorch (Paszke et al., 2019), with Transformers (Wolf et al., 2019) and Deep Graph Library (Wang et al., 2019), and learned with the RAdam optimizer (Liu et al., 2020) with a learning rate of 0.0001. We used BERT-base, uncased (Devlin et al., 2019) as a pretrained model, with fine-tuning and a learning rate of 0.00002. We used spaCy (Honnibal and Montani, 2017) for dependency parsing.
The following hyperparameters were used: number of GATs layers was 2; a mini-batch size of 16; L2 regularization coefficient of 0.1; dropout rate of 0.1; word dropout rate of 0.2; 50 training iterations, with early stopping on the validating data on the basis of the F1 score for the informative class; and smoothing ratio for the three smoothing methods of CTCTC of 0.2.

Baseline methods
We prepared baseline methods as shown in Figure  3. To confirm the effectiveness of the GATs, a baseline method of "no GATs" that does not use GATs but the output vectors of BERT is directly fed into the FFNNs. Also, to confirm the effectiveness of CTCTC, a baseline of "no CTCTC" that does not use CTCTC but cross entropy loss is used. Table 1 shows the results for the validation databased experiment. The rows in which Use GATs and Use CTCTC are not checked indicate the baselines shown in Section 4.2. F1 score shows the F1 score for the informative class with the mean and standard deviation of five-time trials of the same settings. Our methods using both GATs and CTCTC (# 9 and 10) achieved the top-2 results in the table. Table 2 shows the results on the test data, which  are the official results of the shared task and we ranked 21st out of 55 participants 5 . The table also shows the results of the top-3 teams in the shared task.

Discussion
The results for the methods using GATs with CTCTC (#9 and 10) are better than the others. This is because our CTCTC uses vectors of each token so the performance depends on the quality of the vector of each token. Our GATs work to improve the quality of the vector of each token by using token-to-token relations. Therefore, we believe our GATs and CTCTC work well in combination.
On the other hand, GATs without CTCTC cannot make the best use of the improved vectors because they are mixed up vectors of tokens into one vector using max-pooling, so some of the details of the vectors are lost. Also, in using CTCTC without GATs, we observed that the output vectors of each token in the sentence are almost the same. This means that token-level information is lost, so accuracy may be lower for methods using CTCTC in these cases. By using GATs with CTCTC, we can avoid losing the information, which leads to good results.

Related Work
There are a number of methods that use GATs with a pre-trained language model. Lu et al. (2020) use a network on a vocabulary graph, which is based on word co-occurrence information, and Huang and Carley (2019) and Xu and Yang (2019) use syntactic features as a graph. Also, there are several methods that use positional encoding into GATs (Ingraham et al., 2019;Ishiwatari et al., 2020). Our method uses GATs to consider syntactic features with positional features in combination, which is distinguishable from conventional methods. The CTC loss function is widely used for long data sequence with not-one-to-one-aligned reference data such as speech recognition (Graves et al., 2013;Kim et al., 2017), but to the best of our knowledge, no method that uses CTC for text classification tasks exists.

Conclusion and Future Work
In this paper, we proposed a GATs-based model that embeds token-to-token relations, and a loss function that can learn classes for each tokens. We conducted evaluations using the identification of informative COVID-19 English Tweets dataset, and confirmed that our proposed methods are effective.
To determine whether CTCTC can work for other tasks especially for the classification into large amount of classes and to exploit pre-trained models other than BERT, especially for tweetspecific models such as BERTweet (Nguyen et al., 2020a) and CT-BERT (Müller et al., 2020), are subjects of as our future work.