KNU CI System at SemEval-2018 Task4: Character Identification by Solving Sequence-Labeling Problem

Character identification is an entity-linking task that finds words referring to the same person among the nouns mentioned in a conversation and turns them into one entity. In this paper, we define a sequence-labeling problem to solve character identification, and propose an attention-based recurrent neural network (RNN) encoder–decoder model. The in-put document for character identification on multiparty dialogues consists of several conversations, which increase the length of the input sequence. The RNN encoder–decoder model suffers from poor performance when the length of the input sequence is long. To solve this problem, we propose applying position encoding and the self-matching network to the RNN encoder–decoder model. Our experimental results demonstrate that of the four models proposed, Model 2 showed an F1 score of 86.00% and a label accuracy of 85.10% at the scene-level.


Introduction
In this paper, we define character identification (CI) (Chen et al., 2017) as a sequence-labeling problem and use a recurrent neural network (RNN) encoder-decoder (Enc-Dec) model (Cho et al., 2014) based on the attention mechanism (Bahdanau et al., 2015) to solve it. An Enc-Dec is an extension of the RNN model; it generates an encoding vector using an RNN in the encoder when an input sequence is given and performs decoding using the encoding vector. The attention mechanism calculates the alignment score for the two sequences and performs the input sequence and weighted sum so that they can focus more on the position that affects the output result. The self-matching network (Wang et al., 2017) is used to calculate an attention weight for itself and a context vector by using a weighted sum, after which the weights of similar words in the RNN sequence can be applied to aid in coreference resolution. Position encoding (PE) (Sukhbaatar et al., 2015, Park and Lee, 2017, Vaswani et al., 2017 is a method of applying weights differently, according to the positions of words appearing in a sequence. Training and prediction are performed by multiplying a weight vector by a vector of positions to be identified in a given input sequence.
In an Enc-Dec model, a long input sequence results in performance degradation due to loss of information in the front portion of the input sequence when encoding is performed. In this paper, we propose four models that apply PE, attention mechanism, and self-matching network to Enc-Dec models to solve the problem of performance degradation due to long input sequences.
To summarize, the main contributions of this paper are as follows: 1. In this paper, we define CI task as sequencelabeling problem, and perform training and prediction in end-to-end model. 2. We propose four models using Enc-Dec based on attention mechanism and achieve high performance.

System Description
An Enc-Dec model maximizes P( | ) using an RNN. The encoder generates an encoder hidden state by encoding the input sequence, and the decoder generates an output sequence that maximizes P( | ) using the hidden state of the decoder, which was generated until this time step, with the encoder hidden state. The attention mechanism is a method of determining which part of the target class should be focused using the hidden state of the decoder and the hidden state of the encoder when performing decoding.

Model 1: Attention-based Enc-Dec model
The first model proposed in this paper is a general attention mechanism-based Enc-Dec model, as shown in Figure 1.
The input of the encoder is one document that contains sentences (multiparty dialogue). Each sentence consists of words, and the input sequence is = { 1 , 2 , … , } . The input to the decoder is = { 0 , 1 , … , } consisting of the positions of the words given in the gold mentions, and the output sequence accordingly becomes = { 0 , 1 , … , } consisting of the character number, which is corresponded with the decoder's input mentions.
We use word embedding and adopt the K-dimensional word embedding , ∈ [1, ] for all input words, where is the word index in the input sequence. We perform feature embedding for three featuresspeaker, named entity recognition (NER) tags, and capitalizationand concatenate them to make ̃.
 The uppercase feature is a binary feature (1 or 0) that verifies whether the uppercase is included in the word.
 10-dimensional speaker embedding for a total of 205 different types of speakers included by "unknown".
 19-dimensional NER embedding for a total of 19 different types of NER tags.
We use bidirectional gated recurrent unit (BiGRU) (Cho et al., 2014) for the encoder. The hidden state of the encoder for the input (word) sequence is defined as ℎ .
The decoder of our model uses the GRU as follows.
The input of the decoder is the hidden state ℎ generated by the encoder corresponding to each position of which is the gold mention sequence. The hidden state ℎ of the current decoder receives the hidden state ℎ of the encoder corresponding to the output position of the previous decoder and the previous hidden state of the decoder.
At the attention layer of the decoder, we use the attention weight to compute the alignment score for the gold mention input into the decoder and the encoder hidden state ℎ input. The attention layer acts as a coreference resolution for each gold mention and input sequence. After calculating the attention weights, we create the context vector . We use soft attention and hard attention in Eq. (8). Soft attention = ∑ ℎ is an attention-pooling vector of the whole input sentence of the encoder (ℎ ). The other attention-pooling vector is hard attention = ℎ , which is based on the argmax function Eq. (7) for attention weight to choose the position with high score for the decoder input as the gold mention.
After calculating the context vector between the input of the encoder and the input of the decoder, we calculate , using which the context vector , decode hidden state ℎ and encoder hidden state ℎ are concatenated in the decoder hidden layer. Next, the softmax function is used to calculate the alignment score for , and then the character index ( ) for the CI task corresponding to the input of the decoder is obtained using the argmax function.

Model 2: Attention-based Enc-Dec w/ model with PE
The second model is based on the first model but uses PE which is a method of applying a weight to an input sequence of an RNN according to the word order. Among the words in the coreference resolution, the antecedent has a feature that appears mainly in the preceding context. In this paper, we apply PE with a feature to the encoder input sequence, and use the weight according to the word order as the feature. As shown in Eq. (11), PE information is concatenated to Eq.
(2) to produce ̃, and PE is calculated as shown in Eq. (12). In PE, is the index of the word, is the total length of the input sequence, is the position of the sentence, and is the number of dimensions of the word expression. The weight of PE is calculated as a real value that gradually decreases between 1 and 0, and is applied to the input of the encoder to take advantage of the feature that the predecessor precedes the current mention. In Eq. (12), (1 − 2 / ) denotes the order of words. If it is a front word, it has a higher value than the next word. ( / ) is a weight based on the sentence order, and when the sentence is different, the weight reduction rate difference is calculated to be higher than the value decreasing in the sentence. The expression for the encoder and decoder are the same as for model 1.

Model 3: Self-matching Network-based Enc-Dec model
The third model is also based on the first model, but performs encoding by using the self-matching network in the encoder without using PE, as shown in Figure 2. The self-matching network is used for calculating the alignment score for a given RNN sequence and itself, and then for performing a weighting sum with itself to create a context vector. While using the self-matching network for encoding, attention weights are weighted with high alignment scores between similar words. For example, if "Rachel's child-hood best friend" and "Monica" appear in a sentence, a high alignment score between them is calculated by the self-matching network.
The input sequence of the encoder becomes ̃ in Eq. (2), and a feed-forward neural network is used, as in Eq. (13). Next, we use the self-matching network to compute the attention weight for the t sequence and create a context vector that reflects the self-attention (Eqs. 14-16).  ] is transmitted to the encoder hidden layer input. The equation is as follows: The decoder of model 3 performs training and prediction using a decoder such as the one used in model 1 (Eqs. 4-8) based on the hidden state where encoding is performed as above.

Model 4: Self-matching Network-based RNN Enc-Dec model with PE
Model 4 is based on model 3 using the self-matching network; it additionally uses PE, which was also used for model 2, as a feature to confirm the word order.

Experimental Results
We evaluate the entity linking performance of the models using label accuracy and macro-F1 (Chen et al., 2017), and the coreference resolution performance using CoNLL F1 (Rahman and Ng, 2009). The word representation used in this paper is a data-set provided by LDC, which is learned by a neural network language model (Bengio et al., 2003, Lee et al., 2014, and is set to 50 dimensions. The experiments were performed with cross validation. The hyper parameters used in the experiment are as follows. We used tanh for the encoder and decoder, and ReLU for the attention layer. The hidden layers had 150 dimension, and the dropout of all layers was set to 0.3. The learning was done using RMSprop (Hinton et al., 2012) and the learning rate was reduced by 50% for every 5 epochs without performance improvement starting at 0.1. The decoder attention functions of the models used in the experiments are all based on hard attention, and are compared with soft attention in Table 1. Table 1 shows a comparison between the CI performances of the models on the trial set. M 2' is a model in which PE proposed by Vaswani et al. (2017) is applied, and M 4' is a model in which soft attention is applied to M 4. At the episode-level, M 3 showed the best Main F1 performance (86.30%) and M 1 showed the best All F1 performance (23.33%). At the scene-level, M 4 showed the highest Main F1 performance (87.41%), and M 4 showed the highest All F1 performance (23.92%). In the case of M 2 and M 2', we can see that the proposed PE method resulted in a better overall performance.
At the episode-level, M 4' showed a better Main F1 performance (1.83%) than M 4, whereas M 4 showed a better All F1 performance (by 2.67%). At the scene-level, M 4 showed a better Main F1 performance (by 1.18%) than M 4', whereas M 4'  showed a better All F1 performance (by 1.86%). Thus, it can be seen that the use of hard attention results in a better performance. Table 2 presents the experimental results of the test set (episode-and scene-level) for the method proposed in this paper, and the performance comparison with other competing models, namely AMORE UPF (Amore), Kampfpudding (Kamp.), and zuma. In the Main + Other character evaluations at episode-level, M 2 showed the best performance among all models (F1 of 85.01%, Acc of 84.36%), whereas in the All character evaluations, M 3 showed the best F1 performance (17.02%) and M 2 showed the best Acc performance (68.42%). At the scene-level, M 2 showed the best performance in both the Main + Other and the All character evaluation. The proposed method showed a lower overall performance in the All character evaluation compared with other competing models, but showed a higher performance in the Main + Other character evaluations. The reason for the lower performance in the All character evaluation is that the number of data points is smaller than that of the main characters.

Conclusion
In this paper, we defined the entity-linking problem of SemEval-2018 Task 4 as a sequence-labeling problem and proposed four models to solve it. Experimental results showed that M 2 shows the best performance in the test set scene-level (Main + Other characters), with an F1 of 86.00% and Acc of 85.10%. In the Main entities + Others evaluation of SemEval-2018 Task 4, it ranked 1 st with an F1 of 83.37% and Acc of 82.13%. In All Entities + Others, it ranked 2 nd with an F1 of 13.53% and Acc of 68.55%.
In future work, we will apply character CNN to solve the unknown word problem, and we will add word expressions such as GloVe (Pennington et al., 2014) and ELMo (Peters et al., 2018). We will also enhance the performance by tightening the model with less data by adding the features used in the task 4-based model.