Semi-supervised Learning for Vietnamese Named Entity Recognition using Online Conditional Random Fields

We present preliminary results for the named entity recognition problem in the Vietnamese language. For this task, we build a system based on conditional random fields and address one of its challenges: how to combine labeled and unlabeled data to create a stronger system. We propose a set of features that is useful for the task and conduct experiments with different settings to show that using boot-strapping with an online learning algo-rithm called Margin Infused Relaxed Algorithm increases the performance of the models.


Introduction
Named Entity Recognition (NER) is an important problem in natural language processing and has been investigated for many years (Tjong Kim Sang and De Meulder, 2003). There have been a lot of works on this task, especially for major languages such as English, Chinese, etc. (Mc-Callum and Li, 2003;Gao et al., 2005;Ritter et al., 2011). For the Vietnamese language, several authors have attempted to tackle the NER problem using both supervised and semi-supervised methods Tran et al., 2007;Nguyen et al., 2010;Pham et al., 2012;Le Trung et al., 2014). However, previous works for NER in the Vietnamese language mainly used offline supervised learning methods, where all the training data are gathered before a model is trained.
In this paper, we report preliminary results for a Vietnamese NER system trained by using conditional random fields (CRFs) (Lafferty et al., 2001). Unlike previous works for NER in the Vietnamese language, we use an online learning algorithm, the Margin Infused Relaxed Algorithm (MIRA) (Crammer and Singer, 2003), to train the CRFs. Furthermore, due to the fact that the number of labeled data is small while that of unlabeled data is very large, we treat this problem under the semisupervised learning framework. In particular, we use the bootstrapping method on top of the CRF models to gradually increase the number of labeled data. Using bootstrapping, a small number of new labeled training data are available after each round and then can be used to update the CRF model.
We demonstrate that using MIRA to learn CRFs instead of the traditional offline method would increase the performance of our system. We also propose a set of features that is useful for this task and gives competitive performance. In contrast to previous works such as (Tran et al., 2007), we do not use features from outside sources, e.g. gazetteer features; so our feature set does not require human effort to create such resources and therefore, is easy to build.
The rest of this paper is organized as follows. In Section 2, we review some previous works for the NER task, especially for the Vietnamese language. A brief introduction to CRF and MIRA is given in Section 3. This will be followed by a description of our feature set in Section 4. In Section 5, we describe our semi-supervised learning approach for the Vietnamese NER problem. We show our experimental setup and results in Section 6. In Section 7, we give some discussions about the problem. Finally, we conclude our paper and discuss some future works in Section 8.

Related Works
NER is an important problem that was first introduced at the Sixth Message Understanding Conference (MUC-6) (Grishman and Sundheim, 1996) and since then has attracted many researchers to investigate the problem with new methods as well as different languages. Over the years, researchers have tried to solve the problem under supervised learning (McCallum and Li, 2003), semisupervised learning (Ji and Grishman, 2006), and unsupervised learning (Etzioni et al., 2005) frameworks. One dominant approach for NER is supervised learning with conditional random fields (McCallum and Li, 2003). However, semisupervised learning approaches are also attractive for this task because it is expensive to get a large amount of labeled data. Notably, Riloff et al. (1999) introduced the mutual bootstrapping method that proved to be highly influential. Besides, using bootstrapping methods, Ji and Grishman (2006) were able to improve the performance of existing NER systems.
For the Vietnamese language, using supervised learning,  built an NER system with CRFs and reported 87.90% F 1 score as their highest performance. Using SVMs, Tran et al. (2007) achieved 87.75% F 1 score for the task. For semisupervised learning, Pham et al. (2012) achieved 90.14% F 1 score using CRFs with the generalized expectation criteria (Mann andMcCallum, 2010), while Le Trung et al. (2014) reported an accuracy of 95% for their system that uses bootstrapping and rule-based models.

Conditional Random Fields
Linear-chain conditional random field (CRF) is a sequence labeling model first introduced by Lafferty et al. (2001). This model allows us to define a rich set of features to capture complex dependencies between a structured observation x and its corresponding structured label y. Throughout this paper, we will use the term CRF to refer to linearchain CRF, a widely used type of CRFs in which x and y have linear-chain structures. Formally, let x = (x 1 , x 2 , . . . , x T ) be the input sequence, y = (y 1 , y 2 , . . . , y T ) be the label sequence, F = {f k (y t , y t−1 , x)} K k=1 be a set of real-valued functions (features) over two consecutive labels y t , y t−1 and the input sequence x, and Λ = {λ k } K k=1 be the set of parameters associated with the features that we want to learn. A linear-chain CRF defines the conditional distribu-tion p(y|x) as: is the normalization constant, also called the partition function.
Normally, training a CRF is an iterative process where all the parameters are updated after each iteration to maximize the conditional log-likelihood of the training data. During testing, the label sequence for a new test instance is determined by a Viterbi-like algorithm, which returns the label sequence with the highest probability according to the trained model (Sutton and McCallum, 2006).

Margin Infused Relaxed Algorithm
MIRA is an online learning algorithm developed by Crammer and Singer (2003). In this algorithm, at each round, the model receives a training example, makes a prediction on the example, and suffers a loss. Then the training algorithm updates the weight vector so that the norm of the change to the weight vector is as small as possible while keeping a margin at least as large as the loss of the incorrect examples.
Details of the single-best MIRA (Crammer, 2004;McDonald et al., 2005) for the sequence labeling task are given in Algorithm 1. In the update step at line 4 of the algorithm, s(x, y) is a scoring function and L(y, y ) is a loss function. The difference between MIRA and offline training for CRFs is that MIRA processes one data example at a time while the offline algorithm processes all the data at each iteration. However, the features and the prediction algorithm are identical regardless of the learning algorithms.

Features for CRFs
We model NER as a sequence labeling task where each word in a sentence is associated with a tag to indicate which type of named entities it belongs to. There are 5 possible tags that we are interested in: person, organization, location, miscellaneous (proper names), and none. The none tag indicates that the corresponding word is not a part of any named entity. For instance, it may be a verb or an adjective.
We build a set of features that is useful for the Vietnamese NER task. Recall that a feature Algorithm 1 MIRA for Sequence Labeling in CRFs is a function over the observation x and two consecutive labels y t and y t−1 . In this paper, we use as features the binary functions that can be fully defined based on the observation sequence.
Particularly, the first type of features we use is the identity of words in a window of size 5 and their combinations. Besides, information about capitalization plays a notably important role for this task. For example, a person's name always has its first letter capitalized, and an abbreviation of a company's name or a place is all capitalized (e.g. Ho Chi Minh city is abbreviated as HCM). Thus, we add orthographic features to feed this information into the CRFs. This type of features describes whether a word is in lower case, whether it has the first letter capitalized, whether all of its letters are capitalized, and whether it contains numeric letters or not. For this type of features, we also take the orthographic information of words in a window of size 5. Finally, we include as features the part-of-speech of the word and the combination of the word's identity and its part-of-speech to better describe the context of the sentence.
We note that not all of the features described above are used since there are possibly redundant features that do not increase the performance. Therefore, we conduct a feature selection step for choosing which features to be utilized for later experiments. We first start with the current word's identity and orthographic features. Then, we add several features, build an appropriate model, and measure its performance on a validation set, which contains 150 sentences extracted from the total training data. If the performance increases, we keep those features; otherwise, they are discarded. The process of adding and discarding features is
In Table 1, we give the final set of our features. This set includes 2 groups: single and complex features. The first group contains features about word identity (W), part-of-speech (P), and orthographic information (O). Complex features are formed by combining the single features. From Table 1, possible word identity features such as W −1,1 and W −2,2 are not listed because they were eliminated during the feature selection step.

Bootstrapping with CRFs
One main difficulty of the Vietnamese NER task is the lack of labeled data. Since texts from news, books, etc. naturally do not come with named entity labels, we have to manually label the data set. This is tedious and time consuming when the data size becomes very large. One way to address this problem is to gradually create more labeled data with just a small amount of labeled data via semisupervised learning.
More specifically, we use the bootstrapping method in this paper. First, we build a model on a labeled corpus and use it to label the data from a data set that has not been labeled. After that, we select some newly labeled instances (sentences in our case), remove them from the unlabeled data set, and add them to the labeled data set. The criteria for choosing instances may vary and depend on the task. Then, the next model is trained on the new labeled set and it will also get an amount of new labeled data from the unlabeled data set. This process is repeated until we satisfy with the amount of labeled data that have been received.
We provide our CRF training procedure with bootstrapping in Algorithm 2. The criterion for choosing the sentences from the unlabeled data Algorithm 2 Bootstrapping with CRFs INPUT: Labeled data set L, unlabeled data set U , number of iterations n, the amount of sentences per round k. 1: for i = 0 to n do 2: Train CRF model M i on data set L.

3:
Use M i to label U .

4:
Choose k labeled sentences X = {x j } k j=1 with highest confidence from U .

5:
L ← L ∪ X; U ← U \ X. 6: end for 7: return M n . set is to choose the sentence whose best label sequence got the highest probability assigned by the model.

Setup
We build a corpus of 1,911 sentences from law news articles and manually tag their named entity labels. To build the unlabeled data set, we collect another 17,500 sentences, which also come from law articles .Both data sets are collected from online newspaper articles. The labeled data set is annotated using the IBO label format (Tjong Kim Sang and De Meulder, 2003) with the 5 labels mentioned in Section 4.
For the bootstrapping experiments, we split our corpus into two parts: the first part contains a fixed set of 411 sentences for testing, and the second part contains 1500 sentences for training. We train 3 initial models using 500, 1000, and 1500 sentences respectively from the second part and apply the bootstrapping algorithm to each trained model, with the maximum number of iterations n being 15. In each iteration, the model selects the top k = 10 highest confidence (i.e., highest value of p(y|x, Λ)) sentences to add into its training set. Finally, we compare the results of these models after 5, 10, and 15 rounds of bootstrapping with the initial models. To evaluate the performance of the models, we use the micro-averaged precision (P ), recall (R), and F 1 score (F ).
In our experiments, we use the CRF++ toolkit 1 which comes with MIRA training option to build our models. Regarding the tasks of Vietnamese word segmentation and part-of-speech tagging, we use a standard tool for Vietnamese language processing provided by Nguyen et al. (2005).

Results
In Table 2, we depict the highest F 1 score (in %) of the models for every 5 rounds of bootstrapping. For all the initial training sizes, the best CRF trained using MIRA outperforms the best normal CRF in the semi-supervised learning scenario. With 1000 initial training sentences, we achieve the highest increase in F 1 score (which is 2.43%) after 5 rounds of bootstrapping with MIRA compared to not using bootstrapping. Our highest performance is 89.16%, obtained by training with 1500 initial sentences and after 15 rounds of bootstrapping with MIRA.
It is interesting to note that the performance does not always increase after every round. From our error analysis, whenever a model makes a mistake at a round, it affects all the following models and makes them more inaccurate. This leads to a decrease of F 1 score for the later models on the fixed test set.

Discussions
When inspecting the best model in Table 2 (CRF model using MIRA with 1500 training sentences and 10 rounds of bootstrapping), we find several cases that may be difficult for the model to predict. In the examples below, every two consecutive words are separated by a white space, the syllables in each word are connected by underscores, and the bold phrases include one word and its wrongly predicted label. All words having the none label or having been correctly classified are neither in bold nor followed by any label.
For the Vietnamese language, we find that the model may easily confuse a person name with a location name and vice versa. For instance, the model may mistake a person name for a location name as in the following sentence: Họ nói rằng lượng hàng hoá họ nhận được có nguồn từ Trần_Thế_Luân/location.
(They said that all the goods they received originated from Tran_The_Luan.) Here, the word "Trần_Thế_Luân" refers to a person name rather than a location name as predicted above. In this case, the confusion may be caused  by the similar sentence structures when using a person name or a location name. For example, we can replace the word "Trần_Thế_Luân" in the sentence above by a location name and the sentence is still correct. Furthermore, in Vietnamese, many person names are used to name the locations. This also makes it more difficult to distinguish these two labels.
Another source of mistakes is the confusion between an organization name and a person name. For example, the following sentence was added during bootstrapping: Trong_khi_đó, ACB/none đang dư tiền nên đã chuyển cho Vietbank/person và Kienlongbank/person.
(In the meantime, ACB is having a lot of extra money, so they transfer some to both Vietbank and Kienlongbank.) In this example, the model could not recognize "ACB" as an organization name, and it also misclassified "Vietbank" and "Kienlongbank" as person names (ACB, Vietbank, and Kienlongbank are in fact three major banks in Vietnam). This is a difficult case since the English word "bank" is concatenated with the word "Viet" and "Kienlong", and thus it is harder to classify these words without using an external dictionary. Moreover, the sentence structure also cannot help to distinguish the two labels in this case because we can replace the three words "ACB", "Vietbank", and "Kienlongbank" by three person names and the sentence is still correct.

Conclusions and Future Works
We have presented preliminary results for a Vietnamese NER system trained using the CRF with MIRA and bootstrapping. We also proposed a set of useful features, which are easy to compute and do not need human work for processing unlabeled data. Our experiments showed that combin-ing CRFs trained by MIRA with bootstrapping increases our system's performance.
For future works, we will focus on how to choose more meaningful sentences from the unlabeled data set and how to enhance the bootstrapping algorithm for the NER task. Since there are many algorithms to build our model, investigating how to combine these models in the semisupervised learning framework to achieve better results is also a promising direction.