Temporal Information Extraction from Korean Texts

As documents tend to contain temporal information, extracting such information is attracting much research interests recently. In this paper, we propose a hybrid method that combines machine-learning models and hand-crafted rules for the task of extracting temporal information from unstructured Korean texts. We address Korean-speciﬁc research issues and pro-pose a new probabilistic model to generate complementary features. The performance of our approach is demonstrated by experiments on the TempEval-2 dataset, and the Korean TimeBank dataset which we built for this study.


Introduction
Due to the increasing number of unstructured documents available on the Web and from other sources, developing techniques that automatically extract knowledge from the documents has been of paramount importance. Among many aspects of extracting knowledge from documents, the extraction of temporal information is recently drawing much attention, since the documents usually incorporate temporal information that is useful for further applications such as Information Retrieval (IR) and Question Answering (QA) systems. Given a simple question, "who was the president of the U.S. 8 years ago?", for example, a QA system may have a difficulty in finding the right answer without the correct temporal information about when the question is posed and what '8 years ago' refers to.
There have been many studies for temporal information extraction, but most of them are applicable only to their target languages. The main reason for this limitation is that some parts of temporal information are difficult to predict without the use of language-specific processing. For example, the normalized value '1983-03-08' can be represented by 'March 8, 1983' in English, while it can be represented by '1983년삼월8일' in Korean. The order of date representation in Korean is usually different from that of English, and the digit expression in Korean is more complex than that of English. This implies that it is necessary to investigate language-specific difficulties for developing a method to extract temporal information.
In this paper, we propose a method for temporal information extraction from Korean texts. The contributions of this paper are as follows: we (1) show how the Korean-specific issues (e.g., morpheme-level tagging, various ways of digit expression, uses of lunar calendar, and so on) are addressed, (2) propose a hybrid method that combines a set of hand-crafted rules and machinelearning models, (3) propose a data-driven probabilistic model to generate complementary features, and (4) create a new dataset, the Korean Time-Bank, that consists of more than 3,700 manually annotated sentences.
The rest of the paper is organized as follows. Section 2 describes the background of the research. Section 3 presents the details of the proposed method, the Korean TimeBank dataset, and how we apply the probabilistic model for generating features. Section 4 shows experimental results, and Section 5 concludes the paper.
The shared tasks can be summarized into three extraction tasks: (1) extraction of timex3 tags, (2) extraction of event and makeinstance tags, and (3) extraction of tlink tags. The timex3 tag is associated with expressions of temporal information such as 'May 1973' and 'today'. The event tag and makeinstance tag represent some eventual expressions which can be related to temporal information. The makeinstance tag is an instance of the event tag. For example, the sentence, "I go to school on Mondays and Tuesdays", contains one event tag on 'go' and two makeinstance tags as the action 'go' occurs twice on Mondays and Tuesdays. The tlink tag represents a linkage between two tags. The tlink can be a linkage between two timex3 tags (TT tlink), two makeinstance tags (MM tlink), a timex3 tag and a makeinstance tag (TM tlink), or Document Creation Time and a makeinstance tag (DM tlink). Note that the tlink takes makeinstance tags as arguments, but not the event tags, as the event tags are merely templates for them. For the above sentence, there will be two TM tlinks: go-Tuesdays and go-Mondays. The TT tlink is assumed to be easy to extract, so TempEval does not incorporate the TT tlink into the task of extracting tlink tags.
Among many related studies, there are several leading ones. HeidelTime is proposed for extraction of timex3 tags (Strotgen and Gertz, 2010). It strongly depends on hand-crafted rules, and showed the best performance in TempEval-2. Llorens et al. (2010) proposed TIPSem for all of the three extraction tasks. It employs Conditional Random Fields (CRF) (Lafferty et al., 2001) for capturing patterns of texts, and defines a set of hand-crafted rules for determining several attributes of the tags. ClearTK is another work proposed for all three extraction tasks (Bethard, 2013a); it utilizes machine-learning models such as Support Vector Machines (SVM) (Boser et al., 1992;Cortes and Vapnik, 1995) and Logistic Regressions (LR), and shows the best performance in TempEval-3.
Although the existing approaches show good results, most of them are applicable only to their target languages. The first reason is that there are several attributes which are difficult to predict without the use of language-specific processing. For instance, the attribute value of timex3 tag has a normalized form of time (e.g., 1999-04-12) following ISO-8601. This is not accurately predictable by relying solely on data-driven approaches. The second reason is that they depend on some language-specific resources (e.g., Word-Net) or tools. Unless the same quality of resources or tools is achieved for other languages, the existing works would not be available to the other languages. To alleviate this limitation, a language independent parser for extracting timex3 tags is proposed (Angeli and Uszkoreit, 2013). Its portability is demonstrated by experiments with TempEval-2 dataset of six languages: English, Spanish, Italian, Chinese, Korean, and French. However, the performance in English and Spanish datasets are about twice as high as the other languages, since the method highly depends on the feature definition and language-specific preprocessing (e.g., morphological analysis). This implies that it is necessary to address language-specific difficulties in order to achieve high performance.
Korean language has many subtle rules and exceptions on word spacing. Korean is an agglutinative language, where verbs are formed by attaching various endings to the stem. There are usually multiple morphemes for each token, and empty elements appear very often because subjects and objects are dropped to avoid duplication. Temporal expressions often take the lunar calendar representation as a tradition. Moreover, the same temporal information can have various forms due to a complex system of digit expression. For instance, a digit 30 can be represented as '30', '삼십 [sam-sib]', or '서른[seo-run]'. Most of these issues stem from the Chinese language, as a large number of Chinese words and letters have become an integral part of Korean vocabulary due to historical contact. All these issues hinder the performance of the existing approaches when applied to Korean documents.
In this paper, we show how these issues are addressed in our Korean-specific hybrid method. To the best of our knowledge, this is the first Koreanspecific study which addresses all of the three extraction tasks.

Korean TimeBank
Although there is a Korean dataset provided by TempEval-2, we chose not to use it because it is small in size and has many annotation errors. In TempEval-2 Korean dataset, there are missing values of timex3 tags, and multiple tags that must be The new dataset is based on TimeML but with several differences. The tags of the new dataset are represented using a stand-off scheme, keeping the original sentences unharmed. As there are often multiple morphemes within each token in Korean, the tags are annotated in letter-level. The letter-level annotation allows multiple annotations to appear within a single token. This also makes the dataset independent of morphological analysis, so it is not required to update the dataset when the morphological analyzer is updated. To enable the letter-level annotation, we introduce several attributes for timex3 tag and event tag: text, begin, end, e begin, and e end. The attributes e begin and e end indicate token indices, while begin and end indicate letter indices of the extent. The attribute text contains the string of the extent. For example, the sentence, "I work today" in Fig. 1, contains one timex3 tag whose text is 'today', where e begin=2, e end=2, begin=0, and end=4.
Since temporal expressions following the lunar calendar representation appear often in Korean, we add an attribute calendar. The value of the calendar can be LUNAR or other types of calendar, and its default value is GREGORIAN when it is not explicitly clarified. We also add two values for the attribute mod of timex3 tag: START OR MID and MID OR END, as these expressions appear often in Korean. For example, '초중반[cho-joongban]' represents beginning or middle phase of a period, and '중후반[joong-hoo-ban]' represents middle or ending phase of a period.
The source of the Korean TimeBank includes Wikipedia documents and hundreds of manually generated question-answer pairs. The domains of the Wikipedia documents are personage, music, university, and history. The documents are annotated by two trained annotators and a supervisor, all majoring in computer science. The annotated tags of each document is saved in an XML file.

Temporal Information Extraction from Korean Texts
Our proposed method addresses all of the three tasks: (1) extraction of timex3 tags, (2) extraction of event and makeinstance tags, and (3) extraction of tlink tags. The proposed method also extracts additional attributes of timex3 tag, such as freq, be-ginPoint, endPoint, mod, and calendar. The overall process is depicted in Fig. 2, where the solid line represents training process and the dotted line represents testing process. The Korean analyzer at the center of the figure takes Korean texts as an input and generates several raw features as an output, such as results of morphological analysis, Part-Of-Speech (POS) tags, Named-Entity (NE) tags, and results of dependency parsing (Lim et al., 2006). The number of possible POS tags is 45, which follows the definition of Sejong Treebank 1 . The number of possible NE tags is 178, where each of them belongs to one of 15 super NE tags.
The generated raw features are used to define a set of features for machine-learning models and a set of hand-crafted rules. The rules are designed by examining the training dataset and the errors that the proposed method generates with the validation dataset. We employ several machine-learning methods that have shown the best performance in the TempEval shared tasks, such as Maximum Entropy Model (MEM), Support Vector Machine (SVM), Conditional Random Fields (CRF), and Logistic Regression (LR). Fig. 2 introduces Temporal Information Extractor (TIE) which consists of four sub-extractors: timex3 extractor, event extractor, makeinstance extractor, and tlink extractor. The timex3 extractor and the event extractor work independently, and the makeinstance extractor uses the predicted event tags. The tlink extractor makes use of the predicted makeinstance tags and predicted timex3 tags. Thus, the performance of timex3 extractor and event extractor will strongly influence the performance of makeinstance extractor and tlink extractor. These four sub-extractors as a whole give predicted tags as an output, where the tags are represented in morpheme-level. The extent converter at the center of the figure changes the morphemelevel tags into letter-level and vice versa by checking ASCII values of each letter and each morpheme. In training process, the annotated tags of Korean TimeBank are converted into morphemelevel through the extent converter, and used to train the TIE.

Timex3 extractor
The goal of timex3 extractor is to predict whether each morpheme belongs to the extent of a timex3 tag or not, and finds appropriate attributes of the tag. There are five types of timex3 tag: DATE, TIME, DURATION, SET, and NONE. The NONE represents that the corresponding morpheme does not belong to the extent of a timex3 tag, and the other four types follow the same definition of TimeML. This is essentially a morpheme-level classification over 5 classes.
We basically take two approaches: a set of 100 hand-crafted rules and machine-learning models. Examples of the rules for extent and type are listed in Table 1. In the second rule of the table, the first condition is satisfied when the sequence of two morphemes is a digit followed by a morpheme '월 The various ways of digit expressions are also considered. The second condition is satisfied when the morpheme next to the extent is '에[eh]'(at) or '마 다[ma-da]'(every) followed by '번[beon]'(times) or '회[hoi]'(times), and there must be no other tags between the two morphemes. If these two conditions are satisfied, then the sequence of morphemes becomes the extent of timex3 tag whose type is SET. If one of the rules is satisfied, then the remaining rules are skipped for the target morpheme.
We compare two machine-learning models, CRF and MEM, for timex3 tag by experiments. We defined a set of features based on the raw features   (2000).
To predict other attributes of each predicted timex3 tag, we also define sets of rules: 112 rules for value, 7 rules for beginPoint/endPoint, 9 rules for freq, 10 rules for mod, and 1 rule for calendar. Especially, the rules for value and freq take a temporal context into account. For instance, the sentence "We go there tomorrow", makes it hard to predict value of 'tomorrow' without considering the temporal context. We assume that the temporal context of each sentence depends on the previous sentence. For each document, the temporal context is initialized with Document Creation Time (DCT), and the context is updated when a normalized value appears in a certain condition.
Examples of the rules are described in Table 2. If the first rule in the table is satisfied, then the month of value is changed to the corresponding digit and the temporal context is updated.
As there can be multiple clues for determining value within an extent, all of the rules are checked for each timex3 tag. To avoid overwriting value by multiple satisfied rules, the rules are listed in ascending order of temporal unit. That is, the rules for seconds or minutes are listed before the rules for hours or days. This allows value to be changed from smaller temporal unit to bigger unit, thereby avoiding overwriting wrong value. The rules for different attributes are listed in separate files, and are written in a systematic way similar to regular expressions. Such format enables rules to be easily manipulated.

Event extractor
The goal of event extractor is to predict whether each morpheme belongs to the extent of an event tag or not, and finds appropriate class of the tag. There are 7 classes of event tag: OCCURRENCE, PERCEPTION, REPORTING, STATE, I STATE, I ACTION, and NONE. The NONE represents that the corresponding morpheme does not belong to the extent, and the other classes follow the same definition of TimeML. Similar to the timex3 extractor, we take two approaches: a set of 26 rules and machine-learning models (e.g., CRF and MEM), based on the set of features used in the timex3 extractor.
There are several verbs that often appear within the extents of event tags, although they do not carry any meaning. For example, in the sentence, "나는 공부를 하다"(I study), the verb '하 [ha]'(do) has no meaning while the noun '공부 [gong-bu]'(study) has eventual meaning. We define a set of such verbal morphemes (e.g., '위 하[wi-ha]'(for), and '통하[tong-ha]'(through)), to avoid generating meaningless event tags.

Makeinstance extractor
The goal of makeinstance extractor is to generate at least one makeinstance tag for each event tag, and find appropriate attributes. As we observed that there is only one makeinstance tag for each event tag in most cases, the makeinstance extractor simply generates one makeinstance tag for each event tag. For the attribute POS, we simply take the POS tags obtained from the Korean analyzer. We define a set of 5 rules for the attribute tense, and 2 rules for the attribute polarity.

Tlink extractor
The goal of tlink extractor is to make a linkage between two tags, and find appropriate types of the links. For each pair of tags, it determines whether there must be a linkage between them, and finds the most appropriate relType. There are 11 rel-Types: BEFORE, AFTER, INCLUDES, DUR- The NONE represents that there is no linkage between the two argument tags, and the OVERLAP means that the temporal intervals of two tags are overlapping. The other relTypes follow the same definition of TimeML. Thus, it is essentially a classification over 11 classes for each pair of two argument tags.
The tlink extractor generates intra-sentence tlinks and inter-sentence tlinks. For the intrasentence tlinks, we take two approaches: a set of 19 rules and machine-learning models (e.g., SVM and LR). Among the four kinds of tlinks (e.g., TT tlink, TM tlink, MM tlink, and DM tlink), our tlink extractor generates the first three kinds of tlinks. The reason for excluding DM tlink is that we maintain the temporal context initialized with DCT in the timex3 extractor, so it is not necessary to generate DM tlinks. The TT tlinks are extracted by comparing normalized values of two timex3 tags. Two models are independently trained for predicting TM tlinks and MM tlinks, respectively. We tried many possible combinations of features to reach better performance, and obtained sets of features as described in Table 3.
Given a pair of two makeinstance tags, it is straight forward to derive relType when the two event tags are linked with timex3 tags. Thus, firstly we predict TT tlinks, and thereafter predict TM tlinks and MM tlinks. For the inter-sentence tlinks, we generate MM tlinks between adjacent sentences when there is a particular expression at the beginning of a sentence, such as '그 후[geuhoo]'(afterward), '그 전[geu-jeon]'(beforehand), or '그 다음[geu-da-eum]'(thereafter).

Online LIFE
As the performance of the Korean analyzer is not stable, we need complementary features to make better classifiers. Jeong and Choi (2015) proposed Language Independent Feature Extractor (LIFE) which generates a pair of class label and topic label for each Letter-Sequence (LS), where LS represents frequently appeared letter sequence. The class labels can be used as syntactic features, while the topic labels can be used as semantic features. The concept of LS makes it language independent, so it is basically applicable to any language. This is especially helpful to some languages that have no stable feature extractors. Korean is one of such languages, so we employ the LIFE to generate complementary features.
The temporal information extractor must work online because it usually takes a stream of documents as an input. However, as the LIFE is originally designed to work offline, we propose an extended version of the LIFE, namely, Online LIFE (O-LIFE), whose parameters are estimated incrementally. When we design O-LIFE, the LS concept of LIFE becomes a problem because the LS dictionary changes. For example, if the LS dictionary has only one LS goes and a new token go comes in, then the LS dictionary may contain go and es. Note that the LS goes does not exist in the new dictionary. This issue is addressed by our proposed algorithm which basically distributes the values of previously estimated parameters to new prior parameters of overlapping LSs. For the above example, φ k,'goes will be distributed to β k,'go and β k,'es , where k is a topic index.
The formal algorithm of O-LIFE is shown in Algorithm 1, where C is the number of classes and T is the number of topics. S stream is the number of streams, and S s is s-th stream of D s documents. The four parameters a, b t , g and b c are default values of the priors α, β, γ and δ. The three threshold parameters t 1 , t 2 , and t 3 are used to generate LS dictionary. δ s c,w =b c , w ∈ Dic s 10: for each item w i ∈ Dic s−1 do 11: if w i ∈ Dic s then 12: for each w i ∈ Dic s do 20: initilize φ s , η s , π s , and θ s to zeros 29: initilize class/topic assignments 30: [φ s ,η s ,π s ,θ s ] = 31: ParameterEstimation(S s ,α s ,β s ,γ s ,δ s ) 32: 33: The LS dictionary of O-LIFE is updated as it reads data. That is, |Dic prev | ≤ |Dic cur | and Dic prev ⊆ Dic cur , where Dic prev is the previous dictionary and Dic cur is the current dictionary. To handle this change in dictionary, the two parameters (e.g., β, δ) are updated by four steps, based on an assumption that every unique LS w i of the previous dictionary should contribute to the new dictionary as much as possible. Firstly, as described in 8th line of the algorithm, the two parameters are initialized with default values. Secondly, if a particular LS w i of the previous dictioanry exists in the new dictionary, then the two parameters increase by B s−1 t,w i w p and D s−1 c,w i w p , re-  Training Validation Test  documents  536  131  173  sentences  2357  466  879  timex3  1245  253  494  event  6594  1145  2609  makeinstance  6615  1155  2613  tlink  1295  374  674 spectively. B s−1 t denotes an evolutionary matrix whose columns are LS-topic distribution φ s−1 t , and D s−1 c means an evolutionary matrix whose columns are LS-class distribution η s−1 t . By multiplying them with the weight vector w p , the contribution in initializing priors is determined for each time slice. We call this value, the weighted contribution. Thirdly, for every w i which contains w i , the two parameters increase by the weighted contribution. Lastly, for every w i which overlaps with w i , the two parameters increase by r times of the weighted contribution, where w overlap is the overlapping part.
The class labels and topic labels are converted to morpheme-level features by concatenating labels of LSs overlapping with a given morpheme. We call these features as LIFE features, and these are used to train machine-learning models, together with the features based on the raw features.

Experiments
The Korean TimeBank is divided into a training dataset, a validation dataset, and a test dataset. The statistics of the dataset are described in Table 4. As shown in the table, only one makeinstance tag exists for each event tag in most cases, which follows the assumption of makeinstance extractor.

Timex3 prediction
For the extent and type prediction, only the exactly predicted extents and types are regarded as correct, and the results are summarized in Table 5, where MEM is trained only with the features generated from raw features, and MEM L is trained with both of the features and LIFE features. We employ the CRF++ library 2 and MEM toolkit 3 . The optimal 2 http://crfpp.googlecode.com/svn/trunk/doc/index.html 3 http://homepages.inf.ed.ac.uk/lzhang10/maxent toolkit parameter settings are found by a grid search with the validation set. The optimal setting for CRF is as follows: L1-regularization, c=0.6, and f=1. The MEM shows its best performance without Gaussian prior smoothing. Both models give generally better performance when window size is 2. The parameter setting for O-LIFE is as follows: a=0.1, b t =0.1, g=0.1, b c =0.1, w p =(1), C=10, T =10, t 1 =0.4, t 2 =0.4, t 3 =0.7, and the number of iterations for estimation is 1000.
As shown in the table, CRF gives generally better performance than MEM and rules. We tried a combination of the rules and machine-learning models, and the combination led to an increase in the performance. For example, the combination of CRF and rules gives better performance than using only CRF. This can be explained that there are some patterns that the machine-learning models could not capture, so the combination with the rules can deal with the patterns. Note that using the LIFE features dramatically increases the performance. We believe that this is due to the raw features of Korean (e.g., POS tagger) being unstable. The LIFE features complement these unstable features by capturing syntactic/semantic patterns that are inherent in the given documents. Furthermore, we observed that the combination of the rules and machine-learning models trained with LIFE features does not contribute to the performance. This implies that using LIFE features allows the machine-learning models to capture the patterns that could not be captured by the machine-learning models without LIFE features. The CRF L is discovered to be the best, and the other remaining attributes are predicted using the rules. The performance is measured in a sequential manner, so the performance generally decreases from the top to bottom of the table.
Another experiments of timex3 prediction using TempEval-2 Korean dataset are conducted to compare with the existing method of Angeli and Uszkoreit (2013). The existing method makes use of a latent parse conveying a language-flexible representation of time, and extract features over both the parse and associated temporal semantics. The results of comparison are shown in Table 6, where our method uses CRF L for type and rules for value. For a fair comparison, both methods are trained and tested using only TempEval-2 Korean dataset. As mentioned before, TempEval-2 Korean dataset contained errors, so we corrected them and

Event prediction
The results of event prediction are summarized in Table 7, which are similar to the results of timex3 prediction. By a grid search, the optimal parameter settings are found to be the same as that of the timex3 extractor. Employing the LIFE features again increases the performance, and the CRF L is discovered to be the best.

Makeinstance prediction
All the attributes of makeinstance tag are predicted through hand-crafted rules. The performance is summarized in Table 8. The measurement of the attribute POS is excluded because we simply take the results of the Korean analyzer.

Tlink prediction
By performing a grid search with the validation set, we found that Support Vector Machine (SVM) of C-SVC type with Radial Basis Function (RBF) gives the best performance when Γ of kernel function is 1/number of features. It is also found that the L1-regularized Logistic Regression (LR) with C=1 gives the best performance. We observed that both models give better performance when a window size is 1. There are two cases of tlink prediction: (1) tlink prediction given correct other tags, and (2) tlink prediction given predicted other tags. The results of the first case are summarized in Table 9, where the performance is measured in a sequential manner. As shown in the table, we tried a combina-  tion of the rules and machine-learning models, and the combination of SVM and rules performed the best. Note that we do not use the LIFE features for tlink prediction because we observed that the LIFE features do not contribute to the performance for tlink prediction. We believe that this is because the LIFE features represent only the syntactic/semantic patterns of the given terms, but not arbitrary relations between the terms.
The results of the second case are obtained using the best combination, and are shown in Table 10. Note that the tlink tags are predicted without the LIFE features, while the other tags (e.g., timex3, event, makeinstance) are obtained using the LIFE features. To measure the impact of the LIFE features, we also conduct the experiment of the second case given the predicted tags without using the LIFE features, and the results are shown in Table 11. As shown in Table 10 and Table 11, using the LIFE features increases the F1 score about 6 percents.

Conclusion
We introduced a new method for extracting temporal information from unstructured Korean texts. Korean language has a complex grammar, so there were many research issues to address prior to achieving our goal. We presented such issues and our solutions to them. Experimental results illustrated the effectiveness of our method, especially when we adopted the extended probabilistical model, Online LIFE (O-LIFE), to generate complementary features for training machinelearning models. In addition, as there were no sufficient data for this study, we have manually constructed the Korean TimeBank consisting of more than 3,700 annotated sentences. We will extend our study to interact with Knowledge-Base for achieving better prediction of temporal information.