Towards Never Ending Language Learning for Morphologically Rich Languages

This work deals with ontology learning from unstructured Russian text. We implement one of components Never Ending Language Learner and introduce the algorithm extensions aimed to gather specificity of morphologicaly rich free-word-order language. We demonstrate that this method may be successfully applied to Russian data. In addition we perform several additional experiments comparing different settings of the training process. We demonstrate that utilizing of morphological features significantly improves the system precision while using of seed patterns helps to improve the coverage.

The main challenge is to design systems that do not require any human involvement and may efficiently store lots of information limited only by the amount of the knowledge uploaded to the Internet. One of the ways of representing information for such systems is ontologies.
According to the famous definition by Gruber (1995), ontology is "an explicit specification of a conceptualization", i.e. formalization of knowledge that underlines language utterance. In the simplest case, ontology is a structure containing concepts and relations among them. In addition, it may contain a set of axioms that define the relations and constraints on their interpretation (Guarino, 1998). One of the advantages of such structures is data formalization that simplifies the automatic processing. Ontologies are widely used in information retrieval, texts analysis and semantic applications (Albertsen and Blomqvist, 2007;Staab and Studer, 2013).
In many practical applications, ontological concepts should be associated with lexicon (Hirst, 2009), i.e. with language expressions and structures. Even though ontologies themselves contain knowledge about the world, not a language, their primary goal is to ensure semantic interpretation of texts. Thus, ontology learning from text is an emerging research direction (Maedche, 2012;Staab and Studer, 2013).
One of the approaches that are used to learn facts from unstructured text is called Never Ending Language Learning (NELL) (Carlson et al., 2010a). 1 One of the NELL advantages is its low demand for preprocessed data required for the learning process. Given an initial ontology that contains 10-20 seeds for each category as an input, NELL can achieve a high performance level on extracting facts and relations from a large corpus (Carlson et al., 2010a). 2 The first implementation of NELL (Carlson et al., 2010a) worked with English. An attempt was made to extend the NELL approach for the Portuguese language (Duarte and Hruschka, 2014). The main result of these experiments was that applying initial NELL parameters and ontology to non-English web-pages would not show high results; initial configuration did not work well with Portuguese web-pages. The authors made a conclusion that in order to extend the NELL approach to a new language, it is necessary to prepare a new seed ontology and contextual patterns that depend on the language rules.
In this paper, we introduce a NELL extension to the Russian language. Being a Slavic language, Russian has a rich morphology and free word order. Thus, common expressions for semantic relations in text have a specific form: the word order is less reliable than for Germanic or Romance languages; the morphological properties of words are more crucial. However, many pattern learning techniques are based on word order of pattern components and usually do not include morphology. Thus, the adaptation of the NELL approach to a Slavic language would require changes in the pattern structure. We introduce an adaptation of NELL to Russian, test it on a small dataset of 2.5 million words for 9 ontology categories and demonstrate that utilizing of morphology is crucial for ontology learning for Russian. This is the main contribution of this paper. The rest of the paper is organized as follows. Section 2 overviews original NELL approach. Our improvements of the algorithm are presented in Section 3. Section 4 describes our data source, its preprocessing, and experiments we run. Results of these experiments are presented and discussed in Section 5. In Section 6, we give a brief overview of the related papers. We summarize the results and outline the future work in Section 7.

Never Ending Language Learner
The NELL architecture, which is presented in Figure 1, consists of two major parts: a knowledge base (KB) and a set of iterative learners (shown in the lowest part of the figure). The system works iteratively: first, the learners try to extract as much candidate facts as possible given a current state of the KB; after that, the KB is updated using learners output. This process is running infinitely, with the current state of KB being freely available at the project webpage. 3 In this work, we focus on one of the NELL components, namely Coupled Pattern Learner (CPL). CPL is the free-text extractor that learns contextual patterns to extract instances of ontology categories. The key idea of CPL is that simultaneous ("coupled") learning of instances and patterns yields a higher performance than learning them independently (Carlson et al., 2010b).
An expression that matches text in CPL consists of three parts, whích must be found within the same sentence: 3 http://rtw.ml.cmu.edu/rtw/ 1. Category word. The list of category words is fixed and defined in the initial ontology.
2. Instance extracting pattern. A pattern consists of at most three words including punctuation like commas or parenthesis, but excluding category and instance words.
3. Instance word. At the beginning 3-5 seed instances are defined for each category.
CPL uses two sets: the set of trusted patterns and set of trusted instances, which are considered to be actual patterns and instances for the corresponding category. Different implementations may or may not exclude patterns/instances from the corresponding sets during further iterations.
The process starts with a text corpus and a small seed ontology that contains sets of trusted patterns and trusted instances. Then every learning iteration consists of the two following steps: • Instance extraction. To extract new instances, the system finds a co-occurrence of the category word with a pattern from the trusted list and then identify the instance word. If both category and instance words satisfy the conditions of the pattern, then the found word is added to the pool of candidate instances for the current iteration. When all sentences are processed, candidate instance evaluation begins after which the most reliable instances are added to the set of trusted instances; • Pattern extraction. To extract new patterns, the system finds a co-occurrence of the category word with one of its trusted instances. The sequence of words between category and instance are identified as a candidate pattern.
When all candidate patterns are collected, the most reliable patterns are added to the trusted set.

Adaptation to the Russian Language
Russian patterns should have a specific structure, which should comprise morphological components. Thus we expand the form of the search expression so that case and number are taken into account for both category and instance words. Let us consider an example, which illustrates importance of including morphology into patterns: Тренеры знают множество приемов для дрессировки собак, такие как поощрение едой и многие другие. Coaches know many techniques for training dogs, such as stimulation with food and etc.
This sentence matches such as pattern and without morphological constraints that may lead to extracting of wrong relations "stimulation is a dog". If the pattern have specified only part-of-speech rules, then our algorithm would produce a lot of errors. Specification of the arguments (nominative in this example) helps to avoid such false pattern triggering. Another way to avoid such errors would be a syntax annotation of all data and running CPL on top of this annotation; we leave this approach for further research. 4

Strategies for Expanding the Trusted Sets
To add new patterns and instances to the corresponding trusted sets, we use Support metric. For each category, instances and patterns are ranked separately using the following formulas: for instances and This particular example would probably produce the same error on the English translation, though we believe that such cases should be more rare. Since English has almost no morphology some other mechanism should be used to restrict over-production of patterns; in particular, distinguishing between verb subject and object is easier for a free-word-order language.
for patterns, where i is an instance word, p is a pattern, Count c (i , p) is the number of cases when i and c match as arguments of p in the corpus related to category c, Count c (x ) is the total number of matches of x in the corpus related to category c, TruInst is a set of trusted instances, TruPat is a set of trusted patterns, and (t) is an iteration.
Instances and patterns with higher support are considred to be trusted. To define trusted patterns and istances, we use FILTERBYTHRESHOLD procedure, which is implemented in two versions using two different strategies.
The first strategy uses a certain threshold on Support value that is computed after the first iteration for patterns and instances separately. On the first iteration, the filter equals to zero, that means we allow pattern and instance extraction without any limitations. Then the threshold is set as a minimum value of support for all extracted patterns and instances correspondingly. On the next iterations, only the instances and patterns that have Support value greater or equal than these thresholds are added to the trusted sets. Note that within this strategy, Support value of any pattern and instance does not decrease. We will refer to it as THRESHOLD-SUPPORT. This is the main strategy for CPL-RUS. THRESHOLD-SUPPORT does not limit trusted elements during algorithm run. It is greedy in sense that it collects all possible instances and patterns that are trusted enough and use them to extract new patterns and instances. Thus, final filtering should be applied in this case after the algorithm stops and the final instances, which has support not less than a certain minimal support, should be selected.
The second strategy uses a threshold on a number of elements of the trusted sets. After extracting new instances and patterns, they are sorted with respect to their Support, and then 50 most reliable instances and patterns are left in the trusted sets. We assume that this procedure would be able to correct errors made on the earlier iterations, when the algorithm have more evidence. This strategy was used in (Duarte and Hruschka, 2014). We will refer to it as THRESHOLD-50.

Implementation
Our implementation of CPL component is summarized in Algorithm 1. The algorithm processes each category c separately. It starts with a set of Algorithm 1 COUPLED PATTERN LEARNER (CPL-RUS).

Require: set of trusted patterns TruPat
c , and a preprocessed corpus for each c: we use only sentences that contains c lexeme(s) to speed up iterations.
Though this algorithm should run infinitely with more and more data (that is how the original NELL process organized), only small corpora are used in our experiments, and the process stops if no more patterns or instances are found during the previous iteration.

Data
We use Russian Wikipedia as the data source due to the convenience of downloading a relatively small corpus devoted to some particular topic (e.g. animals) using Wikipedia categories. 5 However, we do not use a specific Wikipedia structure for anything but corpus collection, thus our method can work with any other source types. Note, that even though the Wikipedia format for articles has its own standards, all of them are written by different people with changing of author style across documents. That makes Wikipedia a good resource to obtain way the data with some varieties in style.
We use Petscan service 6 to download Wikipedia pages that belong to a certain category. For initial experiments, we collect several corpora try-5 Wikipedia categories are different from those in ontology though they can be easily matched. ing to select wide but not too general categories. For example, we consider animals to be too general and split it into several subcategories, such as birds, fish, etc. The rational is that too broad categories might be too computationally heavy for initial experiments, while too narrow categories might not contain enough data. In total, we use a corpus of 2.5 million sentences extracted from 7 various categories (see Table 4.1). Then we annotate text with morphological attributes, such as part-of-speech, case, number, and lexeme, using Pymorphy tool (Korobov, 2015). The results of the processing are lists of extracted patterns and instances for each category.

Initial Ontology
The initial ontology consists of 9 categories and 41 instances; it is presented in Table 4.2.
Note that FRUIT and VEGETABLE are subcategories for FOOD; we run all three independently that allow us to compare the algorithm performance on more general vs. more narrow categories.
The seed CPL patterns and their morphological constraints are listed in Table 4.2.

Experiment Design
We run experiments for all categories independently. Then we collect all extracted instances and manually annotate them as correct or incorrect. Then for each category c, we evaluated precision using the following formula:   where CorrInst(c) is the number of correct instances extracted for category c, and AllInst(c) is the whole number of instances, that were extracted by CPL for category c.
When we use the THRESHOLD-SUPPORT strategy, we perform a final filtering using different minimal support values. For algorithm comparison, we use values 0 .1 , 0 .5 and 1 .0 The main experiment is devoted to CPL-RUS with THRESHOLD-SUPPORT strategy. The algorithm converges after 6-10 iterations depending on category. We run it on all the categories and investigate the dependency of precision on support value used to cut off trusted instances after the algorithm converges.
In addition, we perform a set of smaller experiments to study CPL properties and impact of different parameters. We test: 1) usefulness of morphological features; 2) usefulness of pattern seeds; 3) differences between threshold selection strategies.
In the first experiment, we compare CPL-RUS and a version of this algorithm which do not use morphology (thus, similar to the English CPL). We will refer to the second one as CPL-NOMORPH. We run it on three ontology categories: VEGETABLE, FRUIT, and FOOD. The first run uses morphological constraints and the second allows words in all morphological forms.
In the second experiment, we investigate if the usage of seed patterns can improve the quality of the algorithm; the same experiment was conducted by (Duarte and Hruschka, 2014). As can be seen from the description in Section 2, CPL can learn without seed patterns, relying only on the set of initial categories and instances. However, since the initial ontology is small, this might be not the optimal strategy. We will refer to the second algorithm as CPL-NOPAT. We run the algorithms on the same three categories: VEGETABLE, FRUIT, and FOOD.
In the third experiment, we compare two Threshold selection strategies described in Section 3.3: THRESHOLD-SUPPORT, based on minimal Support after the first iteration and THRESHOLD-50 that keeps the fixed number of patterns and instances and revise the trusted lists after each iteration. 112 5 Results and Discussion 5.1 On CPL-RUS Table 5.1 shows the main results of running CPL-RUS on the whole ontology using seeds.
There is a huge variety in results among categories with COUNTRY and SPORT being the most problematic ones despite the minimum support.
FOOD as the more general category performs much worse than more narrow VEG-ETABLE and FRUIT, though for these categories the number of extracted instances is very low (see Table 5.2).
Interestingly, CPL-RUS with minimal support 0 .5 shows better results in terms of precision than with minimal support 1 . It means that some false positives have a very high Support value.

On Morphological Constraints
The results of evaluating the importance of including morphological constraints to the Russian CPL are shown in Table 5.2. The precision for all categories, in this case, is much lower, which makes CPL-NOMORPH completely useless. While CPL-RUS can achieve precision 1 .0 for VEGETABLE and FRUIT categories, the maximum result for the same categories in unconstrained mode is 0 .43 . Table 5.2 presents results on comparison of the learning progress for the three categories with and without morphological constraints. As can be seen, morphological constraints decrease the number of extracted instances and patterns and slow down the training process. Table 5.3 shows the results for running CPL-NOPAT, which does not use any seed patterns. In comparison with CPL-RUS (Table 5.1), this algorithm yields worse precision, especially for the more general FOOD category. Table 5.3 shows the total number of extracted instances in both cases. As can be seen, running algorithm without seed patterns increases its coverage but decreases the resulting precision.

On Threshold Selection Strategies
Precision for different thresholds of Support in CPL-RUS is shown in Figure 2. The numerical values of precision for three minimal support values are shown in Table 5.1.
In our final experiment, we test THRESHOLD-50 strategy that re-arrange patterns and instances on every step and allows only 50 of them to be trusted. The results for four ontology categories are shown in Table 5.4. Precision is better for that strategy, but the number of extracted instances is very small. It means that this strategy yields lower Recall (which is hard to evaluate in exact numbers). This gives us the opportunities for future work to find the way to determine the minimal support value that would satisfy both conditions: the number of extracted instances should not be small, and the precision should be high and does not vary among categories. 7

Comparison with Other Approaches
The results of our experiments can be compared with the two previous work on this approach in English and Portuguese languages. Because in this work we extend the basic CPL algorithm only with morphological features of the Russian language, it makes it easy to compare the accuracy of our CPL realizations. The average accuracy for the English CPL version of the algorithm is reported as 0.78 with the minimum as 0.2 for the SPORTS EQUIP-MENT category and maximum as 1.0 for the AC-TOR, CELEBRITY, FURNITURE and SPORTS LEAGUE categories (Carlson et al., 2010a). The maximum average accuracy for the Russian language is 0.612. As it can be seen, the results for the Russian language also vary between different categories, from 0.16 to 1.0, but the average algorithm accuracy is higher for the English language. The results for the Portuguese version of CPL are presented separately for 5 , 10 , 15 , 20 iterations of the algorithm (Duarte and Hruschka, 2014). Since we did not run more than 10 iterations of CPL for each category, the most valuable result of comparison of two CPL realizations is to choose the accuracy of 10-iterations of the Portuguese CPL. The results of the average accuracy for the Portuguese CPL is varied from 0.04 to 0.95 (Duarte and Hruschka, 2014).

Related Work
In this paper, we focus on coupled pattern and instance learning from the text for ontology learning; the papers related to this topic are briefly   overviewed in this section. More general introduction to NELL and its predecessors can be found in (Carlson et al., 2010a).
Bootstrapping is well-known as a method for semi-supervised pattern learning. It was initially proposed for Information Extraction, that is for the traditional setting when the event templates are given beforehand (Riloff et al., 1999;Agichtein and Gravano, 2000;Yangarber, 2003). Bootstrapping for ontology learning from text has been applied, for example, by (Liu et al., 2005;Paliouras, 2005;Brewster et al., 2002).
Later the same principle was adapted for Open-Domain Information Extraction, aiming at discovering entity relations without any restrictions on their type (Shinyama and Sekine, 2006;Banko et al., 2007;Wang et al., 2011).
The idea of automatic extracting of domain templates from large corpus has been extensively studied, for example, by (Filatova et al., 2006;Chambers and Jurafsky, 2011;Fader et al., 2011    search field becomes closer to ontology learning and knowledge-base population, though the latter task might be more difficult since it requires crossdocument inference (Ji and Grishman, 2011). The idea of simultaneous (coupled, joint) learning of both instances and relation have been justified. Li and Ji (2014) argued that though these two tasks are traditionally broken down into separate components, this is a rather artificial division leading to over-simplification and error propagation from the earlier tasks to the later steps.
Using a knowledge base to extract relations has been previously proposed as a distant supervision approach by, among others, (Mintz et al., 2009;Surdeanu et Table 9: Results for running CPL-RUS with THRESHOLD-50. these works assumed that the KB is rather big (such as Freebase).
As far as we aware, this is the first work on the application of pattern learning techniques for the Russian language, despite general interest in Information Extraction (Starostin et al., 2016) and building of linguistic resources (Loukachevitch and Dobrov, 2014;. Bocharov et al. (2010) and Sabirova and Lukanin (2014) used rule-based approach to extract taxonomic relations from text. Kuznetsov et al. (2016) applied a number of machine learning techniques to automatic relation extraction from the Russian Wikipedia but their method depends on the specific structure of Wikipedia.

Conclusion
In this work, we made the first attempt to adapt the NELL approach to the Russian language. We changed CPL component, so it can work with morphology. We conducted several experiments with the extended version, CPL-RUS algorithm on the corpus containing over 2.5 million sentences. Our main findings are the following: 115 • it is possible to adapt CPL for Russian with relatively little efforts; • the morphological constraints are crucial for Russian pattern learning; • a small set of manually compiled seed patterns increases CPL accuracy; • the obtained results vary for different categories; that probably means that the algorithm settings should be optimized independently for each category.
This work leaves a room for further experiments. We plan to run CPL on much bigger datasets, including the whole Wikipedia corpus and other web-pages. This would require an expansion of the seed ontology and, probably, a construction of seed patterns individually for each category or a group of categories.
We will also continue working on threshold selection strategies. Another line of research is to run CPL on top of syntactic annotation; in principle, this should increase precision though some amount of errors might be introduced by syntax parser itself.