How to Handle Split Antecedents in Tamil?

Resolution of the anaphoric entities in natural language text is very much essential to extract the complete information from the text. In this paper, we present a methodology to resolve one of the difficult pronouns, plural pronouns with split antecedents in Tamil. We have used a salience measure based approach with salience factors obtained from sub-categorization information of nouns and selectional restriction rules of the verbs. We have evaluated our approach with Tamil novel corpus and the results are encouraging.


Introduction
Anaphoric expressions in natural language text help in bringing cohesion to the text. The resolution of these anaphoric expressions is vital in developing information extraction and understanding systems. Theoretically various anaphoric expressions such as pronominal, reflexives, reciprocals, distributors, one pronoun, definite descriptions, VP anaphora, and zero anaphora are well studied. Automatic resolution engines for various types of anaphors were presented from early 80's of the last century, starting with Hobb's (1978) naïve approach followed by knowledge rich approaches by Carter et al (1987), Carbonll and Brown (1988), and Rich and LuperFoy (1988). These approaches were followed by knowledge poor approaches by Lappin and Leass (1994), Kennedy and Baguraov, Mitkov (1998) etc. Centering theory based approach was introduced by Grosz, Kuhn (1979, 1981). The task of anaphora resolution got boosted with various Machine Learning (ML) techniques. The first ML approach was presented by Dagan and Itai (1980) and various ML techniques were later used. Byron (2001) has mentioned difficult anaphors which are excluded in most of the systems and they are as follows. i) Constructions which are required to interpret pronouns with split antecedents or cataphora. ii) Pronouns with antecedents, different from NPs such as clauses. iii) Pronouns with no antecedents in the discourse such as deictic or generic pronouns. There are very less number of automatic resolution engines for these difficult anaphors. In this paper, we present an algorithm for automatic resolution of one of the difficult anaphors; plural pronouns with split antecedents. We have studied the split antecedents in Tamil, a morphologically rich and verb final South Dravidian language and came up with an algorithm to resolve it. Consider example 1 given below:

b) avarkal ceythiyaalarkalai
They press-people(N) Ithirabath-aucil canththinar. Hyderabad-House(N)+loc meet(V)+pst+3p (They met the press people at Hyderabad House.) In the above example 1, the plural pronoun 'avarkal' (they) in sentence 1.b, refers to nepal pirathamar ke. pi. ooli (Nepal Prime minister K P Oli) and inthiya pirathamar moodiyai (Indian Prime Minister Modi), where these two entities have occurred in the subject and object of sentence 1.a.
Split antecedents are well studied in the framework of computational model in various languages and the details are as follows. Kosuga (2014) has presented a study on Japanese reciprocal anaphor 'otagai' with split antecedents. Han et al. (2011) have presented a behavioral study of grammatical status of 'caki' in Korean, which takes split antecedents as referent. Split antecedents were considered for coreference annotation in various languages such as Spanish, Catalan, Italian, English, Polish etc. MATE, AnaCora, and Polish Coreference annotation schema support annotation of split antecedents. There are no published works on Split Antecedents in Indian languages and particularly in Tamil and our work is first of its kind.
Split antecedents are well studied under different constructions. Following are the different constructions explored in English. Split antecedents occur with a relative clause construction as in example 2.
Ex 2: Marry met a man and John met a woman who know each other well. (Mckinney-Bock, 2013) There are different theoretical solutions for this split antecedents in relative construction. McKinney-Bock et al (2013) have presented a headexternal approach and Ning Znang (2007) have proposed a syntactic derivation approach. Split antecedents are dealt with VP-ellipsis construction as in example 3.

Ex 3:
'Sally want to sail around the world and Barbara wants to fly to South Africa and they will, if money is available' (Webber 1978) 'Sally will sail around the world and Barbara will fly to South Africa' Gatt and van Deemter (2009) have studied the characteristics of plural pronouns with split antecedents in GNOME corpus. They have studied the similarity and distance between the plural pronoun and their antecedents. Cristea et al. (2002) have presented a paper investigating the difficult problems that could arise in anaphora resolution and proposed some solutions within the frame work of a general anaphora resolver. They have discussed on the methodology to resolve the plural pronouns with split antecedents. Consider example 4.
Ex 4: a) John waited for Maria. b) They went for pizza.
During the interpretation of the above sentence, a new discourse entity (DE) must be proposed for the group [John, Maria] as soon as the referential expression 'Maria' is parsed. Cristea et al. (2002) came-up with a set of ideas. a) Groups should have a property of similarity of their elements and that group formation is triggered by a first referent to it. b) A group is considered only if it is verbalized as such in the text and it does not exist until it is referred to. c) World knowledge is needed for group identification. We should use similarity measures to identify members of the group. d) A new DE should be proposed when no match between the current entity and the preceding DE arise above a threshold.
With these introductions to split-antecedents, we continue the paper as follow. The following section describes about Tamil and anaphora resolution works in this language. In the third section, we present our approach to resolve split-antecedents in Tamil using selectional restriction rules, subcategorization information and salience measure (Lappin and Leass, 1994). The fourth section has description on the experiments and evaluation. The paper concludes with a concluding section.

Pronoun Resolution in Tamil
Tamil is a morphologically rich and highly agglutinative language. It belongs to Dravidian family of languages. It is a verb final, nominative-accusative and relatively free-word order language. Subject and finite verb has person, number and gender (PNG) agreement. Similarly 3 rd person pronoun's PNG has agreement with its antecedent. 1 st person and 2 nd person pronouns have number agreement with its antecedents. Among Indian languages, there are a few automatic anaphora resolution works done in languages such as Tamil, Hindi, Bengali, Punjabi and Malayalam. Similar to what was mentioned by Byron (2001), these resolution engines do not attempt the difficult anaphors. One of the earliest anaphora resolution works in Indian languages was 'Vasisth' presented by Sobha (2000Sobha ( , 2002 for Hindi and Malayalam. Considering anaphora resolution in Tamil, there are few works on resolution of third person pronouns. The details are as follows. Sobha (2007)

Our Approach for Resolution of Plural Pronouns with Split Antecedents
We attempt to resolve the plural pronouns with split-antecedents using selectional restriction rules of the verb, categorizing the nouns based on its sub-categorization information and ranking the possible antecedents using salience factor weights. In the following sub-sections we explain Subcategorization of nouns and Selectional restriction rules.

Selectional Restriction Rules
The verbs describe the action or the process in the nature and this allow the verbs to take nouns with specific sub-categorization feature as its syntactic arguments. This is defined as the selectional restriction (SR) rules of a verb. Consider the sentence in example 5. Ex 5: raam aappil caappittaan. Ram(N) apple(N) eat(V)+past+3sn 'Ram ate an apple'.
Here 'raam' (Ram) has the sub-categorization feature [+animate, +human] and 'aappil' (apple) with [+edible]. The SR features required by the verb 'caappitu' (eat) for selecting its subject and object are [+animate] and [+edible] respectively. If there is a violation in SR rules, the sentence can be syntactically correct but it will not be semantically correct (Arulmozhi 2006). Verb has the right to select its arguments. We have grouped the verbs according to the sub-categorization information of the subject and object nouns. A group of commonly used 1500 verb senses are analyzed and 500 SR rules are derived from these verbs in-house. The SR rules do not cover figurative usage of language. The sub-categorization features of a noun are explained in the next section. A sample rule is shown in Figure 1.

Sub-Categorization
Sub-categorization features explain the nature of a noun. Essentially, the arguments of the verb, subject and object are analyzed using these features. These features may include the type of noun, its characteristics, state etc. There are totally 104 sub-categorization features. Using the sub-categorization features, which are related to the nouns, the SR features of the verb selects the nouns as its syntactic arguments. We have categoriesed 4500 frequently occurring nouns in Tamil. The Sub-categorization feature for the noun 'aappil' (apple) is presented in Figure 2. These sub-categorization features are used as nodes in building a language ontology. This language ontology is built with respect to the usage of language. Due to this, it deviates substantially from the taxonomy of nature. The sub-categorization features for the nouns can be obtained easily by traversing through various nodes (Arulmozhi, 2006). The nouns are grouped under each node, so we get a coarse to fine grained information of each noun. The ontology starts with [+entity] as the head noun and it divides into [+living] and [living].

Resolution of Plural Pronouns
Using the SR rules and the sub-categorization information of nouns we try to resolve the plural pronouns in a two-step process. In the first step we try to group the noun phrases to form groups which can be possible split-antecedents. The nouns are grouped based on the sub-categorization information and following the verb's SR rule restriction rule. Consider examples 6. In example 6, there are three nouns before the plural pronoun 'avarkal' (they). The subcategorization of these nouns are as follows: a) raam (Ram) We describe the methodology to perform the resolution of plural pronouns, which do not refer to a plural noun phrase, on text preprocessed with syntactic information such as morphological analysis (Ram et al, 2010), POS tag , chunk information, clause boundary (Ram et al, 2012) and named entity (Malarkodi et al, 2012). The morphological analyser gives an indepth analysis of each word, such as root word, suffixes and its labels and person, number and gender (PNG) information. The clause boundary identifier marks the matrix clause and sub-ordinate clause boundaries, which helps in adding positional constraint features.
Following are the steps involved in resolving the plural pronoun. In the first step, we enrich the nouns and the verbs with their sub-categorization information, and SR rules respectively. The named entities (NEs) are mapped to the sub-categorization features, so we get the sub-categorization information using the NE information as described in the example 8. Ex 8: a) Person: [+living; +animate; +vertebrate; +mammal; +human;] b) Location: [-living; -moveable; +landscape] In the second step, when a plural pronoun is encountered in the sentence, the preceding portion of the sentence and two preceding sentences are considered for analysis, as Gatt et al. (2009) have shown that the distance between plural pronouns and their antecedent are very few sentences away.
The noun phrases in the preceding sentences are analysed and grouped to form the possible antecedents. For grouping the NPs, the NPs need to satisfy the following matching conditions. a) The NPs can be grouped together if they have same sub-categorization information or till the last but one node in the ontology is same. Example [+living; +animate; +vertebrate; +mammal; +human; +female] and [+living; +animate; +vertebrate; +mammal; +human;female] are considered to be same since both are same till last but one node. b) Exceptions are as follows: In the case of NPs with sub-categorization [+living] and do not have [+human], we look for sub-categorization match between the NPs only till [+living; +animate] and such NPs are grouped together.
Following are the steps involved to form possible candidates by grouping the NPs.
a) Identify the plural pronoun in n th sentence. b) Consider sentence n-2 th , n-1 th and in n th sentence consider the portion preceding to the plural pronoun to form a candidate sentence set. c) For each sentence in the candidate sentence set; Noun Phrases in the sentence with conjunct suffix 'um' or conjunct word 'maRRum' (and) are united to form conjunct NPs. From now onwards the term NPs refers to both NPs and conjunct NPs. d) For each sentence in sentence set; if there exists NPs satisfying the matching condition, then the NPs are grouped together. e) Group the NPs that occur in same syntactic argument position and satisfy the matching condition across n th , n-1 th and n-2 th sentences. NPs in n-1th sentence 20 9

S.No Salience Factors
NPs in n-2th sentence 10 In the third step, when the possible antecedents are formed by grouping the NPs, they are ranked based on the salience factors derived from the features of NPs such as the sub-categorization information of NPs, the SR rules of verbs followed by the NPs and the syntactic argument position of the NPs in the sentences. The salience factor weights (Lappin and Leass, 1994) are described in table1. The weights for the salience factors are initially manually assigned based on linguistic considerations and fine-tuned through experiments.

Experiment, Results and Discussion
To analyse the plural pronouns, we choose a Tamil novel, 'Ponniyin Selvan' which was authored by Kalki, a well-known writer. As mentioned in Section 3.3, we processed the corpus with morphological analyser, POS tagger, chunker, pruner, clause boundary identifier and named entity recognizer. The corpus is made into a column format, where the information from each preprocessing module is added as a column. In the corpus, we considered the first 1000, plural pronouns, 'avarkal' and 'avai'. These pronouns had four different types of antecedents such as plural noun phrase, conjunct NPs, split antecedents and the pronoun 'avarkal' also refers to honorific NP. The distribution of the pronouns with respect to their antecedents is presented in table 2.  In this experiment, we focus on plural pronouns with split antecedents. We considered the sentence having this plural pronoun and its preceding two sentences. In this set of sentences, as mentioned in Section 3.3, we first tag the sub-categorization information for the nouns and SR rules of the verbs. After forming the possible antecedents by grouping NPs, we rank the possible antecedents with the salience factor weights mentioned in Section 3.3 to find the antecedent. The performance evaluation is done with accuracy as the measure. The results are presented in table 3.

S.No Total number of pronouns with split antecedents
Correctly tagged Accu-racy% 1 51 30 58.82 On analyzing the output, we found errors, when the preceding two sentences have similar NPs in the subject position. Consider the following example 9.

Conclusion
We have presented a methodology to resolve plural pronouns which refer to split antecedents in Tamil. Automatic resolution of split antecedents is less attempted and it is first of its kind in Tamil. Our algorithm works on salience measures, the salience factors for scoring are obtained from the sub-categorization information of the noun phrases and the SR rules of the verbs. We have tested the algorithm on plural pronouns occurred in a Tamil novel. The results are encouraging. We need to test this methodology on a corpus from other domains.