Object-oriented Neural Programming (OONP) for Document Understanding

We propose Object-oriented Neural Programming (OONP), a framework for semantically parsing documents in specific domains. Basically, OONP reads a document and parses it into a predesigned object-oriented data structure that reflects the domain-specific semantics of the document. An OONP parser models semantic parsing as a decision process: a neural net-based Reader sequentially goes through the document, and builds and updates an intermediate ontology during the process to summarize its partial understanding of the text. OONP supports a big variety of forms (both symbolic and differentiable) for representing the state and the document, and a rich family of operations to compose the representation. An OONP parser can be trained with supervision of different forms and strength, including supervised learning (SL), reinforcement learning (RL) and hybrid of the two. Our experiments on both synthetic and real-world document parsing tasks have shown that OONP can learn to handle fairly complicated ontology with training data of modest sizes.


Introduction
Mapping a document into a structured "machine readable" form is a canonical and probably the most effective way for document understanding. There are quite some recent efforts on designing neural net-based learning machines for this purpose, which can be roughly categorized into two groups: 1) sequence-to-sequence model with the neural net as the the black box [Dong andLapata, 2016, Liang et al., 2017], and 2) neural net as a component in a predesigned statistical model [Zeng et al., 2014]. We however argue that both approaches have their own serious problems and cannot be used on document with relatively complicated structures. Towards solving this problem, we proposed Object-oriented Neural Programming (OONP ), a framework for semantically parsing in-domain documents. OONP is neural net-based, but it also has sophisticated architecture and mechanism designed for taking and outputting discrete structures, hence nicely combining symbolism (for interpretability and formal reasoning) and connectionism (for flexibility and learnability). This ability, as we argue in this paper, is critical to document understanding.
OONP seeks to map a document to a graph structure with each node being an object, as illustrated in Figure 1. We borrow the name from Object-oriented Programming [Mitchell, 2003] to emphasize the central position of "objects" in our parsing model: indeed, the representation of objects in OONP allows neural and symbolic reasoning over complex structures and hence it make it possible to represent much richer semantics. Similar to Object-oriented Programming, OONP has the concept of "class" and "objects" with the following analogousness: 1) each class defines the types and organization of information it contains, and we can define inheritance for class with different abstract levels as needed; 2) each object is an instance of a certain class, encapsulating a number of properties and operations; 3) objects can be connected with relations (called links) with pre-determined type. Based on objects, we can define ontology and operations that reflect the intrinsic structure of the parsing task.
For parsing, OONP reads a document and parses it into this object-oriented data structure through a series of discrete actions along reading the document sequentially. OONP supports a rich family of operations for composing the ontology, and flexible hybrid forms for knowledge representation. An OONP parser can be trained with supervised learning (SL) , reinforcement learning (RL) and hybrid of the two. Our experiments on one synthetic dataset and two realworld datasets have shown the efficacy of OONP on document understanding tasks with a variety of characteristics. In addition to the work on semantic parsing mentioned above, OONP is also related to multiple threads of work in natural language processing and machine learning. It is inspired by [Daumé III et al., 2009] on modeling parsing as a decision process, and also state-tracking models in dialogue system [Henderson et al., 2014] for the mixture of symbolic and probabilistic representations of dialogue state. OONP is also related to [Johnson, 2017] for modeling the transition of symbolic state and [Henaff et al., 2016] on having explicit (although not thorough) modeling on entities. OONP is also obviously related to the the recent work on neural-symbolism [Mou et al., 2017, Liang et al., 2017.

Overview of OONP
An OONP parser (as illustrated through the diagram in Figure 2) consists of a Reader equipped with read/write heads, Inline Memory that represents the document, and Carry-on Memory that summarizes the current understanding of the document at each time step. For each document to parse, OONP first preprocesses it and puts it into the Inline Memory , and then Reader controls the read-heads to sequentially go through the Inline Memory (for possibly multiple times, see Section 6.3 for an example) and at the same time update the Carry-on Memory. The major components of OONP are described in the following: • Memory: we have two types of memory, Carry-on Memory and Inline Memory. Carry-on Memory is designed to save the state * in the decision process and summarize current understanding of the document based on the text that has been 'read". Carry-on Memory has three compartments: -Object Memory: denoted as M obj , the object-based ontology constructed during the parsing process, see Section 2.1 for details; -Matrix Memory: denoted as M mat , a matrix-type memory with fixed size, for differentiable read/write by the controlling neural net [Graves et al., 2014]. In the simplest case, it could be just a vector as the hidden state of conventional Recurrent Neural Netwokr (RNN); -Action History: denoted as M act , saving the entire history of actions made during the parsing process.
Intuitively, M obj stores the extracted knowledge with defined structure and strong evidence, while M mat keeps the knowledge that is fuzzy, uncertain or incomplete, waiting for future information to confirm, complete and clarify. Inline Memory , denoted M inl , is designed to save location-specific information about the document. In a sense, the information in Inline Memory is low level and unstructured, waiting for Reader to fuse and integrate for more structured representation.
• Reader: Reader is the control center of OONP, coordinating and managing all the operations of OONP. More specifically, it takes the input of different forms (reading), processes it (thinking), and updates the memory (writing). As shown in Figure 3, Reader contains Neural Net Controller (NNC) and multiple symbolic processors, and Neural Net Controller also has Policy-net as its sub-component. Similar to the controller in Neural Turing Machine [Graves et al., 2014], Neural Net Controller is equipped with multiple read-heads and write-heads for differentiable read/write over Matrix Memory and (the distributed part of) Inline Memory, with possibly a variety of addressing strategies [Graves et al., 2014]. Policy-net however issues discrete outputs (i.e., actions), which gradually builds and updates the Object Memory in time (see Section 2.1 for more details). The actions could also updates the symbolic part of Inline Memory if needed. The symbolic processors are designed to handle information in symbolic form from Object Memory, Inline Memory, Action History, and Policy-net, while that from Inline Memory and Action History is eventually generated by Policy-net.

Figure 3: The overall digram of OONP
We can show how the major components of OONP collaborate to make it work through the following sketchy example. In reading the following text OONP has reached the underlined word "BMW" in Inline Memory. At this moment, OONP has two objects (I01 and I02) for Audi-06 and BMW respectively in Object Memory. Reader determines that the information it is currently holding is about I02 (after comparing it with both objects) and updates its status property to sold, along with other update on both Matrix Memory and Action History.
OONP in a nutshell: The key properties of OONP can be summarized as follows 1. OONP models parsing as a decision process: as the "reading and comprehension" agent goes through the text it gradually forms the ontology as the representation of the text through its action; 2. OONP uses a symbolic memory with graph structure as part of the state of the parsing process. This memory will be created and updated through the sequential actions of the decision process, and will be used as the semantic representation of the text at the end; 3. OONP can blend supervised learning (SL) and reinforcement learning (RL) in tuning its parameters to suit the supervision signal in different forms and strength; 4. OONP allows different ways to add symbolic knowledge into the raw representation of the text (Inline Memory) and its policy net in forming the final structured representation of the text.
RoadMap of the paper: The rest of the paper is organized as follows. We will elaborate on the components of OONP in Section 2 and actions of OONP in Section 3. After that we will give a detailed analysis on the neural-symbolism in OONP in Section 4. Then in Section 5 we will discuss the learning for OONP , which is followed by experiments on three datasets in Section 6. Finally we conclude the paper in Section 7.

OONP: Components
In this section we will discuss the major components in OONP, namely Object Memory , Inline Memory and Reader. We omit the discussion on Matrix Memory and Action History since they are straightforward given the description in Section 1.1.

Object Memory
Object Memory stores an object-oriented representation of document, as illustrated in Figure 4. Each object is an instance of a particular class † , which specifies the internal structure of the object, including internal properties, operations, and how this object can be connected with others. The internal properties can be of different types, for example string or category, which usually correspond to different actions in composing them: the string-type property is usually "copied" from the original text in Inline Memory, while the category properties usually needs to be rendered by a classifier. The links are by nature bi-directional, meaning that it can be added from both ends (e.g., in the experiment in Section 6.1), but for modeling convenience, we might choose to let it to be one directional (e.g., in the experiments in Section 6.2 and 6.3). In Figure 4, there are six "linked" objects of three classes (namely, Person, Event, and Item) . Taking Item-object I02 for example, it has five internal properties (Type, Model, Color, Value, Status), and is linked with two Event-objects through stolen and disposed link respectively. In addition to the symbolic part, each object had also its own distributed presentation (named object-embedding), which serves as its interface with other distributed representations in Reader (e.g., those from the Matrix Memory or the distributed part of Inline Memory). For description simplicity, we will refer to the symbolic part of this hybrid representation of objects as ontology, with some slight abuse of this word. Object-embedding serves as a dual representation to the symbolic part of a object, recording all the relevant information associated with it but not represented in the ontology, e.g., the context of text when the object is created.
The representations in Object Memory, including the ontology and object embeddings, will be updated in time by the operations defined for the corresponding classes. Usually, the actions are the driving force in those operations, which not only initiate and grow the ontology, but also coordinate other differentiable operations. For example, object-embedding associated with a certain object changes with any non-trivial action concerning this object, e.g., any update on the internal properties or the external links, or even a mention (corresponding to an Assign action described in Section 3) without any update. † In this paper, we limit ourselves to a flat structure of classes, but it is possible and even beneficial to have a hierarchy of classes. In other words, we can have classes with different levels of abstractness, and allow an object to go from abstract class to its child class during the parsing process, with more and more information is obtained. According to the way the ontology evolves with time, the parsing task can be roughly classified into two categories • Stationary: there is a final ground truth that does not change with time. So with any partial history of the text, the corresponding ontology is always part of the final one, while the missing part is due to the lack of information. See task in Section 6.2 and 6.3 for example.
• Dynamical: the truth changes with time, so the ontology corresponding to partial history of text may be different from that of the final state. See task in Section 6.1 for example.
It is important to notice that this categorization depends not only on the text but also heavily on the definition of ontology. Taking the text in Figure 1 for example: if we define ownership relation between a Person-object and Item-object, the ontology becomes dynamical, since ownership of the BMW changed from Tom to John.

Inline Memory
Inline Memory stores the relatively raw representation of the document that follows the temporal structure of the text, as illustrated through Figure 2. Basically, Inline Memory is an array of memory cells, each corresponding to a pre-defined language unit (e.g., word) in the same order as they are in the original text. Each cell can have distributed part and symbolic part, designed to save 1) the result of preprocessing of text from different models, and 2) certain output from Reader, for example from previous reading rounds. Following are a few examples for preprocessing • Word embedding: context-independent vectorial representation of words • Hidden states of NNs: we can put the context in local representation of words through gated RNN like LSTM [Greff et al., 2015] or GRU [Cho et al., 2014], or particular design of convolutional neural nets (CNN) [Yu and Koltun, 2015].
• Symbolic preprocessing: this refer to a big family of methods that yield symbolic result, including various sequential labeling models and rule-based methods. As the result we may have tag on words, extracted sub-sequences, or even relations on two pieces.
During the parsing process, Reader can write to Inline Memory with its discrete or continuous outputs, a process we named "notes-taking". When the output is continuous, the notes-taking process is similar to the interactive attention in machine translation [Meng et al., 2016], which is from a NTM-style write-head [Graves et al., 2014] on Neural Net Controller. When the output is discrete, the notes-taking is essentially an action issued by Policy-net. Inline Memory provides a way to represent locally encoded "low level" knowledge of the text, which will be read, evaluated and combined with the global semantic representation in Carry-on Memory by Reader. One particular advantage of this setting is that it allows us to incorporate the local decisions of some other models, including "higher order" ones like local relations across two language units, as illustrated in the left panel of Figure 5. We can also have a rather "nonlinear" representation of the document in Inline Memory. As a particular example [Yan et al., 2017], at each location, we can have the representation of the current word, the representation of the rest of the sentence, and the representation of the rest of the current paragraph, which enables Reader to see information of history and future at different scales, as illustrated in the right panel of Figure 5.

Reader
Reader is the control center of OONP , which manages all the (continuous and discrete) operations in the OONP parsing process. Reader has three symbolic processors (namely, Symbolic Matching, Symbolic Reasoner, Symbolic Analyzer) and a Neural Net Controller (with Policy-net as the sub-component). All the components in Reader are coupled through intensive exchange of information as shown in Figure 6. Below is a snapshot of the information processing at time t in Reader • STEP-1: let the processor Symbolic Analyzer to check the Action History (M t act ) to construct some symbolic features for the trajectory of actions; • STEP-2: access Matrix Memory (M t mat ) to get an vectorial representation for time t, denoted as s t ; • STEP-3: access Inline Memory (M t inl ) to get the symbolic representation x (s) t (through locationbased addressing) and distributed representation x (d) t (through location-based addressing and/or content-based addressing); • STEP-4: feed x (d) t and the embedding of x (s) t to Neural Net Controller to fuse with s t ; • STEP-5: get the candidate objects (some may have been eliminated by x (s) t ) and let them meet x (d) t through the processor Symbolic Matching for the matching of them on symbolic aspect; • STEP-6: get the candidate objects (some may have been eliminated by x (s) t ) and let them meet the result of STEP-4 in Neural Net Controller ; • STEP-7: Policy-net combines the result of STEP-6 and STEP-5, to issue actions; • STEP-8: update M t obj , M t mat and M t inl with actions on both symbolic and distributed representations; • STEP-9: put M t obj through the processor Symbolic Reasoner for some high-level reasoning and logic consistency.
Note that we consider only single action for simplicity, while in practice it is common to have multiple actions at one time step, which requires a slightly more complicated design of the policy as well as the processing pipeline. The actions issued by Policy-net can be generally categorized as the following • New-Assign : determine whether to create an new object (a "New " operation) for the information at hand or assign it to a certain existed object • Update.X : determine which internal property or external link of the selected object to update; • Update2what : determine the content of the updating, which could be about string, category or links.
The typical order of actions is New-Assign → Update.X → Update2what, but it is very common to have New-Assign action followed by nothing, when, for example, an object is mentioned but no substantial information is provided,

New-Assign
With any information at hand (denoted as S t ) at time t, the choices of New-Assign typically include the following three categories of actions: 1) creating (New) an object of a certain type, 2) assigning S t to an existed object, and 3) doing nothing for S t and moving on. For Policy-net, the stochastic policy is to determine the following probabilities: where |C| stands for the number of classes, O c,k t stands for the k th object of class c at time t. Determining whether to new objects always relies on the following two signals 1. The information at hand cannot be contained by any existed objects; 2. Linguistic hints that suggests whether a new object is introduced.
Based on those intuitions, we takes a score-based approach to determine the above-mentioned probability. More specifically, for a given S t , Reader forms a "temporary" object with its own structure (denotedÔ t ), including symbolic and distributed sections. In addition, we also have a virtual object for the New action for each class c, denoted O c,new t , which is typically a time-dependent vector formed by Reader based on information in M t mat . For a givenÔ t , we can then define the following 2|C| + 1 types of score functions, namely New an object of class c: score new ), c = 1, 2, · · · , |C| Assign to existed objects: score to measure the level of matching between the information at hand and existed objects, as well as the likeliness for creating an object or doing nothing. This process is pictorially illustrated in Figure 7. We therefore can define the following probability for the stochastic policy assign ) +e scorenone(Ôt;θnone) is the normalizing factor. Many actions are essentially trivial on the symbolic part, for example, when Policy-net chooses none in New-Assign, or assigns the information at hand to an existed object but choose to update nothing in Update.X, but this action will affect the distributed operations in Reader. This distributed operation will affect the representation in Matrix Memory or the object-embedding in Object Memory.

Updating objects: Update.X and Update2what
In Update.X step, Policy-net needs to choose the property or external link (or none) to update for the selected object determined by New-Assign step. If Update.X chooses to update an external link, Policy-net needs to further determine which object it links to. After that Update2what updates the chosen property or links. In task with static ontology, most internal properties and links will be "locked" after they are updated for the first time, with some exception on a few semi-structured property (e.g., the Description property in the experiment in Section 6.2). For dynamical ontology, on contrary, many important properties and links are always subject to changes. A link can often be determined from both ends, e.g., the link that states the fact that "Tina (a Person-object ) carries apple (an Item-object )" can be either specified from from Tina(through adding the link "carry" to apple) or from apple (through adding the link "iscarriedby" to Tina ), as in the experiment in Section 6.1. In practice, it is often more convenient to make it asymmetrical to reduce the size of action space.
In practice, for a particular type of ontology, both Update.X and Update2what can often be greatly simplified: for example, • when the selected object (in New-Assign step) has only one property "unlocked", the Update.X step will be trivial; • in S t , there is often information from Inline Memory that tells us the basic type of the current information, which can often automatically decide the property or link.

An example
In Figure 8, we give an example of the entire episode of OONP parsing on the short text given in the example in Figure 1. Note that different from our late treatment of actions, we let some selection actions (e.g., the Assign) be absorbed into the updating actions to simplify the illustration.

OONP: Neural-Symbolism
OONP offers a way to parse a document that imitates the cognitive process of human when reading and comprehending a document: OONP maintains a partial understanding of document as a mixture of symbolic (representing clearly inferred structural knowledge) and distributed (representing knowledge without complete structure or with great uncertainty). As shown in Figure 2, Reader is taking and issuing both symbolic signals and continuous signals, and they are entangled through Neural Net Controller. OONP has plenty space for symbolic processing: in the implementation in Figure 6, it is carried out by the three symbolic processors. For each of the symbolic processors, the input symbolic representation could be rendered partially by neural models, therefore providing an intriguing way to entangle neural and symbolic components. Here are three examples we implemented for two different tasks 1. Symbolic analysis in Action History: There are many symbolic summary of history we can extracted or constructed from the sequence of actions, e.g., "The system just New an object with Person-class five words ago" or "The system just put a paragraph starting with '(2)' into event-3". In the implementation of Reader shown in Figure 6, this analysis is carried out with the component called Symbolic Analyzer. Based on those more structured representation of history, Reader might be able to make a informed guess like "If the coming paragraph starts with '(3)', we might want to put it to event-2" based on symbolic reasoning. This kind of guess can be directly translated into feature to assist Reader's decisions, resembling what we do with high-order features in CRF [Lafferty et al., 2001], but the sequential decision makes it possible to construct a much richer class of features from symbolic reasoning, including those with recursive structure. One example of this can be found in [Yan et al., 2017], as a special case of OONP on event identification.

Symbolic reasoning on
Object Memory: we can use an extra Symbolic Reasoner to take care of the high-order logic reasoning after each update of the Object Memory caused by the actions. This can illustrated through the following example. Tina (a Person-object) carries an apple (an Item-object), and Tina moves from kitchen (a Location-object) to garden (Location-object) at time t. Supposing we have both Tina-carry-apple and Tina-islocatedat-kitchen relation kept in Object Memory at time t, and OONP updates the Tina -islocatedat-kitchen to Tina -islocatedatgarden at time t+1, the Symbolic Reasoner can help to update the relation apple -islocatedat-kitchen to apple -islocatedat-garden . This is feasible since the Object Memory is supposed to be logically consistent. This external logic-based update is often necessary since it is hard to let the Neural Net Controller see the entire Object Memory due to the difficulty to find a distributed representation of the dynamic structure there. Please see Section 6.1 for experiments.
3. Symbolic prior in New-Assign : When Reader determines an New-Assign action, it needs to match the information about the information at hand (S t ) and existed objects. There is a rich set of symbolic prior that can be added to this matching process in Symbolic Matching component. For example, if S t contains a string labeled as entity name (in preprocessing), we can use some simple rules (part of the Symbolic Matching component) to determine whether it is compatible with an object with the internal property Name.

Learning
The parameters of OONP models (denoted Θ) include that for all operations and that for composing the distributed sections in Inline Memory. They can be trained with different learning paradigms: it takes both supervised learning (SL) and reinforcement learning (RL) while allowing different ways to mix the two. Basically, with supervised learning, the oracle gives the ground truth about the "right action" at each time step during the entire decision process, with which the parameter can be tuned to maximize the likelihood of the truth. In a sense, SL represents rather strong supervision which is related to imitation learning [Stefan, 1999] and often requires the labeler (expert) to give not only the final truth but also when and where a decision is made. For supervised learning, the objective function is given as where N stands for the number of instances, T i stands for the number of steps in decision process for the i th instance, π (i) t [·] stands for the probabilities of the feasible actions at t from the stochastic policy, and a t stands fro the ground truth action in step t.
With reinforcement learning, the supervision is given as rewards during the decision process, for which an extreme case is to give the final reward at the end of the decision process by comparing the generated ontology and the ground truth, e.g., where the match(M T i obj , G i ) measures the consistency between the ontology of in M T i obj and the ground truth G . We can use any policy search algorithm to maximize the expected total reward. With the commonly used REINFORCE [Williams, 1992] for training, the gradient is given by When OONP is applied to real-world tasks, there is often quite natural SL and RL. More specifically, for "static ontology" one can often infer some of the right actions at certain time steps by observing the final ontology based on some basic assumption, e.g., • the system should New an object the first time it is mentioned, • the system should put an extracted string (say, that for Name ) into the right property of right object at the end of the string.
For those that can not be fully reverse-engineered, say the categorical properties of an object (e.g., Type for event objects), we have to resort to RL to determine the time of decision, while we also need SL to train Policy-net on the content of the decision. Fortunately it is quite straightforward to combine the two learning paradigms in optimization. More specifically, we maximize this combined objective where J SL and J RL are over the parameters within their own supervision mode and λ coordinates the weight of the two learning mode on the parameters they share. Equation 4 actually indicates a deep coupling of supervised learning and reinforcement learning, since for any episode the samples of actions related to RL might affect the inputs to the models under supervised learning. For dynamical ontology (see Section 6.1 for example), it is impossible to derive most of the decisions from the final ontology since they may change over time. For those, we have to rely mostly on the supervision at the time step to train the action (supervised mode) or count on the model to learn the dynamics of the ontology evolution by fitting the final ground truth. Both scenarios are discussed in Section 6.1 on a synthetic task.

Experiments
We applied OONP on three document parsing tasks, to verify its efficacy on parsing documents with different characteristics and investigate different components of OONP.

Data and task
We implemented OONP an enriched version of bAbI tasks [Johnson, 2017] with intermediate representation for history of arbitrary length. In this experiment, we considered only the original bAbi task-2 [Weston et al., 2015], with an instance shown in the left panel Figure 9. The ontology has three types of objects: Person-object, Item-object, and Location-object, and three types of links: 1. is-located-at A : between a Person-object and a Location-object, 2. is-located-at B : between a Item-object and a Location-object; 3. carry: between a Person-object and Item-object; which could be rendered by description of different ways. All three types of objects have Name as the only internal property. Figure 9: One instance of bAbI (6-sentence episode) and the ontology of two snapshots.
The task for OONP is to read an episode of story and recover the trajectory of the evolving ontology. We choose this synthetic dataset because it has dynamical ontology that evolves with time and ground truth given for each snapshot, as illustrated in Figure 9. Comparing with the real-world tasks we will present later, bAbi has almost trivial internal property but relatively rich opportunities for links, considering any two objects of different types could potentially have a link.

Implementation details
For preprocessing, we have a trivial NER to find the names of people, items and locations (saved in the symbolic part of Inline Memory) and word-level bi-directional GRU for the distributed representations of Inline Memory. In the parsing process, Reader goes through the inline word-by-word in the temporal order of the original text, makes New-Assign action at every word, leaving Update.X and Update2what actions to the time steps when the read-head on Inline Memory reaches a punctuation (see more details of actions in Table 1). For this simple task, we use an almost fully neural Reader (with MLPs for Policy-net) and a vector for Matrix Memory, with however a Symbolic Reasoner for some logic reasoning after each update of the links, as illustrated through the following example. Suppose at time t, the ontology in M t obj contains the following three facts (among others) • fact-1: John (a Person-object) is in kichten (a Location-object); • fact-2: John carries apple (an Item-object); • fact-3: John drops apple; where fact-3 is just established by Policy-net at t. Symbolic Reasoner will add a new is-located-at B link between apple and kitchen based on domain logic ‡ .

NewObject(c)
New an object of class-c.

AssignObject(c, k)
Assign the current information to existed object (c, k) Update(c, k).AddLink(c , k , ) Add an link of type-from object-(c, k) to object-(c , k ) Update(c, k).DelLink(c , k , ) Delete the link of type-from object-(c, k) to object-(c , k ) Table 1: Actions for bAbI.

Results and Analysis
For training, we use 1,000 episodes with length evenly distributed from one to six. We use just REINFORCE with only the final reward defined as the overlap between the generated ontology and the ground truth, while step-by-step supervision on actions yields almost perfect result (result omitted). For evaluation, we use the following two metrics: • the Rand index [Rand, 1971] between the generated set of objects and the ground truth, which counts both the duplicate objects and missing ones, averaged over all snapshots of all test instances; • the F1 [Rijsbergen, 1979] between the generated links and the ground truth averaged over all snapshots of all test instances, since the links are typically sparse compared with all the possible pairwise relations between objects.
with results summarized in Table 2. OONP can learn fairly well on recovering the evolving ontology with such a small training set and weak supervision (RL with the final reward), which clearly shows that the credit assignment over to earlier snapshots does not cause much difficulty in the learning of OONP even with a generic policy search algorithm. It is not so surprising to observe that Symbolic Reasoner helps to improve the results on discovering the links, while it does not improves the performance on identifying the objects although it is taken within the learning. It is quite interesting to observe that OONP achieves rather high accuracy on discovering the links while it performs relatively poorly on specifying the objects. It is probably due to the fact that the rewards does not penalizes the objects.  6.2 Task-II: Parsing Police Report

Data & task
We implement OONP for parsing Chinese police report (brief description of criminal cases written by policeman), as illustrated in the left panel of Figure 10. We consider a corpus of 5,500 cases with a variety of crime categories, including theft, robbery, drug dealing and others. The ontology we designed for this task mainly consists of a number of Person-objects and Item-objects connected through a Event-object with several types of relations, as illustrated in the right panel of Figure 10. A Person-object has three internal properties: Name (string), Gender (categorical) and Age (number), and two types of external links (suspect and victim) to an Event-object. An Item-object has three internal properties: Name (string), Quantity (string) and Value (string), and six types of external links (stolen, drug, robbed, swindled, damaged, and other) to an Event-object. Compared with bAbI in Section 6.1, the police report ontology has less pairwise links but much richer internal properties for objects of all three objects. Although the language in this dataset is reasonably formal, the corpus coverages a big variety of topics and language styles, and has a high proportion of typos. The average length of a document is 95 Chinese characters, with digit string (say, ID number) counted as one character. Figure 10: An example of police report and its ontology.

Implementation details
The OONP model is designed to generate ontology as illustrated in Figure 10 through a decision process with actions in Table 3. As pre-processing, we performed regular NER with third party algorithm (therefore not part of the learning) and simple rule-based extraction to yield the symbolic part of Inline Memory as shown in Figure 11. For the distributed part of Inline Memory, we used dilated CNN with different choices of depth and kernel size [Yu and Koltun, 2015], all of which will be jointly learned during training. In making the New-Assign decision, Reader considers the matching between two structured objects, as well as the hints from the symbolic part of Inline Memory as features, as pictorially illustrated in Figure 7. In updating objects with its string-type properties (e.g., Name for a Person-object ), we use Copy-Paste strategy for extracted string (whose NER tag already specifies which property in an object it goes to) as Reader sees it. For undetermined category properties in existed objects, Policy-net will determine the object to update (an New-Assign action without New option), its property to update (an Update.X action), and the updating operation (an Update2what action) at milestones of the decision process , e.g., when reaching an punctuation. For this task, since all the relations are between the single by-default Event-object and other objects, the relations can be reduced to category-type properties of the corresponding objects in practice. For category-type properties, we cannot recover New-Assign and Update.X actions from the label (the final ontology), so we resort RL for learning to determine that part, which is mixed with the supervised learning for Update2what and other actions for string-type properties.

NewObject(c)
New an object of class-c.

AssignObject(c, k)
Assign the current information to existed object (c, k) UpdateObject(c, k).Name Set the name of object-(c, k) with the extracted string.

UpdateObject(Person, k).Gender
Set the name of a Person-object indexed k with the extracted string.

UpdateObject(Item, k).Quantity
Set the quantity of an Item-object indexed k with the extracted string.

UpdateObject(Item, k).Value
Set the value of an Item-object indexed k with the extracted string. UpdateObject(Event, 1).Items.x Set the link between the Event-object and an Item-object, where x ∈{stolen, drug, robbed, swindled, damaged, other} UpdateObject(Event, 1).Persons.x Set the link between the Event-object and an Person-object, and x ∈{victim, suspect} Table 3: Actions for parsing police report.

Results & discussion
We use 4,250 cases for training, 750 for validation an held-out 750 for test. We consider the following four metrics in comparing the performance of different models: Assignment Accuracy the accuracy on New-Assign actions made by the model Category Accuracy the accuracy of predicting the category properties of all the objects Ontology Accuracy the proportion of instances for which the generated ontology is exactly the same as the ground truth Ontology Accuracy-95 the proportion of instances for which the generated ontology achieves 95% consistency with the ground truth which measures the accuracy of the model in making discrete decisions as well as generating the final ontology. We empirically examined several OONP implementations and compared them with a Bi-LSTM baseline, with results given in Table 4.  Table 4: OONP on parsing police reports.
The Bi-LSTM is essentially a simple version of OONP without a structured Carry-on Memory and designed operations (sophisticated matching function in New-Assign ). Basically it consists of a Bi-LSTM Inline Memory encoder and a two-layer MLP on top of that acting as a simple Policy-net for prediction actions. Since this baseline does not has an explicit object representation, it does not support category type of prediction. We hence only train this baseline model to perform New-Assign actions, and evaluate with the Assignment Accuracy (first metric) and a modified version of Ontology Accuracy (third and fourth metric) that counts only the properties that can be predicted by Bi-LSTM, hence in favor of Bi-LSTM. We consider three OONP variants: • OONP (neural): simple version of OONP with only distributed representation in Reader in determining all actions; • OONP (structured): OONP that considers the matching between two structured objects in New-Assign actions, with symbolic prior encoded in Symbolic Matching and other features for Policy-net; • OONP (RL): another version of OONP (structured) that uses RL to determine the time for predicting the category properties, while OONP (neural) and OONP (neural) use a rule-based approach to determine the time.
As shown in Table 4, Bi-LSTM baseline struggles to achieve around 73% Assignment Accuracy on test set, while OONP (neural) can boost the performance to 88.5%. Arguably, this difference in performance is due to the fact that Bi-LSTM lacks Object Memory, so all relevant information has to be stored in the Bi-LSTM hidden states along the reading process. When we start putting symbolic representation and operation into Reader , as shown in the result of OONP (structure), the performance is again significantly improved on all four metrics. More specifically, we have the following two observations (not shown in the table), • Adding inline symbolic features as in Figure 11 improves around 0.5% in New-Assign action prediction, and 2% in category property prediction. The features we use include the type of the candidate strings and the relative distance to the maker character we chose. Figure 11: Information in distributed and symbolic forms in Inline Memory.
• Using a matching function that can take advantage of the structures in objects helps better generalization. Since the objects in this task has multiple property slots like Name, Gender, Quantity, Value. We tried adding both the original text (e.g., Name, Gender, Quantity, Value ) string of an property slot and the embedding of that as additional features, e.g., the length the longest common string between the candidate string and a relevant property of the object.
When using REINFORCE to determine when to make prediction for category property, as shown in the result of OONP (RL), the prediction accuracy for category property and the overall ontology accuracy is improved. It is quite interesting that it has some positive impact on the supervised learning task (i.e., learning the New-Assign actions) through shared parameters. The entanglement of the two learning paradigms in OONP is one topic for future research, e.g., the effect of predicting the right category property on the New-Assign actions if the predicted category property is among the features of the matching function for New-Assign actions.
6.3 Task-III: Parsing court judgment documents

Data and task
We also implement OONP for parsing court judgement on theft. Unlike the two previous tasks, court judgements are typically much longer, containing multiple events of different types as well as bulks of irrelevant text, as illustrated in the left panel of Figure 10. The dataset contains 1961 Chinese judgement documents, divided into training/dev/testing set with 1561/200/200 texts respectively. The ontology we designed for this task mainly consists of a number of Person-objects and Item-objects connected through a number Event-object with several types of links. A Event-object has three internal properties: Time (string), Location (string), and Type (category, ∈{theft, restitutionx, disposal}), four types of external links to Person-objects (namely, principal, companion, buyer, victim) and four types of external links to Item-objects (stolen, damaged, restituted, disposed). In addition to the external links to Event-objects , a Person-object has only the Name (string) as the internal property. An Item-object has three internal properties: Description (array of strings), Value (string) and Returned(binary) in addition to its external links to Eventobjects , where Description consists of the words describing the corresponding item, which could come from multiple segments across the document. A Person-object or an Itemobject could be linked to more than one Event-object, for example a person could be the principal suspect in event A and also a companion in event B. An illustration of the judgement document and the corresponding ontology can be found in Figure 12.

Implementation details
We use a model configuration similar to that in Section 6.2, with however the following important difference. In this experiment, OONP performs a 2-round reading on the text. In the first round, OONP identifies the relevant events, creates empty Event-objects, and does Notes-Taking on Inline Memory to save the information about event segmentation (see [Yan et al., 2017] for more details). In the second round, OONP read the updated Inline Memory, fills the Eventobjects, creates and fills Person-objects and Item-objects, and specifies the links between them. When an object is created during a certain event, it will be given an extra feature (not an internal propoerty) indicating this connection, which will be used in deciding links between this object and event object, as well as in determining the future New-Assign actions. The actions of the two round reading are summarized in Table 5.
Action for 1st-round Description NewObject(c) New an Event-object, with c =Event. NotesTaking(Event, k).word Put indicator of event-k on the current word. NotesTaking(Event, k).sentence Put indicator of event-k on the rest of sentence, and move the read-head to the first word of next sentence NotesTaking(Event, k).paragraph Put indicator of event-k on the rest of the paragraph, and move the read-head to the first word of next paragraph.

Skip.word
Move the read-head to next word Skip.sentence Move the read-head to the first word of next sentence Skip.paragraph Move the read-head to the first word of next paragraph. Action for 2nd-round Description NewObject(c) New an object of class-c. AssignObject(c, k) Assign the current information to existed object (c, k) UpdateObject(Person, k).Name Set the name of the k th Person-object with the extracted string. UpdateObject(Item, k).Description Add to the description of an k th Item-object with the extracted string. UpdateObject(Item, k).Value Set the value of an k th Item-object with the extracted string. UpdateObject(Event, k).Time Set the time of an k th Event-object with the extracted string.

UpdateObject(Event, k).Location
Set the location of an k th Event-object with the extracted string. UpdateObject(Event, k).Type Set the type of the k th Event-object among {theft, disposal, restitution} UpdateObject(Event, k).Items.x Set the link between the k th Event-object and an Item-object, where x ∈ {stolen, damaged, restituted, disposed } UpdateObject(Event, k).Persons.x Set the link between the k th Event-object and an Person-object, and x ∈ {principal, companion, buyer, victim} Table 5: Actions for parsing court judgements.

Results and Analysis
We use the same metric as in Section 6.2, and compare two OONP variants, OONP (neural) and OONP (structured), with Bi-LSTM. The Bi-LSTM will be tested only on the secondround reading, while both OONP variant are tested on a two-round reading. The results are shown in Table 6. OONP parsers attain accuracy significantly higher than Bi-LSTM models. Among, OONP (structure) achieves over 64% accuracy on getting the entire ontology right and over 78% accuracy on getting 95% consistency with the ground truth.

Conclusion
We proposed Object-oriented Neural Programming (OONP), a framework for semantically parsing in-domain documents. OONP is neural net-based, but equipped with sophisticated architecture and mechanism for document understanding, therefore nicely combining interpretability and learnability. Our experiments on both synthetic and real-world document parsing tasks have shown that OONP can learn to handle fairly complicated ontology with training data of modest sizes.