Structured Prediction for Joint Class Cardinality and Entity Property Inference in Model-Complete Text Comprehension

Model-complete text comprehension aims at interpreting a natural language text with respect to a semantic domain model describing the classes and their properties relevant for the domain in question. Solving this task can be approached as a structured prediction problem, consisting in inferring the most probable instance of the semantic model given the text. In this work, we focus on the challenging sub-problem of cardinality prediction that consists in predicting the number of distinct individuals of each class in the semantic model. We show that cardinality prediction can successfully be approached by modeling the overall task as a joint inference problem, predicting the number of individuals of certain classes while at the same time extracting their properties. We approach this task with probabilistic graphical models computing the maximum-a-posteriori instance of the semantic model. Our main contribution lies on the empirical investigation and analysis of different approximative inference strategies based on Gibbs sampling. We present and evaluate our models on the task of extracting key parameters from scientific full text articles describing pre-clinical studies in the domain of spinal cord injury.


Introduction
While there has been significant progress on information extraction tasks with a comparably low level of structural complexity such as entity recognition (Goulart et al., 2011;Nadeau and Sekine, 2007), relation extraction (Zhou et al., 2014;Kumar, 2017), and co-reference resolution (Soon et al., 2001;Ferracane et al., 2016), there is not much progress on capturing the comprehensive meaning of a text with respect to a given semantic model in terms of a given vocabulary of classes and properties. We refer to this task as model-complete text comprehension (MCTC) which requires to put all the above mentioned classical NLP-tasks into a larger context. The goal of MCTC is to capture all the information in the text that is expressible with respect to the semantic model, while ignoring those meaning aspects which are not. This can be framed as a structured prediction problem consisting in inferring the most plausible instance of the semantic model.
One challenging problem in MCTC lies in the prediction of the correct number of individuals for each class, hereinafter referred to as cardinality prediction, that is answering the question(s): "How many (and which) individuals of a class are mentioned in the text?". In essence, this can be approached by grouping mentions of known realworld entities into equivalence classes, which has widely been addressed under the heading of coreference resolution (He, 2007;Singh et al., 2013). However, in many problem domains, we need to identify equivalence classes of entities that are priorily unknown (in terms of not referring to a specific real-world entity). Thus, explicit mentions in text such as naming variations etc. can not be directly mapped to a set of existing entities. To the contrary, such entities are only distinguishable on the basis of their describing properties. Take the case of scientific publications concerning pre-clinical studies containing a variable number of experimental groups each of which is described by an injury model, an animal species, treatments etc. Here, mentions of experimental groups do not refer to existing real-world entities and they need to be inferred/grouped on the basis of their identifying properties that are mentioned in the text. We refer to the prediction of how many distinct individuals 1 of a particular class are (indirectly) mentioned in a text as cardinality prediction and solve it jointly with the prediction of the properties of each individual. We model this joint task as the task of predicting a (logical) model of the text, which involves making choices as to which individuals exist for each class.
Towards capturing the dependence of class cardinalities and properties, we propose a joint inference approach that infers equivalence classes of entities in a text while at the same time predicting the properties of each equivalence class. We model this task as a statistical inference problem, relying on a factorized posterior conditional distribution p( y | x) as implemented in CRFs to approximate the true distribution over possible instantiations y ∈ Y of the semantic model given a text x. Applying maximum-a-posteriori inference, we infer the most likely instance of the model that captures the whole meaning of the text as expressible by the semantic model. This includes the determination of the number of distinct equivalence classes (thus solving cardinality prediction) as well as predicting the properties for each equivalence class. Our approach is evaluated on text comprehension of research articles describing pre-clinical studies in the domain of spinal cord injury. Capturing correct key-parameters of the study protocol can be modeled as an MCTC problem as it requires a comprehensive understanding of the text rather than extracting single binary relations only. In this domain, we focus in particular on the extraction of experimental groups and their properties as described in Section 4.1. The data set 2 and the source code 3 are public available.
In this work, we answer the following research questions: 1) What is the advantage of jointly predicting the cardinality of classes and their properties over an isolated approach and how much does the prediction of the cardinality profit from the joint modelling?
2) What approximative inference strategies work best on this complex inference problem? We examine i) a vanilla Gibbs-based inference strategy ii) an inference strategy that is seeded with cardinality values based on a preceding clustering step., and iii) a parallel multi-chain 2 http://psink.techfak.uni-bielefeld. de/spnlp-2020/mctc-spnlp2020.zip 3 https://github.com/ag-sc/ SCIOExtraction inference strategy in which one chain is constructed for each potential cardinality value.

Related Work
There are a number of traditional natural language processing tasks related to model-complete text comprehension. In this section, we briefly discuss each task and provide some pointers to systems addressing the corresponding task, focusing on the bio-medical domain.
Entity Recognition and Linking (NER+L) describes the task of finding entity mentions in a text and linking them to unique concepts in some knowledge base. The task originated in the context of information extraction, consisting of identifying persons, company names etc. (Nadeau and Sekine, 2007) but has also received prominent attention in the biomedical field focusing on entities such as genes, diseases, treatments, etc. (Goulart et al., 2011). NER+L is an important preliminary step in many downstream applications as it identifies core informational units that are needed for more complex analysis levels including relation extraction, slot filling, and MCTC.
Relation Extraction (RE) describes the task of detecting relations between entities mentioned in a text (Giuliano et al., 2007). While many models rely on a pipeline architecture predicting entities first and then predicting relations, more recent works model both tasks jointly (Luo et al., 2015). Although there has been notable progress on RE in the last years, the task has been typically restricted to extracting binary relations within single sentence boundaries only (Zhou et al., 2014). With our work, we strive to go beyond such simplifications towards document-level text interpretation with respect to a more complex model.
Co-reference resolution (CRR) describes originally the task of finding nouns and pronouns that refer to the same underlying entity (Soon et al., 2001). When applying CRR to the medical field, the task shifts towards the resolution of mentions of diseases, tests, compounds, groups, treatments, etc. (He, 2007). Cardinality prediction in isolation can be modeled as a CRR problem, where the number of distinct non co-referring entities need to be found. With regard to the goal of comprehensive text understanding, classical co-reference resolution is clearly not enough, as also the properties of each entity need to be extracted. While Singh et al. (Singh et al., 2013) have attempted to model the tasks of entity recognition, relation extraction and co-reference resolution jointly, in their approach the interaction between relation extraction and co-reference resolution is not modelled directly, only via entity tags. In our approach we model the joint interaction between inducing equivalence classes (resolving co-references) while extracting the properties of entities/individuals as a basis to inform the decision about whether two individuals are the same (thus co-refer) given their properties. Durret et al. (Durrett et al., 2013) propose a global inference entity-level modeling for classical co-reference resolution based on a rich factor graph. In the unrolled factor graph, each factor refers to one entity property defined on a semantic or syntactic linguistic basis. In contrast to this work where properties of an individual/entity are pre-defined by the semantic model. Thus, our focus lies in their joint exploration while learning their interplay during inference in order to decide whether the properties belong to the same individual or not. Haghighi et al. (Haghighi and Klein, 2010) propose an unsupervised generative model incorporating several linguistic properties of the entity and its mention. In contrast, our work does not rely on entities that are explicitly mentioned in text. Instead, our model follows the schema of a semantic model to reason about the existence of individuals that can be inferred from the text and groups these individuals into groups by way of inferring the properties of these individuals.
The task of slot-filling (SF) was first introduced in the Message Understanding Conference (Grishman and Sundheim, 1996). It is concerned with predicting an entity-centric structure having a set of relations to other entities as it can be found e.g. in ontology-based information extraction (Sanchez-Cisneros and Aparicio Gali, 2013;Buitelaar et al., 2006) or extracting info-boxes from Wikipedia articles (Lange et al., 2010). Contrary to MCTC, classical slot-filling requires the prediction of a single structure per document only, which heavily reduces relational complexity and does not include nested individuals. There are many approaches to SF ranging from relying on distant supervision as described by Surdeanu et al. (Surdeanu et al., 2010) to, more recently, neural approaches as described by Zhang et al. (Zhang et al., 2017). Finally, SF can be seen as an upstream process for (cold-start) knowledge base population as described by ter Horst et al. .
Our work is highly related to information extraction systems in the (bio-) medical field. When it comes e.g. to the prediction of key parameters of clinical studies, most work focuses on the extraction of PICO-concepts: Patient/Problem (P), Intervention (I), Comparison (C) and Outcome (O). Summerscales et al. (Summerscales et al., 2009) have applied conditional random fields to extract key parameters from abstracts of clinical studies including treatments, experimental groups, and outcomes. Contrary to our approach, the task is defined as an NER+L problem, not aiming at capturing the semantic relations and concepts. Trenta et al. (Trenta et al., 2015) have proposed to rely on a maximum entropy classifier jointly extracting fine grained PICO elements from abstracts. Brujin et al. (De Bruijn et al., 2008) combined an SVM-based text classifier with regular expressions to extract PICO elements. Further, Ferracane et al. (Ferracane et al., 2016) aim to leverage co-reference resolution to identify experimental groups (patients) from medical abstracts. However, none of these works aims at deeper extraction of arms/experimental groups and their properties. In general, most approaches in the literature focus on sentence extraction and classification only (Mayer et al., 2018;Zhao et al., 2012; rather than on predicting a semantic structure.

Method
Structured prediction describes a variety of tasks with the goal of predicting a pre-defined target structure that is extracted from an unstructured input text (Smith, 2011). We formulate the MCTC problem as a structured prediction task, where the structure to be predicted is an instance of the semantic model capturing the meaning of a text. This involves the task of predicting the number of individuals of each class (cardinality prediction) as well as predicting the values of the key properties of each individual. Our proposed method relies on probabilistic graphical models i.e. conditional random fields (CRFs; (Lafferty et al., 2001;Sutton et al., 2012)) as their application is well established in many structured prediction tasks in the context of NLP.
Encoding Semantic Models: An instance of the semantic model is encoded as a nested vector y containing as many elements as there are classes and properties in the model. Thus, given a set of classes {C 1 , . . . , C n } and a set of properties P = Figure 1: Schematized factor graph unrolled over the previously shown example. We introduce unary property factors connected to a single property of a single individual and pairwise property factors, connected to two properties of one or two individuals. Both factors are additionally connected to the cardinality variables jointly modelling the properties and cardinalities. For clarity, we omit the observed variables in this example.
. |C i | represents the cardinality of class C i , i.e. the number of individuals of class C i mentioned in the text. I i j ⊆ P is a vector describing an individual of class C i in terms of its properties.
Example: Consider a semantic model consisting of two classes C 1 and C 2 where individuals of class C 1 have properties hasA, hasB, hasC, and individuals of class C 2 have properties hasD, hasE. One specific instance of the semantic model would be represented as: ]. The first component of the first tuple shows that there are two individuals of class C 1 . The first individual has the property values a 1 , b 1 , c 1 for properties hasA, hasB, hasC, respectively. The second individual of class C 1 has property values a 1 , b 1 , c 2 for the above mentioned properties. The second tuple shows that there is one individual of class C 2 which has property values d 1 , e 2 for properties hasD, hasE, respectively.

CRF-based Modelling
Let Y be the set of all possible (nested) vectors over a given vocabulary of classes and properties as exemplified above. Intuitively, this is the set of all possible instantiations of the semantic model. With x being the set of observed input variables corresponding to the list of tokens of the input text, the conditional probability of a specific instance of the semantic model y ∈ Y is p( y| x; θ), with θ being a learned model parameter vector. The best value assignment to the set of target variables, denoted asˆ y, is found by maximum a-posteriori (MAP) inference as shown in Equation (1): As inference in high dimensional vector spaces is often intractable, conditional random fields decompose the joint probability into individual factors. The set of factors and their operating scope is defined by a factor graph (Kschischang et al., 2001;Koller and Friedman, 2009). A factor graph is a bipartite undirected graph G = (V, F ) consisting of a set of factors F and a set of variables V defined as the union of the observed input and the target output variables V = y ∪ x. A factor Ψ ∈ F is a non-negative real-valued exponential function Ψ : V → R ≥0 that computes a scalar score based on a subset ω ⊆ V of random variables defining its operating scope representing a feature vector based on a set of indicator functions, and θ Ψ referring to the set of model weights that are shared between factors of the same type. In our approach, it is crucial to capture dependencies between multiple target variables, in particular between the variables representing the cardinalities of classes and variables representing the individuals' properties. For this reason, we introduce factors that model the interaction between all pairs of property variables while having access to the cardinalities. We schematize our factor graph in Figure 1, unrolled over the previously given example. Let C denote the vector of the cardinalities of all classes and |C i | ∈ C the cardinality of class C i . The decomposition of the conditional probability p( y | x; θ) can be written as shown in Equation (2): (2) where Z denotes the partition function and Ψ (·), Ψ (·) denote factors defined for single and pairs of output variables while having access to the cardinalities C.
The unrolling of factors over the input is performed using imperatively defined factor graphs as proposed by McCallum et al. (McCallum et al., 2009). For approximative inference of the posterior distribution, we rely on the state-based Markov Chain Monte Carlo sampling paradigm. Proposal states are computed and sampled via Gibbs sampling (Casella and George, 1992). While training, the model parameters θ are updated with SampleRank (Wick et al., 2009) that is computing parameter update gradients based on an objective comparison of two states, usually between the current state and the selected successor state (cf. next sections for proposed variations). In our approach the objective is to maximize the F 1 score to the ground truth.

State-based Inference Strategies
In the following, we propose our inference strategies to MCTC with a focus towards cardinality prediction. In state-based inference, a state s t is defined as one specific variable assignment to the target structure y at a specific time point t. While inference proceeds, in each step a set of proposal states S t+1 is computed based on a list of predefined atomic change rules that are applied to the current state, e.g. changing cardinalities of classes or the properties of individuals. The successor state s t+1 ∈ S t+1 is sampled from the generated set of proposal states.
Vanilla Inference: The vanilla inference is based on traditional Gibbs sampling. The inference procedure is initialized with one empty state s 0 that is y = ∅ (no values are assigned) and iteratively updated with atomic change rules. Modifying the cardinality for a class C i is defined as either deleting an existing individual of index j ( y ← y \ I i j ; |C i | ← (|C i | − 1)) or adding a new individual with leading index |C i | ( y ← y ∪ I i |C i | ; |C i | ← (|C i | + 1)). On the level of individuals, an atomic change is defined as deleting, adding, or changing a property value. The inference procedure terminates if the model parameter update converges. The final state represents the most likely instance of the semantic model.
Cardinality Seeded + Inference: In the seeded + inference (c.f. Figure 2), the first state s 0 is initialized with an a priori predicted cardinality value λ C i for each class C i , which is re-sampled as inference proceeds. For this, the system relies on the same atomic change and termination rules as defined for the vanilla inference.
Parallel Multi Chain Inference: The parallel multi chain inference procedure (c.f. Figure 3) is initialized with n independent Markov chains S 0 = [s 0 1 , s 0 2 , . . . , s 0 n ] that are explored in parallel but independently from each other. Each state s 0 i ∈ S 0 is initialized with a fixed number of individuals for each class type ranging between a Figure 2: Schematized seeded + inference. The input is the seed parameter λ which is used to initialize the cardinality of the first state. Within the proposal states, λ can be altered. In each time step the model is updated based on the current state and successor state.
pre-defined minimum α C i and maximum β C i . Contrary to the previous inference strategies, the cardinalities in each chain are not sampled over but remain fixed. Only the property values of the individuals are sampled. The parallel sampling is independent in the sense that for each chain the computation of the set of proposal states and the selection of the successor states is independent of the other chains. The model parameters θ however are shared throughout all chains and are thus updated n times every time step; once for each pair of current-successor state. This inference procedure terminates if all chains converge. The final output is selected based on the highest model probability among the final states of all chains.
Parallel Multi Chain Inference + : The parallel multi chain inference with cross-chain model updates strategy builds on the previous inference strategy in that it includes parallel inference chains with fixed cardinality but integrates cross-chain model update operations after each time step (bold triangle in Figure 3). That is, in addition to the n model updates, a set of state-pairs is computed by pairwise combining the selected successor states of the chains for cross-over model updates. This generates n 2 +n 2 model parameter updates in each time step. The motivation for this cross-chain model updates is to force the model to learn to prefer the correct cardinality values.

Features
Factors are defined in terms of indicator functions that measure the compatibility of variable assignments to the output structure y given the input document x. In the following, we explain four types of feature groups that we consider in our model. The proposed features are intuitively designed to capture document-level semantics and finally selected empirically based on an evaluation of a subset of the training data. Document-level: Document-level features measure the compatibility of property assignments of individuals based on the textual content of the document represented as 3-grams. For this, triples are considered for plausibility, containing the property type, the entity type of the property value, and its textual representation as 3-grams. We further measure the compatibility of pairwise assignments of property values considering their sentential distance, assuming that values within the same property (in case of multi value properties) or individual (throughout multiple properties) are more likely to appear closer together rather than being spread across the document. Document-structure: Document structure features rely on a heuristic segmentation of the document into the standard sections of a scientific article: abstract, introduction, method, results, discussion, references, and unknown. We compute features that capture 3-grams mentioned in specific sections of the article as indicators for the assignment of certain values to properties. By this, we can model that certain content is expected in certain sections and should override inconsistent information appearing in other sections.
Cardinality: Aiming at cardinality prediction, we measure the compatibility of cardinality values in dependence of other random variables in y. For this, we make the choice of a cardinality dependent on n-grams appearing in the surface forms of property values.
In addition, we also consider features implementing a prior for the cardinalities of classes as well as for the number of different values for multi-value properties. By this, the model is able to learn a class/property-specific distribution of cardinality values. For example, assuming that the cardinality of a classĈ has a very high a priori likelihood for a specific value λĈ throughout the training data, this puts pressure on the model during inference to prefer model instances where there are λĈ individuals for the respective class, unless other features provide strong evidence for the contrary.
Within-and Across-Individual Coherence: Sometimes values of properties are shared across individuals within the same class. Thus, we measure the compatibility of value assignments across properties within one individual, but also how plausible it is that a certain value is shared across individuals.

Experiments
Model-complete text comprehension aims at the automatic instantiation of a semantic model based on information extracted from a natural language text. Such an instance contains information about individuals of equivalence classes, their cardinality and their properties. Thus the overall task of MCTC can be evaluated towards i) the correct prediction of the number of individuals, and ii) the prediction of properties for each individual. In the following, we describe our use case application, the experimental procedure and results.

Semantic Model and Data Set
We apply our approach to full text articles describing pre-clinical studies in the domain of spinal cord injury. Our semantic model is an excerpt of the Spinal Cord Injury Ontology (SCIO) (Brazda et al., 2017) centered on the key concept of an experimental group. An experimental group represent an animal model to which a certain injury and treatment is applied and is described by four key properties: i) hasSpecies specifying the species that the animal model belongs to, ii) hasInjury specifying the experimentally inflicted injury, ii) hasTreatment is the list of treatments that were applied, and iv) has-Name is a list of naming variations for that animal group that are used throughout the document. Note that, in accordance with domain experts, only the first three properties are considered to be relevant to describe the experimental group semantically and thus are evaluated. However, the property has-Name can be seen as an auxiliary property that is not necessary to understand the study but provides useful information, e.g. to detect co-references.
The data set contains full text articles that have been annotated by three domain experts using the SANTO framework . Annotations are available on the full level of relevant concepts of SCIO. Each document can be seen as a data point that is annotated with an instance of the semantic model containing a list of experimental groups and their properties. While annotations for the hasName property are linked to specific textual phrases in the document, all other properties are annotated in a distantly supervised fashion. In a preliminary step, we apply a named entity recognition heuristic based on automatically generated regular expressions to compute a set of documentbased annotations for all classes and property values that exist in the semantic model. The names of groups are additionally extracted with a standard CRF using standard token-level features. The final data set contains 96 data points with an average length of approx. 273 sentences per document and a total number of 345 experimental groups (µ = 3.3, σ = 1.3, min = 2, max = 7).

Inference Parameter Estimation
Our proposed inference strategies rely on a prior estimation of the number of individuals for initialization. As described in Section 3.2 the seeded inference strategy requires the seed variable λ. The parallel multi chain inference requires a range of cardinality values 0 ≤ α ≤ β. Details about their estimation are briefly described below.
Seed Prior Cardinality Estimation λ The cardinality seeded (+) inference procedure requires the estimation of the seed parameter λ for each class determining the number of individuals (experimental groups) the initial state begins with. λ is computed by relying on the k-Means algorithm by clustering group names based on textual features. The cluster quality of k-Means depends on two main parameters. First, the determination of the number of clusters, for which we rely on the residual sum of squares (RSS) algorithm with an empirically determined penalization factor for large number of clusters. Second, we rely on a function measuring the distance between two data points, i.e. between two group names. We compare three distance functions: i) Levenshtein distance with a k-Medoid implementation of k-Means, ii) cosine distance of the averaged sum of pre-trained Pubmed-based word embeddings induced with Word2vec (Mikolov et al., 2013), and iii) a random forest classifier (Liaw et al., 2002) with a correlation based feature selection (resulting in Smith-Waterman and 3-gram based Jaccard similarity as features).We evaluate the performances based on the F 1 score using a reference clustering as ground truth obtained from our annotated data set. We define a true positive as a group name that is in the correct cluster, a false positive if it is in a wrong cluster, and a false negative if it is missing in its respective cluster. The Levenshtein distance performed with F 1 = 0.41, the Word2Vec-based cosine distance performs slightly better with F 1 = 0.45, while the random forest classifier reaches a value of F 1 = 0.56. With the random forest outperforming both other models, we rely on this distance function in a k-Means clustering for estimating λ.
Parallel Multi Chain (+) Parameters α and β The parallel multi chain (+) inference strategies require the estimation of a minimum (α) and a maximum (β) number of individuals assuming that the correct cardinality lies between α and β. We estimate both parameters in dependence of the average cardinality of individuals in the training set. With µ being the average cardinality and σ its standard deviation, we set α = µ − σ and β = µ + σ.

Evaluation Setting
Our experiments follow a randomized cross validation regime as usual for experiments on relatively small data sets. We ran each experiment 10 times with a random split into 80% training data (76 in number) and 20% test data (20 in number). We provide evaluation results in terms of precision, recall, and F 1 macro averaged over all documents in three configurations: Cardinality Prediction (CP): We compare the predicted cardinality p c to the ground truth cardinality g c where tp = min(p c , g c ), f p = max(0, (p c − g c )), and f n = max(0, (g c − p c )) Property Prediction (PP): We compare the predicted property values to the ground truth property values where a true positive is a correctly assigned property value, a false positive is a wrongly assigned property value and a false negative as a missing property value of an individual.

Combined (Comb):
We compute the harmonic mean between the cardinality and property prediction scores.

Experimental Results
Our experiments comprise the evaluation of four models each of which is based on one of the de-scribed joint inference strategies predicting cardinality and properties at the same time, as well as a pure cardinality prediction baseline ignoring property prediction. The joint inference models are: RSS: the seeded inference with a fixed cardinality, RSS + : the seeded inference that allows further sampling of the cardinality as described in Section 3.2, PAR: the parallel multi chain inference, and PAR + : the parallel multi chain inference with chain-cross over model updates. As cardinality baseline(s), we provide Co-ref CRF, a CRF based method for clustering group names without a joint prediction of properties, relying on linguistic features only, and RSS, the cardinality as predicted by the RSS based k-Means as reported in the RSS-model. Note that the cardinality in RSS is fixed and does not change during inference so that it can be seen as a baseline for predicting the cardinalities. The experimental results of those models are reported in Table 1. In Table 2, we compare the run time and the number of generated states for the four inference methods.

Discussion
We analyze the results with respect to three different aspects: i) performance of the cardinality prediction, ii) overall performance as measured by the combined harmonic mean, and iii) performance with respect to the run time and complexity.

Cardinality Prediction
The performances of the cardinality prediction can be seen in the first row of Table 1. The CRF-based baseline already yields a very strong F 1 -score of 0.79 which shows that cardinality prediction with linguistic features ignoring property prediction provides already decent results. The k-Means approach with an unsupervised RSS cluster estimation yields a cardinality F 1 -score of 0.64, performing worse than the CRF baseline. When seeding our approximate inference approach with prior cardinality values (RSS + ), the F 1 -score considerably improves by 19 %-points up to 0.83, even outperforming the CRF baseline. The cardinality prediction in PAR performs comparably strong with an F 1 -score of 0.81. This score is further outperformed when integrating the cross-chain model update operation. PAR+ archives performs best in predicting the cardinalities with an F 1 -score of 0.84.
Overall Score The performances of the overall prediction can be seen in the second to last rows in Table 1. With respect to the property prediction, RSS performs best with a score of 0.57, mainly due to the correct detection of TREATMENTS (0.67) and SPECIES (0.62). With a low cardinality performance, the overall score sums up to 0.63 in F 1 . The strong increase in the performance of cardinality prediction in RSS + , compared to the RSS model comes at the cost of an inferior property prediction quality. The combined score for RSS + however shows slightly better results with an F 1score of 0.65. The property prediction in PAR shows similar results to the RSS + for INJURY, a slight decrease for TREATMENTS, and a huge increase (10% points) for SPECIES. The PAR model yields an overall score of 0.66. Activating crosschain model updates (PAR + ), the property prediction shows a performance increase by 8% points for hasTreatment while for both other properties the value is similar to PAR. The PAR + model outperforms all other models in the overall score, but lacks 4%-points for property prediction in comparison to RSS. The results show that property prediction works best when fixing the number of individuals. With PAR + model working best for cardinality prediction, an interesting model combination could be to use the cardinality output of PAR + as initialization to RSS. This however, is left for future work.

Run Time Performance
The run time as well as the number of states for each inference method is shown in Table 2. We report statistics on the average time in seconds (s) that is needed to process a document and depict the search space complexity by providing the average number of generated and evaluated states in thousands (k). All experiments ran on an Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz with 16 cores and 120 GB of available RAM. No GPU or further hardware acceleration was used. The table shows that RSS has the lowest complexity in terms of state generation, which is due to the fixed cardinality and in consequence a significantly reduced search space. In RSS + , we notice a huge increase in the number of generated states by a factor of around 7. At the same time, we observe that the run time factor rises only by a factor of 2.2 in training and 2.8 in test. It is noticeable that the number of generated states and the run time at test time decreases from PAR to PAR + which is probably due to a faster model convergence, however training run time increases.  Table 1: Results of the cardinality baseline(s) and of the inference strategies for joint cardinality and property prediction. We provide macro-F 1 , precision, and recall averaged over 10 runs with random 80/20 splits.

Conclusion
We have proposed an approach to the task of modelcomplete text comprehension (MCTC) that relies on a learned model of the posterior distribution of instances of a semantic model given a text to infer the most likely instance of a semantic model that captures the meaning of the text best. We have relied on CRFs to model the conditional distribution in a factorized way and empirically investigated the impact of different approximate inferences strategies on our problem. Our experiments on the task of predicting the structure of experimental groups from scientific full text articles describing pre-clinical studies in the field of spinal cord injury show that modeling the MCTC task as a joint inference problem, extracting the cardinality in combination with predicting the properties of the individuals, outperforms a number of reasonable baselines predicting the cardinality alone. In future work, we intend to investigate combinations of our inference strategies, relying on the result state produced by our PAR + inference strategy to seed the RSS inference method to re-sample the property values, expecting to see an overall gain in both cardinality prediction and entity property prediction over both inference strategies.