End-to-End Extraction of Structured Information from Business Documents with Pointer-Generator Networks

The predominant approaches for extracting key information from documents resort to classifiers predicting the information type of each word. However, the word level ground truth used for learning is expensive to obtain since it is not naturally produced by the extraction task. In this paper, we discuss a new method for training extraction models directly from the textual value of information. The extracted information of a document is represented as a sequence of tokens in the XML language. We learn to output this representation with a pointer-generator network that alternately copies the document words carrying information and generates the XML tags delimiting the types of information. The ability of our end-to-end method to retrieve structured information is assessed on a large set of business documents. We show that it performs competitively with a standard word classifier without requiring costly word level supervision.


Introduction
Companies and public administrations are daily confronted with an amount of incoming documents from which they want to extract key information as efficiently as possible. They often face known types of documents such as invoices or purchase orders, thus knowing what information types to extract. However, layouts are highly variable across document issuers as there are no widely adopted specifications constraining the positioning and textual representation of the information within documents. This makes information extraction a challenging task to automate.
In addition to the incremental approaches based on layout identification (d'Andecy et al., 2018;Dhakal et al., 2019), a number of recent works have proposed deep neural models to extract information in documents with yet unseen layouts. Following Palm et al. (2017), most of these layoutfree approaches resort to classifiers that predict the information type of each document word. Yet, the information extraction task does not offer word level ground truth but rather the normalized textual values of each information type (Graliński et al., 2020). The word labels can thus be obtained by matching these textual values with the document words but this process is either time-consuming if manually performed or prone to errors if algorithmically performed. Indeed, extracted information may not appear verbatim in the document as its textual values are normalized. For example, the value "2020-03-30" for the document date field may be derived from the group of words "Mar 30, 2020". This forces the development of domain specific parsers to retrieve the matching words. Also, multiple document words can share the textual value of a extracted field while being semantically distinct, hence imposing additional heuristics for disambiguation. Otherwise, a street number may be wrongly interpreted as a product quantity, inducing noise in the word labels.
To the best of our knowledge, Palm et al. (2019) is the only related model that directly learns from naturally produced extraction results. However, the authors only tackle the recognition of independent and non-recurring fields such as the document date and omit the extraction of structured entities. Such entities are structures composed of multiple field values. Within documents, structured information is often contained in tables. For example, a product entity is usually described in a table row with its field values, such as price and quantity, being in different columns. Our work is intended to remedy this lack by proposing end-to-end methods for processing structured information. As a first step towards full end-to-end extraction, we focus in this paper on the recognition of fields whose values always appears verbatim in the document, thus eliminating the need for normalization operations.
As illustrated in Figure 1, extracted structured information can be represented in a markup lan-(a) Document (b) Extracted information Figure 1: A purchase order (a) and the XML representation of its extracted information (b). In this example, we retrieve the ordered products which are contained in the main table of the document. Two fields are recognized for each product entity: the ID number and the quantity.
guage that describes both its content and its structure. Among many others, we choose the XML language 1 for its simplicity. We define as many XML tag pairs as the number of entity and field types to extract. A pair of opening and closing field tags delimits a list of words constituting a field instance of the corresponding type. Following successful applications of sequenceto-sequence models in many NLP tasks (Otter et al., 2020), we employ a recurrent encoder-decoder architecture for outputting such XML representations. Conditioned on the sequence of words from the document, the decoder emits one token at each time step: either a XML tag or a word belonging to a field value. Since field values are often specific to a document or a issuer, extracted information cannot be generated from a fixed size vocabulary of words. Rather, we make use of pointing abilities of neural models (Vinyals et al., 2015) to copy words of the document that carry relevant information. Specifically, we adapt the Pointer-Generator Network (PGN) developed by See et al. (2017) for text summarization to our extraction needs. We evaluate the resulting model for extracting ordered products from purchase orders. We demonstrate that this end-to-end model performs competitively with a word classifier based model while avoiding 1 https://en.wikipedia.org/wiki/XML to create supervision at the word level.
2 Related Work 2.1 Information extraction As mentioned before, most methods for information extraction in documents take the word labels for granted and rather focus on improving the encoding of the document. Holt and Chisholm (2018) combine heuristic filtering for identifying word candidates and a gradient boosting decision tree for independently scoring them. The strength of their model mainly lies on the wide range of engineered features describing syntactic, semantic, positional and visual content of each word as well as its local context.
When extracting the main fields of invoices and purchase orders, Palm et al. (2017) and Sage et al. (2019) both employ recurrent connections across the document to reinforce correlations between the class predictions of words. They show empirically that Recurrent Neural Networks (RNN) surpass classifiers whose prediction dependence is only due to local context knowledge introduced in the word representations. For this purpose, they arrange the words within a document as a unidimensional sequence and pass the word representations into a bidirectional LTSM (BLSTM) network for field classification. Similar to the state-of-the-art in Named Entity Recognition (Yadav and Bethard, 2018), Jiang et al. (2019) also add a Conditional Random Field (CRF) on top of the BLSTM to refine predictions while extracting information from Chinese contracts.
Yet, unlike plain text, word spacing and alignments in both horizontal and vertical directions convey substantial clues for extracting information of documents. By imposing a spurious unidimensional word order, these architectures significantly favor transmission of context in one direction at the expense of the other. Lately, methods that explicitly consider the two dimensional structure of documents have emerged with two different approaches. Lohani et al. (2018), Liu et al. (2019) and Holeček et al. (2019) represent documents by graphs, with each node corresponding to a word or a group of words and edges either connecting all the nodes or only spatially near neighbors. Convolutional or recurrent mechanisms are then applied to the graph for predicting the field type of each node.
Some authors rather represent a document page as a regular two dimensional grid by downscaling the document image. Each pixel of the grid contains at most one token -either a character or a word -and its associated representation. Then, they employ fully convolutional neural networks to model the document, either with dilated convolutions Palm et al., 2019) or encoder-decoder architectures performing alternately usual and transposed convolutions (Katti et al., 2018;Denk and Reisswig, 2019;Dang and Thanh, 2019). Finally, all these works except Palm et al. (2019) output a segmentation mask representing the probabilities that each token contained in a pixel of the grid belong to the field types to extract. Katti et al. (2018) and Denk and Reisswig (2019) additionally tackle tabular data extraction by predicting the coordinates of the table rows bounding boxes to identify the invoiced products.
Instead of directly classifying each word of the document, Palm et al. (2019) output attention scores to measure the relevance of each word given the field type to extract. The relevant words are then copied and fed to learned neural parsers to generate a normalized string corresponding to the expected value of the field. The predicted string is measured by exact match with the ground truth. Evaluated on 7 fields types of invoices, their end-to-end method outperforms a logistic regression based model whose word labels are derived from end-to-end ground truth using heuristics. However, their approach cannot extract structured information such as the invoiced products.
Although there are publicly released datasets for the task of information extraction in documents (Jiang et al., 2019;Graliński et al., 2020), as far as we know, none of them are annotated to recognize structured data.

Structured language generation
A number of works prove that neural encoderdecoder models can produce well-formed and welltyped sequences in a structured language without supplying an explicit grammar of the language.
Extending traditional text recognition, some authors transform images of tables (Zhong et al., 2019;Deng et al., 2019) and mathematical formulas (Deng et al., 2017;Wu et al., 2018) into their LaTeX or HTML representations. After applying a convolutional encoder to the input image, they use a forward RNN based decoder to generate tokens in the target language. The decoder is enhanced with an attention mechanism over the final feature maps to help focusing on the image part that is recognized at the current time step.
Neural encoder-decoder architectures have also been used for semantic parsing which aims at converting natural language utterances to formal meaning representations (Dong and Lapata, 2016;Rabinovich et al., 2017). The representations may be an executable language such as SQL and Prolog or more abstract representations like abstract syntax trees. Text being the modality of both input and output sequences, Jia and Liang (2016), Zhong et al. (2017) and McCann et al. (2018) include attentionbased copying abilities in their neural model to efficiently produce the rare or out-of-vocabulary words.

Approach
We assume that the text of a document is already transcribed before extracting its information. For scanned documents, we employ a commercial Optical Character Recognition (OCR) engine for retrieving the text.
The method we propose for extracting structured information from a document is depicted in  articles. The attention-based pointing mechanism allows to accurately reproduce factual information of articles by copying words that are not in the generator's vocabulary, e.g. rare proper nouns. Similarly, we take advantage of its pointing ability to copy the words from the document which carry relevant information while allowing the generator to produce the XML tags which structure the extracted information. In the following subsections, we describe in details our model and highlight key differences with the original PGN.

Word representation
Each word w i of the document is represented by a vector denoted r i . In complement to the word level embeddings used by See et al. (2017), we enrich representations with additional textual features to cope with the open vocabulary observed within the corpus of documents. First, we follow the C2W model of Ling et al. (2015) to form a textual representation q c i at the character level. To that end, we apply a BLSTM layer over the dense embed-dings associated to the characters of the word and concatenate the last hidden state in both directions. We also add the number n i of characters in the word and case features, i.e. the percentage α i of its characters in upper case and a binary factor β i indicating if it has a title form. We concatenate all these features to form the textual component r t i of the word representation: where q w i is its word level embedding. To take into account the document layout, we also compute spatial features r s i of the word. These encompass the coordinates of the top-left and bottom-right edges of the word bounding box, normalized by the height and width of the page. We concatenate the spatial r s i and textual r t i components to build the word representation r i .

Encoder
The words of the document are organized as a unidimensional sequence of length N by reading them in a top-left to bottom-right order. The word representations {r i } i=1..N are then fed to a two-layer BLSTM to obtain contextualized representations through the encoder hidden states {h i } i=1..N .

Decoder
Decoding is performed by a two-layer forward LSTM, producing a hidden state s t at each time step t. An attention mechanism is added on top of the decoder to compute the attention distribution a t over the document words and the context vector See et al. (2017) use the alignment function of Bahdanau et al. (2015), we employ the general form of Luong et al. (2015) as this is computationally less expensive while showing similar performances: where W a is a matrix of learnable parameters. We simplify the computing of the vocabulary distribution P vocab as the generator is only in charge of producing the XML tags and thus has a vocabulary of limited size. We apply a unique dense layer instead of two and do not involve the context vector h * t in the expression of P vocab : where V and b are learnable parameters The generation probability p gen ∈ [0, 1] for choosing between generating XML tags versus copying words from the document is computed as follows: where x t is the decoder input, vectors w h , w s , w x and scalar b ptr are learnable parameters and σ is the sigmoid function. Then, p gen weights the sum of the attention and vocabulary distributions to obtain the final distribution P (w) over the extended vocabulary, i.e. the union of all XML tags and unique textual values from the document words: Note that if a textual value appears multiple times in the document, the attention weights of all the corresponding words are summed for calculating its probability of being copied.
During training, the decoder input x t is the previous token of the ground truth sequence, while in inference mode, the previous token emitted by the decoder is used. An input token is either represented by a dense embedding if the token is a XML tag or by the textual feature set r t i of the corresponding words {w i } if the token is copied from the document.
To help the model keeping track of words already copied, we concatenate the previous context vector h * t−1 with the input representation x t before applying the first decoder LSTM layer (Luong et al., 2015). We also employ the coverage mechanism proposed in See et al. (2017) in order to reduce repetitions in the generated sequences. The idea is to combine the attention distributions of the previous time steps in the coverage vector c t = t−1 t =1 a t to compute the current attention distribution. We adapt their mechanism to our alignment function, thus changing the equation 2 to: where w c is a vector of adjustable parameters. The training loss is the combination of the negative log-likelihood of the target tokens {w * t } t=1..T and the coverage loss which penalizes the model for repeatedly paying attention to the same words: where λ is a scalar hyperparameter. When the decoding stage is performed, the resulting string is parsed according to the XML syntax to retrieve all the predicted entities and fields of the document.

Dataset
We train and evaluate our extraction model on a dataset of real world business documents which unfortunately cannot be publicly released. It consists of 219,476 purchase orders emanated by 17,664 issuers between April 2017 and May 2018. The dataset is multilingual and multicultural even if the documents mainly originate from the U.S. The number of purchase orders per issuer is at least 3 and at most 31, ensuring diversity of document layouts. Training, validation and test sets have distinct issuers to assess the ability of the model to generalize to unseen layouts. They have been constructed by randomly picking 70 %, 10 % and 20 % of the issuers, respectively. More detailed statistics of the dataset are given in the Table 1. This dataset comes from a larger corpus of documents with their information extraction results that have been validated by end users of a commercial document automation software. Among all the types of information, we focus on the extraction of the ordered product entities which have two mandatory fields: ID number and quantity. From this corpus, we select the purchase orders whose location in the document is supplied for all its field instances. The knowledge of location comes from a layout-based incremental extraction system and ensures that we perfectly construct the labels for training a word classifier. Since a field instance can be composed of multiple words, we adopt the IOB (Inside, Outside, Beginning) tagging scheme of Ramshaw and Marcus (1999) for defining the field type of each document word.

Experiments
Our end-to-end model is compared on this dataset with a baseline extraction method based on a word classifier. This baseline encodes the document as the end-to-end model does, i.e. with the same operations for constructing the word representations r i and the encoder outputs h i . On top of the encoder, a dense layer with softmax activation is added with 5 output units. 4 of these units refer to the beginning and continuation of an instance for ID number and quantity fields. The remaining unit is dedicated to the Outside class, i.e. for the document words carrying information that we do not want to extract. The words with a predicted probability above 0.5 for one of the 4 field units are associ-ated with the corresponding class, otherwise we attribute the Outside class. Field instances are then constructed by merging words with beginning and continuing classes of the same field type. Finally, each quantity instance is paired with an ID number instance to form the product entities. To do so, the Hungarian algorithm (Kuhn, 1955) solves a linear sum assignment problem with the vertical distance on the document as the matching cost between two field instances. For our task, this pairing strategy is flawless if the field instances are perfectly extracted by the word classifier.
The model hyperparameters are chosen according to the micro averaged gain on the validation set. The end-to-end model and baseline share the same hyperparameter values, except the number of BLSTM cells in each encoder layer that is fixed to 128 and 256 respectively, to ensure similar numbers of trainable parameters. The input character and word vocabularies are derived from the training set. We consider all observed characters while we follow the word vocabulary construction of Sage et al. (2019) designed for business documents. This results in vocabularies of respectively 5,592 and 25,677 elements. Their embedding has a size of 16 and 32 and are trained from scratch. The BLSTM layer iterating over characters of document words has 32 cells. For all BLSTM layers, each direction has n/2 LSTM cells and their output are concatenated to form n-dimensional vectors. The decoder layers have a size of 128 and are initialized by the last states of the encoding BLSTM layers. At inference time, we decode with a beam search of width 3 and we set the maximum length of the output sequence to the number of words in the document. This results in 1,400,908 and 1,515,733 trainable parameters for the PGN and the word classifier.
To deal with exploding gradients, we apply gradient norm clipping (Pascanu et al., 2013) with a clipping threshold of 5. The loss is minimized with the Adam optimizer, its learning rate is fixed to 0.001 the first 2 epochs and then exponentially decreases by a factor of 0.8. We stop the training when the micro gain on the validation set has not improved in the last 3 epochs. As suggested in See et al. (2017), the coverage loss is added to the minimized loss only at the end of training, for one additional epoch. We weight its contribution by setting λ = 0.1 as the original value of 1 makes the negative log-likelihood loss increase. The batch size is 8 if the model fits on GPU RAM, 4 other-wise.
The experiments are carried out on a single NVIDIA TITAN X GPU. Model training takes from 3 to 10 days for 10 to 15 epochs. Due to the computational burden, the hyperparameters values have not been optimized thoroughly. Besides, we are not able to train the models on documents with more than 1800 words, which amounts to about 4 % of the training set being put aside. Yet, we evaluate the models on all documents of the validation and test sets. The implementation is based on the seq2seq subpackage of TensorFlow Addons (Luong et al., 2017).

Manual post-processing cost
We evaluate the models by measuring how much work is saved by using them rather than manually doing the extraction. For this purpose, we first assign the predicted products of a document to the ground truth entities, then we count the number of deletions, insertions and modifications to match the ground truth field instances from the predicted instances that have been assigned. The modification counter is incremented by one when a predicted field value and its target do not exactly match. For a given field, we estimate the manual post-processing gain with the following edit distance: 1 − # deletions + # insertions + # modifications N (10) where N is the number of ground truth instances in the document for this field. Micro averaged gain is calculated by summing the error counters of ID number and quantity fields and applying equation 10. We select the assignment between predicted and target entities that maximizes the micro gain of the document. To assess the post-processing gains across a set of documents, we sum the counters of each document before using equation 10.
Our evaluation methodology is closely related to Katti et al. (2018). However, they compute metrics independently for each field while we take into account the structure of entities in our evaluation.
We report in Table 2 the results of both extraction models on the test set. We retain the best epoch of each model according to the validation micro gain. All post-processing gains have positive values, meaning that it is more efficient to correct potential errors of models than manually perform the extraction from scratch (in this case, # insertions = N and # deletions = # modifications = 0). We note that the performances of the word classifier and PGN are quite similar. Even if its field level gains are a little behind, the PGN slightly surpasses the word classifier for recognizing whole documents. Both models significantly reduce human efforts as the end users do not have any corrections to make for more than 2 out of 3 documents. Besides, the PGN produces sequences that are well-formed according to the XML syntax for more than 99.5 % of the test documents.

Visual inspection of the attention mechanism
The comparison with the baseline confirms that the PGN has learned to produce relevant attention distributions in order to copy words carrying useful information. In particular, when the expected field value appears multiple times in the document, the PGN is able to localize the occurrence that is semantically correct, as illustrated in the document displayed in Figure 3. As shown, the PGN focuses its attention on the word 1 in the table row of the product that is currently recognized. On the contrary, the model ignores the occurrences of 1 which are contained in the rest of the product table and in the address blocks. This behaviour is noteworthy since the model is not explicitly taught to perform this disambiguation.

Discussion
The main difficulty faced by both models is ambiguity in the ground truth as our dataset has been annotated by users from many distinct companies. Some documents contain multiple valid values for a field of a unique product. For example, there may be the references from both recipient and issuer for the ID number. The field value which is retained as ground truth then depends on further processing of the extracted information, e.g. integration into a Enterprise Resource Planning (ERP) system. This seriously prevents any extraction model from reaching the upper bound of post-processing gain metrics which is 1.
Besides, the ID number field does not always have a dedicated physical column and rather appears within the description column, without keywords clearly introducing the field instances such as in Figure 1. Also, its instances are constituted on average of more words than the quantity, making less likely the exact match between predicted and target instances. These additional complications explain the gap of model performances between the two fields.
Unlike the word classifier based approach, the PGN tends to repeat itself by duplicating some field instances and skipping others. This is especially observed for documents having a large number of products, therefore large output sequences. To mea-sure the impact of these repetitions on metrics, we split the test set into 3 subsets according to the number of products contained in the document: no more than 3, between 4 and 14 and at least 15 entities. The last subset gathers documents with output sequences of at least 122 tokens. We recompute the metrics for each subset and report the micro averaged gains in Table 3. The performances are stable for the word classifier whatever the number of entities in the document. The PGN is on par with the word classifier for the documents with a small number of products which constitute the vast majority of the dataset. However, its extraction performance greatly declines for large output sequences, indicating that the PGN is more affected by repetitions than the baseline. It is unclear why the coverage mechanism is not as successful on our task as it is for abstractive summarization (See et al., 2017). We also tried to use the temporal attention from Paulus et al. (2018) to avoid copying the same words multiple times but this was unsuccessful too.

Conclusion
We discussed a novel method based on pointergenerator networks for extracting structured information from documents. We showed that learning directly from the textual value of information is a viable alternative to the costly word level supervision commonly used in information extraction. In this work, we focused on purchase orders but the approach could be used to extract any structured entity as long as its information type is known at training time.
Future work should aim to: i) reduce repetitions in the output sequences, ii) add parsing abilities into our encoder-decoder in order to transform the values of copied words. This will allow to process fields that need to be normalized when being extracted.