Form2Seq : A Framework for Higher-Order Form Structure Extraction

Document structure extraction has been a widely researched area for decades with recent works performing it as a semantic segmentation task over document images using fully-convolution networks. Such methods are limited by image resolution due to which they fail to disambiguate structures in dense regions which appear commonly in forms. To mitigate this, we propose Form2Seq, a novel sequence-to-sequence (Seq2Seq) inspired framework for structure extraction using text, with a specific focus on forms, which leverages relative spatial arrangement of structures. We discuss two tasks; 1) Classification of low-level constituent elements (TextBlock and empty fillable Widget) into ten types such as field captions, list items, and others; 2) Grouping lower-level elements into higher-order constructs, such as Text Fields, ChoiceFields and ChoiceGroups, used as information collection mechanism in forms. To achieve this, we arrange the constituent elements linearly in natural reading order, feed their spatial and textual representations to Seq2Seq framework, which sequentially outputs prediction of each element depending on the final task. We modify Seq2Seq for grouping task and discuss improvements obtained through cascaded end-to-end training of two tasks versus training in isolation. Experimental results show the effectiveness of our text-based approach achieving an accuracy of 90% on classification task and an F1 of 75.82, 86.01, 61.63 on groups discussed above respectively, outperforming segmentation baselines. Further we show our framework achieves state of the results for table structure recognition on ICDAR 2013 dataset.

Structure extraction is necessary for digitizing documents to make them re-flowable and index-able, which is useful in web-based services (Alam and Rahman, 2003;Gupta et al., 2007;Khemakhem et al., 2018;Rahman and Alam, 2003).In this work, we look at a complex class of documents i.e., Forms that are used to capture user data by organizations across various domains such as government services, finance, administration, and healthcare.Such industries that have been using paper or PDF forms would want to convert them into an appropriate digitized version (Rahman and Alam, 2003) (such as an HTML).Once these forms are made re-flowable, they can be made available across devices with different form factors (Alam and Rahman, 2003;Gupta et al., 2007).This facilitates providing better form filling experiences and increases the ease of doing business since their users can interact with forms more conveniently and enables other capabilities like improved handling of filled data, applying validation checks on data filled in fields, consistent form design control 1 .
To enable dynamic rendering of a form while re-flowing it, we need to extract its structure at multiple levels of hierarchy.We define TextBlock to be a logical block of self contained text.Widgets are spaces provided to fill information.Some low level elementary structures such as text and widgets can be extracted using auto-tagging capabilities of tools like Acrobat from the form PDF. However, such PDFs do not contain data about higher-order structures such as Text Fields, ChoiceGroups etc. Document structure extraction has been studied extensively with recent works employing deep learning based fully convolution neural networks (He et al., 2017;Wick and Puppe, 2018;Yang et al., 2017) that perform semantic segmentation (Long et al., 2015;Chen et al., 2014;Noh et al., 2015) on document image.Such methods perform well at extracting coarser structures but fail to extract closely spaced structures in form images (as discussed in the Experiments section).With increase in image resolution, number of activations(forward pass) and gradients(backward pass) increase at each network layer which requires more GPU memory during training.Since GPU memory is limited, they downscale the original image at the input layer which makes it difficult to disambiguate closely spaced structures, especially in dense regions(occurring commonly in forms) which leads to merging.
Figure 1 shows different types of TextBlocks, Widgets and higher order groups.Given text blocks and widgets as input, our Form2Seq framework classifies them between different type categories.We hypothesize that type classification of lower level elements can provide useful cues for extracting higher order constructs which are comprised of such smaller elements.We establish our hypothesis for the task of extracting ChoiceGroups, Text Fields and Choice Fields.A Text Field is composed of textblock(textual caption) and associated widgets, as shown in figure 1.A choice group is a collection of boolean fields called choice fields with an optional title text (choice group title) that describes instructions regarding filling it.We study fillable constructs as they are intrinsic and unique to forms and contain diverse elementary structures.
The spatial arrangement of lower level elements with respect to other elements in a form are correlated according to the type of construct.For in-stance, a list item usually follows a bullet in the reading order; field widgets are located near the field caption.Similarly, elements that are part of same higher-order group tend to be arranged in a spatially co-located manner.To leverage this in our Form2Seq framework, we perform a bottom up approach where we first classify lower level elements into different types.We arrange these elements in natural reading order to obtain a linear sequence.This sequence is fed to Seq2Seq (Sutskever et al., 2014) where each element's text and spatial representation is passed through a BiLSTM.The output of BiLSTM for each element is sequentially given as input to an LSTM (Hochreiter and Schmidhuber, 1997) based decoder which is trained to predict the category type.For grouping task, we modify the framework to predict id of the group each lower level element is part of.Here the model is trained to predict same group id for elements that are part of same group.Our contributions can be listed as: • We propose Form2Seq framework for forms structure extraction, specifically for the tasks of element type classification and higher order group extraction.
• We show effectiveness of end-to-end training of both tasks through our proposed framework over performing group extraction alone.
• We perform ablations to establish role of text in improving performance on both tasks.Our approach outperforms image segmentation baselines.
• Further, we perform

Related Work
Earlier works for document layout analysis have mostly been rule based relying on hand crafted features for extracting coarser structures such as graphics and text paragraphs (Lebourgeois et al., 1992).Approaches like connected components and others, were also used for extracting text areas (Ha et al., 1995a) and physical layouts (Simon et al., 1997).These approaches can be classified into top-down (Ha et al., 1995b) or bottom-up (Drivas and Amin, 1995).The bottom-up methods focus on extracting text-lines and aggregating them into paragraphs.Top-down approaches detect layout by subdividing the page into blocks and columns.
With the advancement in deep learning, recent approaches have mostly been fully convolution neural network (FCN) based that eliminate need of designing complex heuristics (Yang et al., 2017;He et al., 2017;Wick and Puppe, 2018).FCNs were successfully trained for semantic segmentation (Long et al., 2015) which has now become a common technique for page segmentation.The high level feature representations make FCN effective for pixel-wise prediction.FCN has been used to locate/recognize handwritten annotations, particularly in historical documents (Kölsch et al., 2018).Wigington et al. proposed a model that jointly learns handwritten text detection and recognition using a region proposal network that detects text start positions and a line follow module which incrementally predicts the text line that should be subsequently used for reading.
Several methods have addressed regions in documents other than text such as tables, figures etc. Initial deep learning work that achieved success in table detection relied on selecting table like regions on basis of loose rules which are subsequently filtered by a CNN (Hao et al., 2016).He et al. proposed multi-scale, multi-task FCN comprising of two branches to detect contours in addition to page segmentation output that included tables.They additionally use CRF (Conditional Random Field) to make the segmented output smoother.However, segmentation based methods fail to disambiguate closely spaced structures in form images due to resolution limitations as discussed in experiments section.Graliński et al. introduced the new task of recognising only useful entities in long documents on two new datasets.FUNSD (Jaume et al., 2019) is a small-scale dataset for form understanding comprising of 200 annotated forms.In comparison, our Forms Dataset is much larger having richer set of annotations.For task of figure extraction from scientific documents, (Siegel et al., 2018) introduced a large scale dataset comprising of 5.5 million document labels.They find bounding boxes for figures in PDF by training Overfeat (Sermanet et al., 2013) on image embeddings generated using ResNet-101.
Few works have explored alternate input modalities such as text for other document related tasks.Extracting pre-defined and commonly occurring named entities from invoices like documents(using text and box coordinates) has been the main focus for some prior works (Katti et al., 2018;Liu et al., 2019;Denk and Reisswig, 2019;Majumder et al., 2020).Text and document layouts have been used for learning BERT (Devlin et al., 2019) like representations through pre-training and then combined with image features for information extraction from documents (Xu et al., 2020;Garncarek et al., 2020).However, our work focuses on extracting a much more generic, diverse, complex, dense, and hierarchical document structure from Forms.Document classification is a partly related problem that has been studied using CNN-only approaches for document verification (Sicre et al., 2017).Yang et al. have designed HAN which hierarchically builds sentence embeddings and then document representation using multi-level attention mechanism.Other works explored multi-modal approaches, using MobileNet (Howard et al., 2017) andFastText (Bojanowski et al., 2017) to extract visual and text features respectively, which are combined in different ways (such as concatenation) for document classification (Audebert et al., 2020).In contrast, we tackle a different task of form layout extraction which requires recognising different structures.
Yang et al. also proposed a multimodal FCN (MFCN) to segment figures, tables, lists etc. in addition to paragraphs from documents.They concatenate a text embedding map to feature volume.We consider image based semantic segmentation approaches as baselines for the tasks proposed.We compare the performance of our approach with 1) their FCN based method and 2) DeepLabV3+ (Chen et al., 2018), which is state of the art deep learning model for semantic segmentation.

Methodology
The spatial arrangement of a lower element among its neighbouring elements is dependent on the class of element.For instance, a list item usually follows a bullet in the reading order.Similarly, elements that are part of the same higher-order group tend to be arranged in a spatially co-located pattern.To leverage relative spatial arrangement of all elements in a form together, we arrange them according to a natural reading order (left to right and top to bottom arrangement), encode their context aware representations sequentially using text and spatial coordinates and use them for prediction.For each task, the decoder predicts the output for each element sequentially, conditioning it on the outputs of elements before it in the sequence in an auto-regressive manner (just like sentence generation in NLP).For group extraction task, our model assigns a group id to each element conditioning it on ids predicted for previous elements.This is essential to predict correct group id for current element (for instance, consider assigning same group id to elements that are part of same group).
Let a form be comprising of a list of TextBlocks (f t ) and list of widgets (f w ).We arrange f e = f t f w according to natural reading order to obtain arranged sequence a e which is used as input for both the tasks ('A' in figure 2).

Element Type Classification
Let t a and s a be the list of text content and spatial coordinates (x,y,w,h) corresponding to a e , where x and y are pixel coordinates from top left corner in image and w & h denote width and height of an element respectively.Our type classification model comprises of three sub-modules namely Text Encoder (TE) which encodes the text representation of each element, Context Encoder (CE) which produces context aware embedding for each element in the sequence, and Type Decoder (TD) which sequentially predicts type output.We discuss each of these modules in detail.Text Encoder : Consider an element {a e } i having text {t a } i comprising of words {w i1 , w i2 , ..., w in }.
Since the text information is obtained through PDF content, the words often contain noise, making use of standard word vectors difficult.To mitigate this, we obtain word embeddings using python library chars2vec2 .This gives a sequence of embeddings {we i1 , we i2 , ..., we in } which is given as input to an LSTM -T E θ 1 , that processes the word embeddings such that the cell state {c t } i after processing last word is used as text representation for {a e } i ('B' in figure 2).A widget's textual representation is taken as a vector of 0s.Context Encoder : Consider a sequence element {a e } i with corresponding textual representation {c t } i and spatial coordinates {s a } i .These are concatenated ('C' in figure 2)) together to obtain {e} i representing the element.The sequence e obtained is given as input to a BiLSTM -CE θ 2 , which produces a context aware embedding {b} i for each element in the sequence ('D' in figure 2).Type Decoder : The output from the previous stage is given as input to a final LSTM based decoder -T D θ 3 , that sequentially outputs the category type for each element ('F' in figure 2).Specifically, the decoder at time step i is given input {b} i to predict the type class of i th element.Additionally, we use Bahdnau attention mechanism (Bahdanau et al., 2014) to make T D θ 3 attend on context memory M ('E' in figure 2) at each time step of decoding, where M is obtained by stacking {b 1 ; b 2 ; ..} column-wise.This is to make it easier for decoder to focus on specific elements in sequence while predicting type for current element since elements sequence in a form tends to be very long.A linear layer with softmax activation is used over the decoder outputs for type classification.
We train all 3 modules -T E θ 1 , CE θ 2 and T D θ 3 together using teacher forcing technique (Williams and Zipser, 1989) and standard cross entropy loss.

Higher Order Group Identification
Our second task is to identify larger groups.Consider one such group -ChoiceGroup, comprising of a collection of TextBlocks and Widgets having different semantics(illustrated in figure 1).A ChoiceGroup contains 1) an optional choice group title which contains details and instructions regarding filling it; and 2) a collection of choice fields which are boolean fields such that each field comprises of a textual caption -choice field caption, and one or more choice field widgets.We formulate target label prediction for this task as that of predicting a cluster/group id for each element.Consider the element sequence a e such that elements {{a e } i1 , {a e } i2 , ...} are part of a group.We assign this group a unique number and train the model to predict same group number for each of these elements.Elements that are not part of any group are assigned a reserved group i.e. 0.
We adopt a similar model as used for type clas-sification except instead of type decoder, we have Group Decoder (GD θ 4 ) such that projection layer classifies each element into one of the groups.We hypothesize that category type of elements can be a useful clue for group decoder.To leverage the type information, we study a variant of our model -Cascaded Model, where we have a common text encoder but separate context encoders -CE T & CE G , and decoders -T D & GD, for the two tasks.Specifically, given a sequence of elements a e with combined textual and spatial representations e ('C' in figure 3), we first first feed them into type context encoder (CE T , 'D' in figure 3) and type decoder (T D, 'F' in figure 3) as before to obtain decoder output sequence d t for each element.We modify the output types to categories which are relevant to the grouping task -ChoiceGroup Title, TextField Caption, ChoiceField Caption, ChoiceWidget, Text Widget, other TextBlocks.Since an element can be part of a field which is contained in choice group, we use two separate FC layers on decoder output to predict separate group ids for the element while determining choice groups and fields.
T D outputs are concatenated with e for each element ('G' in figure 3) and given as input to group context encoder CE G to obtain contextual outputs sequence b t ('H' in figure 3).The group decoder GD ('J' in figure 3) uses the sequence b t as input and attention memory ('I' in figure 3) during decoding.For d t , we purposely use outputs of type decoder LSTM and not final type projection layer outputs as determined empirically in experiments section.All five modules -T E, CE T , T D, CE G and GD are trained end-to-end for both tasks simultaneously.(Siddiqui et al., 2019) and compare the performance of our approach with them.

Implementation Details
For text encoder T E, we fix size of text in a TextBlock to maximum 200 words.We use chars2vec model which outputs 100 dimensional embedding for each word and fix LSTM size to 100.For type classification, we use a hidden size of 500 for both forward and backward LSTMs in CE T and a hidden size of 1000 for decoder T D with size of attention layer kept at 500.We tune all hyper-parameters manually based on validation set performance.Final type projection layer classifies each element into one of 10 categories.For grouping task, both isolated and cascaded model have exactly same configuration for CE G and GD as for type modules.For cascaded model, type projection layer classifies each element into relevant type categories as discussed in Methodology section.We train all models using Adam Optimizer (Kingma and Ba, 2014) at a learning rate of 1 × 10 −3 on a single Nvidia 1080Ti GPU.We determined and used largest batch size(=8) that fits memory.

Results and Discussion
Type Classification : Results for type classification are summarized in  90.06%.The accuracy for SectionTitle improves substantially from 57.02% to 67.48% and shows an improvement of 0.99%, 2.12%, 0.98%, 2.6%, 1.97%, 3.13% for ChoiceWidget, ChoiceGroupTitle, ChoiceCaption, HeaderTitle, Bullet and Stat-icText respectively.Group Identification : We report precision and recall numbers for the task of group extraction.Segmentation methods commonly use area overlap thresholds such as Intersection over Union(IoU) while matching expected and predicted structures(we evaluate baselines with an IoU threshold of 0.4).For our method, given a set of ground truth groups {g 1 , g 2 , g 3 , ..., g m } and a set of predicted groups {p 1 , p 2 , p 3 , ..., p k }, we say a group p i matches g j iff the former contains exactly the same TextBlocks and Widgets as the latter.It takes into account all the lower elements which constitute the group (necessary to measure structure extraction performance).Thus, this metric is stricter than IoU based measures with any threshold since a group predicted by our method and evaluated to be correct implies that bounding box of prediction(obtained by taking the union of elements in it) will exactly overlap with expected group.We first analyse the performance of our method on extracting choice groups.We consider different variants of our approach : 1) model A G -grouping in isolation; 2) model B G -both type and grouping task simultaneously with shared context encoder, type decoder attends on context encoder outputs while group decoder attends on context encoder outputs and type decoder outputs separately; 3) model C G -type identification trained separately, its classification outputs is given as input to group context encoder non-differentiably; 4) model D Gsame as B G except separate context encoders for two tasks and softmax outputs concatenated with textual and spatial vectors as input to group context encoder; 5) model E G -same as D G except instead of softmax outputs, type decoder LSTM outputs are used; and 6) F G (noText) -same as E G except spatial coordinates with isText signal used as input.

Model
Recall Precision F-Score   (Yang et al., 2017).For fair comparison, we implement two variants of each baseline -1) only form image is given as input; 2) textblocks and widgets masks are given as prior inputs with image.We train both variants with an aspect ratio preserving resize of form image to 792x792.For MFCN, loss for different classes are scaled according to pixel area as described in their work.To classify type of an element, we post process prediction masks for baselines by performing a majority voting among pixels contained inside it for that particular element.For MFCN, without prior variant performed better, unlike DLV3+.We report metrics corresponding to better variant.As can be seen in  Extracting Higher Order Constructs Simultaneously: We train our model to detect choice groups, text fields and choice fields together.To enable baseline methods to segment these hierarchical and overlapping structures simultaneously in separate masks, we use separate prediction heads on penultimate layer's output.Table 3 shows the results obtained.Our method works consistently well for all the structures outperforming the baselines.We train our model to predict same group id for texts present in the same row and simultaneously detect columns in a similar manner using a separate prediction head.As a post processing step, we consider different sets of texts which are aligned vertically (sharing common horizontal span along the x-axis).We then consider the column group ids predicted by the model and assign majority column id (determined for a set using texts present in it) to all the texts in the set.The re-assigned ids are then used to determine different groups of texts to recognise columns.We perform similar processing while determining the final rows.Siddiqui et al.
proposed to perform this task through constrained semantic segmentation achieving state-of-the-art results.Table 4 summarises the results obtained and compares our approach with (Siddiqui et al., 2019) showing our method obtains better F1 score for both rows, columns and average metrics (as used and reported in their paper).

Conclusion
We present an NLP based Form2Seq framework for form document structure extraction.Our proposed model uses only lower level elements -textblocks & widgets without using visual modality.We discuss two tasks -element type classification and grouping into larger constructs.We establish improvement in performance through text info and joint training of two tasks.We show that our model performs better compared to current semantic segmentation approaches.Further we also perform table structure recognition (grouping texts present in a table into rows and columns) achieving state-of-the-art results.We are also releasing a part of our forms dataset to aid further research in this direction.Aditya Gupta, Anuj Kumar, VN Tripathi, S Tapaswi, et al. 2007.Mobile web: web manipulation for small displays using multi-level hierarchy page segmentation.In Proceedings of the 4th international conference on mobile technology, applications, and systems and the 1st international symposium on Computer human interaction in mobile technology, pages 599-606.ACM.

Figure 1 :
Figure 1: Different types of TextBlocks(Blue), Widgets(Red) & higher order groups(orange) -ChoiceGroups, Choice Fields, Text Fields in a form.A text field comprises of 1) textblock(referred as text field caption) that describes what to fill & 2) collection of widgets(text widgets).A choice group comprises of a title & collection of choice fields.Textblocks & widgets are classified into different types based on higher order group they are part of.

Figure 2 :
Figure 2: Model Architecture for element type classification.Different stages are annotated with letters.

Figure 3 :
Figure 3: Architecture of our best performing model for group extraction leveraging type model shown in figure 2.

Figure 4 :
Figure 4: Predictions for a form snippet: Adding text input helps Form2Seq identify title which improves grouping.

Figure 5 :
Figure 5: Examples of type classification (left) and choice group extraction (right).Top row shows form (A) and our outputs (B).For type predictions, we visualise our classification outputs as mask for understanding, and show post processed baseline outputs(through majority voting based on predicted masks).We can see that our Form2Seq framework makes better classifications for elements (2,3,5) marked in the top left image (1=Header Title, 2=Choice Group Title, 3=Section Title, 4=Static Text and 5=Bullet).For grouping task, elements highlighted with the same number by our model are predicted as part of same group(zoom in for viewing).Bottom row shows baseline segmentation outputs (C and D).
table structure recognition by grouping table text into rows and columns achieving state of the art results on ICDAR 2013 dataset.

Table 1 :
Element type classification accuracy of different ablation methods and baselines.Here A T , B T and C T are different Form2Seq variants.A T gets only element's spatial coordinates as input, B T gets additional single bit depicting if an element is a TextBlock or a Widget in addition to their spatial coordinates, and C T gets both textual and spatial information as inputs but does not receive the additional bits provided to B T .

Table 1
T to B T ).Adding textual information (model C T ) improves the overall accuracy by 1.19% to

Table 2 :
Comparison between F-scores of different models and baselines for ChoiceGroup Identification only.A G to F G are different variants of Form2Seq.

Table 2
shows joint training of both tasks improves F-score from 53.24 to 54.65 (A G to B G ) with improvement of 1.86 if type information is incorporated non-differentiably(B G to C G ).Our best performing model(E G ) achieves an F-score of 59.72.We observe that using type projection layer softmax outputs instead results in poor performance(E G vs D G ).We observe that using text in Form2Seq(E G ) performs 4.07 points better in F-score vs. ablation F G (w/o text).It can be seen in figure4that F G misses choice group title(red), while Form2Seq with text(E G ) extracts complete choice group 4 .

Table 3 :
Recall(R), Precision(P) and F-score(F) of different methods on extracting different group structures together -text field, choice field and choice group simultaneously.

Table 4 :
Comparison with baseline on Table Structure Recognition (identifying rows and columns) task on ICDAR-2013 dataset.