Layout-Aware Text Representations Harm Clustering Documents by Type

Clustering documents by type—grouping invoices with invoices and articles with articles—is a desirable first step for organizing large collections of document scans. Humans approaching this task use both the semantics of the text and the document layout to assist in grouping like documents. LayoutLM (Xu et al., 2019), a layout-aware transformer built on top of BERT with state-of-the-art performance on document-type classification, could reasonably be expected to outperform regular BERT (Devlin et al., 2018) for document-type clustering. However, we find experimentally that BERT significantly outperforms LayoutLM on this task (p <0.001). We analyze clusters to show where layout awareness is an asset and where it is a liability.


Introduction
Organizations are inundated by paperwork, often in the form of PDFs. Automated processing can help to organize and extract information from these documents, but the right process for a given document depends on its type: invoices are handled differently than contracts, for example. Document classification by type enables such a system; however, it requires training data for all of the desired classes, and finding such data to fit a given business's needs is difficult. There is no one-size-fits-all ontology of document types. While some types, such as invoices, may be common across industries, others, such as loan applications or home-inspection reports, are domain-specific. Users wishing to define their own classes will benefit from a system that enables them to group their own documents. To help with this, the present work addresses the task of clustering documents by type.
Humans grouping documents by type can use both the text and the appearance of documents. For example, we can distinguish a gas bill from an article at a glance, but we need to read at least a few words to determine whether a dense, two-column document is an article or a warranty. We therefore expect that a hybrid document representation that combines layout and text information should outperform a text-only representation when clustering documents by type. LayoutLM (Xu et al., 2019) is such a hybrid system and achieves state-of-theart performance for document-type classification, outperforming text-only baselines. We therefore hypothesized that LayoutLM would also outperform these baselines for document-type clustering.
Sections 3 and 4 describe the systems we compared and the experiments we used to try to confirm this hypothesis. However, the main contribution of this work is experimental evidence of the opposite: LayoutLM performed significantly worse than a simple BERT baseline on this task (Section 5). Analysis of output clusters (Section 5.1) helps to explain this unexpected result.

Related Work
Hybrid layout/text representations Recent work combines layout with text for information extraction. Chargrid (Katti et al., 2018) assigns each pixel on a page a vector. For pixels inside the bounding box of a character, the vector is a one-hot encoding for that character; otherwise, it is a vector of zeros. This generates a vocabsize × height × width tensor representation of the page for input to a CNN encoder-decoder model. BERTgrid (Denk and Reisswig, 2019) is nearly identical, but it replaces the one-hot character encoding with the word's BERT encoding. Liu et al. (2019) represent a document as a fully-connected graph where text boxes are nodes. The edge embedding between two nodes incorporates the distance between them, the text boxes' aspect ratios, and their relative sizes. Similarly, ZeroShotCeres (Lockard et al., 2020) represents semi-structured web pages as graphs, with text-field nodes connected by edges for vertically or horizontally adjacent text fields and siblings or cousins in the DOM tree. Both systems then use graph neural networks over the document graphs.
Document-type classification Classification of documents by type has frequently been treated as an image classification problem. Many works have used varying CNN architectures (Kang et al., 2014;Afzal et al., 2015;Harley et al., 2015;Afzal et al., 2017;Tensmeyer and Martinez, 2017;Das et al., 2018) or other vision-based techniques Sarkhel and Nandi, 2019).
Some works have combined vision and NLP for document-type classification, using OCR for text extraction. Noce et al. (2016) assigned the most relevant words unique colors, then filled the bounding boxes of those words with the corresponding color, enabling the CNN processing the image to "see" the word. Asim et al. (2019) provided the most important words as features to a CNN, later combining the output with an image stream that used an InceptionV3 CNN architecture. Dauphinee et al. (2019) concatenated the output of a CNN image classifier with a multilayer perceptron bagof-words classifier, then fed the concatenation to a meta-classifier. Ferrando et al. (2020) used an ensemble of a BERT classifier and EfficientNets CNNs. Audebert et al. (2020) concatenated image features (from a MobileNet v2 CNN) with text features (generated by passing FastText embeddings for the text through a 1D CNN) to form the input to a multilayer perceptron. Cosma et al. (2020) used text to help pretrain part of their classifier: they performed LDA to determine documents' topics, then trained their CNN to try to predict those topics using only the document image. They ultimately used the CNN as part of a model to predict document type using the image only. All of these systems are supervised, whereas this work addresses unsupervised clustering.

Systems
We compare LayoutLM and BERT, as well as a TF-IDF baseline (sklearn's 1 (Pedregosa et al., 2011) implementation with default hyperparameters). In each case, we use the specified system to generate one vector representation for each document image, then cluster using sklearn's k-means, with k set to the number of gold classes plus one.
BERT (Devlin et al., 2018) is a transformerbased bidirectional model that generates contextualized word embeddings for a sequence of words. The input to a BERT model for the i-th token in the sequence is a sum of (a) its token embedding; (b) a position embedding for position i; and (c) a segment embedding indicating whether the token is in the first or second segment of the input sequence.
LayoutLM (Xu et al., 2019) is a BERT-like transformer model modified to generate layoutaware contextualized word embeddings. In place of BERT's single positional embedding, LayoutLM adds positional embeddings for the x-and ycoordinates of a bounding box around the token. The token's embedding thus incorporates its twodimensional location on the page and its size. This architecture achieves state-of-the-art performance for supervised classification by document type.
Both BERT and LayoutLM output a vector for each token in the input sequence plus the special [CLS] token. However, k-means, like most clustering algorithms, requires a single vector representation of each example. Classifiers use the [CLS] embedding as a single-vector representation for the entire sequence. However, prior work (Reimers and Gurevych, 2019;Wang and Kuo, 2020) has shown that, for BERT without fine-tuning, this is not a good representation of the semantics of the entire sequence. Other options include combining all of the vectors in the output sequence by either averaging or max pooling-set the i-th value in the output vector equal to the max i-th value over all of the sequence vectors. For BERT, we use the average as our representation, since Reimers and Gurevych (2019) showed it captured semantic similarity better than the [CLS] token. For LayoutLM, we try all three methods. 0.20* 0.003 0.14* 0.003 LayoutLM (max pooled) 0.19* 0.001 0.13* 0.000 Table 1: Mean F 1 and ARI over five runs, with standard error of the mean (subscript). Items marked with * are significantly different from BERT average, p < 0.001 based on a two-tailed t-test.

Experiments
We evaluate on RVL-CDIP 2 (Harley et al., 2015), scanned tobacco-litigation documents from the Illinois Institute of Technology Complex Document Information Processing (IIT-CDIP) collection, labeled with type, such as letter or invoice. The complete class list appears in Table 3. We clustered the validation set (40K pages). Like LayoutLM, we used Tesseract 3 for OCR.
We use LayoutLM's publicly-released code and base model for experiments. 4 This model was pretrained on IIT-CDIP, excluding documents in RVL-CDIP. For BERT, we use the Transformers package 5 with the bert-base-uncased model, pretrained on books and Wikipedia. Because LayoutLM's masked language model pretrained on documents from the same domain, while BERT's did not, the dataset could favor LayoutLM.
We calculate F 1 and adjusted Rand index (ARI) for each system, using Manning et al. (2008)'s definitions of true and false positives and negatives. We use sklearn (Pedregosa et al., 2011)'s implementation of ARI. We report the mean over 5 runs and use a two-tailed t-test to determine whether systems differ significantly from the BERT baseline.

Results
Results are shown in Table 1 and Figure 1. Our experiments show that the performance of a system using LayoutLM vectors is significantly worse (p < 0.001) at clustering RVL-CDIP documents by type than a simple BERT baseline. There was no significant difference between the TF-IDF and 2 https://www.cs.cmu.edu/˜aharley/ rvl-cdip/ 3 https://github.com/tesseract-ocr/ tesseract; we used version 4.1.1. 4 https://github.com/microsoft/unilm/ tree/master/layoutlm. The version as of this writing does not include the optional image embeddings. 5 https://github.com/huggingface/ transformers BERT systems.
In contrast to prior work on BERT, where the [CLS] token was a worse representation than averaging (Reimers and Gurevych, 2019;Wang and Kuo, 2020), the best-performing LayoutLM system used the [CLS] token embedding. We suspect this is because averaging or max-pooling Lay-outLM vectors blends together bounding box information for all tokens, erasing the benefits of a layout-sensitive transformer. In light of these results, we also tested [CLS] token and max-pooling for BERT on this task. Consistent with prior work, averaging outperformed both; see Table 2.
All of these scores are low, especially in comparison to classification results. The comparison is misleading, of course, since classification requires training data, and clustering addresses the case where such data is not available. Neverthe-   less, much improvement will be required before document-type clustering is useful for practical applications.

Analysis
To understand this unexpected result, we reviewed example clusters from one run of the BERT system and one of LayoutLM([CLS]).
Documents in LayoutLM's best clusters had consistent layouts, illustrated in Figure 2. Specifications in the highest-purity cluster seem to have been generated from a few templates. For such documents, the layouts are so consistent that no learning is required to identify which aspects of layout to emphasize in grouping the documents. Not all specifications conform to these templates, though. Figure 3 shows some with different formats, which LayoutLM placed in a different cluster. Document layouts that are common across multiple document types also caused problems for LayoutLM. Figure  4 shows an invoice and resume with similar formats from the cluster with the lowest purity. Table 3 lists class precision 6 for the sample clustering runs. From this, we see that LayoutLM performed well on scientific publications. A substantial fraction of this class contains two-column documents, like those in Figure 6, which LayoutLM can recognize. In contrast, BERT far outperformed LayoutLM for resumes, where page layout may be misleading. BERT correctly clustered the two resume images in Figure 5 together, despite their obvious layout differences. LayoutLM understandably placed them in different clusters.

Conclusion
LayoutLM captures textual and layout information about documents. When training data is available, Figure 5: BERT correctly clustered these two resume pages together despite their very different layouts; Lay-outLM put them in different clusters.  a model can learn when to leverage each. Thus, LayoutLM performed quite well at classifying documents by type. But when clustering, there is no model to indicate how to weight features in determining document similarities. In this context, layout information significantly harms performance. Future work should explore ways to incorporate benefits of layout information into a representation while limiting its harm, as well as how layout information affects tasks that fall between classification and clustering, such as semi-supervised learning. Such questions must be answered for documenttype clustering to become practical.