LayoutLM - Noba1anc3/MFCNN GitHub Wiki

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

Abstract

Pretraining techniques have been verified successfully in a variety of NLP tasks in recent years.
Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on
text-level manipulation, while neglecting layout and style information that is vital for document image understanding.

LayoutLM

Jointly model interactions between text and layout information
Leverage image features to incorporate words' visual information into LayoutLM
The first time that text and layout are jointly learned in a single framework for document-level pre-training
SOTA on : form understanding, receipt understanding, document image classification

Introduction

Document AI, or Document Intelligence, is a relatively new research topic that refers techniques for
automatically reading, understanding, and analyzing business documents.

Business documents

digital-born, occurring as electronic files
scanned form, comes from written or printed on paper
purchase order, financial report, business email, sales agreement, vendor contract, letter, invoice, receipt, ...

Understanding business documents is a very challenging task due to
the diversity of layouts and formats, poor quality of image as well as the complexity of template structures.

Contemporary approaches for document AI are usually built upon
deep neural networks from a cv or nlp perspective, or a combination of them.

The first to propose a table detection method for PDF documents based on CNN :

A Table Detection Method for PDF Documents Based on Convolutional Neural Networks. [2016IAPR Workshop on DAS]

Leveraged more advanced Faster and Mask R-CNN model :

DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images. [2017 ICDAR]
Visual Detection with Context for Document Layout Analysis. [2019 EMNLP-IJCNLP]
PubLayNet: largest dataset ever for document layout analysis. [2019]

Take advantage of text embeddings from pre-trained NLP models :

Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks. [2017 CVPR]

Combine textual and visual information for information extraction :

Graph Convolution for Multimodal Information Extraction from Visually Rich Documents. [2019]

Most of these methods confront two limitations:

They rely on a few human-labeled training samples without fully exploring the possibility of using large-scale unlabeled training samples.
They usually leverage either pre-trained CV models or NLP models, but do not consider a joint training of textual and layout information.

Therefore, it is important to investigate how self-supervised pre-training of text and layout may help in the document AI area.

Inspired by the BERT model, where input textual information is mainly represented by
text embeddings and position embeddings, LayoutLM further adds two types of input embeddings:

a 2-D position embedding that denotes the relative position of a token within a document;
an image embedding for scanned token images within a document.

Multi-task Learning Objective

Masked Visual-Language Model (MVLM) loss
Multi-label Document Classification (MDC) loss

the LayoutLM is pre-trained on the IIT-CDIP Test Collection 1.02,
which contains more than 6 million scanned documents with 11 million scanned document images.

We select three benchmark datasets as the downstream tasks to evaluate the performance of the pre-trained LayoutLM.

FUNSD : Spatial Layout Analysis and Form Understanding [ICDARW 2019]
SROIE : Scanned Receipts Information Extraction
RVL-CDIP : Document Image Classification [ICDAR 2015]

Contributions

For the first time, textual and layout information from scanned document images is pre-trained in a single framework
LayoutLM uses the masked visual-language model and the multi-label document classification as the training objectives
The code and pre-trained models are publicly available at https://aka.ms/layoutlm for more downstream tasks

LayoutLM

BERT

The BERT model is an attention-based bidirectional language modeling approach.
It has been verified that the BERT model shows effective knowledge transfer from the self-supervised task with large-scale training data.
The architecture of BERT is basically a multi-layer bidirectional Transformer encoder.

Given a set of tokens processed using WordPiece, the input embeddings are computed by summing the corresponding word embeddings, position embeddings, and segment embeddings. Then, these input embeddings are passed through a multi-layer bidirectional Transformer that can generate contextualized representations with an adaptive attention mechanism.

Two steps in the BERT framework

pre-training
fine-tuning

Objectives to learn the language representation during pre-training

Masked Language Modeling (MLM)
Next Sentence Prediction (NSP)

In the fine-tuning, task-specific datasets are used to update all parameters in an end-to-end way.