MFCNN Paper Documentation - Noba1anc3/MFCNN GitHub Wiki

Learning to Extract Semantic Structure from Documents Using Multimodal Fully Convolutional Neural Networks

The Pennsylvania State University & Adobe Research

Abstract

End-to-end
Pixel-to-pixel
Multimodal
Visual appearance + content of underlying text
Synthetic document generation process
SOTA

Introduction

Document semantic structure extraction (DSSE)
- Page segmentation - appearance based
  - distinguish text from figure, table and line ...
- Logical structure analysis - semantic based
  - paragraph / caption
- Pixel-wise segmentation problem

To incorporate textual information in a CNN-based architecture, we build a text embedding map and feed it to our MFCN.
More specifically, embed each sentence and map the embedding to the corresponding pixels.#

MFCNN
- Encoder
  - learn a hierarchy of feature representation
- Decoder
  - output segmentation mask
- Auxiliary decoder
  - reconstruction during training
- Bridge
  - merge visual representation and textual representation
pixel-wise ground truth data
- previous document understanding datasets
  - small size
  - lack of fine-grained semantic label
- generate large-scale pretraining data
  - two unsupervised tasks
    - reconstruction : better representation learning
    - consistency : encourage pixels belonging to the same regions have similar representation

Background

Logical Structure Analysis Using a set of heuristic rules based on the location, font and text of each sentence
Semantic Segmentation FCN has several limitations : ignoring small objects and mislabeling large objects due to fixed receptive field size
- Noh : unpooling, reuse the pooled location at the up-sampling stage
- Pinheiro : skip connection to refine boundaries
- mfcnn : dilated convolution block
Language and Vision Several joint learning tasks such as image captioning, visual question answering, one-shot learning
have demonstrated the significant impact of using textual and visual representations in a joint framework.
Our work is unique in that we use textual embedding directly for a segmentation task for the first time,
and we show that our approach improves the results of traditional segmentation approaches that only use visual cues.

Method

MFCNN

Unpooling

First, we observe that several semantic-based classes such as section heading and caption usually occupy relatively
small areas. Moreover, correctly identifying certain regions often relies on small visual cues, like lists being
identified by small bullets or numbers in front of each item. This suggests that low-level features need to be used.

However, because max-pooling naturally loses information during downsampling, FCN often performs poorly for small objects.

We propose an alternative skip connection implementation, illustrated by the blue arrows in Fig. 2,
use unpooling to preserve more spatial information.

Dilated Conv Block

We also notice that broader context information is needed to identify certain objects. For an instance, it is
often difficult to tell the difference between a list and several paragraphs by only looking at parts of them.

Inspired by the Inception architecture and dilated convolution, we propose a dilated convolution block,
Each dilated convolution block consists of 5 dilated convolutions with a 3 × 3 kernel size and a dilation d = 1, 2, 4, 8, 16.

Text Embedding Map

Treat a sentence as the minimum unit that conveys certain semantic meanings
Sentence embedding is built by averaging embeddings for individual words

For each pixel inside the area of a sentence, we use the corresponding sentence embedding as the input.
Pixels that belong to the same sentence thus share the same embedding. Pixels that do not belong
to any sentences will be filled with zero vectors.

Embedding used in MFCNN is skip-gram

Synthetic Document Data

While there are several publicly available datasets for page segmentation,
there are only a few hundred to a few thousand pages in each.
Furthermore, the types of labels are limited, for example to text, figure and table,
however our goal is to perform a much more granular segmentation.

We produces completely automated and random layout of partial data scraped from the web.

Implementation Detail

Preprocess

Per-channel mean subtraction
Resize to 384

Training

Conv : 3x3 kernel size, stride 1
Pooling/Unpooling : 2x2 kernel size
Batch normalization after conv before non-linear function
Adadelta with a mini-batch of 2
Since unlabelled label, class weights for classification loss are set according to the number of pixels in each class.
Text Embedding : each word as a 128-dim vector and train a skip-gram on 2016 English Wikipedia dump,
embeddings for out-of-dictionary words are obtained following Bojanowski et al.
In my own implementation based on BERT pretrained model and PCA,
768-dim sentence vector is first calculated by BERT, then use PCA to reduce it to a 128-dim vector.

Post processing

Cleanup strategy for segment masks: refine the segmentation masks For documents in PDF format, we obtain a set of bboxes by analyze the PDF (like PDFMiner),
then refine the segmentation masks by first cal the average class probability for pixels belonging to the bbox,
followed by assigning the most likely label to these pixels.

Experiments

Datasets:

ICDAR 2015
SectLabel
DSSE-200

ICDAR 2015

used in the biennial ICDAR page segmentation competitions
focusing more on appearance-based regions
70 sampled pages from contemporary magazines and technical articles

SectLabel

40 academic papers with 347 pages in the field of computer science

DSSE-200

provides both appearance-based and semantic-based labels
200 pages from magazines and academic papers
regions are assigned labels from : figure, table, section, caption, list and paragraph

Ablation Experiment on Model Architecture

To find the best "base" architecture to be used in the following experiments.
All models are trained from scratch and evaluated on the DSSE-200 dataset

Baseline consists of a feed-forward convolutional network as an encoder,
and a decoder implemented by a fully convolutional network.
Upsampling is done by bilinear interpolation.

Add skip connections, replace bilinear upsampling with unpooling and use dilated convolutions.

Adding Textual Information

Model5, as our vision-only model, and incorporate a text embedding map via a bridge module.
This combined model is fine-tuned on our synthetic documents.