Paper review. 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks - KushnirDmytro/ucuCV2020_HW5 GitHub Wiki

Paper

Title: 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks
Authors: Benjamin Graham, Martin Engelcke, Laurens van der Maaten
Link: https://arxiv.org/abs/1711.10275
Tags: 3D, Convolutional network, Optimization
Year: 2017

Summary

What
- The paper describes an approach to optimize the processing of sparse data with convolutional neural networks.
- Dramatically reduces computational load -> speeds up processing on sparse data.
- 'Sparse': Data of lower 'effective' dimensionality than its representation form: line on 2d image, points in 3d, etc.
- Proposed submanifold sparse convolutional networks (SSCNs) for the efficient processing of high-dimensional, sparse input data.
- They explain naming "SSCNs" because network process low-dimensional data living in a space of higher dimensionality
Aimed problems
- Conv nets are expensive tools for data in higher dimensionalities
  - If the most space in an input signal is empty, processing of the remaining part could be optimized to use resources only for useful information processing
- Dilation of the sparse activation via convolutions:
  - Conv networks are inherently not suited for the processing of sparse data because of their inherent properties.
  - Widening of the receptive field with consecutive ordinary convolution applied is a harmful property for segmentation tasks.
  - Together with wider receptive field sparsity of data decreases, and activations are dilated in deeper layers.
    - Illustration: Sparse inputs after applied 3x3 convolution smear its activations in usual Conv nets:
    - To limit the dilation of sparsity with convolution they restrict the output of each Conv layer only to active at input points:
  - To preserve spacial information from the neighboring pixels (passes) passing forward the pooling or strided convolutions layers are essential
  - following illustration and description are taken from (author's git repo):
  - (i) an active point is highlighted; a convolution with stride 2 sees the green active sites
  - (ii) and produces output
  - (iii) 'children' of a highlighted active point from (i) are highlighted;
  - a submanifold sparse convolution sees the green active sites (iv) and produces output (v);
  - a deconvolution operation sees the green active sites (vi) and produces output (vii).
How
- They use the additional (d + 1)st dimension for d-dimensional data to track the "active" path throughout hidden layers to the output.
- Sparse "active" route is precomputed in advance on each iteration, then active "mask" will be applied for all required computations.
- Active "route" is tracked by the hash table, where the pixel (or weight in hidden layer) position is a key and ...?
- Also the "activity" features could be used to sparsify originally non-sparse data, using some threshold to remove some weak data points.
- To mark the "inactive" elements of a network they used the ground state of a feature vector: all zeros, some value at the ground state position.
  - The Value of a ground state (ground value) is similar to bias coming from the dataset itself.
  - Therefore the ground value calculated once per forward **pass **at training time, and only once for all forward passes at test time.

Operators

Sparse convolution operator (old):
- Defined as SC(m, n, f, s) with m input feature planes, n output feature planes, a filter size of f, and stride s
- Defines active region based on its receptive field of size f^d.
- Example: For the input of size l, then the output has size (l−f+s)/s
- Reduces approx ~50% of computations.
Submanifold Sparse convolution operator (new):
- SSC(m, n, f) is a modification of SC(m, n, f, s=1)
- Input is padded with (f −1)/2 zeros such that input size == output size
- Output is active iff the central pixel of the receptive field is active (no activity dilation)
- They assume that ...
- They feed the content image through the VGG net and extract the ...
Activation function (old/same)
- Ordinary but applied only to active sites
Pooling
- Max-pooling MP(f, s) and average-pooling AP(f, s) are variant of SC(·, ·, f, s).
- MP takes the max(zero vector, input feature vectors in the receptive field)
- AP calculates f−d times the sum of the active input vectors.
Deconvolution
- DC(·, ·, f, s) is an inverse of the SC(·, ·, f, s)
- Active output regions are the same as the active inputs of SC

Implementation

State of (S)SC layer stored as matrix and hash-table
- Matrix has size a × m and contains one row for each of the a active sites
- Hash-table: contains (location, row) pairs for all active sites
  - location is a tuple of integer coordinates
  - row: indicates _row _in the feature matrix
Prerequisites:
- Given a convolution filter size f
- Let F = {0, 1, . . . , f−1}^d denote the spatial size of the convolutional filter
- Define rule_book: a collection R = (Ri: i ∈ F) of f^d integer matrices each with two columns
SC(m, n, f, s):
1. - On single iteration: Input hash-table combined with points in the input layers and all the points in the output layer that can see them => output hash table and rule book.
  - When an output site is _first _visited, a new entry is created in the output hash table.
  - For each active input x located at point i in the receptive field of an output y, add a row (inputhash(x), output-hash(y)) to rule book element Ri
2. - Initialize the output matrix to all zeros
  - For each i ∈ F, there is a parameter matrix Wi ∈ R^(m×n)
  - For each row (j, k) in Ri multiply the j-th row of the input feature matrix by Wi and add it to the k-th row of the output feature matrix (could be done in GPU)
SSC:
1. 1. are the same as for SC
- Notes:
  - input hash table is re-used for the output (construct an appropriate rule_book)
  - As the sparsity pattern does not change, the same rule_book can be re-used in the network until a pooling or _subsampling _layer is encountered.
Cost estimate notes:
- For a active points in the input layer, the cost of building the input hash-table is O(a).
- For FCN and U-Net hash-tables and rule-books is also O(a) regardless of the depth of the network, as each downsampling reduces the number of activations multiplicatively
- SC costs approx as the voting algorithm, while **SSC **has a lower cost, as there is no interaction between the active points.

Experiments

Architecture examples:
Performance results:
- - Comparison with baseline methods:
    - ShapeContext
    - Dense 3D ConvNet is a dense version of the SSCN
    - Multi view 2D convolutional network
  - Comparison between architectures:
    - C3: operates on single spatial resolution by stacking SSC(·, ·, 3) convolutions. Used 8,16,32 or 64 filters/layer, and 2, 4, or 6 layers.
    - FCNs & U-net: 8, 16, 32, or 64 filters in the first layer, x2 filters per downsampling. Conv_blocks stacks of 1, 2, or 3 **SSC** or stacks of 1, 2, or 3 resid_block

Usable code:

Their software toolbox for constructing SSCNs is publicly available.