Paper review. 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks - KushnirDmytro/ucuCV2020_HW5 GitHub Wiki
Paper
- Title: 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks
- Authors: Benjamin Graham, Martin Engelcke, Laurens van der Maaten
- Link: https://arxiv.org/abs/1711.10275
- Tags: 3D, Convolutional network, Optimization
- Year: 2017
Summary
-
What
- The paper describes an approach to optimize the processing of sparse data with convolutional neural networks.
- Dramatically reduces computational load -> speeds up processing on sparse data.
- 'Sparse': Data of lower 'effective' dimensionality than its representation form: line on 2d image, points in 3d, etc.
- Proposed submanifold sparse convolutional networks (SSCNs) for the efficient processing of high-dimensional, sparse input data.
- They explain naming "SSCNs" because network process low-dimensional data living in a space of higher dimensionality
-
Aimed problems
- Conv nets are expensive tools for data in higher dimensionalities
- If the most space in an input signal is empty, processing of the remaining part could be optimized to use resources only for useful information processing
- Dilation of the sparse activation via convolutions:
-
Conv networks are inherently not suited for the processing of sparse data because of their inherent properties.
-
Widening of the receptive field with consecutive ordinary convolution applied is a harmful property for segmentation tasks.
-
Together with wider receptive field sparsity of data decreases, and activations are dilated in deeper layers.
- Illustration: Sparse inputs after applied 3x3 convolution smear its activations in usual Conv nets:
- To limit the dilation of sparsity with convolution they restrict the output of each Conv layer only to active at input points:
- Illustration: Sparse inputs after applied 3x3 convolution smear its activations in usual Conv nets:
-
To preserve spacial information from the neighboring pixels (passes) passing forward the pooling or strided convolutions layers are essential
-
following illustration and description are taken from (author's git repo):
-
- (i) an active point is highlighted; a convolution with stride 2 sees the green active sites
- (ii) and produces output
- (iii) 'children' of a highlighted active point from (i) are highlighted;
- a submanifold sparse convolution sees the green active sites (iv) and produces output (v);
- a deconvolution operation sees the green active sites (vi) and produces output (vii).
-
- Conv nets are expensive tools for data in higher dimensionalities
-
How
- They use the additional (d + 1)st dimension for d-dimensional data to track the "active" path throughout hidden layers to the output.
- Sparse "active" route is precomputed in advance on each iteration, then active "mask" will be applied for all required computations.
- Active "route" is tracked by the hash table, where the pixel (or weight in hidden layer) position is a key and ...?
- Also the "activity" features could be used to sparsify originally non-sparse data, using some threshold to remove some weak data points.
- To mark the "inactive" elements of a network they used the ground state of a feature vector: all zeros, some value at the ground state position.
- The Value of a ground state (ground value) is similar to bias coming from the dataset itself.
- Therefore the ground value calculated once per forward **pass **at training time, and only once for all forward passes at test time.
Operators
-
Sparse convolution operator (old):
- Defined as SC(m, n, f, s) with m input feature planes, n output feature planes, a filter size of f, and stride s
- Defines active region based on its receptive field of size f^d.
- Example: For the input of size l, then the output has size (l−f+s)/s
- Reduces approx ~50% of computations.
-
Submanifold Sparse convolution operator (new):
- SSC(m, n, f) is a modification of SC(m, n, f, s=1)
- Input is padded with (f −1)/2 zeros such that input size == output size
- Output is active iff the central pixel of the receptive field is active (no activity dilation)
- They assume that ...
- They feed the content image through the VGG net and extract the ...
-
Activation function (old/same)
- Ordinary but applied only to active sites
-
Pooling
- Max-pooling MP(f, s) and average-pooling AP(f, s) are variant of SC(·, ·, f, s).
- MP takes the max(zero vector, input feature vectors in the receptive field)
- AP calculates f−d times the sum of the active input vectors.
-
Deconvolution
- DC(·, ·, f, s) is an inverse of the SC(·, ·, f, s)
- Active output regions are the same as the active inputs of SC
Implementation
-
State of (S)SC layer stored as matrix and hash-table
- Matrix has size a × m and contains one row for each of the a active sites
- Hash-table: contains (location, row) pairs for all active sites
- location is a tuple of integer coordinates
- row: indicates _row _in the feature matrix
-
Prerequisites:
- Given a convolution filter size
f
- Let
F = {0, 1, . . . , f−1}^d
denote the spatial size of the convolutional filter - Define
rule_book
: a collectionR = (Ri: i ∈ F)
off^d
integer matrices each with two columns
- Given a convolution filter size
-
SC(m, n, f, s):
-
- On single iteration: Input hash-table combined with points in the input layers and all the points in the output layer that can see them => output hash table and rule book.
- When an output site is _first _visited, a new entry is created in the output hash table.
- For each active input
x
located at pointi
in the receptive field of an outputy
, add a row(inputhash(x), output-hash(y))
to rule book elementRi
-
- Initialize the output matrix to all zeros
- For each
i ∈ F
, there is a parameter matrixWi ∈ R^(m×n)
- For each row
(j, k)
inRi
multiply thej
-th row of the input feature matrix byWi
and add it to thek
-th row of the output feature matrix (could be done in GPU)
-
-
SSC:
-
- are the same as for SC
- Notes:
- input hash table is re-used for the output (construct an appropriate
rule_book
) - As the sparsity pattern does not change, the same
rule_book
can be re-used in the network until a pooling or _subsampling _layer is encountered.
- input hash table is re-used for the output (construct an appropriate
-
-
Cost estimate notes:
- For
a
active points in the input layer, the cost of building the input hash-table isO(a)
. - For FCN and U-Net hash-tables and rule-books is also
O(a)
regardless of the depth of the network, as each downsampling reduces the number of activations multiplicatively - SC costs approx as the voting algorithm, while **SSC **has a lower cost, as there is no interaction between the active points.
- For
Experiments
- Architecture examples:
- Performance results:
-
Comparison with baseline methods:
- ShapeContext
- Dense 3D ConvNet is a dense version of the SSCN
- Multi view 2D convolutional network
-
Comparison between architectures:
- C3: operates on single spatial resolution by stacking SSC(·, ·, 3) convolutions. Used
8,16,32 or 64
filters/layer, and2, 4, or 6
layers. - FCNs & U-net:
8, 16, 32, or 64
filters in the first layer, x2 filters per downsampling.Conv_blocks
stacks of1, 2, or
3 **SSC**
or stacks of1, 2, or 3 resid_block
- C3: operates on single spatial resolution by stacking SSC(·, ·, 3) convolutions. Used
-
Usable code:
Their software toolbox for constructing SSCNs is publicly available.