Paper review. 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks - KushnirDmytro/ucuCV2020_HW5 GitHub Wiki

Paper

  • Title: 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks
  • Authors: Benjamin Graham, Martin Engelcke, Laurens van der Maaten
  • Link: https://arxiv.org/abs/1711.10275
  • Tags: 3D, Convolutional network, Optimization
  • Year: 2017

Summary

  • What

    • The paper describes an approach to optimize the processing of sparse data with convolutional neural networks.
    • Dramatically reduces computational load -> speeds up processing on sparse data.
    • 'Sparse': Data of lower 'effective' dimensionality than its representation form: line on 2d image, points in 3d, etc.
    • Proposed submanifold sparse convolutional networks (SSCNs) for the efficient processing of high-dimensional, sparse input data.
    • They explain naming "SSCNs" because network process low-dimensional data living in a space of higher dimensionality
  • Aimed problems

    • Conv nets are expensive tools for data in higher dimensionalities
      • If the most space in an input signal is empty, processing of the remaining part could be optimized to use resources only for useful information processing
    • Dilation of the sparse activation via convolutions:
      • Conv networks are inherently not suited for the processing of sparse data because of their inherent properties.

      • Widening of the receptive field with consecutive ordinary convolution applied is a harmful property for segmentation tasks.

      • Together with wider receptive field sparsity of data decreases, and activations are dilated in deeper layers.

        • Illustration: Sparse inputs after applied 3x3 convolution smear its activations in usual Conv nets:
          ConvNet activation dilation ptoblem
        • To limit the dilation of sparsity with convolution they restrict the output of each Conv layer only to active at input points: Active regions
      • To preserve spacial information from the neighboring pixels (passes) passing forward the pooling or strided convolutions layers are essential

      • following illustration and description are taken from (author's git repo):

      • layers gif layers view range subscript

      • (i) an active point is highlighted; a convolution with stride 2 sees the green active sites
      • (ii) and produces output
      • (iii) 'children' of a highlighted active point from (i) are highlighted;
      • a submanifold sparse convolution sees the green active sites (iv) and produces output (v);
      • a deconvolution operation sees the green active sites (vi) and produces output (vii).
  • How

    • They use the additional (d + 1)st dimension for d-dimensional data to track the "active" path throughout hidden layers to the output.
    • Sparse "active" route is precomputed in advance on each iteration, then active "mask" will be applied for all required computations.
    • Active "route" is tracked by the hash table, where the pixel (or weight in hidden layer) position is a key and ...?
    • Also the "activity" features could be used to sparsify originally non-sparse data, using some threshold to remove some weak data points.
    • To mark the "inactive" elements of a network they used the ground state of a feature vector: all zeros, some value at the ground state position.
      • The Value of a ground state (ground value) is similar to bias coming from the dataset itself.
      • Therefore the ground value calculated once per forward **pass **at training time, and only once for all forward passes at test time.

Operators

  • Sparse convolution operator (old):

    • Defined as SC(m, n, f, s) with m input feature planes, n output feature planes, a filter size of f, and stride s
    • Defines active region based on its receptive field of size f^d.
    • Example: For the input of size l, then the output has size (l−f+s)/s
    • Reduces approx ~50% of computations.
  • Submanifold Sparse convolution operator (new):

    • SSC(m, n, f) is a modification of SC(m, n, f, s=1)
    • Input is padded with (f −1)/2 zeros such that input size == output size
    • Output is active iff the central pixel of the receptive field is active (no activity dilation)
    • They assume that ...
    • They feed the content image through the VGG net and extract the ...
  • Activation function (old/same)

    • Ordinary but applied only to active sites
  • Pooling

    • Max-pooling MP(f, s) and average-pooling AP(f, s) are variant of SC(·, ·, f, s).
    • MP takes the max(zero vector, input feature vectors in the receptive field)
    • AP calculates f−d times the sum of the active input vectors.
  • Deconvolution

    • DC(·, ·, f, s) is an inverse of the SC(·, ·, f, s)
    • Active output regions are the same as the active inputs of SC

Implementation

  • State of (S)SC layer stored as matrix and hash-table

    • Matrix has size a × m and contains one row for each of the a active sites
    • Hash-table: contains (location, row) pairs for all active sites
      • location is a tuple of integer coordinates
      • row: indicates _row _in the feature matrix
  • Prerequisites:

    • Given a convolution filter size f
    • Let F = {0, 1, . . . , f−1}^d denote the spatial size of the convolutional filter
    • Define rule_book: a collection R = (Ri: i ∈ F) of f^d integer matrices each with two columns
  • SC(m, n, f, s):

      • On single iteration: Input hash-table combined with points in the input layers and all the points in the output layer that can see them => output hash table and rule book.
      • When an output site is _first _visited, a new entry is created in the output hash table.
      • For each active input x located at point i in the receptive field of an output y, add a row (inputhash(x), output-hash(y)) to rule book element Ri
      • Initialize the output matrix to all zeros
      • For each i ∈ F, there is a parameter matrix Wi ∈ R^(m×n)
      • For each row (j, k) in Ri multiply the j-th row of the input feature matrix by Wi and add it to the k-th row of the output feature matrix (could be done in GPU)
  • SSC:

      1. are the same as for SC
    • Notes:
      • input hash table is re-used for the output (construct an appropriate rule_book)
      • As the sparsity pattern does not change, the same rule_book can be re-used in the network until a pooling or _subsampling _layer is encountered.
  • Cost estimate notes:

    • For a active points in the input layer, the cost of building the input hash-table is O(a).
    • For FCN and U-Net hash-tables and rule-books is also O(a) regardless of the depth of the network, as each downsampling reduces the number of activations multiplicatively
    • SC costs approx as the voting algorithm, while **SSC **has a lower cost, as there is no interaction between the active points.

Experiments

  • Architecture examples:
    • sample architectures
  • Performance results:
    • performance results
      • Comparison with baseline methods:

      • Comparison between architectures:

        • C3: operates on single spatial resolution by stacking SSC(·, ·, 3) convolutions. Used 8,16,32 or 64 filters/layer, and 2, 4, or 6 layers.
        • FCNs & U-net: 8, 16, 32, or 64 filters in the first layer, x2 filters per downsampling. Conv_blocks stacks of 1, 2, or 3 **SSC** or stacks of 1, 2, or 3 resid_block

Usable code:

Their software toolbox for constructing SSCNs is publicly available.