Model Detail - Matting-Team/MattingProject Wiki

1. Model Implementation

Model Image

Model 1

Main Idea: Image Matting is performed by combining Low Level Feature with Structural Information and High Level Feature with Semantic Information The huge task called Matting is divided into 3 sub-works.

* High-level branch

The purpose of the high-level branch is simple. Extracting high level features by passing the input to the encoder. Matting Task basically needs semantic information. In particular, in the case of Trimap-free Matting like this project, the degree is stronger.

High level Branch is using Encoder implemented with Pretrained DenseNet201 by loading up to depth 3. The reason for using the pre-trained weights is that it is determined that the classification network based on a huge dataset is specialized in capturing image features.

* Low level branch

Unlike the high level branch, the low level branch does not downsample the image. Thanks to this, the image can preserve the structural information as it is. Note that this low-level branch should not be too deep. If the low-level branch deepens, eventually the receptive field broadens and at the same time leads to loss of structure information. Also, the number of channels is inevitably limited. Since the height and width of the feature are huge, a load occurs in terms of memory.

* Fusion block

The role of the fusion block is simple. It is to combine the resulting features of low-level and high-level features. No special module is used, and it can be seen as a general ConvBlock.

* Training Loss

Loss - L1 Loss, SSIM Loss

  • L1 loss is a pixel wise loss that is used for various tasks and papers.
  • SSIM Loss is a loss that compares the similarity of the kernel window, and it has been used with various losses to show excellent performance.

* Training Options

  • Optimizer: Adam, learning rate 0.0001, betas 0.5, 0.999
  • Epoch: 50 Epochs
  • Augmentation: left, right 30 degree rotation, H, V flip

Model 2

Model2

Because the basic structure is taken from Model 1 above, it has a similar physical meaning.

* Training Loss

Loss - L1 Loss, SSIM Loss

  • L1 loss is a pixel wise loss that is used for various tasks and papers.
  • SSIM Loss is a loss that compares the similarity of the kernel window, and it has been used with various losses to show excellent performance.

* Training Options

  • Optimizer: Adam, learning rate 0.0001, betas 0.5, 0.999
  • Epoch: 80 Epochs
  • Augmentation: left, right 30 degree rotation, H, V flip

Most of them are identical except for the Epoch.

Modules

* CWise(Channel wise Attention)

CWise provides channel-based attention. This erases all information about Spatial, makes operation dependent on Channels, and multiplies features that have passed Activation (typically, Sigmoid) with the input. When channel wise attention is performed, the multiplied feature has a value between 0 and 1. If you interpret this intuitively, you can see that important features will have a value of 1, and unimportant features will have a value close to 0. Thanks to this, the network performs learning mainly on important channels.

* SWise(Spatial Attention)

Spatial Attention, as opposed to Channel wise Attention, erases information about Channels by pooling and performs operations dependent on Spatial. In the same way as Channel wise Attention, the result of activation is multiplied by the input feature. Again, it is possible to infer the physical meaning of attention only to important spatial information.