This article was written by Koustuch on CV-Tricks.
In this series of post, we shall learn the algorithm for image segmentation and implementation of the same using Tensorflow. This is the first part of the series where we shall focus on understanding and be implementing a deconvolutional/fractional-strided-convolutional layer in Tensorflow.
Why is deconvolutional layer so important?
Image segmentation is just one of the many use cases of this layer. In any type of computer vision application where resolution of final output is required to be larger than input, this layer is the de-facto standard. This layer is used in very popular applications like Generative Adversarial Networks(GAN), image super-resolution, surface depth estimation from image, optical flow estimation etc. These are some direct applications of deconvolution layer. It has now also been deployed in other applications like fine-grained recogntion, object detection. In these use cases, the existing systems can use deconvolution layer to merge responses from different convolutional layers and can significantly boosts up their accuracy.
There are four main parts of this post:
What is image segmentation?
What is deconvolutional layer?
Initialization strategy for deconvolutional layer.
Writing a deconvolutional layer for Tensorflow.
Image segmentation is the process of dividing an image into multiple segments(each segment is called super-pixel). And each super-pixel may represent one common entity just like a super-pixel for dog’s head in the figure. Segmentation creates a representation of the image which is easier to understand and analyze as shown in the example. Segmentation is a computationally very expensive process because we need to classify each pixel of the image.
Convolutional neural networks are the most effective way to understand images. But there is a problem with using convolutional neural networks for Image Segmentation.
But, How to use convolutional neural networks for image segmentation:
In general, CNN performs down-sampling, i.e. they produce output of lower resolution than the input due to the presence of max-pool layer. Look at the figure below: This shows alexnet and size at each layer. It’s fed an image of 224*224*3=150528 and after 7 layers, we get a vector of size 4096. This is the representation of the input image that’s great for image classification and detection problems.
However, since segmentation is about finding the class of each and every pixel of the image, down-sampled maps cannot be directly used. For this, we use an upsampling convolutional layer which is called deconvolutional layer or fractionally strided convolutional layer.
To read the 3 other main parts of this post, click here. To read more articles about Tensorflow, click here.