Author(s): Sanket Rajaram Originally published on Towards AI. Understanding Convolutional Neural Network (CNN) — A Guide to Visual Recognition in the AI Era This article will help you understand the application of conventional artificial neural networks to visual recognition problems. We’ll understand the basics of convolutional neural networks with different image processing strategies for feature generation with dimensionality reduction techniques for improving computational complexity. The goal of this article is to provide a deeper understanding of structural design changes made into a traditional artificial neural network to solve real-world problems in the field of image processing. We’ll start from basics of design components such as convolution over single-channel and multi-channel images with different sized kernels for feature map generation and move on to pooling operation for subsampling. Further, we will cover the building blocks of convolutional neural networks like convolution layer, pooling layer and hyperparameters such as convolution strides, padding. With this article, you will able to understand how to practically implement a convolutional neural architecture with different components as convolution, pooling, and fully-connected neural network architecture in TensorFlow 2.0. This article discusses the classical convolutional neural network architecture, LeNet-5 which consists of a stacked architecture of convolutional layers for feature map generation and pooling layers for feature subsampling followed by a multi-layered fully-connected feed-forward neural network for multi-label classification. You will also get an insight on how to design an artificial neural network for the most common use cases as image classification, object detection without any use of rule-based hand-crafted feature detection algorithms and with the more generalized form of the visual recognition architecture. In this article, we’re going to cover the following main topics: Introduction to Conventional Visual Recognition Building Blocks of Convolutional Neural Network (CNN) Designing a Convolutional Neural Network Architecture LeNet-5 — A Classical Neural Network Architecture Implementing Convolutional Neural Network with TensorFlow 2.0 Technical requirements The program code is written and run in a Google Colab Notebook Service offered by Google Inc. The installation instructions are given in respective sections as per the requirements. Introduction to Conventional Visual Recognition In this section, we will take the brief overview of digital image processing including how an image is processed algorithmically, what will happen if we feed the matrix of pixels to an artificial neural network, how we can reduce the spatial dimension of image data, how the conventional neural network architectures like multi-layered perceptron perform with image transformations. This section elaborates the inefficiencies of using the traditional hand-crafted rule-based feature detection algorithms which are unable to exhibit invariance to image translations and computationally expensive architectures. What is Digital Image Processing Digital image processing is a way of applying a set of sophisticated techniques and algorithms to digital images to enhance, optimize, and extract useful information. The structure of the image is mainly composed of a dimensional array of picture elements called pixels. An image is rendered as a dimensional matrix of pixels as shown in the following figure where the number of rows represents the height of the image and the number of columns represents the width of the image. Dimensions of a digital image are the height and width of the image measured in terms of the pixel. A picture element (pixel) represents the brightness or colour intensity value. In the case of a grayscale image, it ranges from 0 to 255, shades of black and white. While in the colour image, it ranges from 0 to 255 for the individual red, green, and blue channels. Fig. 1 Digital Image Processing Computer vision is one of the key areas in artificial intelligence that tries to mimic the human visual system in a way to perceive and enable computing devices to understand and process information in images and videos. Until now, this field possessed limited capabilities for extracting and processing features in images and videos but now the deep neural networks are helping computers to surpass human ability in some tasks of detecting and tagging objects in images. One of the key factors in the rapid development of efficient prototypes in computer vision is the advancement in the smartphone technologies with which we have more images and videos to train artificial neural networks more efficiently and improve the ability of computers to visualize objects in an image. In the traditional rule-based algorithmic systems, objects in the image are typically identified with the help of feature detection algorithms. For example, to detect the edges of an object in an image, we use edge detection algorithms like Sobel and Prewitt, whereas to detect corners of an object, we use corner detection techniques like Harris. For more robust feature detection, invariant features like scale-invariant feature transform (SIFT) and speed-up robust features (SURF) are used, but these features are designed to work in a limited context and fail to generalize. Their context is limited to object identification tasks while neural networks have outperformed in image classification and retrieval tasks. Artificial Neural Network for Digital Image Processing One of the key problems in computer vision is the size of input data as it can be big. Assume we have a color image of size 16 by 16 pixels, as shown in the following figure. As it is a color image, it has red, green, and blue channels; thus, the total number of pixels in this image is 16 x 16 x 3, which is 768 pixels. Assuming this is the input data size for a neural network which is still a manageable size for a neural network to process. In a multi-layered neural network, if the input data size is 768, then possibly a network will have several thousands of parameters or weights to process. But in case of large-sized images, say a color image with 1200 by 800 pixels resolution will have 28,80,000 pixels which will be a huge sized input as it will have millions of parameters to process for a neural network, which is highly time-consuming and an inefficient way of dealing with imagery form of data. […]
↧