Convolutional Neural Network

Convolutional Neural Networks (CNNs) stand at the forefront of a revolution in machine learning, particularly in the realm of computer vision. Their groundbreaking impact lies in their ability to automatically decipher and understand images and videos, mimicking the way humans perceive visual information.

Unlike traditional methods that rely on manual feature engineering, CNNs autonomously learn hierarchical representations from raw pixel data. This transformative capability allows them to discern intricate patterns and features within images and videos, enabling tasks like image recognition, object detection, and even facial identification.

The advent of CNNs has thus ushered in a new era in artificial intelligence, significantly advancing fields such as healthcare, autonomous systems, and multimedia analysis. By unlocking the potential to interpret and comprehend visual information, CNNs have become indispensable tools in shaping the future of intelligent technology.

How computer Interprets Images & videos

Images are essentially a grid of pixels, and each pixel represents a tiny dot of color. The color of each pixel is typically defined by a combination of three values: red, green, and blue (RGB). In grayscale images, each pixel is represented by a single intensity value. The arrangement of these pixels creates the visual content of an image.

A Convolutional Neural Network (CNN) is a specialized neural network adept at retaining spatial information, crucial for tasks like image analysis. It operates on image patches, analyzing groups of pixels simultaneously.

The key to preserving spatial details lies in the convolution layer, which applies various image filters, known as convolution kernels, to the input. These filters extract features such as object edges or distinguishing colors, defining how the convolution layer transforms an input image.

Spatial arrangements, encompassing color or shape, are vital; shape relates to patterns of intensity. Intensity, akin to brightness, aids in detecting object edges by identifying abrupt intensity changes. Specific image filters focusing on pixel groups reveal significant intensity changes, producing outputs highlighting edges and shapes.


In terms of frequency, high-frequency images exhibit rapid intensity changes, often corresponding to object edges. To filter irrelevant information and enhance distinguishing traits like object boundaries, high-pass filters are employed.

These filters make images appear sharper by amplifying high-frequency components, where intensity rapidly changes between neighboring pixels. This process is crucial for emphasizing edges, indicative of object boundaries in images.

Convulation kernels:

A kernel, essentially a matrix of numbers, serves as a powerful tool to modify images in computer vision applications. For instance, a 3 * 3 kernel designed for edge detection exhibits a unique property – the sum of its elements equals zero.

This kernel operates by computing the difference between neighboring pixels, where differences are determined by subtracting pixel values. The critical characteristic of a zero-sum kernel is that it prevents unintended brightness alterations in the filtered image.

Kernel convolution, a pivotal operation in computer vision and the foundation of Convolutional Neural Networks (CNNs), involves systematically passing a small grid of numbers (the kernel) over an image pixel by pixel.

This process transforms the image, offering diverse effects such as edge detection or image blurring. In mathematical terms, the convolution of an input image F(x,y) with the kernel K is denoted as K * F(x,y), resulting in an output image. This fundamental concept underscores the versatility and significance of kernel convolution in shaping visual effects in image processing and deep learning.


1. Edge detection filter:

A common filter for edge detection is the Sobel filter. It emphasizes changes in intensity, highlighting edges within an image. Here are examples of horizontal and vertical Sobel filters:

Horizontal Filter:   

Vertical filter: 

These matrices, when convolved with an image, highlight horizontal and vertical edges, respectively.

2. Center Detection Filter:

This matrix enhances the intensity at the center while suppressing surrounding areas, making it suitable for detecting central regions of objects.

These matrix representations convey the weights assigned to each pixel during the convolution operation, and their application helps emphasize specific features in the input image.

Reference: visualization of Convolution

Edge Handling:

Edge handling in convolution refers to how the convolution operation is applied to pixels at the boundaries of an image. When using convolutional filters on an image, the size of the filter may extend beyond the image dimensions, especially at the edges. Proper handling of these edges is crucial to avoid artifacts and ensure accurate feature extraction.

There are different approaches to edge handling:

1. Valid Padding (No Padding):

In this approach, the convolution is only applied to positions where the filter entirely overlaps with the image. The output size is smaller than the input image. Pixels near the edges are not included in the convolution if the filter extends beyond the image boundary.

2. Same Padding:

Same padding is used to ensure that the output size is the same as the input size. Padding involves adding extra pixels (usually zeros) around the input image so that the filter can fully overlap with all pixels, even at the edges.

3. Zero Padding:

This is a common technique where zeros are added around the input image to prevent information loss at the edges. It ensures that the convolutional filter can properly process pixels at the image boundary.

The computer interprets grayscale images as 2D arrays with height and width, while color images are treated as 3D arrays incorporating height, width, and depth. When applying a filter to a color image, the process involves moving the filter both horizontally and vertically across the image.

In this context, the filter itself is three-dimensional, accounting for values in each color channel (red, green, and blue) at every horizontal and vertical position within the image array. This multidimensional approach allows for a comprehensive analysis of color variations and relationships across different channels.


The stride in convolutional neural networks refers to the step size or the distance by which the filter slides over the input image. A stride of one implies that the filter moves horizontally and vertically with roughly the same width and height as the input image. In contrast, a stride of two means that the convolutional layer is approximately half the width and height of the input image.

Adjusting the stride influences the spatial dimensions of the output volume and, consequently, impacts the network’s ability to capture features at different scales. Stride is a crucial parameter that modulates the balance between spatial resolution and computational efficiency in the convolutional layers of neural networks.

Reference: Padding and stride

Polling layers

Pooling layers operate on the output of convolutional layers in a Convolutional Neural Network (CNN). In the case of a complex dataset with diverse object categories, using a large number of filters in convolutional layers may lead to high dimensionality, potentially causing overfitting. Pooling layers play a vital role in addressing this issue by reducing the dimensionality of the data.

There are two main types of pooling layers:

1. Max Pooling Layer:

Max pooling takes a set of feature maps as input, along with specified window size and stride. The value of each node in the max pooling layer is determined by taking the maximum pixel value within the defined window. This process is repeated for all feature maps, resulting in a stack of feature maps with reduced width and height.

2. Average Pooling Layer:

In addition to max pooling, there is an alternative technique called average pooling. Similar to max pooling, it operates on a set of feature maps, utilizing the window size and stride. However, instead of taking the maximum pixel value, average pooling calculates the average value within the window. Like max pooling, this process results in a stack of feature maps with reduced dimensions.

Pooling layers contribute to the efficiency of CNNs by down-sampling the spatial dimensions, retaining important features while reducing computational complexity. This dimensionality reduction aids in preventing overfitting and improving the network’s ability to generalize to new data.


1. Convolutional Layer:

The input image passes through a convolutional layer, where filters are applied to capture spatial information. These filters highlight specific features in the image, making the array deeper as it progresses through the network.

2. Activation Function (ReLU):

The convolutional layer output is passed through an activation function, commonly ReLU (Rectified Linear Unit). ReLU introduces non-linearity by setting negative values to zero and scales the outputs to a consistent, small range. This helps in learning complex patterns and enhancing the network’s capability to capture diverse features.

3. Max-Pooling Layer:

Following activation, the output undergoes max-pooling. Max-pooling reduces the spatial dimensions (width and height) of the feature maps, emphasizing important information while discarding less relevant details. This aids in computational efficiency and prevents overfitting.

4.Fully Connected Neural Network (FCNN):

As the processed data progresses through convolutional and max-pooling layers, it is then flattened and passed into a Fully Connected Neural Network (FCNN). FCNN is responsible for making final decisions or predictions based on the hierarchical features learned by the earlier layers.

In summary, this sequence of operations — convolution, activation, max-pooling, and FCNN — allows the CNN to progressively extract and understand features in a hierarchical manner, making it effective for tasks such as image classification. Each layer contributes to the network’s ability to recognize and generalize complex patterns from the input data.


Check our other blogs:

Logistic Regression

Data visualization


Our data science and deep learning course in rajajinagar, Bangalore covers a lot of ground on Deep . See you all in the next article. Happy learning!!