Let’s discuss how these layers work to summarize and extract

meaning from pixel data.

Convolutional layers use filters to scan over

local regions in the input space, which are just

collections of pixels near one another,

and capture information about features in those regions.

Each filter is a small square matrix,

usually much smaller than the actual image like 3x3 or 5x5,

with learnable parameters.

The filter is tiled across the input space,

calculating the cross-correlation

between the filter and a small region of the input space

at each tiling.

These values are then used to create a small representation

of the input space.

Each time the filter is applied, a single number is generated,

so the filter shrinks the input space.

When we have three color channels in the input data,

we have filters for each color channel.

So instead of using a 3x3 or 5x5 filter,

we would end up with a 3x3x3 or 5x5x3 filter.

In the following discussion, we focus

on explaining how the convolutional layer processes

data from one channel, but this process

occurs for each channel.

The tiling on the filter across the input space

is used to create the feature map, computed

by calculating the cross-correlation

between the filter and the same-sized region in the input

space.

This cross-correlation is simply the dot product

between the filter and the input space.

We just take products between corresponding cells

and then sum the results.

The numbers in the filter are the learned parameters

that we train, and a bias is included

as another learned parameter.

The number computed from this calculation

is then saved in the feature map in the location corresponding

to where the filter is on the original image.

The stride is the number of steps

we take when we move the filter to a new location.

In this case, a stride of 1 means

we move the filter one pixel to the right on the input space,

generating a value for the feature map at each position.

This process is repeated, moving the filter all over the input

space to generate the feature map.

The stride is used to determine both the horizontal

and vertical movement of the filter.

A stride of 2 would mean moving the filter two pixels

and would create a smaller feature map than a stride of 1.

In this case, a stride of 2 would be invalid

because at the end, there would be an extra column in the input

space ignored by the filter.

The size of the input space can be slightly modified

by using padding dimensions, which

are rows and columns of zeros added to the edge of the input

space.

The filter can then be tiled on the edges,

multiplying some of its weights by the zero padding dimensions.

The idea is to include more information in the feature

map about the boundary than would be preserved

without the padding dimensions.

Adding zeros means that the values in the feature map won’t

be affected by the values in the padding dimension,

but calculations can be performed with the filter

on the edge of the input space a greater number of times.

The choices of filter dimension, padding dimension, and stride

depend on both the size of the input dimension

and the desired size of the feature map, which

must have an integer dimension.

Thus, the formula relating the feature map dimension

to the input dimension, padding dimension, feature map,

and stride is really a constraint

that says the function of those hyperparameters

must be an integer.

The selection of the stride and the filter dimension

thus depend on the size we want for the feature map.

A larger feature map encodes more information

from the original image.

In the example we went through, we take a 6x6 input space

and, using a 3x3 filter with a stride of 1,

we create a 4x4 feature map, which

is more than a 50% decrease in overall data size.