Introduction

This semestre’s course has come to an end. And in this course, I choose to do the project of cats and dogs. The goal of this project is to classify images of cats and dogs using the CNN architecture learned in course.

More specifically, given an image $X$, we have to use a trained CNN $f$ to calculate a prediction $y=f(x)$. This is a binary classification task, so the the label will be 0 for cat and 1 for dog.

Actually the images provided by kaggle can be quite difficult because there are backgrounds and other confusing visual part which will disturb and confuse the discriminative classification model’s classification ability. Also, some examples of cats and dogs’ faces look really similar, so we need to use a CNN to conduct feature learning and use the extracted feature to do classifications (typically by adding a MLP on top of last convolution-pooling layer). CNN is considered to be quite efficient to do image-based classification because it used local connections which dramatically reduced the number of parameters making the network easier to train, and it used weight sharing to extract many feature maps which will help better discimiate the image.

A typical CNN architecture is shown in Fig 1.

Fig. 1 A Typical CNN Architecture

Experiments

Experiment 1.0

As a first try I used a 3 layered CNN, the configuration is a follows:

num_epochs= 100
image_shape = (128,128)
filter_sizes = [(5,5),(5,5),(5,5)]
feature_maps = [20,50,80]
pooling_sizes = [(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Scale(learning_rate=0.1)
Max_image_dim_limit_method = RandomFixedSizeCrop
Dataset_processing = rescale to 128*128

Here I used the RandomFixedSizeCrop to crop images into fixed 128*128 pixels, and using the Scale parameter update method (learning rule) from the blocks package. It will update the parameter in a step in the direction proportional to the previous step. And since it is used with GradientDescent alone, it willperform the steepest descent, which might not be a good choice.
After 100 epochs, the training error = 26.31%, validation error = 25.77%

Experiment 1.1

A second try still with 3 convolutional layer network. A random fixed size crop doesn’t seems to be a good way to limit the size of image to a fixed size (e.g.128*128), so inspired by Florian’s blog, I used a modified version of MinImageDimension to do MaxImageDimension to limit the size of image to a fixed size. And also, I decreased the learning rate to 0.05 so the hyper parameter settings are as follows:

num_epochs= 100 early stopped
image_shape = (128,128)
filter_sizes = [(5,5),(5,5),(5,5)]
feature_maps = [20,50,80]
pooling_sizes = [(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Scale(learning_rate=0.05)
max_image_dim_limit_method= MaximumImageDimension
dataset_processing = rescale to 128*128

cats and dogs 1.1 result

From the training curves, we can observe that the phenomenon of overfitting did occur, so I just stopped training at epoch 27 because the valid error no longer decreases. Next step, we will need to do something to regularize.

Experiment 1.2

In this experiment, before doing some regularization, I also changed the learning rule from the simple Scale method to Adam() which is implemented in Blocks package and is derived from Kingma et al. This will basically adaptively decay the learning rate as learning continues. And the other hyper parameters are unchanged from 1.1. But the overfitting still occurred. As shown in the training curve below. So we need to do something for regularization.

cats and dogs 1.2 result

Experiment 1.3

As we can observe from the learning curve in experiment 1.2, after about 700 epochs, overfitting already occurred, the valid_error will stagnate at around 27%, and the training error will continues to decrease to about 5%, but that isn’t of any use…

So in order to overcome the overfitting phenomenon, there are a lot of methods to regularize, in this experiment 1.3, I uses the Random2DRotation Method to rotate each training image by a random degree, and keep their label untouched, so the model will learn different dogs or cats’images with different random rotated angles, and as we can see from the learning curve of experiment 1.3, this does help to reduce the overfitting. Globally, the training error and validation error curves are sticked together, and the validation error after 100 epochs is 19.4%.

The configurations are as follows:

num_epochs= 100 
image_shape = (128,128)
filter_sizes = [(5,5),(5,5),(5,5)]
feature_maps = [20,50,80]
pooling_sizes = [(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Adam()
max_image_dim_limit_method= MaximumImageDimension
dataset_processing = rescale to 128*128

cats and dogs 1 result

Experiment 2.01

In order to further extract more abstract feature to help classification, I tried some deeper achitectures in this experiment 2.01. I tried to use a 5 layered CNN, and the configuration is showed as below.

num_epochs= 100 
image_shape = (256,256)
filter_sizes = [(5,5),(5,5),(5,5),(5,5),(5,5)]
feature_maps = [20,40,60,80,100]
pooling_sizes = [(2,2),(2,2),(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Adam()
max_image_dim_limit_method= MaximumImageDimension
dataset_processing = rescale to 256*256

So in order to overcome the overfitting phenomenon, I still use the Random2DRotation Method to rotate each training image by a random degree, and keep their label untouched, so the model will learn different dogs or cats’images with different random rotated angles, and as we can see from the learning curve of experiment 2.01, this does help to reduce the overfitting. Globally, the training error and validation error curves are sticked together, and the validation error after 100 epochs is 19.6%.

cats and dogs 2 result

Experiment 2.1

In this experiment 2.1 I tried to use a 6 layered CNN to conduct experiment, but I encountered the same strange thing as encountered by Florian. I used Adam() as update rule to do learning. During training the training error and validation error suddenly diverged and totally ruined the training process.

The configurations are as follows:

num_epochs= 120 
image_shape = (256,256)
filter_sizes = [(5,5),(5,5),(5,5),(5,5),(5,5),(5,5)]
feature_maps = [20,40,60,80,100, 120]
pooling_sizes = [(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Adam()
max_image_dim_limit_method= MaximumImageDimension
dataset_processing = rescale to 256*256

cats and dogs 2 result

After some analysis, I learned that this might be caused by the initial setting of the learning rate. In the course we learned that if the learning rate is too small, the learning will be very slow, but if it is set too large, the learning will diverge.

Experiment 2.2

In this experiment 2.2, I tried to reduce the filter sizes of the last 3 layers to (4,4), and still keep other configurations unchanged.

num_epochs= 61 
batch_size=64
image_shape = (256,256)
filter_sizes = [(5,5),(5,5),(5,5),(5,5),(5,5),(5,5)]
feature_maps = [20,40,60,80,100,120]
pooling_sizes = [(2,2),(2,2),(2,2),(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Adam(learning_rate=0.0005)
max_image_dim_limit_method= MaximumImageDimension
dataset_processing = rescale to 256*256

initial learning rate is very important, if it is set big, the optimization will oscillate, and if it is set too big, the optimization process will diverge, it is linked with the inverse of the biggest eigenvalue of the Hessian of the cost function. However if the learning rate is chosen to be too small, the learning will be very slow, and here, as we are using Adam() to decay the learning rate as time elapses, at the end stage the learning rate will be very small can the learning cannot go on effectively.

Also, the initial learning rate setting is important, because at the begining, we want to get down as fast as possible, so we cannot set too conservatively small learning rate. I’ve tried with trial and errors and concluded that 0.004 was too big, and the default 0.0002 is too small, this time I tried with 0.0005. It seemed work better!

Validation error=8.56% after 61 epochs

cats and dogs 2 result

Experiment 2.3

As mentioned in last experiment 2.2, a learning rate of 0.0005 seems to work well for this 6 layered architecture. And in this experiment I added more feature maps and make the last convolutional layer havint 512 feature maps hoping this will help improve the discriminative power of the model.

num_epochs= 63 
batch_size=64
image_shape = (256,256)
filter_sizes = [(5,5),(5,5),(5,5),(5,5),(5,5),(5,5)]
feature_maps = [20,40,70,140,256, 512]
pooling_sizes = [(2,2),(2,2),(2,2),(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Adam(learning_rate=0.0005)
max_image_dim_limit_method= MaximumImageDimension
dataset_processing = rescale to 256*256

Training error = 3.33%, Validation error = 8.32% after 63 epochs.

cats and dogs 2 result

Result, Discussion And Conclusion
Based on the validation error, the best model is the model 2.3 which has 6 convolutional layers with feature maps [20,40,70,140,256, 512]. its training error = 3.33%, validation error = 8.32% after uploading its predicted test label to Kaggle, its’s test error = 9.896%.

During this project, I have learned a lot. First, the theano and blocks package really provided an efficient way to quickly build some deep learning architecture. And I learned some basic usages of thoses tool although I’m still learning.

In this project, I tried several different hyper parameter settings, since training each network consumes a lot of time, due to time limitation, it’s limited to the experiments described above. And among those trial and errors, I’ved learned several things: 1. Deeper architecture does help to improve performance. If we compare the performance of 3 conv-layered CNN with the performance of 6 conv-layered CNN we can observe that the validation error for 3 layered CNN will be around 20%, while the validation error for 6 layered CNN will be around 10% if propered configured (the hyperparameters). This does make sense, since with deeper CNN, at each layer we typically add more feature maps, which will extract more higer level, more abstract features to help the model to improve its performance. 2. Regularization is very necessary. Actually we can observe that without random rotation of the training images, in those experiments, the training error will continue to go down whereas the validation error will stagnate around a higer value. In those cases overfitting occurs, i.e. the model will try to memeorize the training image and their corresponding training labels and fit the parameters to specifically fit those training images which will lead to bad generalization to unseen data. So by adding random image rotations to the training image, we can regularize. 3. Choice of (initial) learning rate is very important. From the experiment we can observe that a too small learning rate will result in very slow learning, however, a too big learning rate is not good either, it will make the gradient steps oscillate around the cost function surface and eventually diverge. From the course we learned that it is linked with the inverse of the biggest eigenvalue of the Hessian of the cost function. So a good choice of learning rate is important. For the 6 layered CNN in my experiment, after some trial and errors, I found that 0.0005 seems to work well as the initial learning rate for the Adam() optimization algorithm.

I have a learned a lot during this course project. But I’m still a starter in neural network models, and due to time and knowledge limitations, there are still many things to improve. In the future, it is worthwile to try other regularization techniques learned in course such as batch normalizations, and maybe other data augmentation schemes to do better regularization.

Hi,
As mentioned in Blog posts 2.2, this time, I tried to use a 6 layered CNN to conduct experiment, but with more feature maps at each layer, and at the last convolutional layer, there are 512 feature maps. More features maps might help the last MLP to better discriminate the image.

In this experiment 2.3, I still use the Random2DRotation Method to rotate each training image by a random degree, and keep their label untouched, to do regularization, so the model will learn different dogs or cats’images with different random rotated angles. However this time, I tried to set the initial learning rate for Adam() as 0.0005, rather than using its default values.

As we learned from course, initial learning rate is very important, if it is set big, the optimization will oscillate, and if it is set too big, the optimization process will diverge, it is linked with the inverse of the biggest eigenvalue of the Hessian of the cost function. However if the learning rate is chosen to be too small, the learning will be very slow, and here, as we are using Adam() to decay the learning rate as time elapses, at the end stage the learning rate will be very small can the learning cannot go on effectively.

And as tried in last experiment 2.2, I still use a learning rate of 0.0005 which seems to work well for this 6 layered architecture.

In this experiment, I’m using a 6-convolution layered CNN architecture.

The configurations are as follows:

num_epochs= 63 (at the time of posting this blog)
batch_size=64
image_shape = (256,256)
filter_sizes = [(5,5),(5,5),(5,5),(5,5),(5,5),(5,5)]
feature_maps = [20,40,70,140,256, 512]
pooling_sizes = [(2,2),(2,2),(2,2),(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Adam(learning_rate=0.0005)
max_image_dim_limit_method= MaximumImageDimension
dataset_processing = rescale to 256*256

Validation error=8.32% after 63 epochs.

However we can see that at 63 epochs the validation error is only a little bit lower than the 8.5% of the last experiment which has 120 feature maps at the last layer, so the enhencement caused by adding feature maps seems not that much. It is worthwile to try other regularization techniques or better optimization method to help furthur improve performance in the next experiment.

cats and dogs 2 result

Hi,
As mentioned in Blog posts 2.1, this time, I tried to use a 6 layered CNN to conduct experiment, I tried to reduce the filter sizes of the last 3 layers to (4,4), and still keep other configurations unchanged.

In this experiment 2.2, I still use the Random2DRotation Method to rotate each training image by a random degree, and keep their label untouched, to do regularization, so the model will learn different dogs or cats’images with different random rotated angles. However this time, I tried to set the initial learning rate for Adam() as 0.0005, rather than using its default values.

As we learned from course, initial learning rate is very important, if it is set big, the optimization will oscillate, and if it is set too big, the optimization process will diverge, it is linked with the inverse of the biggest eigenvalue of the Hessian of the cost function. However if the learning rate is chosen to be too small, the learning will be very slow, and here, as we are using Adam() to decay the learning rate as time elapses, at the end stage the learning rate will be very small can the learning cannot go on effectively.

Also, the initial learning rate setting is important, because at the begining, we want to get down as fast as possible, so we cannot set too conservatively small learning rate. I’ve tried with trial and errors and concluded that 0.004 was too big, and the default 0.0002 is too small, this time I tried with 0.0005. It seemed work better!

In this experiment, I’m using a 6-convolution layered CNN architecture.
The configurations are as follows:

num_epochs= 61 (at the time of posting this blog)
batch_size=64
image_shape = (256,256)
filter_sizes = [(5,5),(5,5),(5,5),(5,5),(5,5),(5,5)]
feature_maps = [20,40,60,80,100,120]
pooling_sizes = [(2,2),(2,2),(2,2),(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Adam(learning_rate=0.0005)
max_image_dim_limit_method= MaximumImageDimension
dataset_processing = rescale to 256*256

Validation error=8.56% after 61 epochs

cats and dogs 2 result

Hi,
As mentioned in Blog posts 2.01, this time, in 2.1 I tried to use a 6 layered CNN to conduct experiment, but I encountered the same strange thing as encountered by Florian. I used Adam() as update rule to do learning. During training the training error and validation error suddenly diverged and totally ruined the training process.

In this experiment 2.1, I still use the Random2DRotation Method to rotate each training image by a random degree, and keep their label untouched, to do regularization, so the model will learn different dogs or cats’images with different random rotated angles. I wil try to set the initial parameters for Adam() rather than using its default values.

In this experiment, I’m using a 6-convolution layered CNN architecture.
The configurations are as follows:

num_epochs= 120 
image_shape = (256,256)
filter_sizes = [(5,5),(5,5),(5,5),(5,5),(5,5),(5,5)]
feature_maps = [20,40,60,80,100, 120]
pooling_sizes = [(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Adam()
max_image_dim_limit_method= MaximumImageDimension
dataset_processing = rescale to 256*256

cats and dogs 2 result

Hi,
As mentioned in Blog posts 1.3, I will try some deeper achitectures, so in this experiment 2.01 I tried to use a 5 layered CNN, and the configuration is showed as below.

So in order to overcome the overfitting phenomenon, there are a lot of methods to regularize, in this experiment 2.01, I still use the Random2DRotation Method to rotate each training image by a random degree, and keep their label untouched, so the model will learn different dogs or cats’images with different random rotated angles, and as we can see from the learning curve of experiment 2.01, this does help to reduce the overfitting. Globally, the training error and validation error curves are sticked together, and the validation error after 100 epochs is 19.6%.

In this experiment, I’m using the same 5-convolution layered CNN architecture.
The configurations are as follows:

num_epochs= 100 
image_shape = (256,256)
filter_sizes = [(5,5),(5,5),(5,5),(5,5),(5,5)]
feature_maps = [20,40,60,80,100]
pooling_sizes = [(2,2),(2,2),(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Adam()
max_image_dim_limit_method= MaximumImageDimension
dataset_processing = rescale to 256*256

cats and dogs 2 result

But it seems a little bit strange because the validation error still stagnate at around 19.6%, should I train more epochs, or should I do other things to change the architecture, the processing of the images?

Next step I will try add a 6th conv layer and more feature maps at last conv layer to see if it can provide more information to the last MLP-softmax classification layer. And it might also be worthwhile to try other data augmentation schemes to regularize.

Hi,
As mentioned in Blog posts 1.2, simply using the cropped image and limit their size to 128*128 doesn’t help a lot, and we can observe from experiment 1.2 that only use cropped original images to do learning will result in overfitting, and as we can observe from the learning curve in experiment 1.2, after about 700 epochs, overfitting already occurred, the valid_error will stagnate at around 27%, and the training error will continues to decrease to about 5%, but that isn’t of any use…

So in order to overcome the overfitting phenomenon, there are a lot of methods to regularize, in this experiment 1.3, I uses the Random2DRotation Method to rotate each training image by a random degree, and keep their label untouched, so the model will learn different dogs or cats’images with different random rotated angles, and as we can see from the learning curve of experiment 1.3, this does help to reduce the overfitting. Globally, the training error and validation error curves are sticked together, and the validation error after 100 epochs is 19.4%.

In this experiment, I’m using the same 3-convolution layered CNN architecture as in 1, 1.01, 1.1, 1.2
The configurations are as follows:

num_epochs= 100 
image_shape = (128,128)
filter_sizes = [(5,5),(5,5),(5,5)]
feature_maps = [20,50,80]
pooling_sizes = [(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Adam()
max_image_dim_limit_method= MaximumImageDimension
dataset_processing = rescale to 128*128

cats and dogs 1 result

Next step I will try more complicated architectures (e.g. more layers) and with more feature maps at last conv layer to see if it can provide more information to the last MLP-softmax classification layer. And it might also be worthwhile to try other data augmentation schemes to regularize.

Hi,
As mentioned in Blog posts 1.1, the fixed learning rate might not be a good choice, if it is chosen too small, the learning is to slow, and even cannot pull the cost function to a better minimum, on the other hand, if it is set too big, the loss will decrease and may oscillate and even bounces back to a higher loss as encountered in the figure of experiment 1.01… So Adam() might be a better choice.

In this experiment, I’m using the same 3-convolution layered CNN architecture as in 1 and 1.01
The configurations are as follows:

num_epochs= 100 early stopped
image_shape = (128,128)
filter_sizes = [(5,5),(5,5),(5,5)]
feature_maps = [20,50,80]
pooling_sizes = [(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Adam()
max_image_dim_limit_method= MaximumImageDimension
dataset_processing = rescale to 128*128

cats and dogs 1 result

This time, we can still observe the phenomenon of overfitting, so I just stopped training at epoch 21 because the valid error no longer decreases, and stagnate at around 25% however the training error still goes down. So this architecture setting might has its limitaitons. So data augmentation is very important to avoid things like overfitting. I’ll try to use rotations transformations to regularize.

Hi,
As mentioned in Blog posts 1 and 1.01, the random fixed size crop doesn’t seems to be a good way to limit the size of image to a fixed size (e.g.128*128), so inspired by Florian’s blog, I used a modified version of MinImageDimension to do MaxImageDimension to limit the size of image to a fixed size.
In this experiment, I’m using the same 3-convolution layered CNN architecture as in 1 and 1.01
The configurations are as follows:

num_epochs= 100 early stopped
image_shape = (128,128)
filter_sizes = [(5,5),(5,5),(5,5)]
feature_maps = [20,50,80]
pooling_sizes = [(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Scale(learning_rate=0.05)
max_image_dim_limit_method= MaximumImageDimension
dataset_processing = rescale to 128*128

cats and dogs 1 result

This time, we can observe the phenomenon of overfitting, so I just stopped training at epoch 27 because the valid error no longer decreases… Maybe I will try to use Adam() update rules rather than fixed learning rate, and do rotations for the images in order to reduce overfitting.

I also ran an experiment on the cluster, so I used a different architecture, So I called this experiment 2, however the image max size limiting method is still RandomFixedSizeCrop
In this experiment, I’m using a 3-convolution layered CNN
The configurations are as follows:

num_epochs= 100 early stopped
image_shape = (128,128)
filter_sizes = [(5,5),(5,5),(5,5),(4,4),(4,4),(4,4)]
feature_maps = [20,40,60,80,100,120]
pooling_sizes = [(2,2),(2,2),(2,2),(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Adam()
max_image_dim_limit_method= RandomFixedSizeCrop
dataset_processing = rescale to 128*128

The result is after 100 epochs is that training error=19.57%, validation error=19.91%

TRAINING HAS BEEN FINISHED:

Training status:
batch_interrupt_received: False
epoch_interrupt_received: False
epoch_started: False
epochs_done: 100
iterations_done: 31300
received_first_batch: True
resumed_from: None
training_started: True
Log records from the iteration 31300:
saved_to: (‘catsVsDogs128.pkl’,)
time_read_data_this_epoch: 1.25491786003
time_read_data_total: 453.110922098
time_train_this_epoch: 163.532570601
time_train_total: 16410.0936284
train_cost: 0.408693790436
train_error_rate: 0.195686891675
train_total_gradient_norm: 4.36319255829
training_finish_requested: True
training_finished: True
valid_cost: 0.452142506838
valid_error_rate: 0.199169307947
valid_error_rate2: 0.199169307947

I kind of understand why the learning is kind of slow and why there’s no overfitting phenomen in 1.0 and 1.01 because I used RandomFixedSizeCrop to limit the max size of image to 128*128. This might crop the unimportant area area and label it as cats/dogs thus create some noises…
In next experiments, inspired by Florian’s approach, I will try to modify the MinimumImageDimension function so as to limit the maximum Image dimension.

The same datset pre-processing with more layers does help to boost performance, and the last convolution layer has 120 feature maps, which feeds more features to the MLP classifier.

In this experiment, I also used Adam() as learning rule, instead of fixed learning rate, this might also helps the performance to improve.

Hi,
I’ve tried another time the same configuration as in experiment 1, still with random initialization of the weights, just to add the Bokeh plotting, but the results seems to be worse than experiment 1
In this experiment, I’m using a 3-convolution layered CNN
The configurations are as follows:

num_epochs= 100 
image_shape = (128,128)
filter_sizes = [(5,5),(5,5),(5,5)]
feature_maps = [20,50,80]
pooling_sizes = [(2,2),(2,2),(2,2)]
mlp_hiddens = [1000]
output_size = 2
weights_init=Uniform(width=0.2)
step_rule=Scale(learning_rate=0.1)
max_image_dim_limit_method= random crop
dataset_processing = rescale to 128*128

The result is after 100 epochs is that training error=35%, validation error= 35%

cats and dogs 1 result

I kind of understand why the learning is kind of slow and why there’s no overfitting phenomen in 1.0 and 1.01 because I used RandomFixedSizeCrop to limit the max size of image to 128*128. This might crop the unimportant area area and label it as cats/dogs thus create some noises…
In next experiments, inspired by Florian’s approach, I will try to modify the MinimumImageDimension function so as to limit the maximum Image dimension.