A Layman’s Guide to Building Your First Image Classification Model in R Using Keras

Hands-on Vanilla Modelling Part I

Published in

Towards Data Science

13 min readSep 11, 2020

Applications of machine learning (ML) are now almost an integral part of our everyday life. From a speech-recognition based virtual assistant in our smartphones to super-intelligent automated drones, ML and artificial intelligence (AI) is revolutionizing the dynamics of human-machine interactions. AI algorithms, especially the convolution neural networks (CNN) have made computer vision extremely powerful than ever. While the applications of it are breathtakingly awesome, it could be very intimidating to build one’s own CNN model, especially for a non-programmer or a beginner in data science, in general. As an R lover, it was not difficult for me to assert that it gets even more enigmatic for a novice R programmer. The plausible reason for this imbalance could be that the standard neural network and ML libraries (like Keras and Tensorflow) are primarily compatible with Python and naturally gravitate the masses to roll with Python itself, leading to a severe lack of novice’s guide and documentation to facilitate the implementation of these sophisticated frameworks in R. Nevertheless, APIs of Keras and Tensorflow is now available on CRAN. Herein, we are going to make a CNN based vanilla image-classification model using Keras and Tensorflow framework in R. With this article, my goal is to enable you to conceptualize and build your own CNN models in R using Keras and, sequentially help to boost your confidence through hands-on coding to build even more complex models in the future using this profound API. Apart from the scripting of the model, I will also try as much to concisely elaborate on the necessary components while plunging the hardcore underlying mathematics. Now let’s start.

A Quick Introduction to Convolution Neural Network (CNN).

Figure 1. The Convolution Neural Network Architecture (*Image by author*)

As a prelude to the scripting of CNN, let’s understand the formalisms of it in brief. Just like a human eye registers the information only in the confined receptive frame and learns the specific patterns and spatial features, which subsequently trigger specific neural responses, a CNN model operates in the same fashion. We will try to probe the CNN architecture and understand its key components (Figure 1).

Figure 2. Visualizing Convolution (*Image by author*)

Instead of amalgamating the entire image pixels at a time, a subset of pixels is convoluted into a single datum Figure 2. Convolution is carried out by sliding a frame of a smaller size across the entire image acting as a receptive field that accounts for the temporal-spatial features. This frame is called a Kernel/Filter. Imagine scanning a building wall with a flashlight, the filter works in the same fashion. Now, if you have a variety of filters, you are likely to extract or observe more discriminative spatial features and frequently recurring adjacent values. This scheme empowers the learning by not only preserving the spatial characteristics with the least information loss but also significantly reduces the required number of weights, enabling the image classification practically feasible and scalable. For example, consider a 2D black and white (b&w) image as the input to CNN. The model initiates to convolute the image by employing a filter which then slides across the image with a given step size, also called strides. The filter values are analogous to the weights in the neural network. The linear combination of these filter values and the image’s foregrounding pixel values at an instance, generate a single feature output. A set of these values generated produces a new layer called a feature map. Intuitively, a feature map is a condensed form of the input image that preserves all the dominating features and patterns into a smaller dimension for efficient learning.

Figure 3. The Convolution Scheme (*Image by author*)

Elucidating this in Figure 3, the filter of size 4x4 is employed over a b&w image of dimension 10X10. The filter slides across the image with a stride of 1 unit, thus generating a convoluted layer of dimension (10–4+1)X(10–4+1) i.e 7X7. Note that the bias/intercept for this example is assumed to be 0. Similarly, n number of filters would generate n feature maps.

Figure 3. Convolution of a colored image (*Image by author*)

For a b&w image, the depth is 1 and so of the filters. However, a colored image is an ensemble of Red, Blue, and Green (RBG) channels. precisely, a stack of three 2D layers, where each layer represents the intensity of a particular color channel. So for a colored image of depth 3, you will need a filter of dimension (4X4)X3 (Figure 4).

Once the feature maps are generated, an activation function is applied over each of them. For this purpose, the Rectified Linear Unit (ReLU) activation function comes in handy. If the input to ReLU is a negative value, it simply transforms it to zero else it will spit out the very same input value (Figure 4). Mathematically, f(x)= max(0,x).

Figure 5. ReLU and Max Pooling (*Image by author*)

After the activation layer, another scheme is employed to smartly reduce the dimensions of the feature map without losing vital information. This technique is called Max-pooling. Quite similar to the convolution scheme, here only the highest value from the max-pooling window is picked. This window slides in the same fashion as of the filter but with a step-size equal to the dimension of the pooling window (Figure 5). Max-pooling significantly reduces the dimension of the feature map while also preserving the important/dominating features. The final max-pooled layer is then squashed and flatten into the neuron layer that connects to a fully connected neural network to perform the classification task.

Mind you CNN is not limited to a single convolution or a single set of the aforementioned schemes. Backed with a sufficient number of filters, The deeper your network is, the higher performance it will achieve. Intuitively, the initial convolutions would capture the low-level recurring features such as edges and colors. Sequentially, the deep layers tend to capture high-level features like recurring clusters pixels say eyes or noses in a portrait. So, in conclusion, based on your image complexity and computation power, you should choose a sufficient and efficient number of layers and filters.

Now let’s jump into the making of our toy model 🤓!!

The Dataset

I believe that to understand any statistical concept nothing comes handier than a deck of playing cards. Here I will exploit a deck of playing cards but in a slightly unorthodox form. Yes, you guessed it!! I will make a prediction model that should be able to accurately classify and predict the suit of a given image of any arbitrary non-faced playing card.

This dataset contains a set of 43 card images for each suit viz Clubs ♣, Hearts ♥️ , Diamonds ♦️, and Spades ♠️ . Note that the symbols of each suit follow a standard shape however the designing and arrangement of these symbols vary from card to card. We will incorporate this dataset to train and test our model. (To follow the script conveniently, I recommend you to download the images and keep all the files in a parent file as they appear in the Github repository)

Requires Packages and Installation

For this task, you would require Keras and EBImage packages. The former is available on the cran repository. The later is used to deal with the images efficiently and can be called from an open-source software called Bioconductor. These can be installed on Windows OS as:

install.packages(“keras”) # Install the package from CRAN
library(keras)
install_keras() #to setup the Keras library and TensorFlow backendif (!requireNamespace(“BiocManager”, quietly = TRUE))
install.packages(“BiocManager”)
BiocManager::install(“EBImage”)
library(EBImage)

This installs the CPU version of Keras API, which is recommended for the novice for now.

Note: The Keras API will require Python support and R tools plug-in. So make sure you have anaconda and R tools installed on your machine with properly added to your system path.

Exploring the dataset

Next, we will convert the images of each suit into a tensor (numbered matrix). readImage() function from EBImage library does this robustly. Let’s try reading an image from our dataset.

setwd(“C:/parent/spade”) # To access the images of Spades suit.
                         # The path should be modified as per your 
                         # machinecard<-readImage(“ace_of_spades (2).png”) # Reading an imgae from the
                                         # dataset
print(card) # Print the details of image

This gives the output as:

Image 
colorMode : Color 
storage.mode : double 
dim : 500 726 4 
frames.total : 4 
frames.render: 1imageData(object)[1:5,1:6,1]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 1 1 1 1 1
[2,] 1 1 1 1 1 1
[3,] 1 1 1 1 1 1
[4,] 1 1 1 1 1 1
[5,] 1 1 1 1 1 1

This indicates that the image is colored with dimension 500 X 726 X 4. Notice, as we have discussed earlier, here the image depth is 4 and so we will need a filter of depth 4. To unmask these four matrixes we use:

getFrames(card, type=”total”)

This will give the details of the four channels separately.

[[1]]
Image 
colorMode : Grayscale 
storage.mode : double 
dim : 500 726 
frames.total : 1 
frames.render: 1imageData(object)[1:5,1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 1 1 1 1 1
[2,] 1 1 1 1 1 1
[3,] 1 1 1 1 1 1
[4,] 1 1 1 1 1 1
[5,] 1 1 1 1 1 1[[2]]
Image 
colorMode : Grayscale 
storage.mode : double 
dim : 500 726 
frames.total : 1 
frames.render: 1imageData(object)[1:5,1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 1 1 1 1 1
[2,] 1 1 1 1 1 1
[3,] 1 1 1 1 1 1
[4,] 1 1 1 1 1 1
[5,] 1 1 1 1 1 1[[3]]
Image 
colorMode : Grayscale 
storage.mode : double 
dim : 500 726 
frames.total : 1 
frames.render: 1imageData(object)[1:5,1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 1 1 1 1 1
[2,] 1 1 1 1 1 1
[3,] 1 1 1 1 1 1
[4,] 1 1 1 1 1 1
[5,] 1 1 1 1 1 1[[4]]
Image 
colorMode : Grayscale 
storage.mode : double 
dim : 500 726 
frames.total : 1 
frames.render: 1imageData(object)[1:5,1:6]
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 0 0 0 0 0
[2,] 0 0 0 0 0 0
[3,] 0 0 0 0 0 0
[4,] 0 0 0 0 0 0
[5,] 0 0 0 0 0 0

And oh yeah! to elucidate our selected card we do:

Figure 6. The selected card displayed in R viewer (*Image by author*)

display(card)

Although the depth is the same for each card in the data set, the pixel dimensions are varying and so we have to resize each of the cards to 100x100x4 dimensions and combine them all in a stack that is compatible to Keras. This stack then directly goes into the architecture as input.

setwd(“C:/parent/club”) # To access the images of Clubs suit.
                        # The path should be modified as per your 
                        # machineimg.card<- sample(dir()); #-------shuffle the order
cards<-list(NULL);        
for(i in 1:length(img.card))
{ cards[[i]]<- readImage(img.card[i])
 cards[[i]]<- resize(cards[[i]], 100, 100)} #resizing to 100x100club<- cards              # Storing stack of the Clubs cards in
                          # matrix form in a list
#-----------------------------------------------------------setwd(“C:/parent/heart”)# To access the images of Hearts suit.
                        # The path should be modified as per your 
                        # machineimg.card<- sample(dir());
cards<-list(NULL);
for(i in 1:length(img.card))
 { cards[[i]]<- readImage(img.card[i])
cards[[i]]<- resize(cards[[i]], 100, 100)} #resizing to 100x100heart<- cards             # Storing stack of the Hearts cards in
                          # matrix form in a list
#------------------------------------------------------------setwd(“C:/parent/spade”)# To access the images of Spades suit.
                        # The path should be modified as per your 
                        # machineimg.card<- sample(dir());
cards<-list(NULL);
for(i in 1:length(img.card))
{ cards[[i]]<- readImage(img.card[i])
cards[[i]]<- resize(cards[[i]], 100, 100)} #resizing to 100x100spade<- cards             # Storing stack of the Spades cards in
                          # matrix form in a list
#------------------------------------------------------------setwd(“C:/parent/diamond”) # To access the images of Diamonds suit.
                           #The path should be modified as per your 
                           # machine
img.card<- sample(dir());
cards<-list(NULL);
for(i in 1:length(img.card))
{ cards[[i]]<- readImage(img.card[i])
cards[[i]]<- resize(cards[[i]], 100, 100)} #resizing to 100x100
diamond<- cards           # Storing stack of the Diamonds cards in
                          # matrix form in a list
#-------------------------------------------------------------train_pool<-c(club[1:40], 
              heart[1:40], 
              spade[1:40], 
              diamond[1:40]) # Vector of all the training images. 
                             # The first 40 images from each suit 
                             # are included in the train settrain<-aperm(combine(train_pool), c(4,1,2,3)) # Combine and stackedtest_pool<-c(club[41:43], 
             heart[41:43], 
             spade[41:43], 
             diamond[41:43]) # Vector of all test images. The last
                             # 3 images from each suit is included
                             # in test settest<-aperm(combine(test_pool), c(4,1,2,3)) # Combined and stacked

To see what images are included in the test set, we do this:

par(mfrow=c(3,4)) # To contain all images in single frame
for(i in 1:12){
  plot(test_pool[[i]])
  }
par(mfrow=c(1,1)) # Reset the default

I got the cards as shown in figure 8. It could be a different set of cards for you.

One Hot encoding is necessary to create the categorical vectors corresponding to the input data.

#one hot encoding
train_y<-c(rep(0,40),rep(1,40),rep(2,40),rep(3,40))
test_y<-c(rep(0,3),rep(1,3),rep(2,3),rep(3,3))train_lab<-to_categorical(train_y) #Catagorical vector for training 
                                   #classes
test_lab<-to_categorical(test_y)#Catagorical vector for test classes

Let’s Build the architecture

Below is the R script to build the CNN model. I am also giving a comprehensive animation along with it that illustrates what’s what as the script proceeds (Figure 9).

# Model Buildingmodel.card<- keras_model_sequential() #-Keras Model composed of a 
                                      #-----linear stack of layersmodel.card %>%                   #---------Initiate and connect to #----------------------------(A)-----------------------------------#layer_conv_2d(filters = 40,       #----------First convoluted layer
 kernel_size = c(4,4),             #---40 Filters with dimension 4x4
 activation = ‘relu’,              #-with a ReLu activation function
 input_shape = c(100,100,4)) %>%   
#----------------------------(B)-----------------------------------#layer_conv_2d(filters = 40,       #---------Second convoluted layer
 kernel_size = c(4,4),             #---40 Filters with dimension 4x4
 activation = ‘relu’) %>%          #-with a ReLu activation function
#---------------------------(C)-----------------------------------#layer_max_pooling_2d(pool_size = c(4,4) )%>%   #--------Max Pooling
#-----------------------------------------------------------------#layer_dropout(rate = 0.25) %>%   #-------------------Drop out layer
#----------------------------(D)-----------------------------------#layer_conv_2d(filters = 80,      #-----------Third convoluted layer
 kernel_size = c(4,4),            #----80 Filters with dimension 4x4
 activation = ‘relu’) %>%         #--with a ReLu activation function
#-----------------------------(E)----------------------------------#layer_conv_2d(filters = 80,      #----------Fourth convoluted layer
 kernel_size = c(4,4),            #----80 Filters with dimension 4x4
 activation = ‘relu’) %>%         #--with a ReLu activation function
#-----------------------------(F)----------------------------------#layer_max_pooling_2d(pool_size = c(4,4)) %>%  #---------Max Pooling
#-----------------------------------------------------------------#layer_dropout(rate = 0.35) %>%   #-------------------Drop out layer
#------------------------------(G)---------------------------------#layer_flatten()%>%   #---Flattening the final stack of feature maps
#------------------------------(H)---------------------------------#layer_dense(units = 256, activation = ‘relu’)%>% #-----Hidden layer
#---------------------------(I)-----------------------------------#layer_dropout(rate= 0.25)%>%     #-------------------Drop-out layer
#-----------------------------------------------------------------#layer_dense(units = 4, activation = “softmax”)%>% #-----Final Layer
#----------------------------(J)-----------------------------------#compile(loss = 'categorical_crossentropy',
          optimizer = optimizer_adam(),
          metrics = c("accuracy"))   # Compiling the architecture

Figure 8. Illustrating the model creation **(*Image by author*)**

We can get a summary of this model using summary(model.card). The output of this will be a neat and concise summary of the model.

Model: “sequential”
____________________________________________________________________
Layer (type) Output Shape Param # 
====================================================================
conv2d (Conv2D) (None, 97, 97, 40) 2600 
____________________________________________________________________
conv2d_1 (Conv2D) (None, 94, 94, 40) 25640 
____________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 23, 23, 40) 0 
____________________________________________________________________
dropout (Dropout) (None, 23, 23, 40) 0 
____________________________________________________________________
conv2d_2 (Conv2D) (None, 20, 20, 80) 51280 
____________________________________________________________________
conv2d_3 (Conv2D) (None, 17, 17, 80) 102480 
____________________________________________________________________
max_pooling2d_1 (MaxPooling2D) (None, 4, 4, 80) 0 
____________________________________________________________________
dropout_1 (Dropout) (None, 4, 4, 80) 0
____________________________________________________________________
flatten (Flatten) (None, 1280) 0 
____________________________________________________________________
dense (Dense) (None, 256) 327936 
____________________________________________________________________
dropout_2 (Dropout) (None, 256) 0 
____________________________________________________________________
dense_1 (Dense) (None, 4) 1028 
====================================================================
Total params: 510,964
Trainable params: 510,964
Non-trainable params: 0
____________________________________________________________________

Model Fitting

Once the architecture is built, its time to fit our dataset for the training of the model. The fitting is done as:

#fit model
history<- model.card %>%
 fit(train, 
 train_lab, 
 epochs = 100,
 batch_size = 40,
 validation_split = 0.2
 )

On fitting, each epoch (forward feed-backpropagation) will appear in the console area. The processing time may vary from machine to machine. While the epochs are running you should see a graphic in the Rstudio viewer (Figure 9). These are the juxtaposed curves of loss and accuracy for training and validation sets.

The running epochs appearing in the console looks something like this:

Train on 128 samples, validate on 32 samples
Epoch 1/100
128/128 [==============================] — 10s 78ms/sample — loss: 1.3648 — accuracy: 0.3281 — val_loss: 2.0009 — val_accuracy: 0.0000e+00
Epoch 2/100
128/128 [==============================] — 8s 59ms/sample — loss: 1.3098 — accuracy: 0.3359 — val_loss: 1.9864 — val_accuracy: 0.0000e+00
Epoch 3/100
128/128 [==============================] — 8s 61ms/sample — loss: 1.2686 — accuracy: 0.3516 — val_loss: 2.5289 — val_accuracy: 0.0000e+00

A summary of the whole training process can be plotted using plot(history).

Model Evaluation

Once the training is completed. Its time to evaluate our freshly trained model. First, we will look at the performance of moel over the trained data set, and then we will finally test and evaluate our trained model over our test set.

#Model Evaluationmodel.card %>% evaluate(train,train_lab) #Evaluation of training set pred<- model.card %>% predict_classes(train) #-----Classification
Train_Result<-table(Predicted = pred, Actual = train_y) #----Resultsmodel.card %>% evaluate(test, test_lab) #-----Evaluation of test set
pred1<- model.card  %>% predict_classes(test)   #-----Classification
Test_Result<-table(Predicted = pred1, Actual = test_y) #-----Resultsrownames(Train_Result)<-rownames(Test_Result)<-colnames(Train_Result)<-colnames(Test_Result)<-c("Clubs", "Hearts", "Spades", "Diamonds")print(Train_Result)
print(Test_Result)

This will spit:

The 100% accuracy over the train set can be a sign of overfitting but notice that our model has achieved a 100% accuracy over the test set as well. That means we have successfully made a convolution neural network model to correctly classify a given card image into its true suit.

If you are here then congratulations!! You have successfully made your Convolution neural network model. Hope you enjoyed the ride. Please feel free to reach out to me in case you find anything fallible or incorrect. I am also open to all suggestions that can improve the quality of this document.

Thank you for reading and Happy R-ing 😀