Updated: Apr 22, 2020
Hi! This is Anjani from Tech_Turtle and this is my first time writing a blog. I love reading blogs but I never got the time to write them in spite of having immense content. Well, thanks to the quarantine, I finally got all the time I needed, so, let's give it a shot.
This project deals with the detection of Malarial cells and differentiates the Parasitized cells from the Uninfected cells. Here, I've used the VGG model which has shown great results in the past when it comes to Image Data. We've created a few models and tested them on the dataset and we've achieved a maximum accuracy of 96%.
You can find all of my code related to this project here. You can use it as a reference for other projects as well.
The data is easily available at Kaggle. You need to have a Kaggle account in order to download the dataset. The dataset contains a total of 27,558 images (335 MB in size) equally divided into 2 categories: Parasitized (or infected) and Uninfected.
The data we extracted isn’t ready to be implemented. So, we need to clean it. First, we define the location where we’ve kept all our data in the DATADIR variable.
Next, we define the CATEGORIES in which we keep the ‘categories’ we need to predict. The images in the dataset are not all of the same dimension and for the model that I’m going to use to work, we need to resize all our images. So, I defined IMG_SIZE to 100 which is for (100 * 100 images). Feel free to change this if you’re low on memory (Try 50 or 64 if you’ve RAM <= 8GB). We also define the ‘training_data’ variable where we are going to save all our data.
Next, we go through our categories and we provide the path to our data in the ‘path’ variable. We create a class_num variable to save the categories for our dataset, i.e., 1 for parasitized and 0 for uninfected.
Now, we read all the images using Python’s Opencv library and then we resize the image that we talked about before. If your system’s a little bit slow, add:
img_array = cv2.imread(img, cv2.IMREAD_GRAYSCALE)
This converts your images into gray scale, so you get a dimension of (100 * 100) instead of (100 * 100 * 3). That saves like a lot of memory and time.
Next, we add that in our training data along with its class. We do this for all the images in the dataset and then we shuffle our training data to mix both categories.
Saving the Data
We cleaned our data. What’s next? Since the data cleaning is a large and tedious process, we save our training_data in a pickle file. But first, we need to separate our features from our labels.
Here, we define two variables X, for the features and y, for our labels.
Next, we use pickle files to store our data so that we can load them easily later.
To know more about pickle files, you can check out the below links:
Training Our Model
Now, it’s time to create our model. Here, we’re going to use Convolutional Neural Networks to build our model.
So, first we’ll create a baseline model on which we’re going to compare our other models. This model will have only one Convolutional layer.
This is a basic CNN model which provides us an accuracy of > 80 % on our test set.
Now, we need to increase our accuracy because 80% on this dataset is laughable. So, what to do next? Well, for starters, we can increase the number of layers in our model. So, let's add one more layer and see what happens.
Wow, that was a huge change in the accuracy and that is only by adding one more layer. We went up from 80% to 93% and that's incredible for a VGG-2 model.
But again, 93% is good but not that good when it comes to medicine. Like, imagine we send 7% of people who have malaria to their home thinking that they do not have Malaria, that would be disastrous. We can't afford that, so again, the question arises, what can we do next? Well, let's add another layer and see what happens.
Well, the change difference wasn't as high as it was for VGG-2 but we surely got better accuracy than VGG-2. We almost reached 96% which is again pretty good for a VGG-3 model.
Lastly, we apply one of the best model there is for this job and that is the VGG-16. Let's look how we perform with this algorithm:
Well, this also showed an accuracy of 96% which is quite believable because this model works ideal when you huge amount of data.
There are many ways to increase accuracy of your models which play a huge role when you're competing because 1% less accuracy can land you behind hundreds (and sometimes even thousands) of people. Some of them are like creating more dataset by Image Augmentation, or by using Dropouts in the model. But that's for another day. You can though check these methods out in the mean time if you wish. The Tensorflow documentation is nice and simple, search for particular topics and try to implement them in your code to get a better understanding of the matter.
We used Convolutional Neural Networks in this project. The thing about CNN is it performs way better if it has more data, like a minimum of 300-400K images. That way, we can achieve an accuracy of more than 98%.
I guess that's the end of this blog. Hope this helped you to understand the concepts of CNN a little better. I didn't get into the deep understanding of the model because it's a simple project and you can just get away with the built-in functions.
Have any doubts/new ideas?
I’d love to hear from you if I can help you or your team with machine learning.