H2O , Diabetes and Data Science

Machine Learning is all about creating an artificial brain to perform a task by itself. In most cases that task is Prediction.

How to create a brain which can do predictions ?

Create a dumb brain ( technically its called as ‘Model’)
Tell your brain a story (Datasets)
Nurture the brain with essential/contextual data, so that it understands the story better.
Then tell a similar story to the brain and ask what is going to be the climax? (Prediction)
Depending on how accurately it predicted the climax, repeat ‘Step 3’ ( Training the Model – The more useful information you add to a story, the brain starts functioning on its own. )

To do these predictions, there are many technical options available. One popular question would be, whether to use Python or R.

But before heading there, it is more important to understand some of the fundamental concepts to get started.

For example, what are the types of algorithms available? Which algorithm best suits to solve a particular problem? For an algorithm, what are the configurable parameters available?

I recently came across an open source platform called H2O. Easy to install & It has got an interesting UI called ‘Flow’, which helps you quickly get started.

Let’s take this Diabetes data set from Kaggle: (https://www.kaggle.com/uciml/pima-indians-diabetes-database). It has various columns representing the health detail of patients. (of about 768 records)

( ‘Outcome’ column represents whether the patient has Diabetes or not. )

Note: Installing H2O on your PC is quite easy. Just download the package from here, then execute this command “java -jar h2o.jar ” in the directory where you extracted the package.

After the server has successfully started, you will be able to get the H2O Flow web console in this address: http://localhost:54321. ( You can also grab the console URL from the startup logs )

Also, make sure you have installed Java on your computer. This whole H2O flow is built on top of Java.

Let’s create a flow now to predict whether a patient has diabetes or not.

Import the diabetes dataset into H2O Flow:

Parse the file. Now, H2O goes through the diabetes dataset and it tries to understand which attribute is what. This dataset is full of numbers, so columns are recognised as numeric data types.

Note: It is easy for the machine to understand numbers and also the data learning process is going to be more efficient. Doesn’t mean that all data must be numbers or the machine can deal only with numbers.

The parser has automatically figured out the format of the file and the data types. Let’s leave rest of the configurations to default and proceed to the next step.

Create frame(s) – What you get straight after parsing a file, is a frame. A frame is a better-understood version of a dataset.

Note: When we create a model, we need data to validate the model and also we need some data to test the model. So the original dataset is split into multiple frames for this purpose. Here we are going to split the original frame into 3 portions ( 60%, 20%, 20%). We will use the largest frame to train the model & rest of the frames for validation & testing.

Modelling – Let’s start the training. Choose ‘DIABETES_60’ data frame and click on ‘Build model’

Select an Algorithm. Every algorithm has a purpose, efficiency & accuracy levels based on the type of dataset. We are going to use ‘Gradient Boosting Machine‘ in our exercise. ( It important to understand how the algorithm works, so you will know how to configure the parameters available for every available algorithm )
Choose the training frame and validation frame.
response_column is nothing but the data we will be predicting in this exercise. We are going to predict ‘diabetes’ so choose ‘outcome’ column. If you are going to predict Blood Pressure, you will have to ‘BloodPressure’ column.
Let’s leave rest of the parameters to default values and straight away build the model. Now a GBM algorithm based model is created with a unique key (gbm-178e7350-e0d9-46fb-94f0-477425207a04).

Prediction – After the model is successfully created, click on predict. Let’s use ‘DIABETES_20_TEST’ frame to predict diabetes.

So far, we trained a model using the larger part of the dataset (DIABETES_60) and we validated it using DIABETES_20_VALIDATION frame and now we are going to predict diabetes for the patients in the DIABETES_20_TEST frame.

Note: The frame split happens randomly. So you can always export a frame as a file to see the content.

After the prediction is complete, take a look at the prediction outcome by clicking on the prediction id and download it.
You can compare the predicted values with the original ‘outcome’ column.
The prediction accuracy is pretty bad, as we had chosen all default configuration values in GBM algorithm. But you would have got the whole idea of ‘how to do a prediction using H2O ?’.
Some key points to note:
1. Out of 768 rows, we used only 60% of them to train the model. The more data you have for training a model, the more efficient it becomes.
2. We didn’t talk about an important process called ‘Data cleansing’, which is nothing but:
  1. Removing unwanted data in a dataset.
  2. Replacing an empty field with a default value,
  3. Remove the column which has more than 80% of the values found to be empty.
  4. And so on, a clean data set improves the accuracy of a model. Also, the data set needs to be more contextual to what we are trying to predict.
3. After we created the model, you must have had a look at the ‘VARIABLE IMPORTANCE’ graph. Which tells you the columns that are more important to predict diabetes. Yes, of course the graph changes if you change the ‘response_column’ to something else.
4. Some other scenarios I tried using the same dataset are:
  1. Created a model with BloodPressure as response_column, cloned the data set, removed all column’s except BMI & AGE. So with just BMI & AGE I tried predicting the Blood pressure.
  2. Repeated the same scenario again, but this time I removed all columns in the data set except Insulin & Outcome.
  3. Repeated the same scenario again, but this time I am not removing any column except BloodPressure.
  4. Just exporting all three results in an excel sheet, I got the following graph. In this graph, we can see all 3 different coloured lines visible to us, which means varying accuracy but same prediction pattern. (each colour represents a scenario)
  5. Same way I tried predicting the AGE –
    1. Predicted AGE only with Pregnancies & BMI.
    2. Predicted AGE with all Columns in the dataset.
    3. Comparison Graph below (drawn using Microsoft Excel):
      1. Blue line – Scenario 1 – ONLY PREGNANCY & AGE
      2. Red line – Scenario 2 – All columns
      3. Green line – Original AGE

This is one of the simple exercises in machine learning using H2O. Maybe you can try the same exercise using Python / R or You can also try the same approach with different datasets. ( There are many interesting data sets in Kaggle )

Note: H2O has a set of examples using various algorithms. Just by going through them can give you a lot of perspective about ML.

Big Data, Data Science, Machine Learning and Predictive analytics, we already know how disruptive they are. Also, they are huge to explore, complex & complicated. But I think, there are much better/simpler tools available nowadays to get started!

Follow @tallguru

Original link – http://gnanaguru.com/h2o-diabetes-and-data-science/

Leave a Reply Cancel reply