Machine Learning is all about creating an artificial brain to perform a task by itself. In most cases that task is Prediction.
How to create a brain which can do predictions ?
To do these predictions, there are many technical options available. One popular question would be, whether to use Python or R.
But before heading there, it is more important to understand some of the fundamental concepts to get started.
For example, what are the types of algorithms available? Which algorithm best suits to solve a particular problem? For an algorithm, what are the configurable parameters available?
I recently came across an open source platform called H2O. Easy to install & It has got an interesting UI called 'Flow', which helps you quickly get started.
Let's take this Diabetes data set from Kaggle: (https://www.kaggle.com/uciml/pima-indians-diabetes-database). It has various columns representing the health detail of patients. (of about 768 records)
( 'Outcome' column represents whether the patient has Diabetes or not. )
Note: Installing H2O on your PC is quite easy. Just download the package from here, then execute this command "java -jar h2o.jar " in the directory where you extracted the package.
After the server has successfully started, you will be able to get the H2O Flow web console in this address: http://localhost:54321. ( You can also grab the console URL from the startup logs )
Also, make sure you have installed Java on your computer. This whole H2O flow is built on top of Java.
Let's create a flow now to predict whether a patient has diabetes or not.
Import the diabetes dataset into H2O Flow:
Parse the file. Now, H2O goes through the diabetes dataset and it tries to understand which attribute is what. This dataset is full of numbers, so columns are recognised as numeric data types.
Note: It is easy for the machine to understand numbers and also the data learning process is going to be more efficient. Doesn't mean that all data must be numbers or the machine can deal only with numbers.
The parser has automatically figured out the format of the file and the data types. Let's leave rest of the configurations to default and proceed to the next step.
Create frame(s) - What you get straight after parsing a file, is a frame. A frame is a better-understood version of a dataset.
Note: When we create a model, we need data to validate the model and also we need some data to test the model. So the original dataset is split into multiple frames for this purpose. Here we are going to split the original frame into 3 portions ( 60%, 20%, 20%). We will use the largest frame to train the model & rest of the frames for validation & testing.
Modelling - Let's start the training. Choose 'DIABETES_60' data frame and click on 'Build model'
Prediction - After the model is successfully created, click on predict. Let's use 'DIABETES_20_TEST' frame to predict diabetes.
So far, we trained a model using the larger part of the dataset (DIABETES_60) and we validated it using DIABETES_20_VALIDATION frame and now we are going to predict diabetes for the patients in the DIABETES_20_TEST frame.
Note: The frame split happens randomly. So you can always export a frame as a file to see the content.
This is one of the simple exercises in machine learning using H2O. Maybe you can try the same exercise using Python / R or You can also try the same approach with different datasets. ( There are many interesting data sets in Kaggle )
Note: H2O has a set of examples using various algorithms. Just by going through them can give you a lot of perspective about ML.
Big Data, Data Science, Machine Learning and Predictive analytics, we already know how disruptive they are. Also, they are huge to explore, complex & complicated. But I think, there are much better/simpler tools available nowadays to get started!Follow @tallguru
Original link - http://gnanaguru.com/h2o-diabetes-and-data-science/