Experimenting with AWS Machine Learning for Classification

In this post, I'll explore the new AWS Machine Learning services.

The problem we are trying to solve is to classify auto accident severity given a set of features. I'll not go into further details of the data set and what classification algorithms,etc. here since the goal of this blog is to explore the new AWS Machine Learning service step by step.

In the next blog post, I'll explore another service: Microsoft Azure Machine Learning.

Let's get started by logging into the AWS Console.

Now select Machine Learning service:

Too bad my original region(US West) is not supported by AWS Machine Learning. So, select the suggested region.

The open screen comes up. Select "Get Started"

Select Standard setup and Launch it, which then asks for data from either S3 or Redshift. 
So, let's load the accident data to S3. Please remember to change the permission of your file to be publicly accessible. I'll explain in a minute why I have two files.
Once we enter the name of the data file in S3, it will verify the data source.
Now we have to chose our Target variable that we are trying to classifying, in this case it's accident severity. BUT wait a minute, what happened to all the variable names?!
WHY did it default to numerical regression as the preferred machine learning algorithm for this problem? We want machine learning classification! Hmm..
Let's finish this process off and backtrack later. Once we selected the Target variable/feature. It's time to Review everything. Row ID is optional.
Now going back to fix the variable names , we should go back to the Step 2. Schema and select YES to the question "Does the first line in your CSV contain the column names?" Viola! All the column headings went through.
Then reselect the Target variable. Our original data set did NOT have a variable called Target. I had to do OFFLINE data transformation by changing MAX_SEV_IR variable which is column 24 to a CATEGORICAL variable. The way I did that is to follow the data dictionary by transforming MAX_SEV_IR numeric values 0,1,2, to the corresponding categories: 0= no injury, 1=non-fatal injury, 2 = fatal injury.
Once we did that, it automatically changed the ML module to "Multiclass Classification," which is what we wanted to do originally.
After everything is selected, we need to review our configurations.
The following screen shows the data set is being created in AWS Machine Learning environment.
After the data set has been created , we need to use this data source to create(train) an ML model.
Let's go to ML Dashboard and check on the progress of our work. We see that the AWS is still cranking through splitting of the data set into training and test set. And after that it will feed it into the classifier (ML Model stage). And when all is done and said it will do an evaluation of the model performance.

Once everything is done. Let's see how it performed.

Let's click on Explore model performance to see the details. It looks too good to be true.

Oh, wow! Wait a minute... Something is amiss!

The model has a 100% classification accuracy across all three different types of accident severity types?! Something is wrong. For more details on how to read and interpret the matrix above, check out this documentation here.

It was fun to experiment with the new waves of Machine Learning services. As a data scientist, I still prefer the powerful language R so I know exactly what I put in the models, tune it, and understand its outputs. Yes, these GUI-based machine learning services can be easier for the novices, but it's not obvious if it does exactly what one wants to do and if it's flexible enough for fine tuning. Perhaps, I need to spend more time on the documentations. This is just first impressions. I'm sure these things will improve over time.

Additionally, it takes what seems like a VERY LONG time to process a relatively small data file. We are talking about 43K rows of data. R can rip through that thing very quickly, but I was waiting like 15-20 minutes for the entire sequence to process on AWS Machine Learning.

So, the use case for AWS Machine Learning ONLY makes sense if one has REALLY large scale data that you need the cloud computing infrastructure. Otherwise, it's really slow. It's like using Hadoop to process 1MB of data. Not a good use case. :)

For a professional data scientist, I find this canned service rather limiting and does not offer the full flexibility of a true data science computing environment. To be fair, I'm sure it will improve.

Well, that's all for now folks.

Next time, I'll explore another Machine Learning services using the new Microsoft Azure Machine Learning.

Originally posted here

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 7636


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Sam Mason on May 31, 2015 at 7:00pm

I felt the same - its a good start but I'd like to see ensemble method support as well as a number of other algorithms (decision trees, NB, SVMs etc).

However if you delve into the developer guide you'll quite a bit of support for feature transformations including n-gram generation etc. You can do quite a bit with your own recipes.

I'm sure a lot of updates are coming though - talking to the AWS guys I know, this is only the start of a big play in the ML space.

Comment by Ron Segal on May 28, 2015 at 12:58pm

Most enlightening, thanks Peter.

I agree that for me support for reproducible research is pretty critical. Difficult with point and click tools.

Best wishes, Ron

Comment by John Tilly on May 28, 2015 at 10:28am

We welcome you to take a look at ForecastThis machine learning platform. We're not one of the 'big boys', this platform is fully agnostic and independent and accesses a library of hundreds of algorithms, performing thousands of model tests in minutes.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service