Regression Prediction using AWS Machine Learning

We wanted to be able to predict median rent of a place given the median price of the home, median household income of the place and the percentage of homes vacant in that place. The data can be downloaded from here

The steps to be followed are

  1. Create data source

  2. Train the model

  3. Evaluate the model

  4. Generate predictions


To get started login into Amazon AWS console and click on machine learning. It shows all your entities by default. An entity can be an ML model, Data set, Evaluation etc




Creating a Datasource

To create a dataset click on “Create new” drop down button and then select “Data Source”



To create a Datasource, your data file needs to be present in either amazon S3 or RedShift

If you are getting data from S3, you need to provide the location of the data in your S3. Once you provide the information click “verify”



The datasource is validated in this process


A schema is composed of all attributes in the input data and their corresponding data types. Amazon ML uses the information in the schema to correctly read and interpret the input data, compute statistics, apply the correct attribute transformations, and fine-tune its learning algorithms.


You can provide a separate schema file when you upload your AWS S3 data. Here we let Amazon ML to infer the attribute types and create a schema.

On the schema page check the “Does the first line in your CSV contain the column names?” option to “Yes”.

Make sure that the attributes in the file are assigned the correct datatype.




Review the types properly and click continue


In the next page for “Do you want to use this dataset to create and/or evaluate a ML model?” choose “Yes”


This will let us select the target attribute




The “Target” is the attribute which the model must learn to predict. Here we want to predict Median rent of a place. So we select it as target


In the next page for “Do you want to select an identifier?” choose “Yes”


and in the next page check “Geo_ID”


and click on “Review”


In the next page review the attributes and click “Finish”


Once you click finish you see the data source being “initialized”. It takes some time to reach “Completed” status


Training ML model

Amazon ML supports 3 types of ML models, namely

  1. Binary classification

  2. Multi class classification

  3. Regression

The type of model depends on the type of data you want to predict

For binary classification AWS ML uses logistic regression algorithm and for multi class classification and regression, it uses multinomial logistic regression and linear regression algorithms respectively


Regression model

Since we want to predict the rent at a particular place, which is a number we use Regression ML. The ML model based on training data, computes one weight for each feature to form a model that can predict or estimate the target value


The learning algorithm consists of a loss function and an optimization technique. The loss function used for regression by AWS ML is squared loss function and the optimization technique is SGD.


Create an ML model

You can create an ML model either from the datasource or from the “Create New” dropdown button in the dashboard, like you created the dataset


If you created it from the create new dropdown button you have to provide the name of the data source on which the model has to train




Click “Continue”. In the next page give the name of the model


In the next page for “Training and evaluation settings” choose “Default”

Because it is best to start with the simple and default options first.


By selecting this option an evaluation will automatically be generated. 70% of the data will be used for training and the remaining 30% will be used for evaluation



Evaluating the model

Once the model is built, it can be run on some data which it has not seen and the predicted values can be compared to that of the original value to evaluate the performance of the model.


Since we selected the “Default” option, an evaluation is automatically generated


For regression tasks, Root mean square error is used to evaluate the accuracy.

The RMSE for our model is 278




We can also see the the distribution of errors of the estimates. It can be seen by selecting

Evaluations -> Explore Performance


Generating Batch Predictions

You can start generating the batch predictions and real time predictions immediately



You will need to create a datasource to generate the batch predictions



AWS Machine Learning charges an hourly rate for the compute time used to build predictive models, and then you pay for the number of predictions generated for your application. For real-time predictions you also pay an hourly reserved capacity charge based on the amount of memory required for your model.


For data analysis and model building amazon charges $0.42 per Hour

For generating batch predictions $0.10 per 1,000 predictions

For real time predictions $0.0001 per prediction, rounded up to the nearest penny.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 5150


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Dr. Dimitrios Geromichalos on November 19, 2015 at 12:34am

Very interesting article.I'd like to add that from a risk management perspective the median is not the only relevant quantile. Here, also best and - especially worst - cases like the 5% and 95% are important. I did not see anything like that in ML yet, but I suppose the information could easily be obtained from the residuals. This quantile information is (still) the most important one for banks and insurers when they calculate figures like Value at Risk and Economic Capital and should be also useful for everyone who wants to minimize risks.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service