13 Machine Learning Data Set Collections

Here are 13 resources on Machine Learning data sets.

Landsat on AWS

Landsat 8 data is available for anyone to use via Amazon S3. All Landsat 8 scenes from 2015 are available along with a selection of cloud-free scenes from 2013 and 2014. All new Landsat 8 scenes are made available each day, often within hours of production. MathWorks has created a freely-downloadable tool for accessing, processing, and visualizing Landsat on AWS data in MATLAB. With this tool, you can create a map display of scene locations with markers that show each scene’s metadata.

Category: GIS, Sensor Data, Satellite Imagery, Natural Resource



NASA NEX is a collaboration and analytical platform that combines state-of-the-art supercomputing, Earth system modeling, workflow management and NASA remote-sensing data. Through NEX, users can explore and analyze large Earth science data sets, run and share modeling algorithms, collaborate on new or existing projects and exchange workflows and results within and among other science communities.


Common Crawl Corpus

A corpus of web crawl data composed of over 5 billion web pages. This data set is freely available on Amazon S3 and is released under the Common Crawl Terms of Use.


1000 Genomes Project and AWS

The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals. The Amazon mirror contains the complete data set from the project and the data can be found at: s3.amazonaws.com/1000genomes.


MNIST database of handwritten digits

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.


UCI Machine Learning Repository

UC Irvine Machine Learning Repository currently maintain 333 datasets as a service to machine learning community.


Delve Datasets

The Delve datasets and families are available from this page. Every dataset (or family) has a brief overview page and many also have detailed documentation. You can download gzipped-tar files of the datasets, but you will require the delve software environment to get maximum benefit from them. Datasets are categorized as primarily assessment, development or historical according to their recommended use. Within each category we have distinguished datasets as regression or classification according to how their prototasks have been created.


Data sets for nonlinear dimensionality reduction

Data sets for nonlinear dimensionality reduction provides datasets for Swiss roll and Faces.



mldata is a machine learning dataset repository. It contains more than 800 public archived data sets with ratings, views, no of downloads, comments.


Mammographic Image Analysis

When benchmarking an algorithm it is recommendable to use a standard test database (data set) for researchers to be able to directly compare the results. Most of the mammographic databases are not publicly available. The most easily accessed databases and therefore the most commonly used databases are the Mammographic Image Analysis Society (MIAS) database and the Digital Database for Screening Mammography (DDSM).



Mulan: A Java Library for Multi-Label Learning have Multi-label classification datasets and Multi-target regression datasets.


Auton Lab Datasets

The Auton Lab encourages researchers to examine and replicate their findings. To facilitate this goal, they provide datasets identical to those used in their published works.


Datasets for "The Elements of Statistical Learning"

Datasets for "The Elements of Statistical Learning" provides datasets in different types of categories like Bone Mineral Density, Countries, Galaxy and many  more.

This list is compile by Ogmer.

Views: 15782


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Sudhir Kumar on January 22, 2016 at 11:10am

Great List. Thanks. And then there is tons of data available for Free now. We are in Beta. Register and Download DataSets for FREE now:

Comment by Mohinder Dick on December 18, 2015 at 2:51am

I am looking for a dataset with observations on US or European car prices. Can anyone suggest a good one? Preferably I would like observations in the thousands is not tens of thousands. I found a few with four to five hundred.

Comment by Brendan Martin on October 27, 2015 at 6:11pm

Have you worked with any of these yourself? I'm interested in working with NASA NEX. Seems really interesting.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service