Ten Favorite Open Data Libraries by Justin Tenuto

There are precious few things that everybody adores. Once you get past breakfast in bed and two dollar bills, the list starts to look a little barren. But if there’s one thing we can agree on as a society it’s this: free stuff is good and cool and you want some of it right now.

In the spirit of this immutable law, we’ve compiled a list of our ten favorite places to find open data. Here they are, in no particular order.

Data is Plural: A tremendous weekly newsletter that will make opening your inbox on Wednesdays a joy. Just this week, Jeremy Singer-Vine sent out datasets ranging from cancer statistics to a survey of Scottish witchcraft. Try not subscribing after you heard that. We dare you.

UC Irvine Machine Learning Repository: One of the oldest open data portals online, the UC Irvine ML Repository has some of the oldest and most famous open datasets in the field. Relevant papers are linked from each and the datasets themselves are really well organized. Plus, it sort of feels like a website from 1995, which is quaint and lovely.

Data.gov: With nearly 200,000 datasets of all stripes, data.gov was a one of those rare political promises that actually came true. It’s a well organized collection that’s consistently updated. They reply to requests and emails, which we can speak to from personal experience.

Open Data Inception: This one’s a gem. ODI contains over 1600 open data portals from every corner of globe. Except you Greenland. Get your shit together.

Kaggle Datasets: The fine folks at Kaggle just launched their open data page a week or two ago, but it’s already in a great place. The best part is it’s integrated directly with Kaggle, so you can show off the scripts you write, get feedback from other data scientists, and publish fancypants graphs.

Data for Everyone: Since a couple of our publicly available datasets found their way to Kaggle, here’s a (not at all) subtle reminder that CrowdFlower posts the best open datasets that find their way through our platform. In fact, we just put one up today. That’s called synergy, y’all.

Yahoo! Webscope Datasets: Scroll about halfway down the page and you’ll find a cluster of datasets ranging from search marketing advertiser bidding to tagged Flickr images.

Stanford Large Network Dataset Collection: The “large” in the title is no joke. The Stanford datasets range from thousands to billions of edges and it’s likely the only place you’ll see the word “Friendster” today. Besides this blog. You get the idea.

rs.io 100: Robb Seaton spent himself a lot of time compiling this diverse and curated collection of one hundred some-odd datasets from various nooks and crannies of the web. Anyplace you can analyze 10,000 annotated cat images, 12 million bibliographic records, and 260 terabytes of genome data is a good place to be indeed.

Amazon Web Services (AWS) Public Data Sets: Yet another searchable repository with oodles of open datasets. Some really fun stuff in here, like Wikipedia traffic statistics, NASA climate projections, and the Sloan Digital Sky Survey.

If we missed any open data hubs you particularly enjoy, do let us know. We’d be happy to update.

Ten Favorite Open Data Libraries by Justin Tenuto

Leave a Reply Cancel reply