Two big datasets to challenge your data science expertise

Stack exchange data dump

This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks. For complete schema information, see the included readme.txt.

Click here to see context.

Medicare Claims data set

Available when you participate in the new Cloudera challenge.

The new Data Science Challenge: Detecting Anomalies in Medicare Claims will be available starting March 31, 2014. It costs $600 to partcipate. I guess they are worried that the data get re-sold or about some other potential data leaks. They also want real practionioners (an issue on Kaggle competitions), as students are unlikely to fork out $600. But if you participate, you get a copy of Hadoop to install on your laptop; this copy it emulates multi-node Hadoop.

Links to other data sets

Views: 8177


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service