Stack exchange data dump
This is an anonymized dump of all user-contributed content on the Stack Exchange network. Each site is formatted as a separate archive consisting of XML files zipped via 7-zip using bzip2 compression. Each site archive includes Posts, Users, Votes, Comments, PostHistory and PostLinks. For complete schema information, see the included readme.txt.
Click here to see context.
Medicare Claims data set
Available when you participate in the new Cloudera challenge.
The new Data Science Challenge: Detecting Anomalies in Medicare Claims will be available starting March 31, 2014. It costs $600 to partcipate. I guess they are worried that the data get re-sold or about some other potential data leaks. They also want real practionioners (an issue on Kaggle competitions), as students are unlikely to fork out $600. But if you participate, you get a copy of Hadoop to install on your laptop; this copy it emulates multi-node Hadoop.
Links to other data sets