The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive ~110B lines (1.5TB bzipped) of user-news item interaction data, collected by recording the user- news item interaction of about 20M users from February 2015 to May 2015. In addition to the interaction data, we are providing the demographic information (age segment and gender) and the city in which the user is based for a subset of the anonymized users. On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article. The interaction data is timestamped with the user’s local time and also contains partial information of the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining.
The dataset may be used by researchers to validate recommender systems, collaborative filtering methods, context-aware learning, large-scale learning algorithms, transfer learning, user behavior modeling, content enrichment and unsupervised learning methods.
The readme file for this dataset is located in part 1 of the download. Please refer to the readme file for a detailed overview of the dataset.
Click here to access the Yahoo dataset.
For large collections of data set repositories (the ones featured in the above picture), click here.
This dataset is no more available.
Actually it was used in a Stanford class a couple of years ago.
Assignment link: https://sing.stanford.edu/cs303-sp10/assignments/assignment2.html
I used this data set for an dplyr example to our East Bay R Meetup. See the R Markdown
Other related talks in my archives: https://ds4ci.org/archives/