The purpose of this project is to create a powerful mechanism to generate high quality traffic for any web site, via automated or semi-automated feed syndication.
How would you proceed to identify all the main digital publishers that accept RSS feeds? This question actually corresponds to one of the projects currently accepted in our data science apprenticeship.
In its simple form, we would provide you with hundreds of websites (click here to download sample list), you would check on Alexa (or other sources) where their traffic comes from, and identify sources (domains and web pages) that accept external RSS feeds after eliminating all general referrers such as Google, LinkedIn, Yahoo, Facebook, StackOverflow etc. and looking only at specialized ones such as BusinessIntelligence.com. Indeed, that's how I discovered BusinessIntelligence.com, as it was listed as a referrer on Alexa for some other data science website (say xyz.com, though this is not the actual domain name). BusinessIntelligence.com actually accept feeds, and that's how xyz.com gets some traffic, and why Alexa lists BusinessIntelligence.com as a referrer for xyz.com.
A more elaborate version of this project is to create an RSS feed exchange where every publisher can swap RSS feeds with other publishers - usually based on category similarity. Feeds would be categorized (hundreds of categories), and you would pre-populate the feed database with hundreds of publisher URL's (not feed submission websites) where you can submit your feed. The taxonomy creation, detection of seed URL's to get started (e.g. using 1 million URLs from Quantcast, check the "download Top US Sites" box at the bottom of this page to get the list in question, or download it from here - it's 8 MB compressed), and crawling / data extraction / data processing are data science sub-projects. One raw approach is to do a 1 million Google searchers for keyword such as domain+submit+RSS+feed where domain is a domain in the 1-million domain list. This is just an idea to get you started, but I'm sure you can do better and refine my approach.