The following comprehensive listings were produced by analyzing our large member database, extracting websites that our members mentioned or liked, and for each web site, identifying
The design of the member database (non-mandatory sign-up questions and choices offered to new members on sign-up) was done by our home data scientist (me) long ago, precisely with the purpose in mind of performing analyses like this one, down the road. Other analyses produced in the past include: 6,000 companies hiring data scientists, best cities for data scientists, demographics of data scientists, and 400 job titles for data scientists: see related links at the bottom of this article.
The Whole Internet (Source: Wikipedia)
Seed keywords were used to identify, for each website, whether one or more of the keywords in our list was found on the front page, using a web crawler. This helps categorize websites - the final goal being the creation of a data science webste taxonomy.The seed keywords that we used (hand-picked) are very popular data science related keywords:
We used a web crawler to browse all the URLs, after identifying and cleaning the websites fields (URLs listed by members), in our member database. Click here to get the script used to sumarize the data, as well as a sample of raw data. Note that improving this study is now a new project added to our list of projects for DSA candidates: In short, it consists of creating a niche search engine for datascience, better than Google, and a taxonomy of these websites. Candidates interested in this project will have access to the full data. Because this is based on data submitted by users, the raw data is quite messy and requires both cleaning and filtering. Details are found in my script - it's a good example of code used to clean relatively unstructured data.
Here we categorize the websites in four major clusters:
We provide direct clickable links for domains in category 1 (above and below) only. The choices of these various parameters is to guarantee robustness in our results, filter out noise, and for internal security reasons: listing hundreds of little know websites (with clickable links) can get you penalized by Google, can results in many requests for link removal, and many might of these links can die in the next few months, creating a bad user experience (and additional Google penalties).
The 2,500 Website Listing
Here are the links to the four major categories of data science websites:
The field between parentheses represents the year when the website in question was first mentioned - it does not represent when the website was created, thought it's a good proxy to tell how old the website is. The member database goes as far back as 2007. The list of keywords attached to each website represents which seed keywords were found on the front page, when crawling the website. The number of stars (1, 2 or 3) represents how popular the website is: it's an indicator of how many members mentioned it. Of course, brand new websites might not have 3 stars yet.
Data and Source Code
Source code (two scripts including a web crawler / parser / summarizer, and code to produce final HTML pages), as well as raw, intermediate and final data (samples, screen shots), and details about the 3-step procedure used to publish these listings, can be found here.
Our methodology, to build our semi-categorized website listing, has the following additional features:
Uncrowlable websites, bad domains
Possible Improvements, Next Steps
There are various ways to improve my methodology and the quality of the results. Here I mention a few:
Also, if you want your website to be listed, create a DSC profile and publish your website on your profile (look at the question about "favorite website", on sign-up).
Finally, if interested, join our Data Science Apprenticeship to work on and improve this project, and turn it into a search engine and full taxonomy, changing automatically every day based on data gathered by the web crawlers. Check project #7 in our list of business/applied projects. Time permitting, I will publish more advanced web crawling based on my infringement detection app (currently paused).