Subscribe to DSC Newsletter

For explanations about the methodology, including source code and possible improvements, read our main article on this subject. It also provides links to our other three listings.

The field between parentheses represents the year when the website in question was first mentioned - it does not represent when the website was created, thought it's a good proxy to tell how old the website is. The member database goes as far back as 2007. The list of keywords attached to each website represents which seed keywords were found on the front page, when crawling the website. The number of stars (1, 2 or 3) represents how popular the website is: it's an indicator of how many members mentioned it. Of course, brand new websites might not have 3 stars yet. 

Notes

  • As many as 800 out of 2,800 original all websites could not be crawled. I re-run the crawler on these websites a few hours later, increasing the value of the time-out parameter, and using a different user agent string in the code (the $ua->agent argument for those familiar with the web crawling LWP::UserAgent library). I then re-run it a few more times the same day, and eventually managed to reduce the number of un-crawlable websites to about 300. Maybe trying another day, with a different IP address, following the robot.txt protocol (crawling robots.txt on each failed website) might further reduce the number of failed crawls. However, about 250 of the uncrawlable websites were just simply non-existent, mostly because of typos in member fields (user-entered information) in our database.
  • Some of the uncrawlable websites result from various redirect mechanisms that cause my script not to work, or sometimes because it redirects to an https address (rather than http).. 
  • I extracted the error information for all uncrawlable websites. Typically, the "500 Bad Domain" error means that the domain does not exist (rarely, it is a redirect issue). Sometimes adding www will help (changing mydomain.com to www.mydomain.com).
  • Some of the "bad domains" with 1 or 2 mentions, were actually irrelevant and dead websites posted by spammers. So this analysis allowed us to find a few spammers, and eliminate them!

Here's the listing

  • www-01.ibm.com (2010) ***
  • spss.com (2008) ***
  • greenplum.com (2011) ***
  • ats.ucla.edu (2008) ***
  • dmg.org (2008) ***
  • gigaom.com (2011) ***
  • datasciencetoolkit.org (2012) **
  • web.analytics.yahoo.com (2011) **
  • informationmanagement.com (2011) **
  • coremetrics.com (2008) **
  • bytemining.com (2012) **
  • dynamicdataanalytics.org (2013) **
  • mongodb.org (2013) **
  • autonlab.com (2008) **
  • dataminingtools.net (2010) **
  • rapidinsightinc.com (2011) **
  • jstatsoft.org (2008) **
  • visual-literacy.org (2012) *
  • neuralmarkettrends.com (2008) *
  • z-consulting-llc.com (2010) *
  • revolution-computing.com (2008) *
  • datamining.com (2010) *
  • portraitsoftware.com (2010) *
  • biguidance.com (2012) *
  • blogs.forrester.com (2012) *
  • or-exchange.com (2008) *
  • fitzgeraldanalytics.com (2011) *
  • unica.com (2010) *
  • me.utexas.edu (2011) *
  • stat.ucla.edu (2012) *
  • metamarkets.com (2012) *
  • deloitteanalytics.com (2011) *

Views: 551

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service