Subscribe to DSC Newsletter

Scraping with 300 req/sec in R? Yes you can!

Try the new non-blocking http API in curl 2.1:

R sitemap example, Jeroen Ooms, 2016

This code demonstrates the new multi-request features in curl 2.0. It creates an index of all files on a web server with a given prefix by recursively following hyperlinks that appear in HTML pages.

For each URL, we first perform a HTTP HEAD (via curlopt_nobody) to retrieve the content-type header of the URL. If the server returns 'text/html', then we perform a subsequent request which downloads the page to look for hyperlinks.

The network is stored in an environment like this: env[url] = (vector of links)

WARNING: Don't target small servers, you might accidentally take them down and get banned for DOS. Hits up to 300req/sec on my home wifi.

 

.

To read original article, click here. For more information on web scraping, click here.

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 2534

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service