Scraping with 300 req/sec in R? Yes you can!

EmmanuelleRieuf
January 10, 2016 at 4:30 am

Try the new non-blocking http API in curl 2.1:

R sitemap example, Jeroen Ooms, 2016

This code demonstrates the new multi-request features in curl 2.0. It creates an index of all files on a web server with a given prefix by recursively following hyperlinks that appear in HTML pages.

For each URL, we first perform a HTTP HEAD (via curlopt_nobody) to retrieve the content-type header of the URL. If the server returns ‘text/html’, then we perform a subsequent request which downloads the page to look for hyperlinks.

The network is stored in an environment like this: env[url] = (vector of links)

WARNING: Don’t target small servers, you might accidentally take them down and get banned for DOS. Hits up to 300req/sec on my home wifi.

.

To read original article, click here. For more information on web scraping, click here.

DSC Resources

Career: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | Jobs
Knowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSC
Buzz: Business News | Announcements | Events | RSS Feeds
Misc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For Bloggers

Additional Reading

What statisticians think about data scientists
Data Science Compared to 16 Analytic Disciplines
10 types of data scientists
91 job interview questions for data scientists
50 Questions to Test True Data Science Knowledge
24 Uses of Statistical Modeling
21 data science systems used by Amazon to operate its business
Top 20 Big Data Experts to Follow (Includes Scoring Algorithm)
5 Data Science Leaders Share their Predictions for 2016 and Beyond
50 Articles about Hadoop and Related Topics
10 Modern Statistical Concepts Discovered by Data Scientists
Top data science keywords on DSC
4 easy steps to becoming a data scientist
22 tips for better data science
How to detect spurious correlations, and how to find the real ones
17 short tutorials all data scientists should read (and practice)
High versus low-level data science

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Tags:Uncategorized

Leave a Reply Cancel reply