Try the new non-blocking http API in curl 2.1:
R sitemap example, Jeroen Ooms, 2016
This code demonstrates the new multi-request features in curl 2.0. It creates an index of all files on a web server with a given prefix by recursively following hyperlinks that appear in HTML pages.
For each URL, we first perform a HTTP HEAD (via curlopt_nobody) to retrieve the content-type header of the URL. If the server returns 'text/html', then we perform a subsequent request which downloads the page to look for hyperlinks.
The network is stored in an environment like this: env[url] = (vector of links)
WARNING: Don't target small servers, you might accidentally take them down and get banned for DOS. Hits up to 300req/sec on my home wifi.
.
To read original article, click here. For more information on web scraping, click here.
DSC Resources
Additional Reading
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge
Posted 1 March 2021
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central