A Visual Studio 2013 demo project including the WebpageDownloader and LinkCrawler can be downloaded here.
The US digital universe currently doubles in size approximately every three years . In fact, Hewlett Packard estimates that by the end of this decade, the digital universe will be measured in ‘Brontobytes’, which represent one billion Exabytes or around two quadrillion years of music . Each minute, internet users send over 204 million emails, Google receives over 4 million search requests, YouTube users upload 72 hours of video, and 277,000 tweets are posted on twitter . It is estimated that in 2012, only 1/2 a percent of the data in the US digital universe was analyzed . Data mining web content efficiently is becoming an all to common task. When crawling web pages or downloading website content, a single threaded approach can often leave one waiting unreasonable amounts of time for critical information. In previous articles, I have written on many topics about extracting valuable information from the data contained in webpages (references are included below). In the machine learning world, we typically refer to this as extracting "features" from webpages.
One of the most basic features which can be extracted from web content is a webpage's HTML which can simply be downloaded using C#. Webpage HTML content can also be processed further to produce more elaborate and complex features. For example, the HTML downloaded from a webpage contains much different information than the text and images which are displayed in the browser using the webpage's URL. I have provided just a few examples of valuable features which can be extracted when mining webpages in the list below including some references for further reading on some topics:
However, almost all successful webpage feature extractions begin with simply downloading webpages. The download process can also be the largest bottleneck in feature extraction systems, especially when you are downloading content that you did not create and may not have any idea where the actual content is located. Sometimes HTML can be full of errors, missing scripts, malicious content, or located on very old servers, down servers, or a large number of other reasons which could potentially cause much grief during the download process. In fact, when you are mining thousands, tens-of-thousands, or millions of webpages, you are guaranteed to run into webpages that cause the download process to either fail or hang. The remainder of this article demonstrates how to successfully mine large volumes of webpages by downloading them in parallel, keeping a detailed log of your progress, and even retrying downloads that fail or time out.
I have created a tool in C# called the "Webpage Downloader". This class can be used within any C# program to download large volumes of webpage content in parallel. Creating the class in a C# program is very simple as shown in Figure 1.
Figure 1 - Creating the Webpage Downloader.
Each webpage downloaded by the Webpage Downloader is returned in a WebPageItem class. The webpage item class includes all of the relevant details for a particular download and could easily be further extended to meet business requirements for a particular project. Figure 2 shows each of the current properties supported by the WebPageItem class which are visible in the class constructor.
Figure 2 - The Webpage Downloader's Class Constructor.
The WebPageItem in Figure 2 contains each downloaded webpage's URL, HTML, the server's response url (important to detect a redirect), an error flag, error message, a flag indicating if the server provided a successful response, and the URL's byte array data, if the requested URL is downloaded as binary file. Any number of new properties or methods could be added to the WebPageItem class to extent the webpage related features collected during the download process.
Figure 3 shows the asynchronous DownloadURL() method which is used to download a single webpage when provided a valid URL. It also writes any download errors to a log file (when provided). The URL's content can be downloaded as either binary or text data. This allows the user to download HTML as text or download other file types such as images in a binary format.
Figure 3 - The DownloadURL() Method Asynchronously Downloads a Single Webpage in Either a Binary or String Format.
Portions of this method are asynchronous and the method itself is marked with the "async" keyword in its signature. Using the C# HttpClient, a timeout is also set to ensure that when a download "hangs" for any reason, it will eventually be timed out using a duration specified by the user. The method also includes a true/false or boolean parameter named "binary". This parameter allows the URL to be downloaded asynchronously as either a byte array or as a string. The C# "await" keyword allows us to utilize the HttpContent's asynchronous read / download methods ReadAsByteArrayAsync() and ReadAsStringAsync() to download multiple webpage's content simultaneously in the background. The DownloadURL() method could also be extended to make the download type determination dynamically based upon a URL's file extension instead of using the "binary" parameter. For instance, image file extensions could be downloaded as binary data while HTML file extensions were downloaded as text. Regardless of the download's format or success, the DownloadURL() method returns a WebPageItem which contains any downloaded data, the server's response, any redirect URL which may have occurred, and any relevant error messages. This is a very important feature which allows each download to be tracked. When downloading large volumes of URL's in parallel, this becomes very important since the user will want detailed control over things such as how long to wait before timing out, the download format to use, and how many times to retry downloads which fail.
Since the DownloadURL() method executes asynchronously, we can now download multiple webpages at the same time. The Webpage Downloader accomplishes this task using two thread safe collections and two lists as shown in Figure 4.
Figure 4 - Data Structures used by the Webpage Downloader.
The thread safe BlockingCollection "downloads" contains each successful webpage download as it occurs. Since the collection is thread safe, multiple worker threads can safely place downloaded WebPageItem content into the collection at the same time. This also allows the user to successfully create a parallel producer / consumer style processing pipeline where downloaded webpages can immediately be further processed in a manner specified by the user. This is described in greater detail later on.
The _RequeueCounts data structure keeps track of how many times a webpage download is attempted for a given URL. When the download attempts is less than the maximum attempts specified by the user, the URL is added back into the _ReTry list for another download attempt. Otherwise the URL is added to the log with a message that the download failed after exceeding the maximum attempts, along with the last error message received.
The _DownloadTasks list is used within the DownloadUrls() method to act as a "concurrency" throttle. When the DownloadUrls() method first executes, one DownloadUrl() task is created for each URL to be downloaded up to the maximum number of downloads specified by the user. Figure 5 shows the creation of these download tasks using the DownloadUrl() method.
Figure 5 - Starting a Maximum Number of Concurrent Webpage Downloads.
Once the _DownloadTasks list is full, a second while loop executes which "awaits" any executing download task to complete. When the first task completes, it is removed from the _DownloadTasks list and a new download is started. The completed download is also added to the "downloads" BlockingCollection for further downstream processing or possibly re-queued in the event of an error. This process is illustrated in Figure 6.
Figure 6 - Throttle Parallel Downloads Based upon a User Specified Concurrency Level.
In Figure 7, the WebpageDownloader created in Figure 1 is used to download a list of unique links. First the DownloadUrls() method is called to begin downloading webpages in parallel. According to Figure 1, the WebpageDownloader will download up to 100 webpages at a time retrying each download up to 3 times. Downloads will also timeout after 60 seconds, and each download will be documented in the downloadLog.txt file.
Figure 7 - Using Downloads from the Webpage Downloader in a Parallel Pipeline.
In the Figure 7 example, as soon as the first download is completed, a second Parallel.ForEach loop immediately processes the downloaded content. Keep in mind that other URL's could also be in the process of downloading at the exact same time since the DownloadUrls() method is asynchronous and running in a separate thread. In this simple example, the second Parallel.ForEach loop saves each of the downloaded items to an output directory specified by the user.
The DownloadUrls() method recursively calls itself in the event that not all URLs were successfully downloaded and additional retries were specified by the user. During processing any URLs placed into the _ReTry collection are simply provided as input to another recursive call to the same DownloadUrls() method as shown in Figure 8. In this manner, failed downloads will continue be re-attempted until they are either successfully downloaded or the maximum attempts are exceeded.
Figure 8 - Using Recursion to Retry Failed Downloads.
Have you ever wondered how a search engine might crawl the web looking for links which are connected to a single webpage? The following section demonstrates using the WebpageDownloader to start with a single webpage URL and crawl the links within every connected webpage until the collection reaches a certain maximum size. The purpose of this demo application is merely to illustrate using the Webpage Downloader class for processing downloaded webpages. There are multiple improvements which could be made to this demo project for crawling links more efficiently. However, they would also add many more lines of code which may make understanding the illustration more complex.
Figure 9 - Crawling Webpage Links using the Webpage Downloader.
In Figure 9 part of the CrawlLinks() method is shown. We begin by downloading a single starting URL. Next, a for each loop is used to process any downloaded WebPageItems which are provided back to the for each loop by the WebpageDownloader. For each of the downloaded WebPageItems we call the FindAllLinks() method to extract the URLs from any <a> link tags in the downloaded html. This method also filters out only common html file types as well for the demo.
However, the FindAllLinks() method is only called when the program has not yet exceeded the maximum number of downloads specified by the user. Each new link found is added to a list of new URLs to download. In the event that all provided URLs have been processed in the current run of the CrawlLinks() method, it next checks the "newLinks" list to determine if there are any additional links left to crawl. When "newLinks" are available and additional downloads are required, the CrawlLinks() method recursively starts over using the "newLinks" as the starting URLs and continues crawling. Once the method has collected enough link files, the for each loop is simply exited ignoring any remaining files.
Using my own website www.jakemdrew.com as the starting URL, I performed 100 link crawl downloads using the Webpage Downloader. For example, Figure 10 below shows the CrawlLinks() method being used in combination with a WebpageDownloader to download 100 crawled links in parallel using 100 concurrent download workers for the WebpageDownloader.
Figure 10 - Crawling 100 Links using 100 Concurrent WebpageDownloader Workers.
I performed benchmarks to identify the first 100 links associated with www.jakemdrew.com using 2, 50, and 100 concurrent download workers. The elapsed times are shown below:
Looking at the Webpage Downloader's log in Figure 11 we can see that the LinkCrawler() located 39 new links to download from the first URL provided. Since the crawler had not yet located 100 links, it continued to download html and search for new links within the 39 new URL's. The crawler then found an additional 2541, links contained in the html of the 39 links during its third recursive execution. However, only a fraction of those links were actually downloaded since we set a download threshold of 100 total webpages.
Downloads Requested: 1 3/18/2015 7:54:10 PM ################################################################### http://www.jakemdrew.com/|Success
Downloads Requested: 39 3/18/2015 7:54:10 PM ###################################################################
Downloads Requested: 2541 3/18/2015 7:54:21 PM ################################################################### http://www.google.com/chrome|Success
Figure 11 - The Webpage Downloader Log.
It is clear that sizable performance gains can be achieved by using a parallel webpage download approach when mining data from the web. However, there is a diminishing return on adding additional concurrency workers to the download process. The optimal threshold is based upon numerous factors including both processing power and connection speed. I think it is also worth mentioning that in certain situations utilizing an anonymity network such as TOR  can be very beneficial to avoid facing download performance penalties or blocks which may be placed at the ip level by certain webpage providers. While moving from 2 concurrent download workers reduced execution times from 44 down to 16 seconds, increasing concurrent workers from 50 to 100 workers actually increased our total download time to 17 seconds. Carefully choosing the appropriate parameters can dramatically impact execution times when dealing with larger volumes of data.