Step 0: Should I try to Scrape this?
So you’re excited about a great idea, you’ve found a great site that looks easy to scrape, time to jump right in and start writing a scraper, right?
First things first, it’s crucial to examine the data you wish to scrape. Is it well organized? Will it be difficult to clean later? Sentences can be notoriously hard to clean, because sentences have to be parsed somehow to extract meaningful data. You and I can read a sentence and easily get the relevant information, but for a computer to parse the necessary information, it requires precision. The more variation in your sentences, the harder it is to scrape and clean. And if you need to convert whatever you scraped into numeric data or dates, then that is perhaps double, triple the work. Examples shown in a bit.
Secondly, some websites provide an API and indicate their preference to not be scraped. Though there are certain advantages to web scraping, such as getting data in real time, most of the time it is better off to use the API. The data retrieved from an API is most likely well organized and validated by some sort of schema.
If you decided the data will be relatively easy to clean, and that it’s better off to scrape than to use an API, if available, then:
Step 1: Examining the Website’s Structure and Elements
Scraping first involves collecting data from the right locations. We start by visually locating the data we want and any patterns they follow. Is the information in a table? Is it in the same location on every page?
For The Flight Deal, the data we want are inside the title links located in each box. There are ten per page. As we go through each page, the URL stays consistent, only updating the page number as the page changes. Overall, the structure of the website is quite consistent.
Clicking on a few of the links, the overall structure again seems well organized. Also, the data I want is there, such as posting date, routing, airline in the title, price, and so on.
Pain Point #1: Not recognizing that scraping lines of text or sentences means a lot of cleaning later
Pain Point #2: Failing to recognize that the data is spread across entries, one per page, such that it is difficult to see if the data really is well organized or not.
Once we have an idea of how we want our scraper to access the data, we can use inspect element to find the HTML to directly locate where our data is. Remember, inspect element is your friend. Look for any tags or patterns in the classes that might uniquely identify where the data is.
Step 2: Grabbing the Proper Path
Pain Point #3: Trying to test Xpaths with a Scrapy spider instead of using the Scrapy shell to test first.
The first thing to try is to see whether the data can be accessed using the unique attribute of the tag it’s enclosed in, or use the tag itself if the data is enclosed in multiple instances of the tag. For example, a list or a table. If that doesn’t work, try accessing the data’s parent container. Often, sifting from the parent container downwards gives good insight on how to access the elements nested within. Again, using inspect element may not always offer the right path, but using the Scrapy shell can reveal the proper path to use . I won’t go into code here, but if you are using Scrapy, remember, response.css is your friend.
Step 3: Develop the Framework and Utilize your Web Scraper
This step is the fun part. After you’ve made sure you have the right paths via Scrapy shell, it’s time to write the code for the Web Scraper! For Scrapy, there’s a template that is somewhat easy to follow. Declare your item fields, think of the patterns from Step 1 for the spider to follow, the target paths from Step 2 for your spider to extract information from, create the pipeline, and make sure you set your settings! Download delay is important to ease the load of crawling on a server and to prevent getting kicked off.
Pain Point #4 : Forgetting to set the pipeline in the settings for Scrapy and not quite sure why information isn’t being gathered.
For The Flight Deal, I implemented my spider to access each of the ten title links, extract the data from within those links via the proper Xpaths or css paths, and then move onto the next page with the next set of links. Hint: you can do this by creating a list of urls, or using a regular expression to indicate what rules for the spider to follow. In my spider, I used a regular expression to limit my spider to crawl from page/1/ to page /999/. I also used BeautifulSoup to extract airport code and city from Orbitz.
It would also be wise to do a few test runs before doing a full on run of your web scraper. If you have to access multiple pages, testing different ranges of pages and seeing the output is good, too.
Step 4: Cleaning Data
Did you make sure to follow Step 0, and that your data will be relatively easy to clean? Did you also know that even without scraping, cleaning data takes up the majority of the time for a data scientist?
Pain Point #5 : Spending more than 48 hours learning the pain of cleaning and validating data from text. Yes, 48 hours of just cleaning! This involved intimately learning how to regular expressions, think of clever ways to convert date strings into consistent dates with ordering, and constant test and re-testing. It was like running chemical reactions in a laboratory thousands of times until I got the right formula, or designing a product and going through rounds and rounds of user experience testing.
Before I delve into an optional section on my struggle with cleaning my data from The Flight Deal, I’d like to mention there are two ways to approach cleaning data from web scraping. One way is to parse the data in the web scraper prior to outputting the data, and another way is to output raw data, and then do it after, or both. For my project, I thought it would be best to extract the relevant parts of text first in my web scraper, then format the data itself later.
Aside: The Pains of Working with Inconsistent, Text-Based Data Scraped from the Web
As I mentioned before, there can be inconsistencies in the data you want to scrape. Structurally, a website is usually well-organized, but without a schema or some sort of validation, the data itself can be wildly inconsistent. Here are a couple of examples from Fare Availability on The Flight Deal:
Valid for travel on the outbound until December.
Valid for travel early November – mid December or January 2017.
Valid for travel in December.
Valid for travel early January 2017 – late March 2017 or April 2017 – May 2017, July 2017 – August 2017.
Valid for travel January 10, 2017 – August 20, 2017.
Valid for travel November 10th – mid December.
It took a many rounds of testing, rewriting my regular expressions, and scraping and re-scraping to capture these cases.
Further, there are cases which I decided not to handle:
Using the flexible data search on Avianca.com, we found the following valid travel dates.
- Outbound: August 29th, August 31st, September 3rd, September 10th
- Return: September 8th – October 2nd
How would I even parse that into a range that the majority of the data was following?
Furthermore, notice that some of the text has a day or a year, and some of the text has either “early”, “mid”, or “late”. These date strings would have to be cleaned into a consistent format later on, which would require again to handle different cases.
Valid for travel November 10th – 25th.
.Why? After careful inspection, I missed a single ? next to a space. In most cases, there is a month in front of the day, with a space in between them. Without the month, that space isn’t there either!
Unlucky for me, a majority of the data I wanted did not have a unique identifying attribute or tag. Rather, they were all in the same large container under bullet points. So I had to write multiple regular expressions to detect the proper location of the data I desired. Whereas the remaining data had less variability than travel dates, there were still issues like these:
Valid for travel on the outbound until early Decembe. Must purchase at least 3 day in advance of departure.
LAX – NRT (Tokyo) – TPE (Taipei) – NRT – – LAX
PHL – JFK (New York) – FCO (Rome) – CTA (Catania) – FCO – JFK – JFK
typos and inconsistent data! Again, typed sentences are generally harder to parse. From the above it does seem that these texts are entered manually.
Detecting a debugging can be quite time-consuming.
Even seemingly clean websites can have errors, such as the Orbitz site where I extracted airport information from:
“Dalat, Viet Nam – Lienkhang DLI)” .
“Skopie, Macedonia (FYROM) (SKP)”
The two above threw off my code for the longest time as I tried splitting by parentheses to separate city and airport code.
Lastly, I had to parse the text for dates into a usable context. Detecting and adding the year, swapping the day and month so that they are in the same consistent order, and handling, “early”, “mid”, “late” all involve regular expressions and are not trivial. Furthermore, because I kept the format of “early”, “mid”, “late” instead of assuming granularity, I assigned these dates a certain value so I can perform comparison and whether a date string fell within a certain date range.
Step 5: Load Data into a Data Frame
Load into Pandas, R Data Frames, whatever you like. Maybe clean the data a bit more. Now you’re ready to play with the data you just scraped. That’s it! Get going!