Contributed by Daniel Donohue. Daniel took NYC Data Science Academy 12 week full time Data Science Bootcamp pr... between Sept 23 to Dec 18, 2015. The post was based on his third class project(due at 6th week of the program).
For our third project here at NYC Data Science, we were tasked with writing a web scraping script in Python. Since I spend (probably too much) time on Reddit, I decided that it would be the basis for my project. For the uninitiated, Reddit is a content-aggregator, where users submit text posts or links to thematic subforums (called "subreddits"), and other users vote them up or down and comment on them. With over 36 million registered users and nearly a million subreddits, there is a lot of content to scrape.
I selected ten subreddits---five of the top subreddits by number of subscribers and five of my personal favorites---and scraped the top post titles, links, date and time of the post, number of votes, and the top rated comment on the comment page for that post. The ten subreddits were:
There are many Python packages that would be adequate for this project, but I ended up using Scrapy. It seemed to be the most versatile among the different options, and it provided easy support for exporting scraped data to a database. Once I had the data stored in a database, I wrote the post title and top comment to txt files, and used the wordcloud module to generate word clouds for each of the subreddits.
When you start a new project, Scrapy creates a directory with a number of files. Each of these files rely on one another. The first file, items.py, defines containers that will store the scraped data:
Once filled, the item essentially acts as a Python dictionary, with the keys being the names of the fields, and the values being the scraped data corresponding to those fields.
The next file is the one that does all the heavy lifting---the file defining a Spider class. A Spider is a Python class that Scrapy uses to define what pages to start at, how to navigate them, and how to parse their contents to extract items. First, we have to import the modules we use in the definition of the Spider class:
The first two imports are merely situational; we'll use regex to get the name of the subreddit and BeautifulSoup to extract the text of the top comment. Next, we import spider.Spider, from which our Spider will inherit, and spider.Request, which will lend Request objects from HTTP requests. Finally, we import our Item class we defined in items.py.
The next task is to give our Spider a place to start crawling.
The attribute allowed_domains limits the domains the Spider is allowed to crawl; start_urls is where the Spider will start crawling. Next, we define a parse method, which will tell the Spider what to do on each of the start_urls. Here is the first part of this method's definition:
This uses XPath to select certain parts of the HTML document. In the end, links is a list of the links for each post, titles is a list of all the post titles, etc. Corresponding elements of the first four of these lists will fill some of the fields in each instance of a RedditItem, but the top_comment field needs to be filled on the comment page for that post. One way to approach this is to partially fill an instance of RedditItem, store this partially filled item in the metadata of a Request to a comment page, and then use a second method to fill the top_comment field on the comment page. This part of the parse method's definition achieves this:
For the ith link in the list of comment urls, we create an instance of RedditItem, fill the subreddit field with the name of the subreddit (extracted from the comment url with the use of regular expressions), the link field with the ith link, the title field with the ith title, etc. Then, we create a request to the comment page with the instruction to send it to the method parse_comment_page, and store the partially filled item temporarily in this request's metadata. The method parse_comment_page tells the Spider what to do with this:
Again, XPath specifies the HTML to extract from the comment page, and in this case, BeautifulSoup removes HTML tags from the top comment. Then, finally, we fill the last part of the item with this text and yield the filled item to the next step in the scraping process.
The next step is to tell Scrapy what to do with the extracted data; this is done in the item pipeline. The item pipeline is responsible for processing the scraped data, and storing the item in a database is a typical such process. We chose to store the items in a MongoDB database, which is a document-oriented database (in contrast with the more traditional table-based relational database structure). Strictly speaking, a relational database would have sufficed, but MongoDB has a more flexible data model, which could come in use if I decide to expand on this project in the future. First, we have to specify the database settings in settings.py (another file initially created by Scrapy):
The delay is there to avoid violating Reddit's terms of service. So, now we've set up a Spider to crawl and parse the HTML, and we've set up our database settings. Now we need to connect the two in pipelines.py:
The first class is used to check if a link has already been added, and skips processing that item if it has. The second class defines the data persistence. The first method in MongoDBPipeline actually connects to the database (using the settings we've defined in settings.py), and the second method processes the data and adds it to the collection. In the end, our collection is filled with documents like this:
The real work was done in actually scraping the data. Now, we want to use it to create visualizations of frequently used words across the ten subreddits. The Python module wordcloud does just this: it takes a plain text file and generates word clouds like the one you see above, and with very little effort. The first step is to write the post titles and top comments to text files.
The first two lines after the imports are there to change the default encoding from ASCII to UTF-8, in order to properly decode emojis (of which there were many in the comments). Finally, we use these text files to generate the word clouds:
The WordCloud object uses reddit_mask.jpg as a canvas: it only fills in words in the black area. Here's an example of what we get (generated from posts on /r/totallynotrobots):
After all of this, I am now a big fan of Scrapy and everything it can do, but this project has certainly only scratched the surface of its capabilities.