Contributed by Gordon. Gordon took NYC Data Science Academy 12 week full time Data Science Bootcamp pr... between Sept 23 to Dec 18, 2015. The post was based on his third class project(due at 6th week of the program).
"Kickstarter is the world's largest funding platform for creative projects," says the first line of the description on the company's website. Creators post projects on Kickstarter hoping for their work to be crowd funded by interested parties. If the project's goal is met before the projects expiry date, the money promised becomes money to spend. If not, the pledges go unfulfilled.
I first looked at the library called grab, but the poor English language report proved to be too large a barrier to overcome. Next to enter my gaze was Scrapy, but that was always discarded to its perceived over-complexity. I finally settled on Selenium as my tool.
Selenium's tagline is terse: It automates browsers. Out of the box Selenium allows one to open a web browser, goes to a page, and do any action a human could do (clicking button, filling in forms, etc), in addition to the base task of parsing html for information. One can partner Selenium with Phantom.js to do this surfing without opening a browser, but, for some reason, that proved to be slower on my machine.
The code below activates Selenium, navigates to Kickstarter's website, and then stores all the project categories and their urls.
from selenium import webdriver
browser = webdriver.Firefox()
categories = browser.find_elements_by_class_name('category-container')
category_links = 
for category_link in categories:
#Each item in the list is a tuple of the category's name and its link.
For each category I use its url to navigate to its page. The default is to show the first 20 active projects in a given category. Using Selenium I expand the results to show the first 20 projects of all the projects ever submitted for that category.
for category in category_links:
I then went to each project's page and scraped the data I wanted. This included the project's name, funding goal, current money garnered, and description. There was some branching in the code to account for different project states: funded and finished, funded and not finished, etc.
The eagle-eye reader will notice a key omission here. I did indeed only scrape the first 20 projects of the 15 categories due to time restrictions. My conservative estimate put the time to scrape data on all of Kickstarter's 200,000 plus projects at four days. The difference between scraping data on 600 instead of 200,000+ projects was five lines of code.
What this snippet does is click the "Load More" button at the bottom of the category's page until every project is loaded, and then scrapes the data for each.
Once I have the full data I intend to do some extensive Machine Learning on the data to try to a build a predictive model to tell whether or not a Kickstarter project will be funded. Finally, I will build a web app that will allow a user to input the description of their Kickstarter project, and they will be able to receive a prediction of whether or not it will be funded.
That's still a long way off, though. Still, I decided to do some basic Machine Learning of the type that I want to do later.
My process involved separating the data into two: one with numeric data and the other with textual data. With the numeric data I used vanilla Logistic Regression on the entire data to achieve an 83% accuracy rate, a 23% increase over the baseline accuracy. Next I used Natural Language Processing to build a model on training data, and then tried to predict the test data set. Surprisingly, the accuracy was 97%.
I'm exciting to see what the more robust process will produce.