Web Crawling & Analytics Case Study - Database Vs Self Hosted Message Queuing Vs Cloud Message Queuing

The Business Problem:


To build a repository of used car prices and identify trends based on data available from used car dealers. The solution to the problem necessarily involved building large scale crawlers to crawl & parse thousands of used car dealer websites everyday.


1st Solution: Database Driven Solution to Crawling & Parsing


Our initial Infrastructure consisted of a crawling, parsing and database insertion web services all written in Python. When the crawling web service finishes with crawling a web site it pushes the output data to the database & the parsing web service picks it from there & after parsing the data, pushes the structured data into the database.



Problems with Database driven approach:


  • Bottlenecks: Writing the data into database and reading it back proved be a huge bottleneck and slowed down the entire process & left to high & low capacity issues in the crawling & parsing functions.

  • High Processing Cost: Due to the slow response time of many websites the parsing service would remain mostly idle which lead to a very high cost of servers & processing.


We tried to speed up the process by directly posting the data to the parsing service from crawling service but this resulted in loss of data when the parsing service was busy. Additionally, the approach presented a massive scaling challenge from read & write bottlenecks from the database.


2nd Solution: Self Hosted / Custom Deployment Using RabbitMQ



To overcome the above mentioned problems and to achieve the ability to scale we moved to a new architecture using Rabbitmq. In the new architecture crawlers and parsers were Amazon EC2 micro instances. We used Fabric to push commands to the scripts running in the instances. The crawling instance would pull the used car dealer website from the website queue, crawl the relevant pages and push output the data to a crawled pages queue.The parsing instance would pull the data from the crawled pages queue, parse them and push data into parsed data queue and a data base insertion script would transfer that data into postgres.


This approach speeded up the crawling and parsing cycle. Scaling was just a matter of adding more instances created from specialized AMIs.


Problems with RabbitMQ Approach:

  • Setting up, deploying & maintaining this infrastructure across hundreds of servers was a nightmare for a small team

  • We suffered data losses every time there was a deployment & maintenance issues. Due to the tradeoff we were forced to make between speed and persistence of data in Rabbitmq, there was a chance we lost some valuable data if the server hosting RabbitMQ crashed.  


3rd Solution: Cloud Messaging Deployment Using IronMQ & IronWorker


The concept of having multiple queues and multiple crawlers and parsers pushing and pulling data from them gave us a chance to scale the infrastructure massively. We were looking for solutions which could help us overcome the above problems using a similar architecture but without the headache of deployment & maintenance management.


The architecture, business logic & processing methods of using Iron.io & Ironworkers were similar to RabbitMQ but without the deployment & maintenance efforts. All our code is written in python and since Iron.io supports python we could set up the crawl & parsing workers and queues within 24 hours with minimal deployment & maintenance efforts. Reading and writing data into IronMQ is fast and all the messages in IronMQ are persistent and the chance of losing data is very less.






Key Variables

Database Driven Batch Processing

Self Hosted - RabbitMQ

Cloud Based - Iron MQ

Speed of processing a batch




Data Loss from Server Crashes & Production Issues

Low Risk

Medium Risk

Low Risk

Custom Programming for Queue Management

High Effort

Low Effort

Low Effort

Set Up for Queue Management


Medium Effort

Low Effort

Deployment & Maintenance of Queues


High Effort

Low Effort


Views: 3393

Tags: mq, webcrawling


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by RAVINDER PAL VASHIST on May 21, 2015 at 12:57am

good one

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service