Need solution To Crawl 200 million Web Page in 10-1 month Using Perl and hadoop is this possible?

Hi all Greetings from Prajan,

 I am Prajan(Pandiyarajan), Working As a Perl Developer in Sciera Solution.I have 2+ year in perl and Big Data .

 I got a difficult task from my manager ,that was i need to Crawl 200 million URL in same Domain with in 20-1 month time ,i have tried in Perl i got Maximum 250 hits only in 60 seconds.I heard will make this using Hadoop But i don't know hadoop ,Any one can You give a solution for my task with detail then how can i reach This Task.

Views: 586


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by John on November 25, 2013 at 7:38am

You should have multiple crawlers (no matter whether Perl or Hadoop) running in parallel. But each crawler is required to run in uniform and unbiased and random way.

This methodology can be much scalable up to infinity.



© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service