Subscribe to DSC Newsletter

Need solution To Crawl 200 million Web Page in 10-1 month Using Perl and hadoop is this possible?

Hi all Greetings from Prajan,

 I am Prajan(Pandiyarajan), Working As a Perl Developer in Sciera Solution.I have 2+ year in perl and Big Data .

 I got a difficult task from my manager ,that was i need to Crawl 200 million URL in same Domain with in 20-1 month time ,i have tried in Perl i got Maximum 250 hits only in 60 seconds.I heard will make this using Hadoop But i don't know hadoop ,Any one can You give a solution for my task with detail then how can i reach This Task.

Views: 546

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by John on November 25, 2013 at 7:38am

You should have multiple crawlers (no matter whether Perl or Hadoop) running in parallel. But each crawler is required to run in uniform and unbiased and random way.

This methodology can be much scalable up to infinity.

John

 

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service