Subscribe to DSC Newsletter


I have a data science background, so my goal of using Hadoop would be to store large amounts of data in a HDFS and use a cluster to perform some (parallelized) analytics (e.g. some Machine learning algorithms) on parts of these datasets. To be a little bit more specific, consider the following case: for some large dataset stored in a HDFS, I want to run a simple algorithm on, say, 100 random samples of this dataset and combine these results.

As I understand the concept, to achieve this, I could write a Map function that tells the Tasktrackers on my cluster nodes to perform the analytic on part of the data. Moreover, I should write a Reduce function to 'combine' the results.

Now for the technical side; as I understood, each of the machines in my cluster contains a DataNode and a TaskTracker. I imagine that a TaskTracker on a certain machine might require data for its calculations, that is not present on the DataNode on this particular machine. So the main question that arises is: How does a TaskTracker obtain its required data? Does it combine data present on its neighboring with DataNode from other DataNodes, or does it treat its neighbor DataNode as equally as all the other in DataNodes the cluster? Is all required data transferred to a TaskTracker in the first place?

Please shed some light on these questions as it would really help me in understanding the fundamental principles of Hadoop. Should I have completely have misunderstood the Hadoop workflow in the first place, please let me know, as it, too, would help me a lot.

Thank You


Views: 109

Reply to This


  • Add Videos
  • View All

Follow Us

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service