I have a data science background, so my goal of using Hadoop would be to store large amounts of data in a
HDFS and use a cluster to perform some (parallelized) analytics (e.g. some Machine learning algorithms) on parts of these datasets. To be a little bit more specific, consider the following case: for some large dataset stored in a
HDFS, I want to run a simple algorithm on, say, 100 random samples of this dataset and combine these results.
As I understand the concept, to achieve this, I could write a
Map function that tells the
Tasktrackers on my cluster nodes to perform the analytic on part of the data. Moreover, I should write a
Reduce function to 'combine' the results.
Now for the technical side; as I understood, each of the machines in my cluster contains a
DataNode and a
TaskTracker. I imagine that a
TaskTracker on a certain machine might require data for its calculations, that is not present on the
DataNode on this particular machine. So the main question that arises is: How does a
TaskTracker obtain its required data? Does it combine data present on its neighboring with
DataNode from other
DataNodes, or does it treat its neighbor
DataNode as equally as all the other in
DataNodes the cluster? Is all required data transferred to a
TaskTracker in the first place?
Please shed some light on these questions as it would really help me in understanding the fundamental principles of Hadoop. Should I have completely have misunderstood the Hadoop workflow in the first place, please let me know, as it, too, would help me a lot.