Share 'Practical Apache Spark in 10 minutes. Part 2 - RDD'
Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). It is a fault-tolerant collection of elements which allows parallel operations upon itself. RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs.
Spark provides two ways to create RDDs: loading an external dataset and parallelizin…
You can share this blog post in two ways…
Share this link:
Send it with your computer's email program: Email this