Subscribe to DSC Newsletter

Parallelize R Code Using Apache® Spark™

R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. SparkR’s evolving interface to Apache Spark offers a wide range of APIs and capabilities to Data Scientists and Statisticians. With the release of Spark 2.0, and subsequent releases, the R API officially supports executing user code on distributed data. This is done primarily through a family of apply() functions.

In this Data Science Central webinar, we will explore the following:

●Provide an overview of this new functionality in SparkR.

●Show how to use this API with some changes to regular code with dapply().

●Focus on how to correctly use this API to parallelize existing R packages.

●Consider performance and examine correctness when using the apply family of functions in SparkR.

Speaker: Hossein Falaki, Software Engineer -- Databricks Inc.

Hosted by: Bill Vorhies, Editorial Director -- Data Science Central

Views: 326

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service