All Videos Tagged Spark (Data Science Central) - Data Science Central 2019-08-24T02:21:17Z https://www.datasciencecentral.com/video/video/listTagged?tag=Spark&rss=yes&xn_auth=no DSC Webinar Series: From Pandas to Apache Spark™ tag:www.datasciencecentral.com,2019-07-03:6448529:Video:851584 2019-07-03T19:18:10.864Z Tim Matteson https://www.datasciencecentral.com/profile/2edcolrgc4o4b <a href="https://www.datasciencecentral.com/video/dsc-webinar-series-from-pandas-to-apache-spark"><br /> <img alt="Thumbnail" height="135" src="https://storage.ning.com/topology/rest/1.0/file/get/3189210470?profile=original&amp;width=240&amp;height=135" width="240"></img><br /> </a> <br></br>***Please be aware there is a slight audio issue from approximately 10:45-13:00 in the recording***<br></br> <br></br> Presenting Koalas, a new open source project unveiled by Databricks, that brings the simplicity of pandas to the scalability powers of Apache Spark™.<br></br> <br></br> Data science with Python has exploded in popularity over the past few years and… <a href="https://www.datasciencecentral.com/video/dsc-webinar-series-from-pandas-to-apache-spark"><br /> <img src="https://storage.ning.com/topology/rest/1.0/file/get/3189210470?profile=original&amp;width=240&amp;height=135" width="240" height="135" alt="Thumbnail" /><br /> </a><br />***Please be aware there is a slight audio issue from approximately 10:45-13:00 in the recording***<br /> <br /> Presenting Koalas, a new open source project unveiled by Databricks, that brings the simplicity of pandas to the scalability powers of Apache Spark™.<br /> <br /> Data science with Python has exploded in popularity over the past few years and pandas has emerged as the lynchpin of the ecosystem. When data scientists get their hands on a data set, pandas is often the most common exploration tool. It is the ultimate tool for data wrangling and analysis. In fact, pandas’ read_csv is often the very first command students run in their data science journey.<br /> <br /> The problem? pandas does not scale well to big data. It was designed for small data sets that a single machine could handle. On the other hand, Apache Spark has emerged as the de facto standard for big data workloads. Today many data scientists use pandas for coursework, and small data tasks. When they work with very large data sets, they either have to migrate their code to PySpark's close but distinct API or downsample their data so that it fits for pandas.<br /> <br /> Now with Koalas, data scientists get the best of both worlds and can make the transition from a single machine to a distributed environment without needing to learn a new framework.<br /> <br /> In this latest Data Science Central webinar, the developers of Koalas will show you how:<br /> <br /> Koalas removes the need to decide whether to use pandas or PySpark for a given data set<br /> For work that was initially written in pandas for a single machine, Koalas allows data scientists to scale up their code on Spark by simply switching out pandas for Koalas<br /> Koalas unlocks big data for more data scientists in an organization since they no longer need to learn PySpark to leverage Spark<br /> <br /> Speakers:<br /> Tony Liu, Product Manager, Machine Learning - Databricks<br /> Tim Hunter, Sr. Software Engineer and Technical Lead, Co-Creator of Koalas - Databricks<br /> <br /> Hosted by:<br /> Stephanie Glen, Editorial Director - Data Science Central Parallelize R Code Using Apache® Spark™ tag:www.datasciencecentral.com,2017-08-15:6448529:Video:607234 2017-08-15T23:37:42.031Z Tim Matteson https://www.datasciencecentral.com/profile/2edcolrgc4o4b <a href="https://www.datasciencecentral.com/video/parallelize-r-code-using-apache-spark"><br /> <img alt="Thumbnail" height="135" src="https://storage.ning.com/topology/rest/1.0/file/get/2781530416?profile=original&amp;width=240&amp;height=135" width="240"></img><br /> </a> <br></br>R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. SparkR’s evolving interface to Apache Spark offers a wide range of APIs and capabilities to Data Scientists and Statisticians. With the release of Spark 2.0, and subsequent releases, the R API officially supports executing user code on distributed data. This is… <a href="https://www.datasciencecentral.com/video/parallelize-r-code-using-apache-spark"><br /> <img src="https://storage.ning.com/topology/rest/1.0/file/get/2781530416?profile=original&amp;width=240&amp;height=135" width="240" height="135" alt="Thumbnail" /><br /> </a><br />R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. SparkR’s evolving interface to Apache Spark offers a wide range of APIs and capabilities to Data Scientists and Statisticians. With the release of Spark 2.0, and subsequent releases, the R API officially supports executing user code on distributed data. This is done primarily through a family of apply() functions.<br /> <br /> In this Data Science Central webinar, we will explore the following:<br /> <br /> ●Provide an overview of this new functionality in SparkR.<br /> <br /> ●Show how to use this API with some changes to regular code with dapply().<br /> <br /> ●Focus on how to correctly use this API to parallelize existing R packages.<br /> <br /> ●Consider performance and examine correctness when using the apply family of functions in SparkR.<br /> <br /> Speaker: Hossein Falaki, Software Engineer -- Databricks Inc.<br /> <br /> Hosted by: Bill Vorhies, Editorial Director -- Data Science Central How to Keep Your R Code Simple While Tackling Big Datasets tag:www.datasciencecentral.com,2017-02-14:6448529:Video:525978 2017-02-14T23:16:25.169Z Tim Matteson https://www.datasciencecentral.com/profile/2edcolrgc4o4b <a href="https://www.datasciencecentral.com/video/how-to-keep-your-r-code-simple-while-tackling-big-datasets"><br /> <img alt="Thumbnail" height="135" src="https://storage.ning.com/topology/rest/1.0/file/get/2781529731?profile=original&amp;width=240&amp;height=135" width="240"></img><br /> </a> <br></br>R, TERR, Spark and Python are tools that benefit from larger systems. Software-Defined Servers enable data scientists to size their processing system to the size of a particular data problem. In this Data Science Central webinar you will learn how Software-Defined Servers work in practice for several common data science tools and will explore… <a href="https://www.datasciencecentral.com/video/how-to-keep-your-r-code-simple-while-tackling-big-datasets"><br /> <img src="https://storage.ning.com/topology/rest/1.0/file/get/2781529731?profile=original&amp;width=240&amp;height=135" width="240" height="135" alt="Thumbnail" /><br /> </a><br />R, TERR, Spark and Python are tools that benefit from larger systems. Software-Defined Servers enable data scientists to size their processing system to the size of a particular data problem. In this Data Science Central webinar you will learn how Software-Defined Servers work in practice for several common data science tools and will explore how removing core and memory constraints has multiple, profound and positive implications for application developers tackling big data problems of all kinds.<br /> <br /> Speaker: Michael Berman, Vice President of Engineering -- TidalScale<br /> <br /> Hosted by: Bill Vorhies, Editorial Director -- Data Science Central Jump over the Data Preparation Hurdle with Spark tag:www.datasciencecentral.com,2015-12-01:6448529:Video:356030 2015-12-01T22:41:15.301Z Tim Matteson https://www.datasciencecentral.com/profile/2edcolrgc4o4b <a href="https://www.datasciencecentral.com/video/jump-over-the-data-preparation-hurdle-with-spark"><br /> <img alt="Thumbnail" height="180" src="https://storage.ning.com/topology/rest/1.0/file/get/2781531103?profile=original&amp;width=240&amp;height=180" width="240"></img><br /> </a> <br></br>Data scientists don’t scale. In using them to do manual data preparation, you’re missing a huge opportunity to extract the most value from your intellectual assets.<br></br> <br></br> The good news? By automating and accelerating much of this raw data crunching and ETL work, you enable non-data scientists to do data preparation rapidly and simply—and ask their… <a href="https://www.datasciencecentral.com/video/jump-over-the-data-preparation-hurdle-with-spark"><br /> <img src="https://storage.ning.com/topology/rest/1.0/file/get/2781531103?profile=original&amp;width=240&amp;height=180" width="240" height="180" alt="Thumbnail" /><br /> </a><br />Data scientists don’t scale. In using them to do manual data preparation, you’re missing a huge opportunity to extract the most value from your intellectual assets.<br /> <br /> The good news? By automating and accelerating much of this raw data crunching and ETL work, you enable non-data scientists to do data preparation rapidly and simply—and ask their own questions and find their own answers. What’s more, in this new Big Data Discovery environment, answers come in minutes, not months. Data scientists are able to focus on Spark-driven advanced analytics that yield game-changing answers.<br /> <br /> In this next DSC webinar, you will learn:<br /> <br /> How to automate your data integration process to set up your organization to be truly data-driven<br /> How to manage your data as a self-service feature at the speed of thought<br /> How to effectively unearth big insights that effectively impact the bottom line in the most efficient cycles.<br /> Speaker: Josh Och -- Platfora<br /> <br /> Hosted by: Bill Vorhies, Editorial Director -- Data Science Central Let Spark Fly: Advantages and Use Cases for Spark on Hadoop tag:www.datasciencecentral.com,2014-04-29:6448529:Video:165232 2014-04-29T21:41:20.603Z Tim Matteson https://www.datasciencecentral.com/profile/2edcolrgc4o4b <a href="https://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop"><br /> <img alt="Thumbnail" height="180" src="https://storage.ning.com/topology/rest/1.0/file/get/2781551520?profile=original&amp;width=240&amp;height=180" width="240"></img><br /> </a> <br></br>Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to… <a href="https://www.datasciencecentral.com/video/let-spark-fly-advantages-and-use-cases-for-spark-on-hadoop"><br /> <img src="https://storage.ning.com/topology/rest/1.0/file/get/2781551520?profile=original&amp;width=240&amp;height=180" width="240" height="180" alt="Thumbnail" /><br /> </a><br />Apache Spark is currently one of the most active projects in the Hadoop ecosystem, and as such, there’s been plenty of hype about it in recent months, but how much of the discussion is marketing spin? And what are the facts? MapR and Databricks, the company that created and led the development of the Spark stack, will cut through the noise to uncover practical advantages for having the full set of Spark technologies at your disposal and reveal the benefits for running Spark on Hadoop<br /> <br /> In today’s webinar, you will get a quick introduction to the Spark ecosystem and learn:<br /> • How you can leverage the enhanced functionality Spark provides to Hadoop to solve for your specific use cases<br /> • How easy it is to develop applications and models using Spark APIs<br /> • Why MapR is the only distribution that supports the complete Spark stack<br /> • What unique advantages you gain from having a complete Spark stack on Hadoop