Subscribe to DSC Newsletter

Migrating an Excel Spreadsheet to MySQL and to Spark 2.0.1 (Part 1)

Moving legacy data to modern big data platform can be daunting at times. It doesn’t have to be. In this short tutorial, we’ll briefly review an approach and demonstrate on my preferred data set: This isn’t a ML repository nor a Kaggle competition data set, simply the data I accumulated over decades to keep track of my plastic model collection, and as such definitely meets the legacy standard!

We’ll describe steps followed on a laptop VirtualBox machine running Ubuntu 16.04.1 LTS Gnome. The following steps are then required:

  1. Import the .csv file in MySQL, and optionally backup a compressed MySQL database file.
  2. Connect to MySQL database in Spark 2.0.1 and then access the data: we’ll demonstrate an interactive Python approach using Jupyter PySpark in this post and leave an Rstudio Sparkyl access based on existing methods for another post.

There’s really no need to abandon legacy data: Migrating data to new platform will enable businesses to extract and analyze data on a broader time scale, and open new ways to leverage ML techniques, analyze results and act on findings.

Additional routes methods to import CSV data will be discussed in a forthcoming post.

Views: 3509

Replies to This Discussion

Thanks Mark,

Interesting post. Are there any limitations in MySQL in term of number of columns or shoul you denormalize your data.

Thanks,

Mouloud

Mouloud,

Thanks for feedback and inquiry.

The only drawback I see is that you need to declare every field type... 

Look for an upcoming post where I propose a direct CSV read method into a pyspark.sql.dataframe.

Glad it is helping you...

Marc

Thanks,

RSS

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service