Moving legacy data to modern big data platform can be daunting at times. It doesn’t have to be. In this short tutorial, we’ll briefly review an approach and demonstrate on my preferred data set: This isn’t a ML repository nor a Kaggle competition data set, simply the data I accumulated over decades to keep track of my plastic model collection, and as such definitely meets the legacy standard!
We’ll describe steps followed on a laptop VirtualBox machine running Ubuntu 16.04.1 LTS Gnome. The following steps are then required:
There’s really no need to abandon legacy data: Migrating data to new platform will enable businesses to extract and analyze data on a broader time scale, and open new ways to leverage ML techniques, analyze results and act on findings.
Additional routes methods to import CSV data will be discussed in a forthcoming post.
Interesting post. Are there any limitations in MySQL in term of number of columns or shoul you denormalize your data.
Thanks for feedback and inquiry.
The only drawback I see is that you need to declare every field type...
Look for an upcoming post where I propose a direct CSV read method into a pyspark.sql.dataframe.
Glad it is helping you...
Thank you, Marc Borowczak for this amazing post.Good job!