Moving legacy data to modern big data platform can be daunting at times. It doesn’t have to be. In this short tutorial, we’ll briefly review an approach and demonstrate on my preferred data set: This isn’t a ML repository nor a Kaggle competition data set, simply the data I accumulated over decades to keep track of my plastic model collection, and as such definitely meets the legacy standard!
We’ll describe steps followed on a laptop VirtualBox machine running Ubuntu 16.04.1 LTS Gnome. The following steps are then required:
There’s really no need to abandon legacy data: Migrating data to new platform will enable businesses to extract and analyze data on a broader time scale, and open new ways to leverage ML techniques, analyze results and act on findings.
Additional routes methods to import CSV data will be discussed in a forthcoming post.
Thanks Mark,
Interesting post. Are there any limitations in MySQL in term of number of columns or shoul you denormalize your data.
Thanks,
Mouloud
Mouloud,
Thanks for feedback and inquiry.
The only drawback I see is that you need to declare every field type...
Look for an upcoming post where I propose a direct CSV read method into a pyspark.sql.dataframe.
Glad it is helping you...
Marc
Thanks,
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles