Subscribe to DSC Newsletter

Open Source automated ETL of web-based data sources

Hi,

I've been involved in bringing together a number of healthcare-related datasets into a bespoke online analytics service for several years.  Whilst the high volume, patient level data is handled in a pretty much fully automated process, an increasing number of datasets published by public bodies are also of interest to users of the service.

The datasets are typically updated on a monthly, quarterly or annual basis and I currently download them when my Outlook calendar the data might be due, and put each one through a transformation process with might contain steps in Excel (even if it is just save as csv!) then a Linux shell script to prep the data for loading in csv format to the data warehouse which uses a Big Data technology.

In terms of human effort, these datasets take far more effort that the millions of rows of patient-level data I have managed to automate.  The main issues are:

  • the data is often not published to schedule (so someone needs to remember to keep on checking!)
  • there are sometimes changes to:
    • the location the data is published
    • the structure of the data, and therefore the ETL scripts need to be updated
  • the semi-automated process is prone to human error and requires more QA checking
  • the creation of ETL scripts for each new dataset is quite time-consuming

So, I was really looking for a fresh perspective on how this type of activity could be done more efficiently and preferably using an Open Source solution.

Any bright ideas?

Thanks,

Phil

Views: 831

Reply to This

Replies to This Discussion

As part of a Big Data Analytics system we developed a few years ago (called LM Ensemble/LM Wisdom), we took part of the UI-ETL component and Open Sourced it.  It is still available as Open Source on GitHub.  The UI-ETL component was rebranded LM Streamflow (the name was NOT my idea).  Here is a link to it:  LM StreamFlow ( https://github.com/lmco/streamflow/wiki ) . Again, all the code is available in GitHub and I am uncertain if anyone is maintaining it.

Hello,

EplSite ETL might help you: https://eplsite-etl.blogspot.com

- Open Source.
- Easy to use.
- Low resource consuming.
- Just the necessary tools to do the job.
- Web interface.
- Very easy to customize it because it is developed in Perl, all the source code is included.
- It is possible to run transformations using cron jobs on Linux or task manager on Windows.

Airflow is probably the go-to open source tool in this space.

Thank you for those three responses - I will check them out.

If I were to relax my Open Source requirement and ask for recommendations including commercial tools, what are the best to check out?

Just starting to investigate this company.  See what you think

https://www.trifacta.com/products/wrangler-editions/

Have signed up for a trial of Wrangler - will let you know how it goes Phil. Thanks.

Hi Phil,

Have taken a brief look at Trifacta Wrangler free version including some of the training videos.  Seems that it only works on local files when importing - so not able to grab directly from the publisher's website.  Not necessarily a deal-breaker but if I could create and end-to-end flow it would be perfect.

Phil

Just a quick update - had a whistle-stop demo of Wrangler Pro and it does look pretty impressive.  The downside is that they want about 5000 USD per user per year plus 10000 USD server costs per annum.  Blows it out of the water for us pretty much.

Have been looking at Tableau Prep this week which, given that we have Tableau Desktop licenses, comes free as part of the deal.  The initial training videos show a very straightforward product to learn and use, but I'm concerned looking on their forum that it is not production ready yet.  May be too soon in the product cycle unfortunately.

Phil

Reply to Discussion

RSS

Follow Us

Resources

© 2018   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service