Subscribe to DSC Newsletter

What kind of tools do data scientists and engineers use for integrating with external data sources?

Taking a typical problem where:

  • External data exists on an API;
  • Data scientist/engineer wants to poll this periodically and add the data to their own database/data warehouse.

What kind of tools are people using for this task (Airflow, bespoke scripts)?

Views: 294

Reply to This

Replies to This Discussion

I'd say we tend to see a few different trends.

Early on (junior level, small company, simple tasks, etc) we see a lot of locally maintained scripts. Python running with Jupyter for example is an especially common one. Scripts on a server triggered with cron job. Google apps scripts. Even excel web urls and macros. This is usually when one individual has been responsible for a lot of manual download/upload work and wants to automate some of it. It's still that one individual managing these simple automated jobs - but it at least saves them some time. As the task becomes more import or grows in scope/size those small scripts are hard to maintain and the data scientist engineer will pursue specialized software or managed services. From the simple/less expensive (eg zapier) to the robust/more expensive (tableau). These solutions are still usually still limited to the engineers own account/desktop etc. and serves mostly to automate the tasks their responsible for.

As the importance of the jobs grows to enterprise level it doesn't make sense to be so locally tied to that engineer, and the department will pursue larger scale solutions (often cloud hosted) that can be deployed/accessed/managed at the dept or enterprise level. Full disclosure we (TMMData) are one of them.



  • Add Videos
  • View All

© 2020   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service