Subscribe to DSC Newsletter

What kind of tools do data scientists and engineers use for integrating with external data sources?

Taking a typical problem where:

  • External data exists on an API;
  • Data scientist/engineer wants to poll this periodically and add the data to their own database/data warehouse.

What kind of tools are people using for this task (Airflow, bespoke scripts)?

Views: 312

Reply to This

Replies to This Discussion

I'd say we tend to see a few different trends.

Early on (junior level, small company, simple tasks, etc) we see a lot of locally maintained scripts. Python running with Jupyter for example is an especially common one. Scripts on a server triggered with cron job. Google apps scripts. Even excel web urls and macros. This is usually when one individual has been responsible for a lot of manual download/upload work and wants to automate some of it. It's still that one individual managing these simple automated jobs - but it at least saves them some time. As the task becomes more import or grows in scope/size those small scripts are hard to maintain and the data scientist engineer will pursue specialized software or managed services. From the simple/less expensive (eg zapier) to the robust/more expensive (tableau). These solutions are still usually still limited to the engineers own account/desktop etc. and serves mostly to automate the tasks their responsible for.

As the importance of the jobs grows to enterprise level it doesn't make sense to be so locally tied to that engineer, and the department will pursue larger scale solutions (often cloud hosted) that can be deployed/accessed/managed at the dept or enterprise level. Full disclosure we (TMMData) are one of them.

There are no standards associated to this, but this may help in solving the problem.

Scripts/programs written using programming languages (like python, java,etc.) can get the data by hitting the api-endpoints and dumping it into relational databases/blob storages/files/data lakes/no-sql databases, big-data clusters,etc. These scripts are then automated using cron jobs, bespoke scripts or by building pipelines through Apache Airflow, Zapier and other such pipeline-building tools.Various Data integration tools are used to create workflows/pipelines to load the data like:

  1. Sql Server Integration Services.
  2. Pentaho
  3. Talend Studios
  4. AWS Glue
  5. Azure Data factory
  6. IBM InfoSphere
  7. ArcESB

 

RSS

Videos

  • Add Videos
  • View All

© 2020   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service