What kind of tools do data scientists and engineers use for integrating with external data sources?

Taking a typical problem where:

  • External data exists on an API;
  • Data scientist/engineer wants to poll this periodically and add the data to their own database/data warehouse.

What kind of tools are people using for this task (Airflow, bespoke scripts)?

Views: 404

Reply to This

Replies to This Discussion

I'd say we tend to see a few different trends.

Early on (junior level, small company, simple tasks, etc) we see a lot of locally maintained scripts. Python running with Jupyter for example is an especially common one. Scripts on a server triggered with cron job. Google apps scripts. Even excel web urls and macros. This is usually when one individual has been responsible for a lot of manual download/upload work and wants to automate some of it. It's still that one individual managing these simple automated jobs - but it at least saves them some time. As the task becomes more import or grows in scope/size those small scripts are hard to maintain and the data scientist engineer will pursue specialized software or managed services. From the simple/less expensive (eg zapier) to the robust/more expensive (tableau). These solutions are still usually still limited to the engineers own account/desktop etc. and serves mostly to automate the tasks their responsible for.

As the importance of the jobs grows to enterprise level it doesn't make sense to be so locally tied to that engineer, and the department will pursue larger scale solutions (often cloud hosted) that can be deployed/accessed/managed at the dept or enterprise level. Full disclosure we (TMMData) are one of them.

There are no standards associated to this, but this may help in solving the problem.

Scripts/programs written using programming languages (like python, java,etc.) can get the data by hitting the api-endpoints and dumping it into relational databases/blob storages/files/data lakes/no-sql databases, big-data clusters,etc. These scripts are then automated using cron jobs, bespoke scripts or by building pipelines through Apache Airflow, Zapier and other such pipeline-building tools.Various Data integration tools are used to create workflows/pipelines to load the data like:

  1. Sql Server Integration Services.
  2. Pentaho
  3. Talend Studios
  4. AWS Glue
  5. Azure Data factory
  6. IBM InfoSphere
  7. ArcESB


A lot of the more cloud based vendors are including custom data connectors for third party apps into their applications. These enable a user to enter the login info for the cloud app and connect to the API automatically through the user interface, enabling them to pull and integrate data at regular intervals.


© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service