The following advice is built from my experience working as a data scientist on a variety of projects across different data & engineering teams. Many data scientists (myself included) do not come from a computer science or software development background, so may not have formal training or good habits in code writing. These tips should help data scientists work collaboratively to write good code and build models in a way that will be easier to productionize.
This is important for both collaboration and backups. It allows you to track the changes to a project as it undergoes development, useful for coordinating tasks and encouraging due diligence. Git is a powerful version control software, with the ability to branch parts of the development, track and commit changes, push and fetch from remote respositories, and merge code pieces together overcoming conflicts as necessary.
A key component of collaborative coding is the ability to hand it over to other developers for review and use, meaning it has to be readable. This includes using appropriate variable and function names with explanatory comments where necessary, and regular inclusion of docstrings that introduce the piece of code and its details. It is also important to follow the relevant style guide for the language you’re using, e.g., PEP-8 in Python.
When writing code it’s important to keep it modular. That is, to break it up into smaller pieces that execute separate tasks as part of the overall algorithm. This level of functionality makes it easy to:
Try to consider what tests can be written alongside your code in order to check the validity of your assumptions and logic. These tests can be anything from a simulation of the expected inputs and outputs, to a series of unit tests to check the code functionality. A unit test generally exercises the functionality of the smallest possible unit of code (which could be a method, class, or component) in a repeatable way. For example, if you are unit testing a class, your test might check that the class is in the right state. Typically, the unit of code is tested in isolation: your test affects and monitors changes to that unit only. Ideally this forms part of a “Test Driven Development” framework for encouraging that all pieces of software are fully reviewed and tested before being integrated or deployed, minimising time spent refactoring and debugging later on.
Try to write your code as if you’re putting it into production. This will form good habits as well as make it easy to scale-up when it inevitably (hopefully) does go into production.
Consider “algorithm efficiency” and try to optimise to reduce runtime and memory use. “Big-O notation” is important here.
Also consider your code environment or ecosystem and avoid dependencies. Maybe use virtualisation either at the code level (e.g. Python virtualenv) or at the operating system level (e.g. Docker containers).
Production level code should also employ “logging” to make it easy to review, inspect and diagnose issues when executing the code.