Common Mistakes When Outsourcing Private Data and How to Avoid Them

The majority of companies choose to outsource their ML solution. It makes sense since AI development demands unique and hard-to-obtain expertise and experience. That is why it’s better to work with a team that specializes in this kind of development.

However, if you want to make a custom machine learning model, you need to provide the outsourcing team with your data. And this is where problems start. How do you pass your sensitive data to external parties without putting the security and privacy of your clients at risk?

At Serokell, we often meet clients concerned about how their data is going to be used. I talked to Ivan Markov, Head of the Data Science department, to prepare this guide to answer the most common questions and help you to feel protected when working with external teams.

Three common data-related scenarios when working with clients

First, let’s talk about how machine learning works. In ML, we use algorithms that run on data and learn from it. It’s easy to understand that data is essential in ML ― without it, you will not obtain the result you want.

When working with clients, ML teams often have to face one of the three unpleasant scenarios:

Data doesn’t exist.
Data is open-source.
Data is confidential.

The first scenario is the most common one and also the most complicated one. As people say, data is the new gold. You can’t just find what you need on the internet for a custom model tailored to one specific business. Unfortunately, when we face a situation like this, we have to decline the project.

The open-source scenario is slightly better. The data is already there, and anyone can use it. But, let’s say you decided to google photos of random people. If you’re training a model just for fun and won’t tell anyone, any AI ethicist will say to you that’s morally wrong. But it’s hard for the authorities to know that you’re doing it. But what if you want to create a commercial face recognition system? These people didn’t give their consent for you to train your face recognition model on their photos, and you and your company can be in serious trouble. Even Facebook had to face legal consequences and delete its database of scraped Instagram photos.

So when you take open-source data, it’s always important to know what kind of license is protecting it. It depends on the license, but usually, that’s illegal to use open-source for commercial purposes. Of course, somebody would have to prove that you used this data illegally if your code isn’t made open-source. It’s not that easy to catch you. But still, this will stain your reputation forever. We don’t recommend that.

Finally, there is a third option. The client comes to you, and they have data. But they ask you to build a model without transmitting this data. That’s extremely hard, as you can guess, and not a lot of data scientists can or are willing to do it. There can be various reasons for this approach. The client has sensitive data, tries to protect the client’s privacy, or has something to hide. We don’t know. The problem is that it’s hard to build a model that gives reproducible results without seeing the data. You have to be sure that the data you’re training the model on is similar or identical. Otherwise, it won’t work.

What is the alternative to these terrible situations?

There are several things that can help you deal with each of these scenarios successfully.

Know what personal data is

General education of both sides usually helps. The developer should be transparent about what they are going to do with the data. And client needs to know how to protect themselves in case something goes wrong. Usually, a well-made contract and an NDA are what you need. In this case, both sides understand that if the client’s private information gets into the internet, there will be lawsuits. In the contract, it is necessary to report precisely where this personal data is. Quite often, at this stage, sides discover that private data is not needed at all! The ML team doesn’t need your customers’ names or gender, or age ― all this can be extracted from transactions in the anonymized form!

Learn how to do anonymization well

How does anonymization work? Let’s take retail, for example. It is necessary to anonymize the numbers of the loyalty or credit cards. An excellent solution can be cryptographic hash functions which represent card numbers in the form of numeric/letter strings, and only the customer knows the key to translating them back. These numbers cannot be associated with a real person.

There are cases when models need actual personal data, for example, in medicine. It is possible to restore the sex by MRI, but it’s a more complicated task for age. And for diagnosis, you usually need it. There is a way out: divide people into age groups. 18-24, 25-36, every patient’s age falls into one of the classes. You don’t even need to label these groups in an open way; call them a, b, and c. This is enough for the model to take age information into account. But you still need a formal patient’s consent (usually, patients sign this form at the check-in).

Learn to use remote server access well

Many companies rely on remote server access. In this case, they give access via SSH, and the developer can only execute commands there, with no Internet access. For the team, this is highly inconvenient. You do not see the screen, but for an ML engineer, it’s essential to see the data for the speed of development and visualization. But you will probably find people who will agree to that. The problem is that setting up remote desktop protocol right is quite tricky. You need to make sure that the communication only goes one way, and you need to know what you are doing to fine-tune everything. In the meanwhile, this is usually not required if you did the anonymization right.

Conclusion

So, summing up, what are the major mistakes when outsourcing private data?

Anonymization is badly made. It is necessary to double-check that all fields correspond to the typing.
Messed up remote access. Traffic control is either expensive or complicated, but don’t do it if you’re unsure you can do it right.
Overdid anonymization. In this case, you can’t learn anything from data.
Poorly drafted contract. Write down what is disclosure, what kind of data is not allowed, how much is the turnover. Please consult with a specialist who will advise you on how to do it right.
If you use data illegally, then you cannot give out such data. If someone tells on you, then that’s it for you and your business. In medicine, you cannot even give it to trustees, according to the law, even if it’s not open-source.