Big data is giving promises of new opportunities for official statistics. But at the same time, it is raising issues of confidentiality. With massive data sets being made available, re-identification can be a real threat to confidentiality.

What if we could allow people to link different datasets for their studies, but put a limit to the number of datasets they can link? This way, it will never be possible to get all the records on an individual, even though it is still possible to get any combination of a fixed number of these records.

Is it possible to use the approach of crypto-currencies to make it possible, but hard to match records from different surveys? This way, different records from differents datasets can be linked, but the cost of the linking of different data set is very high, therefore there is a maximum number of datasets that can be linked in one study. This way, confidentiality is protected while still making it possible to link results of different surveys for the need of a given study.

Let's assume that it is possible to build a unique identifier for an individual by combining his birthdate, birthplace, etc. This is a numique number that anyone can compute if the required information on the individual is available. It is a unique and universal identifier for the individual. Let's call it UUI.

When a surveys s1 is conducted, an algorithm is used to generate an identifier ID_s1 from the UUI. The data of survey s1 are published with the identifier ID_s1.

In the same way, a surveys s2 is conducted, and the same algorithm is used to generate an identifier ID_s2 from the UUI. The data of survey s2 are published with the identifier ID_s2.

The algorithms are designed in such a way that:

- It is practically impossible to know that ID_s1 was derived from UUI.

- It is practically impossible to know that ID_s2 was derived from UUI.

- It is possible to know that ID_s1 and ID_s2 are for the same individual, but it is computationally expensive (like mining a bitcoin). This way, for given comptational resources, there is a limit to the number of identifiers from different datasets for the same indivual that can be matched. The computrational cost limits the number of datasets that can be linked, protecting the privacy of individuals.

Can this approach solve the problem of being able to track people across different information collection activities without invading their privacy? Can it be used for making civil registration easy for example using new technologies? Or for panel surveys or informal sector surveys, among others?

Is such protection of confidentality based on cryptography feasible and practical?

Possible extension: Imagine that, by simply reading a RFID chip, it is possible to know if a person has registered for something; if the person was vaccinated for a given disease; if the person has voted; if the person has participated to a given survey; But one can't know all these things together, one can only know a small subset of the information. The amount of the information you can get by reading the RFID chip is limited by design to protect the confidentiality of the person. Would it facilitate the production of official statistics?

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central