Big data is giving promises of new opportunities for official statistics. But at the same time, it is raising issues of confidentiality. With massive data sets being made available, re-identification can be a real threat to confidentiality.
What if we could allow people to link different datasets for their studies, but put a limit to the number of datasets they can link? This way, it will never be possible to get all the records on an individual, even though it is still possible to get any combination of a fixed number of these records.
Is it possible to use the approach of crypto-currencies to make it possible, but hard to match records from different surveys? This way, different records from differents datasets can be linked, but the cost of the linking of different data set is very high, therefore there is a maximum number of datasets that can be linked in one study. This way, confidentiality is protected while still making it possible to link results of different surveys for the need of a given study.
Let's assume that it is possible to build a unique identifier for an individual by combining his birthdate, birthplace, etc. This is a numique number that anyone can compute if the required information on the individual is available. It is a unique and universal identifier for the individual. Let's call it UUI.
When a surveys s1 is conducted, an algorithm is used to generate an identifier ID_s1 from the UUI. The data of survey s1 are published with the identifier ID_s1.
In the same way, a surveys s2 is conducted, and the same algorithm is used to generate an identifier ID_s2 from the UUI. The data of survey s2 are published with the identifier ID_s2.
The algorithms are designed in such a way that:
- It is practically impossible to know that ID_s1 was derived from UUI.
- It is practically impossible to know that ID_s2 was derived from UUI.
- It is possible to know that ID_s1 and ID_s2 are for the same individual, but it is computationally expensive (like mining a bitcoin). This way, for given comptational resources, there is a limit to the number of identifiers from different datasets for the same indivual that can be matched. The computrational cost limits the number of datasets that can be linked, protecting the privacy of individuals.
Can this approach solve the problem of being able to track people across different information collection activities without invading their privacy? Can it be used for making civil registration easy for example using new technologies? Or for panel surveys or informal sector surveys, among others?
Is such protection of confidentality based on cryptography feasible and practical?
Possible extension: Imagine that, by simply reading a RFID chip, it is possible to know if a person has registered for something; if the person was vaccinated for a given disease; if the person has voted; if the person has participated to a given survey; But one can't know all these things together, one can only know a small subset of the information. The amount of the information you can get by reading the RFID chip is limited by design to protect the confidentiality of the person. Would it facilitate the production of official statistics?