Long title: Tag and sense methods in statistical sample surveys: a generalisation of capture-recapture methods in order to combine the methodological soundness of statistical sampling with the power of big data
The statistical community is very excited about the promise of big data: massive amounts of data available for analysis at fingertips. However, some statisticians are still skeptical about the capacity of big data to replace sampling surveys, the main limitation being the lack of controlled randomization in data collected via sensors. Tag and sense methods, a generalization of capture-recapture methods, may be the solution for reconciling the methodological soundness of statistical sampling with the power of big data. The idea before tag and sense methods is this one: if we can identify, in the massive amount of data collected by sensors, the data of a sub-set of individuals that have been specially sampled using rigorous statistical methodology, then the joint availability of data from the sample and data from the whole population will allow us to make powerful inferences about the statistical properties of the population.
The steps of tag and sense are the following:
- Make a rigorous random sampling in the population. For example, a sample of women in a village. Or a sample of cars.
- Tag the sample population using any technology that would allow identifying them as individuals of the sample when collecting big data via sensors. For example, equip the women with devices that allow identifying them as women from the sample at water points. O equip the cars with devices that allow identifying them as cars from the sample when crossing a bridge.
- Collect joint data via sensors: for the whole population and for the sample. For example, we can count the number of people at the water point and, among these people the number of elements in our sample. Or the number of cars crossing a bridge, and, among these cars, the number of cars from our sample. Joint data can, of course, be more complex than that, including for example spatial information and information about categories in the sample, depending on the amount of information that the tag contains.
- Use the joint data to make inference. For example, big data will only tell us how many people use a water point, but the sample data may tell us how far they come from, and, in general, may allow us to link them to households and make even more powerful inference about the whole population. For the cars, we may know via sensing that 1000 cars use a bridge per day. But the sample information may tell us that among these cars are 5% of the cars of a given area and 12% of cars from another area for example. The joint information allows powerful inference.
The key point about tag and sense is that it introduces the missing randomization in the big data, hence allowing rigorous statistical inference.
There are many benefits of such approach that makes it a cost effective-methods for sampling surveys in developing countries that taps on the benefits of big data:
- Its builds on mobile technologies that are very promising for statistical information collection.
- It increases the amount and reliability of collected data without compromising the randomization.
- It reduces the costs of surveys through the reduction of movements of enumerators, the use of technologies that facilitate data processing. Furthermore, the integration is done at the lowest possible level: the individual level.
- It allows easy integration with data collected for other purposes, reinforcing the possibility of integrating statistical surveys across domains.
- The spatial dimension is taken into account by design, and it enriches the analysis of the collected data.
- International cooperation in data collection can be built on it: people crossing borders can be sensed on other countries and the data are shared with the countries where they were sample using a commonly agreed mechanism. This will allow surveys on cross-border issues that have been very difficult so far.
- Progressive improvement of the inference as new data is coming: the data that has been collected after for example three months can be used to produce first inferences. But new data will continue to be collected until the expiration of the tags and these data will be used to improve the estimations.
- Cumulative tagging (tagging a new sample every time without withdrawing the old sample) will allow progressively tagging all the population if necessary.
- Confidentiality issues: what amount of information should be in the tag? How can we set the tag to expire after some time? How can we prevent non-authorized people from reading the information in the tag? (see note 1 for proposed solutions)
- Choice of technology to use and standardization: international cooperation requires using the same standards across borders. The tags have to be more complex than those currently used in capture-recapture methods as more information is needed in the tag and better protection of that information is required.
- Statistical inference theories and tools: new models will require new statistical inference theories and tools.
: Giving identifiers to individuals while preserving privacy. Possibilities:
1- Every individual (in the sample) has identifier and
(a) If the identifier is unique to the person, it must expire after some time and a new id is given with no logical possibility of linking the two.
(b) Alternatively identifiers are not unique to the individual, they represent membership to various predefined categories: sex, age group, etc. All categories membership are not exposed together, only a few are given by the identifier. The identifier should therefore change according to predefined rules and only expose the membership of a few number of categories at the same time.
2- The identifier can deactivate itself in certain conditions.