I have a question for all you experts out there. In my line of work (epidemiology), we have significant issues with both privacy (it is health information), and jurisdiction (federated political system). Assuming the security issues are worked out, is it possible to design a system that would allow an analyst to access personal data for analysis from systems throughout the country and undertake the analysis without ever taking receipt of or actually looking at the data (and therefore never actually receiving the personal info)? The way I would see this working myself, as a non-expert, is the creation of some kind of temp file like you can have for analysis in SAS, drawing from the different datasets but not keeping the data, and having the output of the analysis (which would be at a high enough level to get rid of the privacy concerns) be at the federal level. But I don't know whether this is possible at all. Can anyone enlighten me?
If it helps at all for the details, I would foresee it being something like having access to full postal codes, but having a layer of analysis that would aggregate them for mapping and calculating as a rate of disease in the population, so that you use the very detailed data, but the output if not as fine-grained (mapped to a county say, or something at that level). I hope I'm being clear.
You can have personal fields encrypted. When I worked at Visa, credit card numbers (in our data sets) were encrypted. Although, in this case the issue was more protection against ID theft than privacy. But it came with a price, as the credit card numbers contain some value in itself (especially the first digits) from a data mining point of view. Encrypting the last 6 digits would have been a better solution, though it's possible some regulations require the full number to be encrypted.
Maybe there is a way to encrypt zip codes to preserve spatial proximities, in your case.
Would ecryption mean that my analysts would take possession of the data elements? What ideally I'd have is a situation where we could include an element in an analysis that is sourced from multiple sites, but never actually downloaded to a central dataset - we'd in a sense 'touch' it, but never take it, then apply a layer of analysis to it that would aggregate it to a level of space that would no longer present a privacy concern. This of course would mean that we'd have to trust our data providers implicitly, which is always an issue, but it could get around some of our jurisdictional issues.
Can one analyse data from disparate sources (assuming the elements were the same) without ever having to put them together into a single combined dataset?