The European Union is a few short months away from finalizing a sweeping regulation that will dramatically change the way in which data can be handled and in which data science can be utilized. This new regulation will affect all corporations using data from EU citizens, not just those with offices in the EU. Those collecting data from more than 5k EU citizens per year will be consider accountable, regardless of company location. The EU parliament is so serious about compliance with these new privacy and data protection laws that it has proposed a fine for violations of up to 5% of global annual turnover (1 million Euros for smaller companies). Needless to say, this massive fine has attracted serious attention to the regulation. Companies have already started preparations to comply.
Personal privacy and data protection are currently legislated and enforced in the EU through a patchwork of individual member state laws and independent supervisors. The current lack of a single privacy framework complicates compliance and data transfer for multi-national corporations, while also preventing EU supervisors from addressing privacy violations in a unified manner. More to the point, overly aggressive data driven business models, ineffective lobby strategies and underinvestment in data protection have resulted in a market failure argument stimulating what is a stepwise regulatory change. This change will be provided by the General Data Protection Regulation (GDPR).
The GDPR will become the law of the land across the EU, replacing for the most part the current member state regulations. Three years in development, final ratification is due this year, during the current Luxembourg presidency of the EU, or as a worst timeline case – during the Dutch Presidency (January-June 2016). Enforcement will occur within a two-year window following ratification, implemented via a One Stop Shop approach to supervision (the member state where the corporation is headquartered will supervise).
The Police and Judicial Cooperation Data Protection Directive (PJCD) will be released simultaneously and will address use of data by law enforcement agencies.
Potential Conflict of Goals: The upcoming privacy regulations will be especially challenging for data scientists as it will push data use in precisely the opposite direction to where many data scientists are tending to push.
Ideally, both data scientists and privacy advocates are pursuing the best interests of the individual. They have, however, different goals in their methodologies. Data Science has the goal of acquiring new data and finding new uses for existing data. While privacy advocates strive to minimize data collection, data scientists strive to maximize it. While privacy advocates strive to decrease unexpected uses of data, data scientists strive to increase them. Compliance with the GDPR will require very careful alignment and coordination of these goals in a way in which the individual is benefited from both a privacy/data protection as well as from an economic perspective.
Generating Private Data: We are becoming increasingly aware of the ways in which the analytic techniques of data scientists are able to draw unanticipated insights from what was thought to be innocuous data. Projects have been carried out which, for example, link sensitive but anonymized data to specific individuals, reveal the gender and/or ethnicity of individuals based on Facebook likes, retrieve personal records of individuals based on a snapshot taken on the street, fingerprint cell phones based on cell tower check-ins, etc.
In a previous post, I wrote about how Netflix had legal problems when they didn’t realize how data science techniques could de-anonymize legally protected data released during the Netflix Prize. The state of Massachusetts had a similar problem in 2002 when health care records of public employees were released as anonymous and later partially de-anonymized.
So we see how personal data may be volunteered, observed or inferred. Although the majority of press in the last few years has focused on concerns over data observation (e.g. cookie legislation, audio/video surveillance, RFID etc.), regulators are shifting their attention to the realms of Big Data, Smart Sensors, and advanced analytics.
Thus, advancements in Data Science have and will continue to expand the definition of Personally Identifiable Information (PII). These advancements will undoubtedly influence privacy legislation in the future.
Our increased usage of cutting-edge data storage and analytic technologies put us even more at risk of violating privacy concerns. Modern data technologies, including an abundance of noSQL technologies, on-demand cloud storage, and in-memory processing, are encouraging data scientists and corporations in general to produce massive stores of raw data (data lakes). This storage raises the following challenges from a privacy compliance perspective:
Data awareness: Companies lose oversight of what data is stored, where it is replicated, and what the risks and privacy implications of that data may be.
Governance: Raw data may be flowing into the systems of pilot programs without mature governance models. In addition, there is concern over the security features of many cloud storage systems.
Control: As raw data with unknown potential is retrieved, stored, copied and distributed, companies may find themselves in a position where they have lost oversight of where data has flowed and have lost the ability to implement right to be forgotten/right to erasure.
In Part 2, we will discuss
How pending EU privacy regulation will have a direct impact on general data collection and use and specifically on data analysis and data science
Steps that should be taken across the organization today