Security concerns over easy access to personally identifiable data
Andy Green, a technical content specialist at Varonis, specialists in data governance software, says stricter de-identification rules will almost certainly become law in the EU amid e-crime concerns over the hacking of easy-to-access personally identifiable information.
Personally identifiable information, or PII, is pretty intuitive. If you know someones phone, social security or credit card number, you have a direct link to their identity. Hackers use these identifiers, along with a few more personal details, as keys to unlock data, steal identities and ultimately commit crimes.
The lines between PII and non-PII data are blurring. Its been known for at least ten years that there are specific pieces of data that may appear anonymous, but when theyre taken together are just as effective at identifying a person as traditional PII.
The easiest to understand of these so called quasi-PIIs is the trio of full birth date, postcode and gender. If a company had published a dataset that had been de-identified by removing all the standard PIIs, but left those three data items alone, a smart hacker could, with very high likelihood, find the name and address of the person behind that data.
Why would this work? At a very basic level, the identity thief is effectively doing the work of a detective essentially going through lists looking for matches. The lists in this case are voting records, which in the US are available for most towns and counties at a nominal fee typically around $40. In the UK, this information is free. Voting records contain name, address and, most importantly, full birth date; postcodes can be easily determined from address. By looking for matching birth dates and postcodes, identity thieves narrow down the search to a few names. Add gender information and for most postcodes, hackers can arrive at a unique name. Of course, the more additional information or clues gathered, especially taken from social media and other websites, the easier it is to filter and narrow down names when theres more than one candidate.
Take the US, for example. A quick back of the envelope calculation tells you why one might do very well with this approach. Taking 365 days ignoring leap years and multiplying by an average age of 80, it works out that a complete birth date gives 29,200 bins to place a zip codes worth of US citizens. If you have gender information, you double the number of slots, to a little over 58,000.
I can hear nitpickers saying that voting rolls contain only the names of those over the age of 18, so you would have to remove 6,570 slots. True enough, but researchers have shown its possible to exploit Facebooks leaky handling of data on school age minors to partially address this gap.
In any case, based on the last US census, there are more than 40,000 zip codes, with an average of only 7,000 people per zip code. It seems theres a good chance most of those 7,000 people will find themselves alone in one of those 58,000 slots. In other words, the odds are that most of them wont share the same date of birth, zip code and gender.
Data privacy expert Latanya Sweeney, Professor of Government and Technology in Residence at Harvard University, ran the numbers back in 2000: using then current census data (broken down by zip codes and age groups) she was able to identify 87 per cent of the people in the US using just those three non-PIIs.
Piecing the information together is even easier in the UK, as a postcode will often cover little more than a single street.
Fortunately, Ms Sweeneys research and results from other experts have made their way to policy-makers. For example, when medical research on US patients is published, the HIPAA (Health Insurance Portability and Accountability Act of 1996) Safe Harbor de-identification rules say that no geographic unit smaller than a state can be included in the public data. Full dates (eg, admission, birth) must also have the year removed.
With US regulations on PII varying by the particular legislation, this is by no means a universal rule. However, the Federal Trade Commission, an influential regulatory agency on privacy matters, has recently issued new best practices on data de-identification. It has called for all companies to achieve a reasonable level of confidence that their public dat