Confidentiality for the Big Data World

In part one of this series, we looked at how many organisations have failed to learn lessons from the privacy mistakes of the past.

When the Australian Bureau of Statistics wanted to release their census data online to the public, they were faced with a problem: how could they safely allow people to query their data without compromising the privacy of the Australian public?

They turned to Space-Time Research for a solution.

Even before any confidentiality is applied to the data, our software tools automatically deliver some degree of protection, by providing aggregated data: this means you can build a table showing how many people fit into a particular category, but cannot see the individual records that contribute to that result. Of course you can make the underlying unit records available to your users if you wish (through “Record View”), but when dealing with confidential data this feature can be completely disabled.

But even with aggregated data, if you know some facts about an individual in the data set, then you might be able to construct a query that is specific enough for you to find that individual. Once you’ve done that, you can use this query to reveal other private information about this individual.

Let’s say for example a medical records database has been released. I know my neighbour had twins this year, that she is in her late 30s and moved to the area two years ago. When I build a query based on this knowledge, I can see that only one person fits the bill…

Confidentiality Breach

Once I’ve found my neighbour I can start to build other queries to discover confidential information about her medical history:

Confidentiality Breach 2

To deal with this problem, we developed a suite of confidentiality measures that can be applied on-the-fly when building tables. Our most advanced solution is a proprietary algorithm called Perturbation.

Perturbation automatically makes adjustments to smaller cell values–showing a slightly different value in those cells to prevent anyone from identifying an individual. Importantly, it is not simply randomly rounding the values: it is a complex calculation that is specifically designed not to reduce the usefulness of your data by disrupting overall trends, or introducing bias.

Cell values are perturbed in a repeatable and consistent way, so that totals still add up correctly and remain within the perturbation limits themselves.

Want to Know More?

Our confidentiality solutions are trusted by the ABS to protect the Australian public. Interested in learning more about what they could do for you? Get in touch today to discuss your confidentiality needs.

Image: “System Lock” by Yuri Samoilov used under Creative Commons