July 26th, 2021 · 11 minute read

Is your Data Anonymization effective?

Updated: July 26th, 2021

Data Anonymization

Data Anonymization

More than ever, data anonymization is critical to protecting individuals and organizations against privacy risk. Interacting with technology every day, we all become Hansel and Gretel, leaving “breadcrumbs” as we go online, pay for our latte at Starbucks, visit the doctor, sign in our children at kindergarten, and attend meetings online. Those of us who live in smart homes with devices that record our every move and desire, leave a veritable roadmap of crumbs – clear footprints of our daily affairs.

Cybercriminals can use these traces to identify users and expose their identities. To prevent this, companies are obliged, by law, to protect the data they keep from cyber-attacks.

The most prominent law is the General Data Protection Regulation (GDPR) which came into effect in the European Union on May 25, 2018. It mainly protects the privacy rights of EU citizens and applies to companies outside the EU that deal with data of EU residents or businesses. In Canada, there is the Canadian Personal Information Protection and Electronic Documents Act (PIPEDA, and Australia has the Privacy Act. In the US, the California Consumer Privacy Act (CCPA) is only a state-level regulation.

Data anonymization is the response to these strict data privacy regulations and consumer concerns about data safety. Successful data anonymization protects the privacy rights of individuals and also allows organizations to derive benefits from that data. Companies can use and keep anonymized data because it doesn’t fall under the scope of the GDPR.

What is data anonymization?

Data anonymization is a process that removes personally identifiable information about an individual so that it can’t be traced back to the individual. That is the ideal; whether it actually happens in practice or is even possible, is open to debate. At least, scientists warn that anonymized data can often be reverse-engineered using machine learning to re-identify individuals. At the very least, data anonymization should make it impractical or very difficult to link information to a specific individual.

The ideal of data anonymization is to irreversibly prevent the identification of an individual with specific data. This is the only instance in which it can be regarded as truly anonymized.

A side note on de-identification. This term is often used interchangeably with anonymization, but according to co-founder and CEO of Private AI, Patricia Thaine, they are not the same. Whereas anonymization leads to individuals becoming permanently unidentifiable, de-identification is the removal of personal and quasi-identifiers through a process that enables the original data to be reconnected to an individual.

For the purposes of the GDPR, all identifiers, direct and quasi, must be removed from data if an organization wants to store or use data indefinitely.

Data anonymization techniques

There are a number of techniques to safeguard personal information so it’s unrecognizable. There isn’t a single solution that works for all scenarios. The final choice will depend on a range of factors, including the quality of the data and the need to preserve it in some form, industry privacy guidelines, and other risk factors.

It’s important to be aware that although data anonymization techniques can be leveraged to transform a data field to a great degree, these techniques often still carry re-identification risks. 

Why anonymization of data is important

Data anonymization reduces the risk of accidental, negligent or purposeful disclosure of sensitive data in the process of it being shared between different branches of a company, different departments, other companies in the industry, etc.

Data anonymization protects the identity and privacy of individuals as well as that of organizations. The healthcare sector uses data anonymization techniques to safeguard the sensitive health records of patients. Hospitals can share lists of their patients and their health conditions with researchers by withholding sensitive data (names, ID numbers, Social Security Numbers, address) while including data useful for research purposes (age, weight, height, gender, health conditions, medications).

By protecting private or sensitive data through techniques like encryption, anonymization keeps personally identifiable information from databases and protects it against identity theft. It can also help to prevent credit card fraud. 

When data is properly anonymized, it is no longer personal data and is not subject to the provisions of the GDPR. This means companies can freely leverage and share it and derive what business insights they can from it.

The GDPR restricts the length of time an organization can retain data. However, data that’s anonymized can be held indefinitely, which is important for research institutions that need to conduct longitudinal studies.

When organizations practice proper anonymization, they can share the fact with their customers and shareholders. This demonstrates their commitment to data privacy, which is highly regarded by customers and other stakeholders in today’s climate. It is one way to earn the trust of customers and build a reputation for the organization.

Data anonymization techniques

Encryption

This method uses algorithms to convert and present the data in a coded form that is unintelligible to anyone who accesses it. Because the encryption uses a reversible encryption process, this method doesn’t represent complete protection of data since someone with the encryption key can unlock the data in its original form.

Currently, organizations use encryption to protect data-in-transit and data-at-rest, to keep it safe from malicious parties. Data encryption is popular for moving to the cloud. 

Data-in-use can be protected through a technique called homomorphic encryption, which allows companies to work with data without decrypting it first. This means data can be processed in commercial cloud environments while being encrypted.

Pseudonymization

In Pseudonymization, personal identifying data is replaced with fake identifying data or pseudonyms - for instance, replacing Clive Martens with Anton Owen. The GDPR specifically promotes this data de-identification method.

Pseudonymization renders data identifiable later on, but only to authorized users or, someone who has access to the original data set. Since pseudonymized data can be re-identified, it remains personal data and is therefore still subject to the provisions of the GDPR.

In Article 4(5), the GDPR defines the process of pseudonymization as “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information.” This additional data should be stored and protected separately.

Perturbation

Data perturbation alters the original dataset by adding random ‘’noise’’ to it. For instance, numerical values, like the age or height of a cohort can be changed by applying mathematics (adding, subtracting, multiplication, division, or complicated equations) to the values. So, adding 5 to every value or multiplying every value with 5. The values must make sense, though. It would look strange and not credible to multiply ‘’years married’’ by 9, whereas you could multiply street numbers by 9.

Tokenization

Tokenization transforms a real piece of information, such as a credit card number, into a meaningless string of numbers and characters called a token. Tokenization secures the original data, i.e., a social security number, by replacing it with an unrelated number in the same format. The organization can then work with these unique but worthless tokens while the original data is safe in a secure token vault.

Tokenized data cannot be deciphered or reversed since no mathematical method was used to create them – there is no mathematical relationship between the token and the original number it’s meant to replace. Tokenization can prevent attackers from getting their hands on any useful data.

Masking

Masking is often used to protect sensitive data like credit card numbers. This technique replaces original data with a default character or symbol. Masking renders credit card numbers safe and secure when all but the first six and last four digits of each credit card number are replaced by ‘’X’’ for example. With data masking, you can’t retrieve the original data – once masked, the data may be impossible to restore.

Generalization

With this technique, the data is generalized so it loses its precise value, but keeps something of the original value. For instance, instead of noting the precise age as 35 years, it can be denoted as age 30 – 40, a gross income rate of $530,000 can be denoted as < $500,000, or a political science lecturer can become just a lecturer. 

Generalization is effective if enough direct identifiers and quasi identifiers are removed so that the individuals cannot be identified. 

Redaction

This masking technique helps protect sensitive personally identifying data by removal of all or some of a field. There are several redaction techniques of which masking is one. Redaction is not suitable in all instances, but it does render data securely unidentifiable. In most cases it is irreversible. 

Differential privacy

A relatively new technology, differential privacy has become the gold standard for ensuring the privacy of data. Differential privacy allows for data analysis that preserves privacy. Computer scientists developed the technology to allow researchers to find patterns within data sets and draw conclusions from a group as a whole without compromising individual privacy. The method is used to protect massive data sets.

Differential privacy, also known as “epsilon indistinguishability”, creates privacy by adding randomness, or “noise”, to a calculation performed on a data set. Apple uses differential privacy to protect user privacy in iOS, as does Google. The US Census Bureau is planning to use differential privacy to protect the privacy of citizens.

Examples of poor anonymization practices

·       Removing names from a data set. 

Simply removing names from a data set does not safeguard individuals on the list. As long as other accessible data exists, attackers can link the list to that data. That is how researchers at the University of Texas at Austin de-anonymized some of the Netflix data in 2007- they compared rankings and timestamps of users on Netflix with information that was available on the movie-rating website IMDb.

·       Thinking pseudonymization is the same as anonymization

Pseudonymization obscures data, but it doesn’t render it anonymous. Pseudonymized data can be linked to an individual if the original data set is accessible. So, pseudonymizing your data doesn’t provide the same protection as anonymizing it.   

·       Anonymizing data while keeping the original data 

This follows from the previous point. If both the original data and the pseudonymized data are available, the anonymized data becomes pseudonymized data because it’s ultimately identifiable.

·       Thinking data is worthless

Thinking that the data you’re looking at is worthless to others is an easy trap to fall in. What may appear worthless now may, in actual fact, be valuable in a way that you can’t conceive now, or become valuable later on.  The smallest bread crumb may lead back to the one who dropped it.

The bad news for us all

Research shows that anonymization seldom meets the stringent standards set by the GDPR and that there remains a high risk of re-identification even when only a few attributes are available.

Researchers at Imperial College London used a new statistical model to calculate how likely it is to trace a piece of nameless data to identify a person. They found that it is disconcertingly easy, even when the data set is incomplete.

The research found that, in the US, there is a 99.98 percent chance of identifying individuals from 15 attributes in any anonymized data set. You may think that 15 is a high number, but consider that many data sets hold many more of our attributes. 

In one test, using only three attributes resulted in a re-identification probability of 95 percent. That is really unsettling.

The researchers are calling for privacy standards to be higher and data anonymization practices to be reviewed. Datahunter is one such platform that can help assess the effectiveness of your data anonymization programs.  Learn more at www.datahunter.ai or www.apption.com

⇠ Back to Blog