Creating anonymized Primary keys for Google BigQuery

When creating tables in a relational database system, primary and foreign keys are needed to uniquely identify each row of a table. Relational databases have primary key constraints that can enforce this rule. In the case of ‘append only’ BigQuery tables, such constraints don’t exist because BigQuery is designed to store one large denormalized table. It is basically a data lake solution. If we do want anonymized primary keys, we need to generate them on our own. Read this step-by-step manual if you want to create anonymized primary keys for bigquery tables. Consider the following table with salary information.

relation_id	relation_version	date	performance
1	12	2018-12-10	2700
11	2	2018-12-11	2750
3	1	2018-12-10	1600
…

The challenge

Relation_id and relation_version together uniquely identify a table row. You want to report on the performance per customer but you may not report about relation_id’s and relation_version’s. We have another table that contains customer information:

relation_id	relation_version	customer
1	12	Zony
11	2	Carpet and Sons
3	1	Oil NV

The proper way to join these tables is to have them both in BigQuery and join them using both Relation ID and Relation version, however we may not do this due to secrecy reasons.

We initially tried to take the concatenation of the relation_id and relation_version as the input of a SHA-256 hashing function to create a unique anonymized row identifier which can be used to join on. The problem with this approach is that the concatenation is not unique over the rows e.g. ‘1’ + ‘12’ = ‘112’ and ‘11’ + ‘2’ = ‘112’.

How to uniquely anonymize data using python

So what did we do to solve the duplication issue? I used a Cantor pairing function to uniquely map a pair of integers where the order of the integers is of importance to one unique value in a one-to-one relation. For our customer we used the Python version of Apache Beam to process input data.

def SHA_256(hash_input):
import hashlib
output = hashlib.sha256(str(hash_input).encode()).hexdigest()
return output

k1 = int(relation_id)
k2 = int(relation_version)
unique_cantor_pairing_value = (k1 + k2) * (k1 + k2 + 1) / 2 + k2
hash_input = str(unique_cantor_pairing_value)
unique_anonymized_string = SHA_256(hash_input)

If you need to map more than 2 values to 1 unique value, you can use the generalized Cantor tuple function which is just applying the Cantor pairing function recursively. If you only need to anonymize 1 column of the table instead of 2 you don’t need the cantor pairing function.

Applying the Cantor pairing function to the first table yields:

relation_dwh_id	Date	Performance
454f63ac30c8322997ef025edff6abd23e0 dbe7b8a3d5126a894e4a168c1b59b	2018-12-10	2700
6e4001871c0cf27c7634ef1dc478408f642 410fd3a444e2a88e301f5c4a35a4d	2018-12-11	2750
4fc82b26aecb47d2868c4efbe3581732a3e 7cbcc6c2efb32062c08170a05eeb8	2018-12-10	1600

For the second table we get:

relation_dwh_id	customer
454f63ac30c8322997ef025edff6abd23e0 dbe7b8a3d5126a894e4a168c1b59b	Zony
6e4001871c0cf27c7634ef1dc478408f642 410fd3a444e2a88e301f5c4a35a4d	Carpet and Sons
4fc82b26aecb47d2868c4efbe3581732a3 e7cbcc6c2efb32062c08170a05eeb8	Oil NV

As you can see, we can’t derive the relation_id and relation_version from this value but we can use the relation_dwh_id to do SQL relational joins instead.

Closing words

This article highlighted how to uniquely anonymize primary keys using python as an example. Central in the problem is the use of the Cantor Pairing function. Sometime in the next weeks my second blog on “Implementing a left join in Google Dataflow (Apache Beam)” will be published as well.

Other Google Cloud Technical blog-posts:

Google Cloud Services & Consulting

As Premier Google Partner and Google’s EMEA Services Partner of the Year, we truly believe in collaborating with the most innovative technologies in the world. Click the button below to discover our Google Cloud Services.


Partner since 2011	5 countries in EMEA	+1000 references	100 experts

Discover Google Cloud Services