Skip to content

Creating anonymized Primary keys for Google BigQuery

When creating tables in a relational database system, primary and foreign keys are needed to uniquely identify each row of a table. Relational databases have primary key constraints that can enforce this rule. In the case of ‘append only’ BigQuery tables, such constraints don’t exist because BigQuery is designed to store one large denormalized table. It is basically a data lake solution. If we do want anonymized primary keys, we need to generate them on our own. Read this step-by-step manual if you want to create anonymized primary keys for bigquery tables. Consider the following table with salary information.


relation_id relation_version date performance
1 12 2018-12-10 2700
11 2 2018-12-11 2750
3 1 2018-12-10 1600

The challenge

Relation_id and relation_version together uniquely identify a table row. You want to report on the performance per customer but you may not report about relation_id’s and relation_version’s. We have another table that contains customer information:

relation_id relation_version customer
1 12 Zony
11 2 Carpet and Sons
3 1 Oil NV


The proper way to join these tables is to have them both in BigQuery and join them using both Relation ID and Relation version, however we may not do this due to secrecy reasons.

We initially tried to take the concatenation of the relation_id and relation_version as the input of a SHA-256 hashing function to create a unique anonymized row identifier which can be used to join on. The problem with this approach is that the concatenation is not unique over the rows e.g. ‘1’ + ‘12’ = ‘112’ and ‘11’ + ‘2’ = ‘112’.

How to uniquely anonymize data using python

So what did we do to solve the duplication issue? I used a Cantor pairing function to uniquely map a pair of integers where the order of the integers is of importance to one unique value in a one-to-one relation. For our customer we used the Python version of Apache Beam to process input data.

def SHA_256(hash_input):
import hashlib
output = hashlib.sha256(str(hash_input).encode()).hexdigest()
return output

k1 = int(relation_id)
k2 = int(relation_version)
unique_cantor_pairing_value = (k1 + k2) * (k1 + k2 + 1) / 2 + k2
hash_input = str(unique_cantor_pairing_value)
unique_anonymized_string = SHA_256(hash_input)

If you need to map more than 2 values to 1 unique value, you can use the generalized Cantor tuple function which is just applying the Cantor pairing function recursively. If you only need to anonymize 1 column of the table instead of 2 you don’t need the cantor pairing function.

Applying the Cantor pairing function to the first table yields:

relation_dwh_id Date Performance
2018-12-10     2700    
2018-12-11    2750   
2018-12-10    1600    

For the second table we get:

relation_dwh_id customer
Carpet and Sons   
Oil NV   


As you can see, we can’t derive the relation_id and relation_version from this value but we can use the relation_dwh_id to do SQL relational joins instead.

Closing words

This article highlighted how to uniquely anonymize primary keys using python as an example. Central in the problem is the use of the Cantor Pairing function. Sometime in the next weeks my second blog on “Implementing a left join in Google Dataflow (Apache Beam)” will be published as well.

Other Google Cloud Technical blog-posts:

Google Cloud Services & Consulting

As Premier Google Partner and Google’s EMEA Services Partner of the Year, we truly believe in collaborating with the most innovative technologies in the world. Click the button below to discover our Google Cloud Services.

Partner since 2011 5 countries in EMEA +1000 references 100 experts