When creating tables in a relational database system, primary and foreign keys are needed to uniquely identify each row of a table. Relational databases have primary key constraints that can enforce this rule. In the case of ‘append only’ BigQuery tables, such constraints don’t exist because BigQuery is designed to store one large denormalized table. It is basically a data lake solution. If we do want anonymized primary keys, we need to generate them on our own. Read this step-by-step manual if you want to create anonymized primary keys for bigquery tables. Consider the following table with salary information.
Relation_id and relation_version together uniquely identify a table row. You want to report on the performance per customer but you may not report about relation_id’s and relation_version’s. We have another table that contains customer information:
|11||2||Carpet and Sons|
The proper way to join these tables is to have them both in BigQuery and join them using both Relation ID and Relation version, however we may not do this due to secrecy reasons.
We initially tried to take the concatenation of the relation_id and relation_version as the input of a SHA-256 hashing function to create a unique anonymized row identifier which can be used to join on. The problem with this approach is that the concatenation is not unique over the rows e.g. ‘1’ + ‘12’ = ‘112’ and ‘11’ + ‘2’ = ‘112’.
How to uniquely anonymize data using python
So what did we do to solve the duplication issue? I used a Cantor pairing function to uniquely map a pair of integers where the order of the integers is of importance to one unique value in a one-to-one relation. For our customer we used the Python version of Apache Beam to process input data.
def SHA_256(hash_input): import hashlib output = hashlib.sha256(str(hash_input).encode()).hexdigest() return output k1 = int(relation_id) k2 = int(relation_version) unique_cantor_pairing_value = (k1 + k2) * (k1 + k2 + 1) / 2 + k2 hash_input = str(unique_cantor_pairing_value) unique_anonymized_string = SHA_256(hash_input)
If you need to map more than 2 values to 1 unique value, you can use the generalized Cantor tuple function which is just applying the Cantor pairing function recursively. If you only need to anonymize 1 column of the table instead of 2 you don’t need the cantor pairing function.
Applying the Cantor pairing function to the first table yields:
For the second table we get:
|Carpet and Sons|
As you can see, we can’t derive the relation_id and relation_version from this value but we can use the relation_dwh_id to do SQL relational joins instead.
This article highlighted how to uniquely anonymize primary keys using python as an example. Central in the problem is the use of the Cantor Pairing function. Sometime in the next weeks my second blog on “Implementing a left join in Google Dataflow (Apache Beam)” will be published as well.
Other Google Cloud Technical blog-posts:
- Creating a master orchestrator to handle complex Google Cloud Composer jobs
- Creating complex orchestrations with Google Cloud Composer using sub_controllers
- Google Cloud: Infrastructure as Code with Terraform
Google Cloud Services & Consulting
As Premier Google Partner and Google’s EMEA Services Partner of the Year, we truly believe in collaborating with the most innovative technologies in the world. Click the button below to discover our Google Cloud Services.
|Partner since 2011||5 countries in EMEA||+1000 references||100 experts|