When creating tables in a relational database system, primary and foreign keys are needed to uniquely identify each row of a table. Relational databases have primary key constraints that can enforce this rule. In the case of ‘append only’ BigQuery tables, such constraints don’t exist because BigQuery is designed to store one large denormalized table. It is basically a data lake solution. If we do want anonymized primary keys, we need to generate them on our own. Read this step-by-step manual if you want to create anonymized primary keys for bigquery tables. Consider the following table with salary information.
relation_id | relation_version | date | performance |
1 | 12 | 2018-12-10 | 2700 |
11 | 2 | 2018-12-11 | 2750 |
3 | 1 | 2018-12-10 | 1600 |
… |
The challenge
Relation_id and relation_version together uniquely identify a table row. You want to report on the performance per customer but you may not report about relation_id’s and relation_version’s. We have another table that contains customer information:
relation_id | relation_version | customer |
1 | 12 | Zony |
11 | 2 | Carpet and Sons |
3 | 1 | Oil NV |
The proper way to join these tables is to have them both in BigQuery and join them using both Relation ID and Relation version, however we may not do this due to secrecy reasons.
We initially tried to take the concatenation of the relation_id and relation_version as the input of a SHA-256 hashing function to create a unique anonymized row identifier which can be used to join on. The problem with this approach is that the concatenation is not unique over the rows e.g. ‘1’ + ‘12’ = ‘112’ and ‘11’ + ‘2’ = ‘112’.
How to uniquely anonymize data using python
So what did we do to solve the duplication issue? I used a Cantor pairing function to uniquely map a pair of integers where the order of the integers is of importance to one unique value in a one-to-one relation. For our customer we used the Python version of Apache Beam to process input data.
def SHA_256(hash_input): import hashlib output = hashlib.sha256(str(hash_input).encode()).hexdigest() return output k1 = int(relation_id) k2 = int(relation_version) unique_cantor_pairing_value = (k1 + k2) * (k1 + k2 + 1) / 2 + k2 hash_input = str(unique_cantor_pairing_value) unique_anonymized_string = SHA_256(hash_input)
If you need to map more than 2 values to 1 unique value, you can use the generalized Cantor tuple function which is just applying the Cantor pairing function recursively. If you only need to anonymize 1 column of the table instead of 2 you don’t need the cantor pairing function.
Applying the Cantor pairing function to the first table yields:
relation_dwh_id | Date | Performance |
454f63ac30c8322997ef025edff6abd23e0 dbe7b8a3d5126a894e4a168c1b59b |
2018-12-10 | 2700 |
6e4001871c0cf27c7634ef1dc478408f642 410fd3a444e2a88e301f5c4a35a4d |
2018-12-11 | 2750 |
4fc82b26aecb47d2868c4efbe3581732a3e 7cbcc6c2efb32062c08170a05eeb8 |
2018-12-10 | 1600 |
For the second table we get:
relation_dwh_id | customer |
454f63ac30c8322997ef025edff6abd23e0 dbe7b8a3d5126a894e4a168c1b59b |
Zony |
6e4001871c0cf27c7634ef1dc478408f642 410fd3a444e2a88e301f5c4a35a4d |
Carpet and Sons |
4fc82b26aecb47d2868c4efbe3581732a3 e7cbcc6c2efb32062c08170a05eeb8 |
Oil NV |
As you can see, we can’t derive the relation_id and relation_version from this value but we can use the relation_dwh_id to do SQL relational joins instead.
Closing words
This article highlighted how to uniquely anonymize primary keys using python as an example. Central in the problem is the use of the Cantor Pairing function. Sometime in the next weeks my second blog on “Implementing a left join in Google Dataflow (Apache Beam)” will be published as well.
Other Google Cloud Technical blog-posts:
- Creating a master orchestrator to handle complex Google Cloud Composer jobs
- Creating complex orchestrations with Google Cloud Composer using sub_controllers
- Google Cloud: Infrastructure as Code with Terraform
Google Cloud Services & Consulting
As Premier Google Partner and Google’s EMEA Services Partner of the Year, we truly believe in collaborating with the most innovative technologies in the world. Click the button below to discover our Google Cloud Services.
Partner since 2011 | 5 countries in EMEA | +1000 references | 100 experts |