With the rise of cloud platforms and the lakehouse paradigm, data governance has become increasingly complex. Organizations face complex security and governance models that can be difficult to navigate, due to data being spread across multiple cloud platforms, regions, organizations, and workspaces.
When you have multiple places to store your data, managing and securing, your data becomes complicated. Storage solutions typically have security tied to the physical location of data, meaning that making changes to security requires changes in the physical data layout and vice versa. Unity Catalog offers a unified solution for this issue by providing a centralized hub for managing permissions, fine-grained access control, a modern Data Catalog, and a suite of other useful functionalities.
What is the Unity Catalog, and what does it offer?
Databricks Unity Catalog is a unified data governance solution; it serves as a centralized hub within Databricks that manages all permissions. You give it privileged access to your data assets, and in turn, it provides fine-grained access control to the entities within it. The Unity Catalog does not live within your Databricks Workspace; it is hosted outside of it, allowing you to set your permissions once and propagate through all workspaces in a region. Administrators can set permissions through a user interface or a SQL API for the more code-savvy people.
The Unity Catalog simplifies data governance so that your data engineers, analysts, and data scientists can focus on manipulating data and not on accessibility.
What does Databricks Unity Catalog have to offer?
Fine-Grained Access Control: Unity Catalog is secure by default. Any entity added to Databricks has no access initially; each data asset becomes an object that entities get assigned permission to. This allows you to create a fine-grained security layout across your files, tables, views, models, columns, and rows. Auditing is possible through delta logs, enabling you to gain insights into how your data is used and by whom. While Databricks previously provided access control for the legacy Hive Metastore, its permission layout is non-restrictive by default. It has to utilize the built-in access control functionality very carefully to ensure entities don’t get unwanted access.
Data Discovery: Unity Catalog provides a search engine that considers access control while scanning your data, allowing you to find and reference any data point you have permission to within your data inventory.
Auto-tuning: Unity Catalog offers better performance on your queries by giving you a low-latency metadata serving and table auto-tuning. it will perform automatic data compaction in the background, optimizing data files for better input/output performance.
Automated Data Lineage: One of the latest features of the Unity Catalog is automated data lineage. Unity Catalog will comb through all the spark logs generated by your queries and visualize your data flow; this adds upstream and downstream visibility of each one of your tables, columns, notebooks, dashboards, and workflows. This makes it easier to identify the source of data to prove that data is trustworthy, or assess the impact of proposed changes. Automate data lineage does this on top of the Unity Catalog security layout, making sure that entities can view only data they are allowed to see, the automated data lineage tool also offers a REST API for integrating with other catalogs.
Delta Sharing: Unity Catalog has native support for Delta Sharing, an open protocol for secure data sharing; by creating a Share and referencing data to it, you can easily share Delta Tables across organizations without having to add them as entities in the Databricks Workspace. Any computing platform that can support Spark, Python, Java, PowerBi, Tableau and more can get read access to data you want to share, without having to use or share your Databricks compute resources, Unity Catalog allows you to have the same level of data governance over the data you share as the rest of your data assets, meaning you can establish permission with precision and track access to the shared data through auditing.
Things to take into account
Databricks Unity Catalog is helpful if you want a unified data handling experience. Before you leap, you need to consider a couple of things if you wish to utilize the Databricks Unity Catalog properly.
Initial Setup: Before setting up your Unity Catalog, you must prepare and understand a few things. The first thing is that Unity Catalog requires a Cloud Storage Solution like Azure ADLS gen2 (Unity Catalog does not support Azure Blob Storage) as its “default” location to store your data. You must create a Managed Identity with privileged access to any Cloud Storage Solution you would like Unity Catalog to manage. Additionally, understanding the Databricks Accounts Console is necessary. Databricks Accounts Console is a location where you can handle all workspaces, meta stores, groups, and users in a single subscription.
Governance from the ground up: Unity Catalog is secure by default, working with Unity Catalog requires that your permission model be put at the forefront of your designs. Who is allowed to create catalogs, schemas, and tables, for example, needs to be considered. Unity Catalog defines the owner of an object as any user or service principal who created said object. As an owner, you can assign permission to anyone else in your Unity Catalog or delete the object completely. This permission model might be acceptable for a development environment; however, a more robust one should be designed for production environments.
Understand Objects: The Unity Catalog treats every data asset and entity as an object, and each one of these objects requires that you assign permissions or links to other objects. If you want to create a new Catalog that points to a new storage location, the Catalog will consist of 3 objects linked together: the Catalog itself, the external location that the Catalog is referring to, and the credentials it requires to access the external location. If you want to use the Unity Catalog, you must understand how to properly use permission for your objects.
Who is the Unity Catalog for?
Databricks’ latest solution, the Unity Catalog, is designed to enable organizations to leverage the power of a Lakehouse platform while retaining a centralized point of governance for their data. It is useful for organizations that seek to outsource access to external cloud resources to the Unity Catalog while maintaining complete control over the entities that access data. It is a valuable tool for organizations that want to have explainability on how their data is being generated or need to have an easy and secure solution to give access across organizations while still keeping access control and audit logs centralized.