In today’s data-driven world, organizations are finding it increasingly important to find ways of sharing data in a seamless and secure manner, to promote collaborations and share findings to aid data driven decisions. Databricks offers an open protocol called Delta Sharing, enabling organizations to share and acquire data easily from third parties, no matter the compute platform. To add to this, the Databricks Marketplace is a centralized platform for sharing datasets, notebooks and Machine Learning Models with ease.
For instance, In 2008, the Global Initiative on Sharing All Influenza Data (GISAID) was established to push international data sharing and publication of results regarding influenza virus data and increase intellectual collaboration between institutions. This push had reaped benefits during the COVID-19 pandemic, helping health organizations tracking virus variants and sharing new findings. It is of global importance for entities of all kinds to have the ability to share their data seamlessly and securely. Data sharing can unlock a way to data monetization opportunities, enabling companies with the potential for another stream of income.
What is Delta Sharing?
Announced by Databricks in September 2020 at the Data and AI Summit Europe, Delta Sharing tackles the issues that current data sharing platforms have, such as high operational overhead and scalability. It is an open solution for data providers to share data with the interested parties, while not being restricted to the platform being used, version of the data or copying the data. Delta Sharing works with Unity Catalog, providing centralized governance, secure control of what can be shared, going as granular as defining the partitions of the data that can be accessed. The Unity Catalog Audit Log, records detailed information of activities such as modification of data products confined within the Unity Catalog ecosystem, beneficial for monitoring changes and data compliance.
Delta sharing removes the step of copying data, which was always required with other data sharing protocols. This reduces costs, time and resources on data ingestion processes and monitoring. The open REST protocol is not restricted to datasets only, Machine learning models and notebooks can be sent to different parties by converting these assets into a shareable form such as a delta table. Therefore, the mantra of Delta Sharing is the ability to share data anytime and anyplace in a fast and secure manner.
Figure 1: Delta Sharing with B2B on different platforms
How does Delta Sharing work?
Delta Sharing enables the distribution of data, stored as Delta Tables within Delta Lake, an open-source storage layer. This data is structured in the form of versioned Parquet files, leveraging the benefits of an open-source columnar data format specifically designed to handle high-volume workloads. Delta Lake stores the transaction logs to record the commits made to the data store for atomic, consistent, isolated, and durable (ACID) transactions. Time travel provides users with the flexibility of accessing previous snapshots of the data. To handle evolving data schemas, Delta Lake’s schema evolution allows for modification over time without impacting past data. Additionally, data integrity is ensured by enforcing schema validation, verifying that the data loaded aligns with the defined table structure.
To share data, Delta Sharing works with a REST open protocol that gives access to data recipients. The open protocol can be accessed by several programming languages that support the necessary connectors and libraries, such as Python, Scala, SQL, Java and R. There are two types of Delta Sharing, open Delta Sharing, for when the data users reside outside of the Databricks Workspace, a token for authentication is provided. Databricks-to-Databricks Delta Sharing enables data access for recipients with their own Databricks accounts, even on other cloud providers, without the need to generate and maintain access tokens. This open protocol works with Unity Catalog to give the data provider secure permissions control, granular data governance and tracking data access.
Figure 2: Procedure of Delta Sharing between Data provider and Recipient
Databricks Marketplace
Figure 3: Overview of the Databricks Marketplace
The Databricks Marketplace has been exposed for Public Preview in April 2023. Essentially it builds on the capabilities of Delta Sharing by being an open marketplace for sharing different products such as datasets, notebooks, dashboards and Machine Learning Models. Users can easily search for data products, provided from a wide range of companies in different domains for free or paid. The data providers can share a notebook and dashboard explaining the dataset being shared, therefore the data consumer can have a better sense and knowledge of the data before requesting it. Once data is requested from the Databricks Marketplace, users do not need a Databricks platform for them to access the data products, and no replication of data is done through the whole data sharing process. The costs associated with Delta Sharing primarily involve the setup and hosting of the storage infrastructure for data retention, as well as any data transfers performed by the data recipient.
Databricks has an ever growing clientele, with over 7,000 organizations. As the adoption of the Databricks Marketplace grows, the expanding network of clients significantly amplifies the network effect within the global data sharing ecosystem. The increased number of active users develops a reliable environment for companies and individual users to leverage the power of the marketplace to seamlessly share and collaborate on data-driven insights.
Benefits of Delta Sharing
The ever increasing volumes of data, digitization of companies and need for collaboration makes data sharing a need for every data centric organization. Delta Sharing allows users to have an open cross-platform, no vendor lock-in and less time invested in resources to carry out the sharing of data procedures. With data being the modern era gold rush, companies in domains such as telecoms or retail can leverage their large amounts of data by monetizing it with Databricks Marketplace along with other data products. To ensure the security and integrity of these processes, Delta Sharing incorporates Unity Catalog and offers granular control over sharing and auditing capabilities. With these remarkable advantages, organizations can embrace data sharing to accelerate innovation and drive actionable insights for informed decision-making.
Part 2
In the second part, we will explore a demo of Delta Sharing on Databricks, demonstrating seamless data sharing capabilities. Stay tuned!