Delta-Sharing, the difficulties are Blaring!

Databricks >> Delta-Lake >> Delta-Sharing : Setup & Review

Richie Bachala
5 min readSep 12, 2021

What is Databricks?

A data management and processing platform provided by the creators of Apache Spark. Shares the same name as the company ‘Databricks’. Built on top of the Apache Spark engine with a unified processing interface across all platforms.

Delta Sharing Conceptually

Delta Sharing as a service is to allow data to be shared across environments without the sharer and the recipient to be on the same data platform or cloud.

Delta Sharing is said to be an open protocol for secure real-time exchange of large datasets that is supposed to scale out once the client connects to the cloud storage. Delta Sharing goals are as follows:

Share live data directly without replication. Providers can reliably in real-time use ACID transactions on Delta Lake and Recipients will always see a consistent view; while providing strong security for massive datasets.

While Delta Lake is a collection of Parquet files, a Delta-Sharing Provider decides what data they share and runs a sharing server which manages access for recipients. A Delta-Sharing Recipient runs a client that supports the protocol with connectors for Pandas, Apache Spark, Rust and Python.

Delta-Share Prerequisites & the SETUP:

To create a scenario from the limited documentation on GitHub suggests that you must create a Delta-Sharing Server in order to become a Data Provider and Delta Sharing Client to become a Data Recipient; which requires —

1. the Web Terminal enabled on the Spark driver node

2. installing the Delta Sharing server package

3. then Delta Sharing client library preferably at the Cluster level so anyone attaching to this cluster can access Delta-Sharing.

1. Web Terminal

The Web Terminal is used for monitoring resource usage and installing Linux packages. To run the web terminal server go to the Cluster detail page, click Apps and then “Launch Web Terminal” button. This brings a new browser tab with the terminal server command line.

2. Install Delta-Sharing Server

To execute Delta-Share Spark v:12.0.1.0 server package:

The support for debugging is extremely limited on the Databricks documentation.

Requires an additional “Sharing Server” to be deployed that manages sharing permissions.

3. Install Delta-Sharing Client

We will do this in 3 sub-steps:

3.1 New Library:

Using Cluster Libraries, add new library by “%Install New” button:

3.2. Delta-Sharing-Server:

Search Packages “delta-sharing-server”

3.3. Install Python library

The Python library is now installed at the Compute Cluster Level:

The Databricks sweet spot for the benchmark can be determined by executing two or three selected queries through various cluster sizes.

Types of clusters we tested/tried: D4, D8, E8, E16

Delta-Share Commands

Provider SQL

Create Delta Share: create share <DeltaShare>;

Describe Delta Share: describe share <DeltaShare>;

Assign Table to Delta Share: alter share <DeltaShare> add table <schema>.<table>;

Create Recipient: create recipient <Recipient>;

Grant Select to Recipient: grant select on share <DeltaShare> to recipient <Recipient>;

Recipient Python: Public Delta Share Example Authentication Setup:

Limited access control that uses a single token for all shares

Running Delta Share Example

Import the library, authenticate setup, and list the Delta Share tables:

Can’t revoke access to data once shared.

This was difficult to achieve, if anyone reading has had success with getting this to work in your initial attempts please leave a comment below.

Issues with Hierarchical SQL Queries

Large table queries are not performant without optimizations. Considerable time required to optimize large tables — this calls for planning beforehand based on workload expected. Overall, really not impressed with the performance.

The difficulties —

Although the Delta-Share concepts sound great in theory, the documentation lacks details for step-by-step walkthrough to successfully create the Data Provider server side.

SQL syntaxes beyond standard DDL & DML statements have issues, especially when your Hierarchical queries and relationships don’t get registered. It’s takes a time to ingest small datasets via long-running Spark jobs.

Difficult to find cluster sizes sweet spot.

Data Marketplace — not sure as how customers could discover and source data directly from providers as there is no marketplace for such access, may be its just hypothetical at the moment. Also, Delta Sharing community is non-existent. Is anyone out there?

Delta Share gives its customers the ability to share data across regions, but cross-cloud sharing is not currently available (AWS only today).

In my opinion, as a product delta-sharing is not there for enterprise consumption; looking forward to new updates & follow the stack as they mature.

Summary

Its great to see this space get competitive with new players like Databricks trying to join ranks with seasoned players like Snowflake, Redshift & Google — some more mature than others.

The sharing mechanism in data platforms is vastly complex — security, performance & efficiency become important factors for liability & for enterprise level adoption (governance & revocable access is extremely important).

As one of the leaders in the community of data engineering, I feel its important to be invested and follow the rise of data marketplaces for sharing data, as some-day we will be trading data futures & data as a commodity. I am looking to the day when that becomes the norm.

Thank you for reading!

--

--

Richie Bachala

Distributed SQL, Data Engineering Leader @ Yugabyte | past @ Sherwin-Williams, Hitachi, Oracle