Row and Column-Level Security With Databricks
Co-written with Himanshu Arora, Data Solution Architect at Databricks
As discussed in our post on the right to be forgotten (only available in French) European regulations impose strict rules on data governance. That said, categorizing your data by criticality is crucial for determining who has access to it. This is why data governance is a major, complex, and multifaceted challenge for companies. Data access is part of this governance.
Relational databases have been providing ways of controlling and limiting this access for years.
Using data stored in a data lake is more recent. However, the issue remains the same regardless of where the data is stored.
Using Databricks allows you to set up row-level or column-level security.
How to Restrict Column and Row Access with Azure Databricks
The principle is the same for row-level or column-level security.
You need to define user groups, create views, and provide access to those views based on what the users have permission to see.
You can achieve row-level security by placing a specific clause for a given view.
In this example, we’ll be working on a sales table containing sales made by two sales regions: East (Est) and West (Ouest).
The sales representatives should only have access to the sales in their region: so a sales rep from the “West” region should only see rows that include the “RegionCommerciale” column with the value “Ouest” and vice versa.
Sales managers should have access to the whole table.
This means we need to create three groups:
- The “Est” group for sales representatives from the East region
- The “Ouest” group for sales representatives from the West region
- The “Direction” group for the sales managers
To create the link between the user’s group and the values in the “RegionCommerciale” column, we use the IS_MEMBER function to determine which rows the user can access.
How to Create Groups
Go to the Admin Console to create groups:
Click “Create Group”:
Give the group a name. We’ll call this group “Est”:
Once you have created the group, you need to add its users:
My user has been added to the “Est” group.
The command below tests whether or not the current user belongs to the “Est” group.
Selecting the “Ventes” (Sales) table with a condition that uses the “is_member” function on the values in the “RegionCommerciale” column only returns the three rows for the “Est” sales region.
The next step is to create a view using that query:
A quick check shows that this view does show the rows for the user’s region:
A user in the “Ouest” group would see the following:
The “Direction” group has access to the whole table. To grant access to the table to that group only, we need to enable table access control.
There are three requirements to use this feature:
- You must have a Premium plan workspace
- The user(s) must not have cluster creation access
- You must use High Concurrency clusters
Once this is done, you can enable the feature in the workspace:
Then at cluster level:
Finally, grant access to view the table to the “Direction” group.
When you need to use groups, the easiest method is utilizing the SCIM functionality, which allows you to manage AD groups.
You can achieve column-level security by selecting and positioning a condition in the field(s) that use the IS_MEMBER function in “SELECT” and not in the “WHERE” clause.
See the sample table below:
Creating the view applies a condition to display the “Montant” (Amount) column. So, if the user belongs to the “RegionCommerciale” column, the value in the “Montant” column is displayed. Otherwise, the value “Not authorized” is shown instead.
Here is the result display for a user in the “Est” group:
We have looked at how you can place rules for viewing data.
However, if different workspaces use the same data, you need to replicate the same configuration across all of those workspaces.
Unity Catalog lets you define rules to be applied to all workspaces through a shared catalog.
⚠️Note: Unity Catalog is still in beta. The examples shown below are subject to change.
In addition to managing data access, Unity Catalog can also control access to notebooks, models, etc.
Since the catalog is managed within your Databricks account, you are not tied to a particular cloud provider.
Unity Catalog tries to create a link to the existing catalog ecosystem. That ecosystem is changing all the time.
You can manage the catalog via a dedicated Unity Catalog graphical user interface.
Any access restrictions are created using attributes. First, you need to create an attribute. In the example below, the attribute is named “pii” (“personal identifiable information”):
CREATE ATTRIBUTE pii
Mark the relevant columns with the attribute. The iot_events table with the “email” column and the users table with the “phone” column are relevant:
ALTER TABLE iot_events ADD ATTRIBUTE pii ON email
ALTER TABLE users ADD ATTRIBUTE pii ON phone
Finally, you can restrict access to the columns marked with this attribute to a user group:
GRANT SELECT ON DATABASE iot_data
HAVING ATTRIBUTE NOT IN (pii)
Unity Catalog also provides audit logs detailing access to the columns linked to an attribute.
Easy, Centralized Management
Relational databases have been providing features to restrict data access for a long time. Although newer than relational databases, tools that use data from a data lake are starting to have similar governance features.
Unity Catalog makes it easier to manage all that while centralizing management.