Home > Row and Column-Level Security With Databricks
Donatien Tessier
12 October 2021
Lire cet article en Français

Row and Column-Level Security With Databricks

Co-written with Himanshu Arora, Data Solution Architect at Databricks

As discussed in our post on the right to be forgotten (only available in French) European regulations impose strict rules on data governance. That said, categorizing your data by criticality is crucial for determining who has access to it. This is why data governance is a major, complex, and multifaceted challenge for companies. Data access is part of this governance.

Relational databases have been providing ways of controlling and limiting this access for years.

Using data stored in a data lake is more recent. However, the issue remains the same regardless of where the data is stored.

Using Databricks allows you to set up row-level or column-level security.

 

How to Restrict Column and Row Access with Azure Databricks

The principle is the same for row-level or column-level security.

You need to define user groups, create views, and provide access to those views based on what the users have permission to see.

 

Row-Level Security

You can achieve row-level security by placing a specific clause for a given view.

In this example, we’ll be working on a sales table containing sales made by two sales regions: East (Est) and West (Ouest).

 

 

 

The sales representatives should only have access to the sales in their region: so a sales rep from the “West” region should only see rows that include the “RegionCommerciale” column with the value “Ouest” and vice versa.

Sales managers should have access to the whole table.

This means we need to create three groups:

  1. The “Est” group for sales representatives from the East region
  2. The “Ouest” group for sales representatives from the West region
  3. The “Direction” group for the sales managers

To create the link between the user’s group and the values in the “RegionCommerciale” column, we use the IS_MEMBER function to determine which rows the user can access.

 

How to Create Groups

Go to the Admin Console to create groups:

 

Admin console Databricks

 

 

Click “Create Group”:

 

créer un groupe sur l'admin console databricks

 

Give the group a name. We’ll call this group “Est”:

 

création group Databricks console

Once you have created the group, you need to add its users:

 

Affectation utilisateurs au groupe Databricks console

 

My user has been added to the “Est” group.

The command below tests whether or not the current user belongs to the “Est” group.

 

Commande databricks

 

Selecting the “Ventes” (Sales) table with a condition that uses the “is_member” function on the values in the “RegionCommerciale” column only returns the three rows for the “Est” sales region.

 

Databricks affichage lignes

 

The next step is to create a view using that query:

 

Databricks création de vues

 

A quick check shows that this view does show the rows for the user’s region:

 

Databricks consultation lignes utilisateur

A user in the “Ouest” group would see the following:

 

Affichage Databricks colonnnes lignes

 

The “Direction” group has access to the whole table. To grant access to the table to that group only, we need to enable table access control.

There are three requirements to use this feature:

  1. You must have a Premium plan workspace
  2. The user(s) must not have cluster creation access
  3. You must use High Concurrency clusters

 

Databricks Cluster mode

 

Once this is done, you can enable the feature in the workspace:

 

Databricks activation fonctionnalité dans le workspace

 

 

Then at cluster level:

 

Databricks table access control

Finally, grant access to view the table to the “Direction” group.

Accès en consultation table Databricks

In this post, we created the groups and assigned users manually. You can, of course, automate these actions, for example, by using the Terraform Databricks provider or via the Databricks CLI.

When you need to use groups, the easiest method is utilizing the SCIM functionality, which allows you to manage AD groups.

 

Column-Level Security

You can achieve column-level security by selecting and positioning a condition in the field(s) that use the IS_MEMBER function in “SELECT” and not in the “WHERE” clause.

See the sample table below:

 

Exemple table à colonnes Databricks

 

Creating the view applies a condition to display the “Montant” (Amount) column. So, if the user belongs to the “RegionCommerciale” column, the value in the “Montant” column is displayed. Otherwise, the value “Not authorized” is shown instead.

 

Affichage colonne Databricks

 

Here is the result display for a user in the “Est” group:

 

Databricks affichage colonne table

 

Unity Catalog

We have looked at how you can place rules for viewing data.

However, if different workspaces use the same data, you need to replicate the same configuration across all of those workspaces.

Unity Catalog lets you define rules to be applied to all workspaces through a shared catalog.

 

⚠️Note: Unity Catalog is still in beta. The examples shown below are subject to change.

 

In addition to managing data access, Unity Catalog can also control access to notebooks, models, etc.

Since the catalog is managed within your Databricks account, you are not tied to a particular cloud provider.

Unity Catalog tries to create a link to the existing catalog ecosystem. That ecosystem is changing all the time.

You can manage the catalog via a dedicated Unity Catalog graphical user interface.

Any access restrictions are created using attributes. First, you need to create an attribute. In the example below, the attribute is named “pii” (“personal identifiable information”):

CREATE ATTRIBUTE pii
 

Mark the relevant columns with the attribute. The iot_events table with the “email” column and the users table with the “phone” column are relevant:

ALTER TABLE iot_events ADD ATTRIBUTE pii ON email
ALTER TABLE users ADD ATTRIBUTE pii ON phone
...

 

Finally, you can restrict access to the columns marked with this attribute to a user group:

GRANT SELECT ON DATABASE iot_data
HAVING ATTRIBUTE NOT IN (pii)
TO product_managers

Unity Catalog also provides audit logs detailing access to the columns linked to an attribute.

 

Unity catalog logs d'audit Databricks

 

Easy, Centralized Management

 

Relational databases have been providing features to restrict data access for a long time. Although newer than relational databases, tools that use data from a data lake are starting to have similar governance features.

Unity Catalog makes it easier to manage all that while centralizing management.

 

 

 

 

 

This posts should interest you
Data Observability
Post co-written by Donatien Tessier and Amine Kaabachi   To quote an old advertising slogan: “Power is nothing without control!”...
Comments
Leave a Reply

Your email address will not be published.

Receive the best of Cloud, DevOps and IT news.
Receive the best of Cloud, DevOps and IT news.