How to run dbt on Databricks SQL Warehouse

Jorge Machado
4 min readFeb 15, 2024

--

How can you seamlessly execute your Dbt project in Databricks, ensuring a cost-effective and efficient data processing solution with familiar SQL syntax? In this episode, we dive into our solution for Azure and Databricks.

Solution

Prerequisites

  • Azure Storage Account with your data in one of the most common file formats (csv, json, avro, parquet, delta, …) in a storage account.
  • Service principal (SP) with a client_id and client_secret providing a technical user for access control
  • Databricks Workspace with sufficient permissions (workspace admin)
  • Installed Databricks CLI and dbt databricks adapter

Steps

In the following steps, the SP is configured for executing the workload, as this offers better adoption for a later deployment.

Grant Storage Access

Go to the storage account holding your data and give the SP the role Storage Blob Contributor for this resource. This allows the storage principal to perform read and write actions on your data.

Add Service Principal to Databricks Workspace

Next, for the SP to be authorized within the Databricks workspace for using compute resources, it is assigned to the workspace. For that, go to the admin console within your Databricks workspace -> “Identity and Access” -> Manage Service Principals and add your designated SP with the entitlements “Workspace Access” and “Databricks SQL Access”.

Creating SQL Warehouse

Go to SQL Warehouses under SQL on the left menu bar. Create a new warehouse. As type, we suggest using serverless, as this automatically starts and stops your compute power as needed and provides you with a cost-effective sql engine.

Databricks Secret Scope

We need the SP’s credentials for storage account access configuration. That why we safely put it client_secret into a databricks secret scope. This can be done via CLI. For that run the following commands:

databricks configure --aad-token # login to workspace
databricks secrets create-scope --scope <scope-name> --initial-manage-principal users
databricks secrets put --scope <scope-name> --key <key-name>

In the editor window that opens, insert the client_secret. It is then safely stored within the secret scope and can be configured for storage access.

SQL Warehouse Storage Access Configuration

Access for SQL Warehouses to Blob Storage is configured in the admin console. Go to Compute -> Manage SQL Warehouses and click on “Add Service Principal” for Data Access Configuration. Add your configuration.

Source Table Creation

For each source table reference in your Dbt project, we need to create a source table and schema within the databricks sql warehouse. Each source table is linked to the data within the storage account.

Under SQL on the left menu side, go to SQL Editor and select your data warehouse as a compute engine. It should start, as you issue the first command. Create and select the catalog and schema as you wish to organize your data.

Your sql script depends on your source table configuration as well as schema and catalog name. An example is this:

USE CATALOG hive_metastore;
CREATE DATABASE IF NOT EXISTS <db_name>;
USE DATABASE <db_name>
CREATE TABLE IF NOT EXISTS <db_name>.<table_name>
USING JSON
OPTIONS (inferSchema "true")
LOCATION "abfss://<storage_container>@<storage_account>.dfs.core.windows.net/<dir>/"

This table creation also verifies the SQL Warehouses permission to access the storage account.

Configure Dbt profile

As a last and final configuration step, we can configure a Dbt connection profile for Databricks SQL Warehouse. For token-based authentication, you can go to your SQL Warehouse -> Connection Details -> Dbt. There you can copy the profile.yaml as well as generate a token for authentication. The dbt commands are then issued on behalf of your logged-in user.

A better alternative is to also configure the dbt commands to run under SP credentials. For that, the profiles.yaml would look as follows:

datamesh_datalake_demo_profile:
target: dev
outputs:
dev:
type: databricks
# catalog: # if using unity catalog, you have to enter the according catalog here
schema: <db_name>
host: <databricks-workspace-url> # without https://
http_path: <your-warehouse-http-path>
auth_type: oauth
client_id: "{{ env_var('DATABRICKS_CLIENT_ID') }}"
client_secret: "{{ env_var('DATABRICKS_CLIENT_SECRET') }}"

When issuing dbt commands with this profile, dbt expects the environment variables DATABRICKS_CLIENT_ID and DATABRICKS_CLIENT_SECRET to be set. These are the SP credentials. Store this profile within your project and run the following to test your connection:

dbt debug --profiles-dir=<path-to-profile>

Now everything is configured to use Databricks SQL Warehouses for Dbt project. When the warehouse is configured as serverless, it will automatically start (when not enabled) and auto-shutdown after the idle period (which can be configured as 1min). This provides you with a cost-effective solution for your dbt data transformations.

Future Work

This setup can be used to trigger dbt runs from your local setup. In deployment, this task likely needs to be executed periodically or based on some triggers. Databricks Workflows can be utilized for that. We will present this to you in a future episode.

Co-Author: strasserpatrick

--

--

Jorge Machado
Jorge Machado

No responses yet