How to run dbt on Databricks SQL Warehouse
How can you seamlessly execute your Dbt project in Databricks, ensuring a cost-effective and efficient data processing solution with familiar SQL syntax? In this episode, we dive into our solution for Azure and Databricks.
Solution
Prerequisites
- Azure Storage Account with your data in one of the most common file formats (csv, json, avro, parquet, delta, …) in a storage account.
- Service principal (SP) with a
client_id
andclient_secret
providing a technical user for access control - Databricks Workspace with sufficient permissions (workspace admin)
- Installed Databricks CLI and dbt databricks adapter
Steps
In the following steps, the SP is configured for executing the workload, as this offers better adoption for a later deployment.
Grant Storage Access
Go to the storage account holding your data and give the SP the role Storage Blob Contributor
for this resource. This allows the storage principal to perform read and write actions on your data.
Add Service Principal to Databricks Workspace
Next, for the SP to be authorized within the Databricks workspace for using compute resources, it is assigned to the workspace. For that, go to the admin console within your Databricks workspace -> “Identity and Access” -> Manage Service Principals and add your designated SP with the entitlements “Workspace Access” and “Databricks SQL Access”.
Creating SQL Warehouse
Go to SQL Warehouses under SQL on the left menu bar. Create a new warehouse. As type, we suggest using serverless, as this automatically starts and stops your compute power as needed and provides you with a cost-effective sql engine.
Databricks Secret Scope
We need the SP’s credentials for storage account access configuration. That why we safely put it client_secret
into a databricks secret scope. This can be done via CLI. For that run the following commands:
databricks configure --aad-token # login to workspace
databricks secrets create-scope --scope <scope-name> --initial-manage-principal users
databricks secrets put --scope <scope-name> --key <key-name>
In the editor window that opens, insert the client_secret
. It is then safely stored within the secret scope and can be configured for storage access.
SQL Warehouse Storage Access Configuration
Access for SQL Warehouses to Blob Storage is configured in the admin console. Go to Compute -> Manage SQL Warehouses and click on “Add Service Principal” for Data Access Configuration. Add your configuration.
Source Table Creation
For each source table reference in your Dbt project, we need to create a source table and schema within the databricks sql warehouse. Each source table is linked to the data within the storage account.
Under SQL on the left menu side, go to SQL Editor and select your data warehouse as a compute engine. It should start, as you issue the first command. Create and select the catalog and schema as you wish to organize your data.
Your sql script depends on your source table configuration as well as schema and catalog name. An example is this:
USE CATALOG hive_metastore;
CREATE DATABASE IF NOT EXISTS <db_name>;
USE DATABASE <db_name>
CREATE TABLE IF NOT EXISTS <db_name>.<table_name>
USING JSON
OPTIONS (inferSchema "true")
LOCATION "abfss://<storage_container>@<storage_account>.dfs.core.windows.net/<dir>/"
This table creation also verifies the SQL Warehouses permission to access the storage account.
Configure Dbt profile
As a last and final configuration step, we can configure a Dbt connection profile for Databricks SQL Warehouse. For token-based authentication, you can go to your SQL Warehouse -> Connection Details -> Dbt. There you can copy the profile.yaml
as well as generate a token for authentication. The dbt commands are then issued on behalf of your logged-in user.
A better alternative is to also configure the dbt commands to run under SP credentials. For that, the profiles.yaml
would look as follows:
datamesh_datalake_demo_profile:
target: dev
outputs:
dev:
type: databricks
# catalog: # if using unity catalog, you have to enter the according catalog here
schema: <db_name>
host: <databricks-workspace-url> # without https://
http_path: <your-warehouse-http-path>
auth_type: oauth
client_id: "{{ env_var('DATABRICKS_CLIENT_ID') }}"
client_secret: "{{ env_var('DATABRICKS_CLIENT_SECRET') }}"
When issuing dbt commands with this profile, dbt expects the environment variables DATABRICKS_CLIENT_ID
and DATABRICKS_CLIENT_SECRET
to be set. These are the SP credentials. Store this profile within your project and run the following to test your connection:
dbt debug --profiles-dir=<path-to-profile>
Now everything is configured to use Databricks SQL Warehouses for Dbt project. When the warehouse is configured as serverless, it will automatically start (when not enabled) and auto-shutdown after the idle period (which can be configured as 1min). This provides you with a cost-effective solution for your dbt data transformations.
Future Work
This setup can be used to trigger dbt runs from your local setup. In deployment, this task likely needs to be executed periodically or based on some triggers. Databricks Workflows can be utilized for that. We will present this to you in a future episode.
Co-Author: strasserpatrick