Databricks notes

These are some of my notes for concepts related to databricks. Not everything is databricks specific but general data engineering concepts that I put together when I was working on databricks projects. I might organize these better or expand into more detailed posts on individual topics. For now this is a rough collection of notes I found useful.

Medallion Architecture

Bronze Layer – Raw Data

This layer stores raw, unprocessed data directly from the source. It acts as a backup and preserves the original format for traceability.

Example: A CSV dump of customer transactions ingested from an S3 bucket or API feed.

Silver Layer – Cleaned & Transformed Data

This layer contains cleaned, validated, and joined data, making it ready for analytical processing. It removes duplicates, applies consistent formats, and integrates data from multiple sources.

Example: A table with customer details joined with valid transaction data, cleaned for nulls and data types.

Gold Layer – Business Ready Data

This layer holds curated, aggregated, and business-specific datasets used in dashboards, reporting, or machine learning models. It’s optimized for end-user consumption and decision-making.

Example: A daily report showing total revenue per region or customer churn predictions by segment.

Unity catalog

Unity metastore attached to databricks workspace

cant access metastore need admin permission we cant have it in free workspace

start from unity catalog, create schema create volume create data folders

Autoloader → incrementally read data

daily we receive files so loading of new data files

we need to make sure we use idempotent behavior process data only once, if a file is processed dont want to process it again

autoloader creates checkpoint it has rocksdb folder, takes care of idempotent behavior as well schema of the file

cached schema needs to be refreshed if there are changes in schema

if tomorrow file schema changes autoloader creates a new column addNewColumns at the end of the file

create checkpoint(we need to maintain a checkpoint location for spark streaming) it is recommended to keep your checkpoint location inside the schema location because it is easy to manage and maintain in production scenario

in schema evolution mode rescue is better than add new columns because it doesn’t break end table

specify trigger once=True it automatically terminates cluster when it is terminated

DLT or Lakeflow Declarative Pipeline

We can do 3 things in DLT

Streaming tables
Mat view
Streaming views

A benefit of delta lake tables is that we can refer table even if it is not created

Expectations are rules we apply on top of the table

dlt.expect_all(rules) where rules is a dict of rules eg

rules = {
    "rule1" : "order_id IS NOT NULL",
    "rule2" : "user_id IS NOT NULL"
}

javascript

We can do 3 things:

send warnings (default)
fail the flow
drop records dlt.expect_all_or_drop(rules)

In Databricks, AutoCDC refers to Automatic Change Data Capture, a feature that simplifies the process of tracking changes (inserts, updates, deletes) in source tables and applying them to target Delta tables—without writing manual MERGE logic.
SCD1 and SCD2 refer to Slowly Changing Dimensions a common data warehousing concept used to handle historical changes in dimension data (non-fact tables like customer, product, etc.).

SCD Type 1 - Overwrite (No History)

When a change happens, the old data is overwritten. No history is preserved.

Use case:

Fixing a typo in a name

Updating non-critical attributes

CustomerID	Name	City
101	Alice	Delhi

→ Alice moves to Mumbai.

CustomerID	Name	City
101	Alice	Mumbai

No history tracking

Simple

SCD Type 2 - Keep History

Every change creates a new row, and the old record is marked as inactive.

Use case:

Tracking history (e.g., address changes, job title changes)

Regulatory or audit needs

CustomerID	Name	City	StartDate	EndDate	IsCurrent
101	Alice	Delhi	2023-01-01	2024-05-15	false
101	Alice	Mumbai	2024-05-16	null	true

Facts and Dimensions

Facts:

Tables that store measurable data—like quantities, sales, revenue, logins, etc.

Usually contain foreign keys to dimension tables and numeric measures.

Example: sales_fact table

sale_id	customer_id	product_id	amount	date_id
1	101	501	100.50	20240101

Dimensions:

Tables that describe the who, what, where, when, how of the facts.

Often contain textual or categorical attributes.

Example: customer_dim table

customer_id	name	city	join_date
101	Alice	Mumbai	2020-06-01

Facts vs Dimensions

Feature	Facts	Dimensions
Content	Measures / metrics	Descriptive attributes
Granularity	Transaction-level	Entity-level (customer, product)
Examples	Sales, orders, payments	Customer, product, region
Key type	Foreign keys (joins dims)	Primary keys