Echo Kanak

Databricks notes

Jul 31, 2025

These are some of my notes for concepts related to databricks. Not everything is databricks specific but general data engineering concepts that I put together when I was working on databricks projects. I might organize these better or expand into more detailed posts on individual topics. For now this is a rough collection of notes I found useful.

Medallion Architecture

Bronze Layer – Raw Data

  • This layer stores raw, unprocessed data directly from the source. It acts as a backup and preserves the original format for traceability.
  • Example: A CSV dump of customer transactions ingested from an S3 bucket or API feed.

Silver Layer – Cleaned & Transformed Data

  • This layer contains cleaned, validated, and joined data, making it ready for analytical processing. It removes duplicates, applies consistent formats, and integrates data from multiple sources.
  • Example: A table with customer details joined with valid transaction data, cleaned for nulls and data types.

Gold Layer – Business Ready Data

  • This layer holds curated, aggregated, and business-specific datasets used in dashboards, reporting, or machine learning models. It’s optimized for end-user consumption and decision-making.
  • Example: A daily report showing total revenue per region or customer churn predictions by segment.

Unity catalog

  • Unity metastore attached to databricks workspace
  • cant access metastore need admin permission we cant have it in free workspace
  • start from unity catalog, create schema create volume create data folders

Autoloader → incrementally read data

  • daily we receive files so loading of new data files
  • we need to make sure we use idempotent behavior process data only once, if a file is processed dont want to process it again
  • autoloader creates checkpoint it has rocksdb folder, takes care of idempotent behavior as well schema of the file
  • cached schema needs to be refreshed if there are changes in schema
  • if tomorrow file schema changes autoloader creates a new column addNewColumns at the end of the file
  • create checkpoint(we need to maintain a checkpoint location for spark streaming) it is recommended to keep your checkpoint location inside the schema location because it is easy to manage and maintain in production scenario
  • in schema evolution mode rescue is better than add new columns because it doesn’t break end table
  • specify trigger once=True it automatically terminates cluster when it is terminated

DLT or Lakeflow Declarative Pipeline

We can do 3 things in DLT

  1. Streaming tables
  1. Mat view
  1. Streaming views

A benefit of delta lake tables is that we can refer table even if it is not created

Expectations are rules we apply on top of the table

dlt.expect_all(rules) where rules is a dict of rules eg

rules = {
    "rule1" : "order_id IS NOT NULL",
    "rule2" : "user_id IS NOT NULL"
}
javascript

We can do 3 things:

  1. send warnings (default)
  1. fail the flow
  1. drop records dlt.expect_all_or_drop(rules)


In
Databricks, AutoCDC refers to Automatic Change Data Capture, a feature that simplifies the process of tracking changes (inserts, updates, deletes) in source tables and applying them to target Delta tableswithout writing manual MERGE logic.
SCD1 and SCD2 refer to Slowly Changing Dimensions a common data warehousing concept used to handle historical changes in dimension data (non-fact tables like customer, product, etc.).

SCD Type 1 - Overwrite (No History)

When a change happens, the old data is overwritten. No history is preserved.

Use case:

  • Fixing a typo in a name
  • Updating non-critical attributes
CustomerIDNameCity
101AliceDelhi

→ Alice moves to Mumbai.

CustomerIDNameCity
101AliceMumbai
  • No history tracking
  • Simple

SCD Type 2 - Keep History

Every change creates a new row, and the old record is marked as inactive.

Use case:

  • Tracking history (e.g., address changes, job title changes)
  • Regulatory or audit needs
CustomerIDNameCityStartDateEndDateIsCurrent
101AliceDelhi2023-01-012024-05-15false
101AliceMumbai2024-05-16nulltrue

Facts and Dimensions

Facts:

  • Tables that store measurable data—like quantities, sales, revenue, logins, etc.
  • Usually contain foreign keys to dimension tables and numeric measures.

Example: sales_fact table

sale_idcustomer_idproduct_idamountdate_id
1101501100.5020240101

Dimensions:

  • Tables that describe the who, what, where, when, how of the facts.
  • Often contain textual or categorical attributes.

Example: customer_dim table

customer_idnamecityjoin_date
101AliceMumbai2020-06-01

Facts vs Dimensions

FeatureFactsDimensions
ContentMeasures / metricsDescriptive attributes
GranularityTransaction-levelEntity-level (customer, product)
ExamplesSales, orders, paymentsCustomer, product, region
Key typeForeign keys (joins dims)Primary keys