Echo Kanak

Databricks notes

A rough collection of notes on Databricks and data engineering concepts

Jul 31, 2025
Databricks notes

These are some of my notes for concepts related to databricks. Not everything is databricks specific but general data engineering concepts that I put together when I was working on databricks projects. I might organize these better or expand into more detailed posts on individual topics. For now this is a rough collection of notes I found useful.

Medallion Architecture

Bronze Layer – Raw Data

  • This layer stores raw, unprocessed data directly from the source. It acts as a backup and preserves the original format for traceability.
  • Example: A CSV dump of customer transactions ingested from an S3 bucket or API feed.

Silver Layer – Cleaned & Transformed Data

  • This layer contains cleaned, validated, and joined data, making it ready for analytical processing. It removes duplicates, applies consistent formats, and integrates data from multiple sources.
  • Example: A table with customer details joined with valid transaction data, cleaned for nulls and data types.

Gold Layer – Business Ready Data

  • This layer holds curated, aggregated, and business-specific datasets used in dashboards, reporting, or machine learning models. It’s optimized for end-user consumption and decision-making.
  • Example: A daily report showing total revenue per region or customer churn predictions by segment.
 

Unity catalog

  • Unity metastore attached to databricks workspace
  • cant access metastore need admin permission we cant have it in free workspace
  • start from unity catalog, create schema create volume create data folders
 

Autoloader → incrementally read data

  • daily we receive files so loading of new data files
  • we need to make sure we use idempotent behavior process data only once, if a file is processed dont want to process it again
  • autoloader creates checkpoint it has rocksdb folder, takes care of idempotent behavior as well schema of the file
  • cached schema needs to be refreshed if there are changes in schema
  • if tomorrow file schema changes autoloader creates a new column addNewColumns at the end of the file
  • create checkpoint(we need to maintain a checkpoint location for spark streaming) it is recommended to keep your checkpoint location inside the schema location because it is easy to manage and maintain in production scenario
  • in schema evolution mode rescue is better than add new columns because it doesn’t break end table
  • specify trigger once=True it automatically terminates cluster when it is terminated
 
 

DLT or Lakeflow Declarative Pipeline

We can do 3 things in DLT

  1. Streaming tables
  1. Mat view
  1. Streaming views
 

A benefit of delta lake tables is that we can refer table even if it is not created

 

Expectations are rules we apply on top of the table

dlt.expect_all(rules) where rules is a dict of rules eg

We can do 3 things:

  1. send warnings (default)
  1. fail the flow
  1. drop records dlt.expect_all_or_drop(rules)
 


In Databricks, AutoCDC refers to Automatic Change Data Capture, a feature that simplifies the process of tracking changes (inserts, updates, deletes) in source tables and applying them to target Delta tableswithout writing manual MERGE logic.

SCD1 and SCD2 refer to Slowly Changing Dimensions a common data warehousing concept used to handle historical changes in dimension data (non-fact tables like customer, product, etc.).

SCD Type 1 - Overwrite (No History)

When a change happens, the old data is overwritten. No history is preserved.

Use case:

  • Fixing a typo in a name
  • Updating non-critical attributes
CustomerID
Name
City
101
Alice
Delhi

→ Alice moves to Mumbai.

CustomerID
Name
City
101
Alice
Mumbai
  • No history tracking
  • Simple
 

SCD Type 2 - Keep History

Every change creates a new row, and the old record is marked as inactive.

Use case:

  • Tracking history (e.g., address changes, job title changes)
  • Regulatory or audit needs
CustomerID
Name
City
StartDate
EndDate
IsCurrent
101
Alice
Delhi
2023-01-01
2024-05-15
false
101
Alice
Mumbai
2024-05-16
null
true
 

Facts and Dimensions

Facts:

  • Tables that store measurable data—like quantities, sales, revenue, logins, etc.
  • Usually contain foreign keys to dimension tables and numeric measures.

Example: sales_fact table

sale_id
customer_id
product_id
amount
date_id
1
101
501
100.50
20240101

Dimensions:

  • Tables that describe the who, what, where, when, how of the facts.
  • Often contain textual or categorical attributes.

Example: customer_dim table

customer_id
name
city
join_date
101
Alice
Mumbai
2020-06-01

Facts vs Dimensions

Feature
Facts
Dimensions
Content
Measures / metrics
Descriptive attributes
Granularity
Transaction-level
Entity-level (customer, product)
Examples
Sales, orders, payments
Customer, product, region
Key type
Foreign keys (joins dims)
Primary keys
 

You might also like

BlogPro logo
Made with BlogPro