Airflow XComs 101
When building pipelines in Apache Airflow, tasks often need to share data with each other.
Thatβs where XComs (cross-communication)
What are XComs?
XComs enable tasks in your Airflow DAGs to exchange small amounts of data. :
- Push: A task stores data in XCom using a unique identifier
- Pull: Another task retrieves that data using the same identifier
- Storage: XCom data is stored in Airflow's metadata database by default
- Identification: Each XCom is uniquely identified by multiple fields: key, run ID, task ID, and DAG ID
Every task instance in Airflow gets its own context dictionary that contains metadata about the current execution.
context["ti"]β refers to the TaskInstance object of the currently running task
XCom Implementation Patterns
Method 1: Explicit Context Usage
The most verbose way using the full context dictionary:
Method 2: Direct TaskInstance Access
Accessing the TaskInstance directly:
Method 3: Implicit XCom with return values (Recommended)
Pythonic approach uses return values and function parameters:
- Less boilerplate code
- More readable and intuitive
- Follows Python conventions
- Automatic XCom handling behind the scenes
Advanced XCom Patterns
Pulling from Multiple Tasks
When we need data from several upstream tasks, to ensure proper dependencies and we use task ID lists:
Pushing Multiple Values
Use dictionaries to organize and share multiple related values:
XCom Limitations and Best Practices
- Keep XComs small - they're for metadata, not bulk data
use for : file paths, URLs, run IDs, execution metadata, row counts, processing stats, small config dicts, status flags, control signals, DB connection strings.
avoid for : raw CSVs, large JSON dumps, entire DataFrames, full datasets, binary files, images, large API responses.
- Size Constraints: XCom storage limits vary by database:
- SQLite: Up to 2GB
- Postgres: Up to 1GB
- MySQL: Up to 64MB
- Use external storage for large datasets and pass references via XCom
- It is unsuitable for sharing large amounts of data, so one should trigger a Spark job or similar.
- JSON Serialization: Data must be JSON serializable (strings, numbers, lists, dictionaries, booleans, null)
a dag example: