Airflow DAGs 101
Defining airflow dags and task dependencies
Apache Airflow is one of the most popular tools for orchestrating data workflows. At the heart of Airflow lies the DAG (Directed Acyclic Graph)
DAG
In Airflow, a DAG (Directed Acyclic Graph) is a Python-based definition of a workflow. It represents a collection of tasks with clearly defined relationships.
- Directed: Each task has a defined order of execution (upstream/downstream).
- Acyclic: Tasks cannot loop back to themselves — preventing infinite cycles.
- Graph: Tasks are connected visually, making the pipeline easy to monitor.
A DAG can be as simple as one task or as complex as thousands of interconnected tasks.
A simple DAG with three tasks might look like this
Defining DAGs: Multiple Approaches
We can define our DAG with different ways
1. The DAG Decorator (Recommended)
The most modern and clean approach uses the @dag decorator:
2. Context Manager Approach
Using the with statement provides clear scoping for your DAG definition:
3. Traditional Operators
For more complex tasks, we can use specific operators:
Managing Task Dependencies
Dependencies define the order in which tasks execute.
Simple Linear Dependencies
For straightforward workflows, use the bitshift operator (>>):
Parallel Task Execution
To have multiple tasks on the same level, use lists:
Complex Dependencies
For workflows like above if we use
we’ll get something like
each time we explicitly call a task it creates an instance of the task.
avoid creating duplicate task instances by using variables:
Using Chain for Complex Dependencies
the chain function provides a cleaner syntax for the same:
Key Takeaways and Best Practices
- Unique Identifiers: Every DAG must have a unique identifier across your Airflow instance
- Start Date: While optional (defaults to None), setting a start date is crucial for scheduling
- Schedule Intervals: Define how frequently your DAG should run (
@daily,@hourly, cron expressions, etc.)
- Documentation: Always include descriptions and tags to make your DAGs discoverable and maintainable
- Operator Selection: Before writing custom code, check the Astronomer Registry for existing operators
- Task Naming: Each task must have a unique identifier within its DAG
- Default Arguments: Use
default_argsdictionary to set common parameters across all tasks
- Dependency Patterns: Use bitshift operators (
>>,<<) and lists for simple dependencies, andchainfor complex patterns
- Avoid Task Duplication: When a task has multiple downstream dependencies, store it in a variable to prevent creating duplicate instances