7 minute read

“Cron is not an orchestrator. A script is not a pipeline.”

TL;DR

ML workflows are dependency graphs that cron cannot manage. DAG orchestrators like Airflow and Kubeflow handle task dependencies, retries, backfills, and distributed execution across heterogeneous infrastructure. The KubernetesExecutor provides infinite scale with zero idle cost and perfect isolation between tasks. Backfilling turns historical reprocessing into a one-click operation, and task caching via input hashing prevents redundant multi-hour recomputation. For how pipelines feed into model deployment, see Model Serialization and Feature Stores.

A complex model railway switching yard seen from above with trains on parallel tracks being routed through switches t...

1. Problem Statement

Standard software runs via Request -> Response. ML software runs as a Pipeline.

  • Step 1: Ingest Data (Wait for Hive partition).
  • Step 2: Clean Data (Requires Step 1).
  • Step 3: Distributed Training (Requires Step 2, Requires GPU cluster).
  • Step 4: Validation (Requires Step 3).
  • Step 5: Deploy.

The Problem: How do you manage this dependency graph reliably, handling retries, backfills, and distributed execution, without writing spaghetti bash scripts?


2. Understanding the Requirements

2.1 Why cron fails

0 2 * * * train_model.sh.

  1. Dependency Hell: What if the Data Ingest job (scheduled at 1:00 AM) is delayed? train_model.sh runs on empty data.
  2. No State: If Step 3 fails, cron doesn’t know. It won’t retry just Step 3. You have to re-run the whole thing.
  3. No Scaling: cron runs on one machine. We need to run Step 3 on a Kubernetes GPU cluster.

2.2 The Solution: The DAG

We represent the workflow as a Directed Acyclic Graph (DAG).

  • Directed: Data flows A -> B.
  • Acyclic: B cannot feed back into A (infinite loop).
  • Graph: Nodes are tasks, Edges are data/control dependencies.

3. High-Level Architecture

An MLOps Orchestrator (like Airflow or Kubeflow Pipelines) consists of:

[Scheduler (The Brain)]
 | (Polls DAGs, Checks Time, Checks Dependencies)
 |
 +---> [Queue (Redis/RabbitMQ)]
 |
 v
 [Workers (The Brawn)]
 | Worker 1: [Task A: Data Prep]
 | Worker 2: [Task B: Training (GPU)]
 | Worker 3: [Task C: Evaluation]

Key Components

  1. Metastore (Postgres): “Task B failed at 2:03 PM”.
  2. Executor: “Launch this task on Kubernetes” vs “Launch this task in a local process”.

4. Component Deep-Dives

4.1 Airflow (The Scheduler King)

  • Python as Config: You define DAGs in Python code.
  • Operators: PostgresOperator, PythonOperator, BashOperator.
  • Sensors: Special tasks that “wait” for something (e.g., S3KeySensor waits for a file to appear).
  • Best For: Data Engineering, Scheduled jobs.

4.2 Kubeflow Pipelines (The ML Specialist)

  • Container Native: Every task is a Docker container.
  • Artifact Tracking: Automatically logs “Task A produced dataset.csv (hash: xyz)”.
  • Best For: Deep Learning workflows where reproducibility (Docker) is critical.

5. Data Flow: The Backfill

A unique feature of Orchestrators is Backfilling. Imagine you change your Feature Engineering logic today. You want to re-train the model on data from the past year using the new logic.

With cron, this is a nightmare. With Airflow:

  1. Clear the state of “FeatureEng Task” for dates=[ ... ].
  2. The Scheduler sees these tasks as “Null state”.
  3. Since they have no dependencies blocking them (historical data exists), it schedules them all in parallel (up to max_active_runs).

This turns “Historical Reprocessing” into a one-click operation.


6. Scaling Strategies

6.1 The Kubernetes Executor

Instead of having a fixed pool of workers (which sit idle or get overwhelmed), use the KubernetesExecutor.

  • Scheduler: “Task A needs to run.”
  • K8s: Spin up a Pod just for Task A.
  • Task A: Runs.
  • K8s: Kill Pod. Pros: Infinite scale. Zero cost when idle. Perfect isolation (Task A uses PyTorch 1.9, Task B uses PyTorch 2.0).

6.2 Caching (Memoization)

If Task A takes 5 hours to generate clean_data.csv, and Task B fails… When we retry, we don’t want to re-run Task A. Kubeflow natively checks inputs. If Hash(Inputs, Code) matches a previous run, it reuses the cached output.


7. Implementation: defining a DAG (Airflow)

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.sensors.s3 import S3KeySensor
from datetime import datetime

def train_model(**context):
    # ML Logic here
    print("Training...")

    with DAG("ml_pipeline_v1",
    start_date=datetime(2023, 1, 1),
    schedule_interval="@daily",
    catchup=False) as dag:

        # 1. Wait for data to arrive in S3
        wait_for_data = S3KeySensor(
        task_id="wait_for_input",
        bucket_key="s3://data/incoming//data.csv"
        )

        # 2. Train logic
        train = PythonOperator(
        task_id="train_model",
        python_callable=train_model
        )

        # 3. Dependency Definition
        wait_for_data >> train

8. Monitoring & Metrics

  1. DAG Parse Time: If your Python DAG file is slow to parse (e.g., connects to DB at top level), the Scheduler hangs. Anti-pattern!
  2. Task Latency: Track Input_Size vs Duration. If data grew 2x but time grew 10x, you have a non-linear scaling bug.

9. Failure Modes

  1. Thundering Herd: You backfill 1 year of data. 365 jobs launch instantly. Your Database crashes.
    • Fix: Set concurrency=10 at the DAG level.
  2. Deadlock: Task A waits for Task B. Task B waits for connection slot held by Task A.
  3. Zombies: The Scheduler thinks Task A is running. Task A’s pod died OOM. The state stays “Running” forever.
    • Fix: Frequent “Heartbeat” checks by the Scheduler.

10. Real-World Case Study: Lyft Flyte

Lyft built Flyte because Airflow wasn’t strict enough about data types. Flyte enforces Type Safety between tasks.

  • Task A output: dataframe_schema[age: int]
  • Task B input: dataframe_schema[age: String]
  • Flyte catches this mismatch at Compile Link Time, preventing runtime failures 5 hours into the pipeline.

11. Cost Analysis

  • Spot Instances: Since ML pipelines are fault-tolerant (auto-retry), use Spot Instances for the Heavy Workers. Savings: 70%.
  • Idle Clusters: Using K8s Executor prevents paying for idle workers at night.

12. Key Takeaways

  1. Dependency Management: This is the core value prop. “Run B only after A succeeds.”
  2. Idempotency: Every task must be idempotent. If run twice, it should produce the same result (or overwrite safely).
  3. Infrastructure as Code: The Pipeline is code. Version control your DAGs.
  4. Static vs Dynamic: Airflow is Static (DAG defined upfront). Prefect/Metaflow allow Dynamic (DAG defined at runtime).

FAQ

Why can’t cron handle ML pipelines?

Cron has no dependency awareness, no state tracking for retries, and no ability to scale across distributed infrastructure. If an upstream data job is delayed, cron will run training on empty data without knowing anything went wrong, and if a step fails, you must manually re-run the entire pipeline instead of just the failed task.

What is backfilling in ML pipeline orchestration?

Backfilling is the ability to re-run pipeline tasks for historical dates when logic changes. With Airflow, you clear the task states for a date range, the scheduler sees them as unresolved, and since historical data already exists, it schedules them all in parallel up to the concurrency limit, turning a nightmare into a one-click operation.

How does the KubernetesExecutor improve ML pipeline scalability?

Instead of maintaining a fixed worker pool that either sits idle or gets overwhelmed, the KubernetesExecutor spins up a dedicated pod for each task and kills it after completion. This provides infinite scale, zero cost when idle, and perfect dependency isolation so Task A can use PyTorch 1.9 while Task B uses PyTorch 2.0.

What are common failure modes in DAG pipeline orchestration?

The main failures are thundering herd from uncapped parallel backfills that crash your database, deadlocks from circular resource dependencies between tasks, and zombie tasks where the scheduler thinks a task is running but its pod has died from out-of-memory errors. Fixes include concurrency limits, heartbeat checks, and explicit resource management.


Originally published at: arunbaby.com/ml-system-design/0049-dag-pipeline-orchestration

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch