Modernizing Legacy ETL to PySpark with Airflow Orchestration

Enterprise data platforms are built on legacy ETL tools. IBM DataStage, Microsoft SSIS, and Talend have powered data warehouse loading, data integration, and batch processing for decades. These platforms couple three concerns into a single system: data transformation logic, execution infrastructure, and workflow orchestration (scheduling, dependencies, error handling). This coupling made sense when these tools were designed, but it creates a rigid architecture that is expensive to scale, difficult to version control, and impossible to run in modern cloud-native environments.

The modern architecture separates these concerns. PySpark handles data transformation on elastic compute infrastructure. Apache Airflow (or Prefect) handles orchestration. Cloud services (EMR, GKE, EKS) provide the infrastructure. This article examines how to migrate legacy ETL control flows to Airflow DAGs with PySpark jobs, and why this separation of concerns produces a fundamentally better architecture.

The Problem with Coupled ETL Platforms

Legacy ETL tools bundle orchestration and compute into a single platform, creating several architectural problems that compound as organizations scale.

Vertical scaling only. DataStage, SSIS, and Talend run on dedicated servers. When data volumes grow, you add RAM and CPU to the existing server. There is no horizontal scaling — you cannot spread a job across 50 machines. PySpark distributes processing across a cluster that can scale from 2 to 200 nodes based on workload.

Proprietary orchestration. Each platform has its own scheduling and dependency management. DataStage uses Sequencer jobs. SSIS uses SQL Server Agent and package execution. Talend uses TAC (Talend Administration Center). These orchestration layers are tightly coupled to the execution engine and cannot manage non-ETL tasks. Airflow orchestrates any task type — PySpark jobs, dbt models, API calls, file transfers, database procedures — in a single DAG.

No version control. Legacy ETL jobs are stored in proprietary repositories — DataStage in its own metadata repository, SSIS in SSISDB or file system packages, Talend in its workspace. These artifacts are binary or XML, making meaningful code review, branching, and merging impractical. PySpark scripts and Airflow DAGs are Python files that live in Git with full version control, pull requests, and CI/CD.

Expensive licensing. DataStage, SSIS (via SQL Server Enterprise), and Talend (commercial editions) carry significant licensing costs. PySpark and Airflow are Apache-licensed open source. The infrastructure cost exists (compute nodes, cloud services), but the software itself is free.

SAS to Apache PySpark migration — automated end-to-end by MigryX

Legacy ETL to PySpark: Transformation Mapping

Each legacy ETL platform has its own transformation vocabulary, but they all perform the same fundamental operations. Here is how common patterns map to PySpark.

Legacy ETL Concept	DataStage	SSIS	Talend	PySpark Equivalent
Source read	Sequential File / DB2 Stage	OLE DB Source	tInput components	spark.read.format()
Filter rows	Filter Stage	Conditional Split	tFilterRow	df.filter()
Join datasets	Join Stage	Merge Join	tJoin	df.join()
Aggregate	Aggregator Stage	Aggregate	tAggregateRow	df.groupBy().agg()
Lookup	Lookup Stage	Lookup	tMap (lookup)	broadcast join
Sort	Sort Stage	Sort	tSortRow	df.orderBy()
Derived column	Transformer Stage	Derived Column	tMap (expression)	df.withColumn()
Target write	DB2 / Sequential File Stage	OLE DB Destination	tOutput components	df.write.format()
SCD Type 2	SCD Stage	SCD Transform	tSCDManager	Delta Lake MERGE

DataStage Parallel Job to PySpark

A typical DataStage parallel job reads from a database, applies transformations in a Transformer stage, joins with a reference dataset, and writes to a target. The PySpark equivalent is a single script:

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.appName("CustomerETL").getOrCreate()

# DataStage: DB2 Connector (source) → PySpark: JDBC read
customers = (
    spark.read.format("jdbc")
    .option("url", "jdbc:db2://host:50000/warehouse")
    .option("dbtable", "raw.customers")
    .option("user", db_user)
    .option("password", db_password)
    .load()
)

# DataStage: Lookup Stage → PySpark: broadcast join
region_lookup = spark.read.parquet("s3a://ref-data/regions/")
enriched = customers.join(F.broadcast(region_lookup), "region_code", "left")

# DataStage: Transformer Stage → PySpark: withColumn + when
transformed = (
    enriched
    .withColumn("full_name", F.concat_ws(" ", "first_name", "last_name"))
    .withColumn(
        "customer_tier",
        F.when(F.col("lifetime_value") >= 100000, "PLATINUM")
         .when(F.col("lifetime_value") >= 50000, "GOLD")
         .when(F.col("lifetime_value") >= 10000, "SILVER")
         .otherwise("BRONZE")
    )
    .withColumn("processed_date", F.current_date())
    .filter(F.col("status") == "ACTIVE")
)

# DataStage: DB2 Connector (target) → PySpark: write
transformed.write.mode("overwrite").parquet("s3a://curated/customers/")

SSIS Package to PySpark

SSIS packages use a Control Flow (task orchestration) and Data Flow (transformation pipeline). The Data Flow maps to PySpark transformations; the Control Flow maps to Airflow. An SSIS Data Flow with Conditional Split, Derived Column, and multiple destinations translates cleanly:

# SSIS Conditional Split → PySpark filter
valid_orders = orders.filter(
    F.col("amount").isNotNull() & (F.col("amount") > 0)
)
invalid_orders = orders.filter(
    F.col("amount").isNull() | (F.col("amount") <= 0)
)

# SSIS Derived Column → PySpark withColumn
valid_orders = valid_orders.withColumn(
    "tax_amount", F.col("amount") * 0.08
).withColumn(
    "total_with_tax", F.col("amount") + F.col("amount") * 0.08
)

# SSIS OLE DB Destination (multiple) → PySpark write
valid_orders.write.mode("append").parquet("s3a://warehouse/valid_orders/")
invalid_orders.write.mode("append").parquet("s3a://quarantine/invalid_orders/")

MigryX: Idiomatic Code, Not Line-by-Line Translation

The difference between MigryX and manual migration is not just speed — it is code quality. MigryX generates idiomatic, platform-optimized code that leverages native features of your target platform. A SAS DATA step does not become a clunky row-by-row loop — it becomes a clean, vectorized DataFrame operation. A PROC SQL query does not become a literal translation — it becomes an optimized query that takes advantage of your platform’s pushdown capabilities.

Orchestration: From Legacy Schedulers to Airflow DAGs

The orchestration layer — scheduling, dependency management, error handling, and monitoring — is the second half of the migration. Legacy ETL tools bundle this into their platforms. The modern stack uses Apache Airflow, which provides a Python-based DAG (Directed Acyclic Graph) framework for defining workflows.

Airflow's architecture separates the scheduler (when to run), the executor (how to run), and the task definition (what to run). This separation means PySpark jobs run on a Spark cluster, database operations run via database connections, and file operations run on worker nodes — each using the appropriate infrastructure.

from airflow import DAG
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from datetime import timedelta

default_args = {
    "owner": "data-engineering",
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
    "email_on_failure": True,
    "email": ["data-alerts@company.com"],
}

dag = DAG(
    "daily_warehouse_load",
    default_args=default_args,
    schedule_interval="0 4 * * *",
    start_date=days_ago(1),
    catchup=False,
    tags=["etl", "warehouse", "pyspark"],
)

# Step 1: Extract from source systems
extract_customers = SparkSubmitOperator(
    task_id="extract_customers",
    application="s3a://etl-code/extract/customers.py",
    conn_id="spark_emr",
    conf={
        "spark.executor.instances": "8",
        "spark.executor.memory": "8g",
        "spark.sql.adaptive.enabled": "true",
    },
    dag=dag,
)

extract_orders = SparkSubmitOperator(
    task_id="extract_orders",
    application="s3a://etl-code/extract/orders.py",
    conn_id="spark_emr",
    conf={
        "spark.executor.instances": "12",
        "spark.executor.memory": "16g",
    },
    dag=dag,
)

# Step 2: Transform and enrich (depends on both extracts)
transform = SparkSubmitOperator(
    task_id="transform_and_enrich",
    application="s3a://etl-code/transform/enrich_orders.py",
    conn_id="spark_emr",
    dag=dag,
)

# Step 3: Load to warehouse
load_warehouse = SparkSubmitOperator(
    task_id="load_warehouse",
    application="s3a://etl-code/load/warehouse_load.py",
    conn_id="spark_emr",
    dag=dag,
)

# Step 4: Run data quality checks
quality_check = SQLExecuteQueryOperator(
    task_id="data_quality_check",
    conn_id="warehouse_db",
    sql="sql/quality_checks.sql",
    dag=dag,
)

# Define dependencies (replaces legacy sequencer/precedence constraints)
[extract_customers, extract_orders] >> transform >> load_warehouse >> quality_check

The dependency syntax [extract_customers, extract_orders] >> transform means both extract tasks run in parallel, and the transform task starts only after both complete. This replaces DataStage's Sequencer, SSIS's precedence constraints, and Talend's trigger connections.

MigryX precision parser — Deep AST-level analysis ensures every construct is understood before conversion begins

Platform-Specific Optimization by MigryX

MigryX maintains deep knowledge of every target platform’s strengths and best practices. When converting to Snowflake, it leverages Snowpark and native SQL functions. When targeting Databricks, it uses PySpark DataFrame operations optimized for distributed execution. When generating dbt models, it follows dbt best practices for modularity and testability. This platform awareness is what makes MigryX output production-ready from day one.

Deployment: EMR, Kubernetes, and Cloud-Native Infrastructure

The third piece of the modernization is infrastructure. Legacy ETL tools run on dedicated on-premises servers. The modern stack runs on elastic cloud infrastructure that scales with workload.

Amazon EMR

EMR provides managed Spark clusters on AWS. For Airflow integration, use the EmrAddStepsOperator or EmrServerlessStartJobOperator to submit PySpark jobs to EMR clusters that scale automatically.

from airflow.providers.amazon.aws.operators.emr import EmrServerlessStartJobOperator

emr_job = EmrServerlessStartJobOperator(
    task_id="run_pyspark_etl",
    application_id="emr-serverless-app-id",
    execution_role_arn="arn:aws:iam::123456789:role/emr-role",
    job_driver={
        "sparkSubmit": {
            "entryPoint": "s3://code-bucket/etl_pipeline.py",
            "sparkSubmitParameters": (
                "--conf spark.executor.memory=8g "
                "--conf spark.executor.instances=20"
            ),
        }
    },
    dag=dag,
)

Kubernetes (EKS / GKE / AKS)

For organizations using Kubernetes, PySpark runs as native Kubernetes pods. The Spark driver creates executor pods dynamically, and they are destroyed when the job completes. This provides true elastic scaling with pay-per-use economics.

from airflow.providers.cncf.kubernetes.operators.spark_kubernetes import SparkKubernetesOperator

k8s_spark_job = SparkKubernetesOperator(
    task_id="run_spark_on_k8s",
    namespace="data-engineering",
    application_file="spark-job-spec.yaml",
    dag=dag,
)

The separation of orchestration (Airflow) from compute (PySpark on EMR/K8s) means you can change your infrastructure without changing your transformation logic. Moving from EMR to Kubernetes requires changing the Airflow operator, not rewriting your PySpark code.

Migration Strategy: Phased Approach

Migrating hundreds of legacy ETL jobs is a program, not a project. A phased approach reduces risk and delivers value incrementally.

Phase 1: Inventory and prioritize. Catalog all legacy jobs with their source systems, target systems, schedules, dependencies, data volumes, and business criticality. Identify quick wins (simple transformations with high licensing cost) and defer high-risk jobs (complex real-time integrations) to later phases.

Phase 2: Build the foundation. Deploy Airflow, configure Spark cluster connectivity (EMR or K8s), establish CI/CD pipelines for PySpark code, and set up monitoring and alerting. This infrastructure investment pays off across all subsequent migrations.

Phase 3: Migrate in waves. Convert legacy jobs in priority order, running the legacy and modern versions in parallel during validation. Compare outputs row-by-row to verify correctness. Decommission legacy jobs only after the modern version has been validated in production for a defined period.

Phase 4: Decommission legacy. Once all jobs are migrated and validated, decommission the legacy ETL platform. This eliminates licensing costs and reduces operational overhead.

Key Takeaways
Legacy ETL platforms (DataStage, SSIS, Talend) couple orchestration and compute, creating rigid architectures that cannot scale horizontally or run in cloud-native environments.
The modern architecture separates orchestration (Airflow) from compute (PySpark), enabling independent scaling, version control, and infrastructure flexibility.
Every legacy ETL transformation has a direct PySpark equivalent — from joins and aggregations to lookups, SCD handling, and conditional routing.
Airflow DAGs replace legacy schedulers with Python-defined dependencies, retries, alerting, and monitoring — managing PySpark jobs alongside any other task type.
Cloud deployment (EMR, Kubernetes) provides elastic infrastructure that scales with data volume and shuts down when idle, replacing expensive always-on ETL servers.
MigryX automates the conversion of legacy ETL jobs to PySpark scripts and Airflow DAGs, accelerating migration timelines by 5-10x.

The migration from legacy ETL to PySpark with Airflow orchestration is not just a technology upgrade — it is an architectural transformation. It replaces proprietary, monolithic platforms with composable, open-source components. It replaces vertical scaling with horizontal elasticity. And it replaces point-and-click development with code that can be reviewed, tested, versioned, and deployed through modern CI/CD pipelines. For organizations spending millions on DataStage, SSIS, or Talend licensing, this migration delivers both immediate cost savings and a platform that scales with the demands of modern data engineering.

Why MigryX Delivers Superior Migration Results

The challenges described throughout this article are exactly what MigryX was built to solve. Here is how MigryX transforms this process:

Production-ready output: MigryX generates code that passes code review and runs in production — not prototype-quality output that needs weeks of cleanup.
Platform optimization: Converted code leverages target platform-specific features for maximum performance and cost efficiency.
25+ source technologies: Whether migrating from SAS, Informatica, DataStage, SSIS, or any of 25+ legacy technologies, MigryX handles it.
Automated documentation: Every conversion decision is documented with before/after code mappings and transformation rationale.

MigryX combines precision AST parsing with Merlin AI to deliver 99% accurate, production-ready migration — turning what used to be a multi-year manual effort into a streamlined, validated process. See it in action.

Ready to modernize your legacy ETL?

See how MigryX converts DataStage, SSIS, and Talend pipelines to PySpark + Airflow with automated code generation.

Explore PySpark Migration Schedule a Demo