Modernizing Legacy ETL to PySpark with Airflow Orchestration

April 4, 2026 · 11 min read · MigryX Team

Enterprise data platforms are built on legacy ETL tools. IBM DataStage, Microsoft SSIS, and Talend have powered data warehouse loading, data integration, and batch processing for decades. These platforms couple three concerns into a single system: data transformation logic, execution infrastructure, and workflow orchestration (scheduling, dependencies, error handling). This coupling made sense when these tools were designed, but it creates a rigid architecture that is expensive to scale, difficult to version control, and impossible to run in modern cloud-native environments.

The modern architecture separates these concerns. PySpark handles data transformation on elastic compute infrastructure. Apache Airflow (or Prefect) handles orchestration. Cloud services (EMR, GKE, EKS) provide the infrastructure. This article examines how to migrate legacy ETL control flows to Airflow DAGs with PySpark jobs, and why this separation of concerns produces a fundamentally better architecture.

The Problem with Coupled ETL Platforms

Legacy ETL tools bundle orchestration and compute into a single platform, creating several architectural problems that compound as organizations scale.

Vertical scaling only. DataStage, SSIS, and Talend run on dedicated servers. When data volumes grow, you add RAM and CPU to the existing server. There is no horizontal scaling — you cannot spread a job across 50 machines. PySpark distributes processing across a cluster that can scale from 2 to 200 nodes based on workload.

Proprietary orchestration. Each platform has its own scheduling and dependency management. DataStage uses Sequencer jobs. SSIS uses SQL Server Agent and package execution. Talend uses TAC (Talend Administration Center). These orchestration layers are tightly coupled to the execution engine and cannot manage non-ETL tasks. Airflow orchestrates any task type — PySpark jobs, dbt models, API calls, file transfers, database procedures — in a single DAG.

No version control. Legacy ETL jobs are stored in proprietary repositories — DataStage in its own metadata repository, SSIS in SSISDB or file system packages, Talend in its workspace. These artifacts are binary or XML, making meaningful code review, branching, and merging impractical. PySpark scripts and Airflow DAGs are Python files that live in Git with full version control, pull requests, and CI/CD.

Expensive licensing. DataStage, SSIS (via SQL Server Enterprise), and Talend (commercial editions) carry significant licensing costs. PySpark and Airflow are Apache-licensed open source. The infrastructure cost exists (compute nodes, cloud services), but the software itself is free.

SAS to Apache PySpark migration — automated end-to-end by MigryX

SAS to Apache PySpark migration — automated end-to-end by MigryX

Legacy ETL to PySpark: Transformation Mapping

Each legacy ETL platform has its own transformation vocabulary, but they all perform the same fundamental operations. Here is how common patterns map to PySpark.

Legacy ETL ConceptDataStageSSISTalendPySpark Equivalent
Source readSequential File / DB2 StageOLE DB SourcetInput componentsspark.read.format()
Filter rowsFilter StageConditional SplittFilterRowdf.filter()
Join datasetsJoin StageMerge JointJoindf.join()
AggregateAggregator StageAggregatetAggregateRowdf.groupBy().agg()
LookupLookup StageLookuptMap (lookup)broadcast join
SortSort StageSorttSortRowdf.orderBy()
Derived columnTransformer StageDerived ColumntMap (expression)df.withColumn()
Target writeDB2 / Sequential File StageOLE DB DestinationtOutput componentsdf.write.format()
SCD Type 2SCD StageSCD TransformtSCDManagerDelta Lake MERGE

DataStage Parallel Job to PySpark

A typical DataStage parallel job reads from a database, applies transformations in a Transformer stage, joins with a reference dataset, and writes to a target. The PySpark equivalent is a single script:

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.appName("CustomerETL").getOrCreate()

# DataStage: DB2 Connector (source) → PySpark: JDBC read
customers = (
    spark.read.format("jdbc")
    .option("url", "jdbc:db2://host:50000/warehouse")
    .option("dbtable", "raw.customers")
    .option("user", db_user)
    .option("password", db_password)
    .load()
)

# DataStage: Lookup Stage → PySpark: broadcast join
region_lookup = spark.read.parquet("s3a://ref-data/regions/")
enriched = customers.join(F.broadcast(region_lookup), "region_code", "left")

# DataStage: Transformer Stage → PySpark: withColumn + when
transformed = (
    enriched
    .withColumn("full_name", F.concat_ws(" ", "first_name", "last_name"))
    .withColumn(
        "customer_tier",
        F.when(F.col("lifetime_value") >= 100000, "PLATINUM")
         .when(F.col("lifetime_value") >= 50000, "GOLD")
         .when(F.col("lifetime_value") >= 10000, "SILVER")
         .otherwise("BRONZE")
    )
    .withColumn("processed_date", F.current_date())
    .filter(F.col("status") == "ACTIVE")
)

# DataStage: DB2 Connector (target) → PySpark: write
transformed.write.mode("overwrite").parquet("s3a://curated/customers/")

SSIS Package to PySpark

SSIS packages use a Control Flow (task orchestration) and Data Flow (transformation pipeline). The Data Flow maps to PySpark transformations; the Control Flow maps to Airflow. An SSIS Data Flow with Conditional Split, Derived Column, and multiple destinations translates cleanly:

# SSIS Conditional Split → PySpark filter
valid_orders = orders.filter(
    F.col("amount").isNotNull() & (F.col("amount") > 0)
)
invalid_orders = orders.filter(
    F.col("amount").isNull() | (F.col("amount") <= 0)
)

# SSIS Derived Column → PySpark withColumn
valid_orders = valid_orders.withColumn(
    "tax_amount", F.col("amount") * 0.08
).withColumn(
    "total_with_tax", F.col("amount") + F.col("amount") * 0.08
)

# SSIS OLE DB Destination (multiple) → PySpark write
valid_orders.write.mode("append").parquet("s3a://warehouse/valid_orders/")
invalid_orders.write.mode("append").parquet("s3a://quarantine/invalid_orders/")

MigryX: Idiomatic Code, Not Line-by-Line Translation

The difference between MigryX and manual migration is not just speed — it is code quality. MigryX generates idiomatic, platform-optimized code that leverages native features of your target platform. A SAS DATA step does not become a clunky row-by-row loop — it becomes a clean, vectorized DataFrame operation. A PROC SQL query does not become a literal translation — it becomes an optimized query that takes advantage of your platform’s pushdown capabilities.

Orchestration: From Legacy Schedulers to Airflow DAGs

The orchestration layer — scheduling, dependency management, error handling, and monitoring — is the second half of the migration. Legacy ETL tools bundle this into their platforms. The modern stack uses Apache Airflow, which provides a Python-based DAG (Directed Acyclic Graph) framework for defining workflows.

Airflow's architecture separates the scheduler (when to run), the executor (how to run), and the task definition (what to run). This separation means PySpark jobs run on a Spark cluster, database operations run via database connections, and file operations run on worker nodes — each using the appropriate infrastructure.

from airflow import DAG
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from airflow.providers.common.sql.operators.sql import SQLExecuteQueryOperator
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from datetime import timedelta

default_args = {
    "owner": "data-engineering",
    "retries": 3,
    "retry_delay": timedelta(minutes=5),
    "email_on_failure": True,
    "email": ["data-alerts@company.com"],
}

dag = DAG(
    "daily_warehouse_load",
    default_args=default_args,
    schedule_interval="0 4 * * *",
    start_date=days_ago(1),
    catchup=False,
    tags=["etl", "warehouse", "pyspark"],
)

# Step 1: Extract from source systems
extract_customers = SparkSubmitOperator(
    task_id="extract_customers",
    application="s3a://etl-code/extract/customers.py",
    conn_id="spark_emr",
    conf={
        "spark.executor.instances": "8",
        "spark.executor.memory": "8g",
        "spark.sql.adaptive.enabled": "true",
    },
    dag=dag,
)

extract_orders = SparkSubmitOperator(
    task_id="extract_orders",
    application="s3a://etl-code/extract/orders.py",
    conn_id="spark_emr",
    conf={
        "spark.executor.instances": "12",
        "spark.executor.memory": "16g",
    },
    dag=dag,
)

# Step 2: Transform and enrich (depends on both extracts)
transform = SparkSubmitOperator(
    task_id="transform_and_enrich",
    application="s3a://etl-code/transform/enrich_orders.py",
    conn_id="spark_emr",
    dag=dag,
)

# Step 3: Load to warehouse
load_warehouse = SparkSubmitOperator(
    task_id="load_warehouse",
    application="s3a://etl-code/load/warehouse_load.py",
    conn_id="spark_emr",
    dag=dag,
)

# Step 4: Run data quality checks
quality_check = SQLExecuteQueryOperator(
    task_id="data_quality_check",
    conn_id="warehouse_db",
    sql="sql/quality_checks.sql",
    dag=dag,
)

# Define dependencies (replaces legacy sequencer/precedence constraints)
[extract_customers, extract_orders] >> transform >> load_warehouse >> quality_check

The dependency syntax [extract_customers, extract_orders] >> transform means both extract tasks run in parallel, and the transform task starts only after both complete. This replaces DataStage's Sequencer, SSIS's precedence constraints, and Talend's trigger connections.

MigryX Screenshot

MigryX precision parser — Deep AST-level analysis ensures every construct is understood before conversion begins

Platform-Specific Optimization by MigryX

MigryX maintains deep knowledge of every target platform’s strengths and best practices. When converting to Snowflake, it leverages Snowpark and native SQL functions. When targeting Databricks, it uses PySpark DataFrame operations optimized for distributed execution. When generating dbt models, it follows dbt best practices for modularity and testability. This platform awareness is what makes MigryX output production-ready from day one.

Deployment: EMR, Kubernetes, and Cloud-Native Infrastructure

The third piece of the modernization is infrastructure. Legacy ETL tools run on dedicated on-premises servers. The modern stack runs on elastic cloud infrastructure that scales with workload.

Amazon EMR

EMR provides managed Spark clusters on AWS. For Airflow integration, use the EmrAddStepsOperator or EmrServerlessStartJobOperator to submit PySpark jobs to EMR clusters that scale automatically.

from airflow.providers.amazon.aws.operators.emr import EmrServerlessStartJobOperator

emr_job = EmrServerlessStartJobOperator(
    task_id="run_pyspark_etl",
    application_id="emr-serverless-app-id",
    execution_role_arn="arn:aws:iam::123456789:role/emr-role",
    job_driver={
        "sparkSubmit": {
            "entryPoint": "s3://code-bucket/etl_pipeline.py",
            "sparkSubmitParameters": (
                "--conf spark.executor.memory=8g "
                "--conf spark.executor.instances=20"
            ),
        }
    },
    dag=dag,
)

Kubernetes (EKS / GKE / AKS)

For organizations using Kubernetes, PySpark runs as native Kubernetes pods. The Spark driver creates executor pods dynamically, and they are destroyed when the job completes. This provides true elastic scaling with pay-per-use economics.

from airflow.providers.cncf.kubernetes.operators.spark_kubernetes import SparkKubernetesOperator

k8s_spark_job = SparkKubernetesOperator(
    task_id="run_spark_on_k8s",
    namespace="data-engineering",
    application_file="spark-job-spec.yaml",
    dag=dag,
)
The separation of orchestration (Airflow) from compute (PySpark on EMR/K8s) means you can change your infrastructure without changing your transformation logic. Moving from EMR to Kubernetes requires changing the Airflow operator, not rewriting your PySpark code.

Migration Strategy: Phased Approach

Migrating hundreds of legacy ETL jobs is a program, not a project. A phased approach reduces risk and delivers value incrementally.

Phase 1: Inventory and prioritize. Catalog all legacy jobs with their source systems, target systems, schedules, dependencies, data volumes, and business criticality. Identify quick wins (simple transformations with high licensing cost) and defer high-risk jobs (complex real-time integrations) to later phases.

Phase 2: Build the foundation. Deploy Airflow, configure Spark cluster connectivity (EMR or K8s), establish CI/CD pipelines for PySpark code, and set up monitoring and alerting. This infrastructure investment pays off across all subsequent migrations.

Phase 3: Migrate in waves. Convert legacy jobs in priority order, running the legacy and modern versions in parallel during validation. Compare outputs row-by-row to verify correctness. Decommission legacy jobs only after the modern version has been validated in production for a defined period.

Phase 4: Decommission legacy. Once all jobs are migrated and validated, decommission the legacy ETL platform. This eliminates licensing costs and reduces operational overhead.

Key Takeaways

The migration from legacy ETL to PySpark with Airflow orchestration is not just a technology upgrade — it is an architectural transformation. It replaces proprietary, monolithic platforms with composable, open-source components. It replaces vertical scaling with horizontal elasticity. And it replaces point-and-click development with code that can be reviewed, tested, versioned, and deployed through modern CI/CD pipelines. For organizations spending millions on DataStage, SSIS, or Talend licensing, this migration delivers both immediate cost savings and a platform that scales with the demands of modern data engineering.

Why MigryX Delivers Superior Migration Results

The challenges described throughout this article are exactly what MigryX was built to solve. Here is how MigryX transforms this process:

MigryX combines precision AST parsing with Merlin AI to deliver 99% accurate, production-ready migration — turning what used to be a multi-year manual effort into a streamlined, validated process. See it in action.

Ready to modernize your legacy ETL?

See how MigryX converts DataStage, SSIS, and Talend pipelines to PySpark + Airflow with automated code generation.

Explore PySpark Migration   Schedule a Demo