Modern Data Migration: From SQL and SSIS to Airflow and Parquet

What we traded, what we gained, and what we're pretending we didn't lose

Modern Data Migration: From SQL and SSIS to Airflow and Parquet

Data migration has not fundamentally changed. It's still Extract, Transform, Load at its core. It's still moving data from one place to another place, hopefully without losing any of it or corrupting it along the way. The problem is the same problem it was fifteen years ago. What has changed though is the tooling, the complexity, the number of moving parts, and the sheer surface area available for things to go catastrophically wrong while everyone involved nods confidently and says the pipeline is working fine.

The old tools were boring. They did what they said they'd do, and when they didn't, you knew about it immediately. New tools are flexible, distributed, cloud-native, and extremely good at hiding mistakes until someone notices the numbers don't add up and starts asking uncomfortable questions.

This is not a story about progress. It's a story about what we traded, what we gained, and what we're pretending we didn't lose.

The Olden Days: When Boring Was Honest

Data migrations used to be straightforward - not because the people doing them were simpler or less ambitious, but because the tools were limited in ways that forced a sort of "honesty". You had SQL scripts. You had SSIS packages (well, maybe, but this is my story and this is how I am telling it). You had stored procedures. You had a source database and a target database, and both of them were things you could actually query, right now, without waiting for a Spark job to finish or an S3 bucket to sync.

The process was crude but visible. You'd write a SELECT statement that pulled data from the source. Then you'd write an INSERT statement that put it in the target. If you were feeling fancy, you'd add some transformation logic in between. Maybe you'd use SSIS to handle the orchestration, dragging boxes around in a visual designer that looked like it was built by someone who'd never heard of user experience but at least showed you exactly what was happening and in what order.

Validation was equally crude. Row counts. Checksums. Spot checks. You'd run a query that counted rows in the source table, run another query that counted rows in the target table, and if the numbers matched, you'd call it a success. If they didn't match, you'd investigate. Immediately. Because the failure was right there, impossible to ignore. The data was in a database. You could query it. You could see it. You could join the source and target tables and find the exact rows that didn't make it across.

When something broke, you knew where it broke. The SSIS package would fail on a specific task. The stored procedure would throw an error on a specific line. The SQL script would violate a constraint and tell you exactly which constraint and which row. The feedback loop was tight. The failure modes were obvious. You didn't need to be a distributed systems expert to understand what went wrong. You just needed to read the error message and look at the code.

This is not nostalgia, because SSIS was not elegant. SQL Server Integration Services was a Microsoft product from an era when Microsoft's idea of good design was "make it work in Windows Server 2003 and call it enterprise-ready." The visual designer was clunky. The XML configuration files were a nightmare. The deployment process involved copying packages to servers and hoping the connection strings were right. It was not beautiful. It was not fashionable. It was not the kind of thing you'd put on your CV if you wanted to work at a startup that served artisanal data pipelines.

But it was inspectable. It was debuggable, deterministic. You could open the package, see exactly what it was doing, and understand the entire flow in about ten minutes. You could run it locally, step through it, watch the data move from source to target. You could test it properly because the entire thing ran on your machine or on a server you could actually access. The complexity was contained. The failure surface was small.

The problem was already solved. Not perfectly or elegantly but well enough. The tools weren't fashionable, but they worked, and more importantly, when they didn't work, you could figure out why without needing a PhD in distributed systems or a three-day course in Kubernetes troubleshooting.

When Scale Became the Excuse

Then things...changed. Not because the old approach stopped working, but because the context shifted in ways that made the old tools feel inadequate. Data volumes increased and systems fragmented. The monolithic database that used to hold everything got carved up into microservices (This is probably another post), each with its own data store, each speaking its own API, each maintained by a different team with different priorities and different ideas about what "data quality" means.

The cloud happened! Suddenly, everything was supposed to be distributed, scalable, cloud-native. On-premise servers became legacy. SQL Server became something you apologised for using. The idea of running a migration script directly against a production database became unthinkable, not because it was technically wrong, but because it wasn't modern. It wasn't what the cool kids were doing.

Vendors started selling pipelines instead of scripts. Orchestration became a first-class concern. You couldn't just run a stored procedure anymore - now you needed a workflow engine. You needed retry logic. You needed monitoring. You needed observability. You needed a dashboard that showed green checkmarks so management could feel confident that the data was flowing, even if nobody actually checked whether the data was correct.

Now I will admit that the narrative was compelling. The old tools don't scale. They're not cloud-native. They're not flexible enough for modern architectures. They're not designed for the kind of distributed, event-driven, microservices-based world we live in now. And all of that was true, to a point. SSIS doesn't scale horizontally. SQL scripts don't handle schema evolution gracefully. Stored procedures don't play nicely with REST APIs and message queues.

But what nobody actually said out loud (at least if they wanted to keep their job) was that the new tools don't solve the fundamental problem any better. They just move the complexity around and make it someone else's problem. They make it harder to see when things go wrong. They introduce new failure modes that the old tools never had to worry about because the old tools were too simple to fail in those particular ways.

This wasn't pointless, of course, and there were real reasons behind it. But it wasn't free. Every layer of abstraction, every new tool in the stack, every decision to decouple and distribute and make things more flexible came with a cost. And that cost was paid in observability, in debuggability, in the ability to actually understand what your data pipeline is doing and whether it's doing it correctly.

What the Cool Kids Use Now

Let's describe the modern approach without (much) judgement, just what people are actually doing when they migrate data in 2026.

You extract data via APIs or Change Data Capture. You don't query the source database directly anymore. That's not done. Instead, you hit an API endpoint, or you set up CDC to stream changes to a message queue, or you use a vendor tool that promises to handle the extraction for you with minimal configuration. The data comes out as JSON or Avro or some other format that's flexible and self-describing and absolutely terrible for actually working with.

You land the data in object storage. S3, usually, or Azure Blob Storage if you're in the Microsoft ecosystem. The data sits there as files. Lots of files. Partitioned by date, maybe, or by some other key that seemed like a good idea at the time. You can't query it directly. Well, you can, technically, with the right tools, but it's not like querying a database. It's slower. It's more complicated. It requires knowing which files to look at and how they're structured.

You use columnar formats. Parquet, usually. Or ORC if you're in the Hadoop world. These formats are efficient for analytical queries. They compress well. They're good for reading subsets of columns without scanning entire rows. They're a sensible choice for data warehousing. They're also opaque. You can't just open a Parquet file in a text editor and see what's in it. You need special tools. You need to know the schema. You need to understand how the data is partitioned and compressed.

You orchestrate with Airflow. Or Prefect. Or Dagster. Or whatever the new hotness is this year. You define your pipeline as a Directed Acyclic Graph. You write Python code that describes the dependencies between tasks. You set up retry logic and alerting and monitoring. You deploy it to a cluster somewhere. You watch the dashboard turn green and assume that means everything worked.

You transform with Python or Spark. You write code that reads from the object storage, applies transformations, and writes back to object storage. Maybe you use Pandas if the data is small enough. Maybe you use PySpark if it's not. Maybe you use Dask or Ray or some other distributed computing framework that promises to make everything faster and easier. The transformations are code now, not SQL. They're flexible. They're testable, in theory. They're also harder to understand if you're not the person who wrote them.

You load via bulk APIs. Salesforce Bulk API is one I've used a lot. GraphQL mutations also. Whatever the target system exposes. You don't INSERT directly into the target database anymore. You send data to an API and trust that it ends up in the right place. You handle rate limits. You handle retries. You handle partial failures. You handle all the things that weren't problems when you were just running INSERT statements.

Each layer is best practice. Each layer is what you're supposed to do if you want to build a modern, scalable, cloud-native data platform. Each layer is recommended by vendors, by consultants, by blog posts, by conference talks. Each layer makes perfect sense in isolation.

Each layer also introduces failure modes that didn't exist before. The API might return a 200 status code but silently drop records. The object storage might have eventual consistency issues. The Parquet files might have schema drift. The Airflow DAG might succeed even though half the tasks failed and retried and eventually gave up. The Spark job might run out of memory and spill to disk and take six hours instead of six minutes. The bulk API might accept the data but fail to process it and not tell you until three days later when someone notices the numbers are wrong.

This isn't new versus old. This is the same outcome with more steps. You're still moving data from one place to another place. You're still trying to ensure correctness and completeness. You're still trying to make it repeatable and reliable. You've just added a dozen layers of indirection, each with its own failure modes, each with its own monitoring requirements, each with its own expertise needed to debug when things go wrong.

The Hidden Costs

Okay, time to talk about what got worse.

You have dashboards now...lots and lots and lots of dashboards. Grafana dashboards showing metrics. Airflow dashboards showing DAG runs. Cloud provider dashboards showing resource utilisation. Every tool in the stack has its own dashboard, its own metrics, its own idea of what "healthy" means. And all of them are lying to you, or at least not telling you the whole truth.

The Airflow DAG is green. That's what the dashboard says. All tasks completed successfully. No errors. No retries. Everything worked. Except it didn't. The task that loads data into Salesforce succeeded, according to Airflow, because the API returned a 200 status code. But Salesforce has its own internal processing queue, and half the records failed validation and got dropped, and you won't know about it until you run a reconciliation report, if you even think to run one.

The metrics say the pipeline processed ten million records. That's true. It did process ten million records. What the metrics don't tell you is that three million of those records were duplicates because the retry logic kicked in and reprocessed the same batch twice. What they don't tell you is that five hundred thousand records had null values in critical fields because the API schema changed and nobody updated the transformation code. What they don't tell you is that the data is wrong, but in ways that are subtle enough that nobody notices until it's too late.

Debugging became archaeology. When something goes wrong, you don't get a stack trace pointing to line 47 of your stored procedure. You get a log file. Actually, you get dozens of log files, scattered across different systems, each with its own format, each with its own timestamp format, each requiring different tools to access.

The data itself became harder to query. It's in object storage now. It's in Parquet files. You can't just run a SELECT statement and see what's there. You need to spin up a Spark cluster, or use Athena, or use some other query engine that understands columnar formats. You need to know which partition to look at. You need to know the schema. You need to wait for the query to run, which might take minutes or hours depending on how much data you're scanning.

In the old world, validation was your problem. You wrote the migration script, you wrote the validation queries, you ran them, you checked the results. In the new world, validation is distributed across the pipeline. The extraction layer validates that it got data from the source. The transformation layer validates that the transformations applied correctly. The loading layer validates that the target system accepted the data. Each layer does its own validation, and each layer reports success, and somehow the end result is still wrong because nobody validated the entire flow end-to-end.

Partial success became the norm. The old tools were all-or-nothing. The transaction either committed or it rolled back. The SSIS package either succeeded or it failed. There was no middle ground. The new tools embrace partial success. The Spark job processed ninety-eight percent of the records successfully. That's good enough, right? The bulk API accepted most of the data. Close enough. The pipeline completed with warnings. Ship it.

Retries became a source of corruption, not resilience. Retry logic is supposed to make pipelines more reliable. If a task fails, retry it. If the API is rate-limiting you, back off and try again. Sounds sensible. In practice, retries cause duplicates. They cause data to be processed multiple times. They cause race conditions. They cause the same record to be inserted twice, or updated with stale data, or deleted and then recreated. And because the retry happened automatically, in the background, orchestrated by Airflow or whatever tool you're using, you don't even know it happened unless you go digging through logs.

Schema drift became invisible. Parquet files are flexible. They're self-describing. If the schema changes, the file just adapts. New columns appear. Old columns disappear. Data types change. And your pipeline keeps running, keeps processing data, keeps reporting success, because the tools are designed to handle schema evolution gracefully. Except "gracefully" means "silently," and "silently" means you don't notice until someone asks why the revenue numbers are off by thirty percent and you realise that the API started returning strings instead of decimals three weeks ago and your transformation code has been treating them as nulls ever since.

This is the cost. This is what got traded. We got scalability, flexibility, cloud-native architecture, and tools that look enterprise-grade and dashboards that make management happy. What we lost was the ability to understand what the pipeline is actually doing, to debug it when it breaks, to validate that it's working correctly, to trust that success means success and not just "no errors were logged."

Why People Still Trust It More

Despite everything I've just described, people trust modern pipelines more than they trusted the old tools. Not because modern pipelines are more reliable or they're easier to validate. People trust them because they look trustworthy.

Airflow looks enterprise-grade. It has a web UI. It has a scheduler. It has a database backend. It has monitoring and alerting built in. It looks like something a serious company would use. It looks like something that was designed by people who know what they're doing. It looks professional. SSIS looked like a Windows application from 2005, because, well, it was. Airflow looks like a modern web application, because it is. Appearance matters. Perception matters. Tools that look sophisticated are assumed to be sophisticated.

Cloud storage feels durable. It's in the cloud. It's replicated across multiple availability zones. It has eleven nines of durability, according to the marketing materials. It's managed by Amazon or Microsoft or Google, companies with entire teams dedicated to keeping your data safe. It feels more reliable than a SQL Server database running on a server in your office.

APIs feel official. If the target system exposes an API, that must be the right way to load data into it. That must be what you're supposed to use. Never mind that a direct database insert would be faster and more reliable. The API is the official interface. Using it feels like doing things properly.

Dashboards imply correctness. If there's a dashboard showing metrics, if there are graphs trending upward, if there are green checkmarks indicating success, then the pipeline must be working. The dashboard wouldn't lie. The metrics wouldn't be wrong. The fact that you can see the pipeline running, see the tasks completing, see the data flowing through the system, creates a sense of confidence. It creates the illusion of control. It makes you feel like you know what's happening, even if you don't.

Modern tooling increased confidence without increasing certainty. That's the sleight of hand. The tools got better at making you feel good about the pipeline without actually making the pipeline more reliable. They got better at hiding problems, at smoothing over failures, at presenting a polished interface that suggests everything is under control.

Now, nobody set out to build tools that lie to you, but the incentives are all wrong. Tool vendors want adoption. They want happy customers. They want you to deploy the tool, see the dashboard turn green, and move on to the next project.

And it works. They trust the tools because the tools are designed to be trusted. They trust the process because the process looks professional. They trust the architecture because the architecture is what everyone else is using. And when something goes wrong, when the data doesn't match, when the numbers are off, they assume it's their fault. They assume they configured something incorrectly, or wrote the transformation code wrong, or didn't understand the API properly. They don't question the tools. They don't question the approach. They just try to fix their specific problem and move on.

The old tools didn't inspire confidence. They were clunky as hell and looked like they might break at any moment (and often did). And that was honest. Because data migration is fragile. It's error-prone. It's the kind of thing that requires constant vigilance and careful validation. The old tools looked like they needed supervision, so people supervised them. The new tools look like they don't need supervision, so people don't supervise them. And that's when things go wrong.

What Really, Really Matters?

Now let's talk about what actually matters when you're migrating data, regardless of what tools you're using or what architecture you've chosen or what the latest blog post says you should be doing.

The problem didn't change. You're still moving data from one place to another place. You're still trying to ensure that the data arrives intact, that it's complete, that it's correct, that it matches what was in the source. You're still trying to make the process repeatable, so you can run it again if you need to, so you can test it properly, so you can trust that it will work in production the same way it worked in development.

Responsibility didn't disappear. Someone is still responsible for ensuring the migration works. Someone still needs to validate the results. Someone still needs to debug the failures. Someone still needs to answer the question "did all the data make it across?" And that someone is probably you if you are reading this.

Testing became harder, not optional. With the old tools, testing was straightforward. You'd run the migration against a copy of the production database. You'd check the results. You'd run it again to make sure it was repeatable. You'd test edge cases. You'd test failure scenarios. It was tedious, but it was doable. With modern pipelines, testing is complicated. You need to mock APIs. You need to set up test environments in the cloud. You need to generate test data that matches production patterns. You need to test each layer of the pipeline independently and then test the entire flow end-to-end. It's harder. It takes longer. It requires more expertise. But it's not optional. If anything, it's more important now than it was before, because the failure modes are more subtle and the tools are better at hiding problems.

Data migration success is still measured by three things. Correctness: is the data accurate? Are the values right? Are the transformations applied correctly? Completeness: did all the data make it across? Are there any missing records? Any gaps in the timeline? Repeatability: can you run the migration again and get the same result? Can you test it properly? Can you trust that it will work in production?

Tooling does not guarantee any of these things. The tools are just tools - mechanisms for moving data around. They don't validate the data. They don't check for errors. They don't ensure that what you intended to happen actually happened. That's your job. That's always been your job. The tools just make it easier or harder to do that job, depending on how well they're designed and how well you understand them.

Pretending modern pipelines are inherently safer is a mistake. They're not safer. They're different. They have different failure modes. They require different expertise. They make different trade-offs. They're better at some things and worse at others. They're appropriate for some use cases and inappropriate for others. Choosing them because they're modern, because they're what everyone else is using, because they look more sophisticated, is not a good reason. Choosing them because they actually solve your specific problem better than the alternatives is a good reason.

In reality what we have done is we replaced SQL we understood with pipelines we trust. We replaced tools that were simple and visible with tools that are complex and opaque. We did this for real reasons. Scale. Flexibility. Cloud-native architecture. Decoupling. But we also did it because it felt like progress. Because it felt like we were moving forward, adopting modern practices, building systems that would last.

All we have actually done is delegate the complexity to the tools. We delegated the validation to the pipeline. We delegated the responsibility to the architecture. And then we trusted that it would all work out, that the tools would handle it, that the dashboards would tell us if something went wrong. And sometimes it does work out. Sometimes the tools do handle it. Sometimes the dashboards do tell you. But not always. Not in ways you can depend on when it really matters.

Data migration is still hard. It's still error-prone. It's still the kind of work that requires expertise and attention and careful validation. The tools changed. The complexity increased. The failure surface expanded. But the fundamental problem remained the same. And the fundamental solution remained the same: understand what you're doing, test it properly, validate the results, and don't trust the tools to do your job for you.

That's the boring reality. Everything else is just tooling.