5 min read

My Journey from SQL Server to Delta Lake: Solving Semi-Structured Data Challenges with Azure Synapse

Ever faced unstructured data constantly breaking your ETL processes at 3 AM? My solution: replacing our rigid SQL architecture with a flexible Delta Lake approach using Azure Synapse. Sometimes the best solution isn't perfect data structure—it's embracing adaptability.
My Journey from SQL Server to Delta Lake: Solving Semi-Structured Data Challenges with Azure Synapse
Photo by Tobias Fischer / Unsplash

Let me tell you a story about a data architecture that was slowly turning into a money pit. You know that feeling when you've built something that works, but deep down, you know it's not going to scale? That was me about a year ago, staring at our Azure SQL costs and wondering how things got so complicated.

Introduction

I've always been a SQL Server person at heart - I love the predictability, the performance, the tooling. But when our data started evolving faster than our schema could handle, I found myself in a classic dilemma. Our XML data was getting more complex by the week, with fields appearing and disappearing like whack-a-mole. Every time the upstream system added a new attribute, our ETL would break, and I'd be the one getting those lovely 3 AM alerts.

Sound familiar? It should, because it's probably one of the most common challenges in modern data engineering - how do you build rigid systems for flexible data?

The Business Problem

Let me paint the picture: Our Azure SQL Server expenses were skyrocketing, with storage and backup fees accumulating at an alarming rate. We were storing these massive XML documents - sometimes reaching several MBs each - and trying to query them efficiently. Our business users needed to extract insights, but the conversation usually went something like this:

Users: "We need to query this data."
Me: "Great! What fields do you need?"
Users: "Well, we're not sure yet. We need to explore it first."
Me: internal screaming 😱

The traditional approach would be to build a schema, transform the XML, and load it into tables. But with constantly evolving data, we were spending more time updating schemas than actually delivering value. Every change involves rebuild, test, and deploy - a repetitive and frustrating process for everyone involved.

Why I Chose the Lakehouse Approach

I'll never forget that Azure Friday episode where I first heard about Delta Lake. It was one of those lightbulb moments.

Delta Lake promised the best of both worlds - the flexibility of data lakes with the reliability of data warehouses. So I sketched out a medallion architecture:

  • Bronze layer: Raw XML files dropped straight into Azure Blob Storage
  • Silver layer: Parsed XML stored in Delta format, but intentionally keeping arrays and nested structures intact
  • Gold layer: Initially, I planned this to be in SQL Server with properly normalized tables

In my mind, this was perfect - raw data preserved, processing simplified, and business users happy with their familiar SQL queries. What could possibly go wrong?

The Schema Evolution Roadblock

Well, I'll tell you what went wrong - reality! (Isn't that always the case?)

After implementing the bronze and silver layers, I was feeling pretty confident. Spark jobs were parsing the XML and writing to Delta tables. Then came the inevitable source system change, and instead of our pipeline breaking... it just worked! The silver layer happily absorbed the new fields without complaint.

But then I tried to update the gold layer in SQL Server, and that's where my dream hit a brick wall. Each schema change meant:

  1. Altering tables
  2. Updating views
  3. Modifying stored procedures
  4. Rewriting ETL logic
  5. Testing everything (again)
  6. Deploying changes (carefully)
  7. Praying nothing broke

It was clear the SQL Server gold layer was going to defeat the entire purpose of our flexible architecture. But here's the rub - our users still needed SQL access, and most of them break out in hives at the mere mention of Python or Spark. I needed a way to give them SQL without the schema lock-in.

SQL Serverless to the Rescue (Almost)

When I discovered Azure Synapse SQL Serverless, I thought I'd found the holy grail. On paper, it was everything we needed:

  • Pay-per-query pricing (goodbye fixed costs!)
  • SQL interface for our business users
  • Ability to query Delta Lake directly
  • No infrastructure to manage

We quickly pivoted our architecture to replace the gold layer with Synapse SQL Serverless views over our Delta tables.

And then... the technical gremlins emerged.

The Technical Challenges I Faced

Have you ever been so excited about a solution that you glossed over the limitations in the documentation? Yeah, me too.

Issue #1: The 8000 Character Limit Monster
Our first shock came when trying to query complex XML arrays. SQL Serverless has an 8000 character limit for string fields in external tables, and our nested JSON structures were way beyond that. Queries started failing with cryptic truncation errors.

Issue #2: The Case of the Disappearing Partitions
Delta Lake has amazing partition pruning capabilities that make queries lightning fast. But for some reason, when creating external tables in SQL Serverless, the partition columns would just appear as NULL and our queries were scanning the entire dataset.

I spent an embarrassing number of late nights trying different approaches, reading obscure forum posts, and generally questioning my career choices.

Engineering the Solutions

After much experimentation (and coffee), I finally cracked it. The solution wasn't in external tables at all - it was in the humble OPENROWSET function.

Instead of creating external tables like this:

CREATE EXTERNAL TABLE dbo.MyXmlData 
WITH (
    LOCATION = 'silver/xmldata',
    DATA_SOURCE = DeltaLakeStorage,
    FILE_FORMAT = DeltaLakeFormat
)

I switched to dynamic queries using OPENROWSET with explicit schema definitions:

SELECT 
    col.xmlData.value,
    col.metadata.timestamp
FROM OPENROWSET(
    BULK 'silver/<Delta Lake folder>',
    DATA_SOURCE = 'DeltaLakeStorage',
    FORMAT = 'delta'
) WITH (
    xmlData VARCHAR(MAX),
    metadata VARCHAR(MAX)
) AS [result]
CROSS APPLY OPENJSON(result.xmlData) AS col

The magic was in that VARCHAR(MAX) definition - it bypassed the 8000 character limit completely! And by specifying the delta lake folder path in the BULK parameter, the function will automatically identify and expose the partitioning columns. I got proper partition pruning back.

For the nested arrays, OPENJSON became my best friend - it let us explode arrays at query time rather than during the data load process.

The Final Architecture

So here's where we landed - an architecture that gives us the best of both worlds:

  • Bronze layer: Raw XML files in blob storage (just as planned)
  • Silver layer: Consolidated, unexploded XML in Delta Lake (just as planned)
  • "Virtual Gold" layer: On-demand schema application through SQL Serverless views

This "virtual gold" concept was the breakthrough - instead of physically transforming data into a rigid schema, we could define the schema at query time based on what users actually needed. When the source changed, we just updated the views, not the underlying data structure.

Business Outcomes

The results were better than I could have hoped for:

  1. Our Azure costs dropped by 62% compared to our previous approach
  2. New fields became available immediately, not after weeks of schema updates
  3. Business users could self-serve their data needs with familiar SQL

But the biggest win? I haven't been woken up by a 3 AM alert in months.

Lessons Learned

If you're facing similar challenges, here are my hard-won insights:

  1. Understand the differences between Spark Delta and SQL Serverless - they look similar on paper but have important technical distinctions
  2. Test with real data volumes early - what works with sample data might break at scale
  3. Don't underestimate the VARCHAR(MAX) parameter - it's a small change with huge implications
  4. Start with the smallest acceptable partitioning strategy - you can always repartition later
  5. Document everything - especially the workarounds!

The most important lesson was learning when to physically transform data versus when to just transform it at query time. In a data landscape where you need to navigate upstream system unknown, query-time transformations give you incredible flexibility.

Conclusion

This journey from SQL Server to Delta Lake taught me that sometimes the best architecture isn't about forcing data into a perfect structure - it's about building systems that can adapt when the data inevitably changes.

Understanding the differences between Spark Delta processing and SQL Serverless querying was the key that make our data architecture more robust and cost effective. While they can work together beautifully, they're not the same thing, and respecting their differences is important.

Are you wrestling with similar semi-structured data challenges? Have you found other approaches that work well? I'd love to hear about your experiences in the comments. After all, the best solutions often come from sharing our data war stories!

Reference

  1. Query Delta Lake Format
  2. Delta Lake partitioned views