Microsoft Fabric ETL: Escape the Notebook Monolith
We’ve all been there: it starts as a simple proof-of-concept. You open up a new Notebook in Microsoft Fabric, write a few cells to ingest data, add some transformations, and maybe append a couple of data quality checks. Fast forward three months, and that “quick script” has evolved into a monolithic monster sprawling across 100+ cells.
Suddenly, your notebook contains utility functions, control flow logic, documentation, environment parameters, and inline unit tests – all tangled together in a single execution flow.
When data engineering code lives entirely inside monolithic notebooks, it becomes notoriously difficult to maintain. You end up violating the DRY (Don’t Repeat Yourself) principle, unit testing becomes effectively impossible, and when a scheduled job fails on cell 87 out of 112, you’re faced with operational blindness while trying to dig through execution states.
In this post, we’re going to fix that. We’ll look at how to leverage Microsoft Fabric’s built-in workspace resources to escape the notebook monolith by adopting an Object-Oriented (OO) Python approach alongside Fabric Pipelines.
The Solution: Separation of Concerns
To solve the monolith problem, we need to enforce a rigid separation of concerns. Instead of letting a notebook do everything from defining functions to managing execution order, we split the responsibilities:
- Reusable Logic: We use Fabric’s built-in resources (Environment/Workspace files) to store pure Python modules. This allows us to write repeatable ETL code in an Object-Oriented style.
- Execution Context: The Notebook itself becomes incredibly thin, acting only as the entry point that retrieves parameters and instantiates our Python classes.
- Control Flow: We use Fabric Pipelines to orchestrate the dependency chains and looping, moving that responsibility out of the notebook entirely.
Deep Dive: The Lakehouseloader Base Class (Template Method + Strategy Pattern)
To eliminate the repetition of writing create table and upsert queries across our Medallion Architecture (Bronze, Silver, Gold), we combine two complementary design patterns: the Template Method Pattern and the Strategy Pattern.
The Template Method gives us the skeleton: a private _table_load method that enforces the strict, non-negotiable sequence of steps every table load must follow – drop (if recreating), create, truncate (if needed), then insert. That order never changes.
The Strategy Pattern gives us the flexibility: instead of locking the create and insert logic into abstract methods on the class, _table_load accepts them as plain callables (create_fn and insert_fn). This means every individual load_* method can supply its own completely independent implementation – one might use the Spark DataFrame API, another a raw Spark SQL MERGE statement – without any of those choices bleeding into each other.
We start by creating this base class (e.g., in a workspace file named lakehouse_utils.py):
from abc import ABC
from typing import Callable
from pyspark.sql import SparkSession, DataFrame
class LakehouseLoader(ABC):
def __init__(self, spark: SparkSession, schema: str = "dbo"):
self.spark = spark
self.schema = schema
def _table_load(
self,
df: DataFrame,
table_name: str,
create_fn: Callable,
insert_fn: Callable,
recreate: bool = False,
truncate: bool = False
):
"""
The Template Method.
Accepts create_fn and insert_fn callables so each load method
can supply its own independent implementation.
"""
print(f"Starting load process for {self.schema}.{table_name} (recreate={recreate}, truncate={truncate})...")
if recreate:
print(f"Dropping table {self.schema}.{table_name} to recreate...")
self.spark.sql(f"DROP TABLE IF EXISTS {self.schema}.{table_name}")
create_fn(df, table_name)
if truncate and not recreate:
print(f"Truncating table {self.schema}.{table_name}...")
self.spark.sql(f"TRUNCATE TABLE {self.schema}.{table_name}")
insert_fn(df, table_name)
print("Load complete.")
return True
Deep Dive: The Medallion Subclasses
With the callable-injection approach, each load_* method in a subclass is entirely self-contained. It defines its own create and insert as inner functions (closures), which naturally capture self from the enclosing method scope — giving them full access to self.spark and self.database_name without any extra wiring. These two functions are then passed directly into _table_load as the strategy for that specific table.
This means there is no shared create or insert logic on the class itself to accidentally inherit or override incorrectly. A Bronze load_dim_date can use explicit Spark SQL DDL to define a strict schema, while a Bronze load_dim_customer can infer the schema from the DataFrame. Meanwhile, a Silver load_dim_customer can run a full Delta Lake MERGE statement — all living independently in the same codebase without any coupling between them.
class BronzeLoader(LakehouseLoader):
def load_dim_date(self, df: DataFrame, recreate: bool = False, truncate: bool = False):
def create(df: DataFrame, table_name: str):
# Use Spark SQL DDL for an explicit schema definition
self.spark.sql(f"""
CREATE TABLE IF NOT EXISTS {self.schema}.{table_name} (
date_key INT,
full_date DATE,
year INT,
month INT,
day INT
) USING DELTA
""")
def insert(df: DataFrame, table_name: str):
# Bronze logic: straight append via DataFrame API
print("Appending records to Bronze dim_date...")
df.write.format("delta").mode("append").saveAsTable(f"{self.schema}.{table_name}")
self._table_load(df, "dim_date", create, insert, recreate=recreate, truncate=truncate)
def load_dim_customer(self, df: DataFrame, recreate: bool = False, truncate: bool = False):
def create(df: DataFrame, table_name: str):
# Infer schema from the DataFrame itself
df.limit(0).write.format("delta").mode("ignore").saveAsTable(f"{self.schema}.{table_name}")
def insert(df: DataFrame, table_name: str):
# Bronze logic: straight append via DataFrame API
print("Appending records to Bronze dim_customer...")
df.write.format("delta").mode("append").saveAsTable(f"{self.schema}.{table_name}")
self._table_load(df, "dim_customer", create, insert, recreate=recreate, truncate=truncate)
class SilverLoader(LakehouseLoader):
def load_dim_customer(self, df: DataFrame, recreate: bool = False, truncate: bool = False):
def create(df: DataFrame, table_name: str):
# Infer schema from the DataFrame itself
df.limit(0).write.format("delta").mode("ignore").saveAsTable(f"{self.schema}.{table_name}")
def insert(df: DataFrame, table_name: str):
# Silver logic: Deduplicate and MERGE via Spark SQL
print("Deduplicating and merging into Silver dim_customer...")
df.createOrReplaceTempView("silver_dim_customer_staging")
self.spark.sql(f"""
MERGE INTO {self.schema}.{table_name} AS target
USING silver_dim_customer_staging AS source
ON target.customer_key = source.customer_key
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
self._table_load(df, "dim_customer", create, insert, recreate=recreate, truncate=truncate)
(You could easily follow this same pattern for a GoldLoader — its load_* methods would define insert inner functions that perform aggregation and write to summary tables, while create could provision the Gold schema via DDL.)
The “Skinny Notebook” Paradigm
Because all the complex, repeatable logic is abstracted away into our Python module, the actual Fabric Notebook becomes incredibly clean. A 100-cell monolith can often be reduced to just 2 or 3 cells.
One of the conveniences that makes this even cleaner is Microsoft Fabric’s default Lakehouse feature. When you attach a default Lakehouse to a notebook, Spark resolves table references using the short schema.table two-part name — there is no need to qualify the full three-part lakehouse.schema.table path. Our LakehouseLoader is built around this: the schema argument (typically "dbo" for the Lakehouse default schema) is the only context needed to resolve any table, keeping the instantiation simple and portable across environments.
The notebook’s only job is to retrieve parameters and call the loader:
# Cell 1: Retrieve Parameters (passed in by Fabric Pipeline)
source_table = mssparkutils.env.get("SOURCE_TABLE")
# The framework knows the target based on the specific load method
# Cell 2: Execute Load via our OO Classes
# Requires a default Lakehouse to be attached to this notebook.
# Tables are resolved as schema.table (e.g. dbo.dim_customer).
from lakehouse_utils import SilverLoader
df_bronze = spark.read.table(source_table)
loader = SilverLoader(spark, schema="dbo")
loader.load_dim_customer(df_bronze, recreate=False, truncate=False)
Orchestration with Microsoft Fabric Pipelines
The final piece of the puzzle is moving the control flow out of the notebook. If you have Python if/else statements determining which tables to load, or for loops iterating over dependencies, those should be ripped out and placed into Fabric Pipelines.
A typical pipeline structure looks much cleaner:
- Copy Data Activity: Ingests files into the Lakehouse.
- Notebook Activity (Bronze): Triggers the heavily structured Bronze load.
- Notebook Activity (Silver): Triggers only upon the success of the Bronze activity.
The Pipeline handles retries, concurrency, and dependencies. The Notebook handles Spark execution. The .py modules handle the repeatable business logic.
Benefits & Results
Making this architectural shift usually yields massive improvements:
- Code Reduction: By removing duplicated
MERGEandINSERTstatement templates, it’s common to see a 60%+ reduction in total codebase lines. - Faster Code Reviews: Pull requests are much easier to read when business logic is safely partitioned into skinny notebooks, while the core mechanics are locked inside standard Python modules.
- True Unit Testability: You can now run
pytestagainst yourBronzeLoaderandSilverLoaderclasses directly, entirely independently of the Notebook UI.
Lessons Learned & Next Steps
If you want to start moving away from monolithic notebooks, keep these things in mind:
- Don’t Over-Abstract: It can be tempting to put everything into the base class. Keep your classes focused.
- Call to Action: To start, find the single most repeated code block in your Fabric notebooks (usually your upsert statement). Extract it to a
.pyfile attached to your workspace, import it into a single notebook, and prove out the pattern!
Bonus: Local-Style Development in VS Code
For developers comfortable with local IDEs, the Synapse VS Code extension for Microsoft Fabric is an absolute game-changer for this OO architecture.
Using the extension’s Workspace explorer, you can right-click your notebook folder and select “Open in Virtual Workspace”. This drops you into an authentic VS Code editing experience where you can seamlessly edit your notebook alongside your shared Python (.py) modules. This gives you local software engineering experiences – like proper syntax highlighting, native code editing, and integrated source control – right against your Microsoft Fabric artifacts (provided git is enabled in Fabric).
A Technical Gotcha:
If you start developing this way, be aware of a sneaky issue regarding Python module caching! When you import custom .py modules from Fabric’s built-in resources into your active notebook session, the PySpark session cache grabs that module code. If you update the .py file and re-run the notebook cell, the notebook will often use the stale, cached version of the module. You will need to account for this by forcefully reloading the modules during interactive development to ensure the notebook always fetches your newest changes!
