Share

Web Scraping in Fabric: It’s a Rendering Problem, Not a Scraping Problem

Web scraping is easy.

Just point Power Query at a URL.

If the website loads in your browser, the data must be there.

If Power Query returns HTML, you’ve scraped the page successfully.

Most Fabric users have probably believed some variation of these assumptions.

I certainly did.

Then one day I found myself staring at the From Web dialog in Dataflow Gen2.

Two options.

Web Content

Web Browser Content

Same connector. Same URL. Completely different outcomes.

I needed data from a website that had no API, no database access, and no export button. Just a table visible in a browser.

The obvious choice seemed to be Web Content. After all, I only needed the HTML.

The refresh succeeded.

The pipeline ran.

The output contained data.

Except it wasn’t the data I wanted.

After several hours of debugging, I discovered the page was built with React. The HTML returned by the server contained nothing but a placeholder:

<div id="root"></div>

The table I could clearly see in my browser didn’t exist in the response at all. JavaScript was creating it after the page loaded.

Switching to Web Browser Content fixed everything instantly.

That was the moment I realised most web scraping problems aren’t really scraping problems. They’re rendering problems.

Once you understand the difference between static HTML and JavaScript-rendered pages, the behaviour of Web Content, Web Browser Content, and even Playwright becomes obvious.

In this post, we’ll go under the hood of all three approaches, compare how they work, and discuss when each one should be your weapon of choice.


The Rendering Gap: Why Your Browser Sees What Power Query Can’t

Before we compare tools, we need to talk about how websites work in 2026. This is where most of the confusion lives.

Static HTML

You send an HTTP GET. The server builds the complete page and sends it back. What you see in the response is what a browser would render. Tables, text — all there.

<html>
  <body>
    <table class="data">
      <tr><td>Actual data here</td></tr>
    </table>
  </body>
</html>

This is the world Web Content was built for.

Dynamic DOM

You send an HTTP GET. The server sends back a skeleton and a pile of JavaScript bundles. The browser then executes those scripts, which fetch data from APIs, and builds the DOM dynamically.

The HTML in your response:

<html>
  <head>...</head>
  <body>
    <div id="root"></div>
    <script src="/static/js/main.a1b2c3.js"></script>
  </body>
</html>

That’s the whole page. The data hasn’t arrived yet. It’s waiting for JavaScript to run.

This is the world Web Browser Content was designed for. It’s also where Web Content fails — quietly, and absolutely.


Option 1: Web Content — The Simple HTTP GET

This is the basic mode. You give Power Query a URL, it fires off an HTTP GET request, and whatever comes back goes into the M engine for parsing.

let
    Source = Web.Contents("https://example.com/data"),
    Table = Html.Table(Source, {"Column1", "Column2"})
in
    Table

Behind the scenes, Web.Contents is built on .NET’s HTTP stack. It supports the widest range of authentication options — Anonymous, Windows, Basic, Web API, Organizational Account, and Service Principal — and works in cloud environments without needing a gateway (Microsoft Learn: Web.Contents).

What it’s good at:

  • Government data portals with static HTML tables
  • Wikipedia infoboxes and reference tables
  • REST APIs returning JSON or XML
  • RSS feeds, sitemaps, CSV files hosted on a URL

What it’s terrible at:

  • Single-page applications built with React, Angular, Vue
  • Any page that loads data via XHR or fetch() after initial page load
  • Pages behind JavaScript-based challenges

The silent killer: Web Content gives you something back every single time. It doesn’t fail. It returns the skeleton HTML. Your pipeline keeps running. Your downstream tables have values. Just… wrong values. No error. No warning. Just meaningless data flowing silently into your lakehouse.


Option 2: Web Browser Content — The Hidden Browser Engine

This is the mode most people don’t know exists — or dismiss because “it’s slow.” They’re right about the slowness. But they’re wrong to dismiss it.

let
    Source = Web.BrowserContents("https://example.com/dashboard"),
    Table = Html.Table(Source, {"Column1", "Column2"})
in
    Table

Same function name prefix. Fundamentally different behaviour. Web.BrowserContents is built on Microsoft Edge’s WebView2 control — a full Chromium-based browser engine. It executes JavaScript, renders the DOM, and then returns the HTML (Microsoft Learn: Web.BrowserContents).

How it works under the hood:

  • Power Query spins up WebView2 (Edge’s Chromium engine)
  • The browser loads the URL, executes JavaScript, renders the DOM
  • Then Power Query extracts the rendered content
  • This takes dramatically longer than a simple HTTP GET

The gateway gotcha: Unlike Web Content, Web Browser Content requires an on-premises data gateway when running in the cloud (Power BI service, Dataflow Gen2, Power Apps). Both Web.BrowserContents and the legacy Web.Page need the WebView2 runtime or Internet Explorer 10+ installed on the gateway machine (Microsoft Learn: Web connector troubleshooting).

What it’s good at:

  • JavaScript-rendered dashboards and SPAs
  • Pages that load data after the initial paint
  • Any site where the real content depends on JavaScript execution

What it’s still bad at:

  • Performance. 5–30 seconds per page instead of 0.5–2 seconds
  • Anti-bot evasion. The embedded browser has a detectable fingerprint
  • Complex auth flows. SSO and multi-step login get messy
  • Bulk scraping. A hundred pages in browser mode takes forever
  • Administrative privileges — WebView2 can’t run in elevated/admin mode

The WaitFor escape hatch: Web.BrowserContents supports a WaitFor parameter that waits for a CSS selector to appear before grabbing the HTML (Microsoft Learn: handling dynamic web pages).

Web.BrowserContents("https://example.com/dashboard", [
    WaitFor = [Selector = "div.data-loaded", Timeout = #duration(0,0,0,10)]
])

This is your main tool for handling pages that need extra rendering time. Without it, dynamic content can cause sporadic errors — sometimes it loads, sometimes it doesn’t.


Option 3: Playwright in Fabric Notebooks

So what do you do when neither Dataflow Gen2 mode works? When the page needs complex interaction — pagination, infinite scroll, login flows — and Web.BrowserContents’ simple “load and wait” can’t cut it?

You reach for Playwright in a Fabric Notebook.

%pip install playwright
import asyncio
from playwright.async_api import async_playwright

async def scrape_dashboard(url: str) -> list[dict]:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")

        # Click through pagination
        data = []
        while True:
            rows = await page.evaluate("""() => 
                Array.from(document.querySelectorAll('.data-row'))
                    .map(r => ({
                        name: r.querySelector('.name').innerText,
                        value: r.querySelector('.value').innerText
                    }))
            """)
            data.extend(rows)
            next_btn = page.locator("button.next-page")
            if await next_btn.is_disabled():
                break
            await next_btn.click()
            await page.wait_for_timeout(2000)

        await browser.close()
        return data

This handles everything the Dataflow Gen2 modes can’t. A real browser, executing real JavaScript, clicking real buttons.

But Here’s Where It Gets Complicated

There’s a lot of advice in community threads saying “just %pip install playwright and you’re good.” The Fabric Community thread on this very topic says headless mode “usually works without issues” — and that “usually” is doing a lot of heavy lifting.

Gotcha #1: You’re not running the same browser. By default, Playwright does not run the full Chromium browser in headless mode. Since Playwright v1.49 (late 2024), headless=True launches a separate binary called chromium-headless-shell — a stripped-down build from Chromium’s //content module (Playwright v1.49 headless changes). It has different rendering behaviour, different GPU handling, and different font rendering. Code that passes locally can fail in Fabric because the browser binary is different.

Gotcha #2: System dependencies aren’t guaranteed. The full Chromium binary requires 20+ system libraries (libX11, libXcomposite, libgbm1, libdbus-1-3, etc.). Fabric’s compute images may or may not include them. The only reliable fix is a workspace-level environment config with an init script — which most people don’t know exists.

Gotcha #3: Notebook sessions time out. The idle timeout is ~20 minutes. If your scraping logic is waiting on slow page loads, Fabric sees no activity and recycles the session. Your in-flight data? Gone.

Gotcha #4: Orphaned browser processes. When your notebook session gets recycled, Fabric kills the Python runtime. But orphaned Chromium processes stick around, consuming memory until the OS eventually cleans them up. They accumulate.

Gotcha #5: Memory accounting doesn’t include the browser. Each browser instance uses 150–600 MB that Fabric’s runtime never sees. Run 5 concurrent sessions and you’ve silently consumed 1.5 GB. When the OS OOM killer takes out your Python process, you get a generic memory error — no mention of the browser.

Gotcha #6: Software rendering is painfully slow. Fabric compute nodes don’t have GPUs. A page that renders in 2 seconds on your laptop takes 10–15 seconds on Fabric. Scrape 50 pages? That’s 10–15 minutes just for rendering.

Gotcha #7: Headless browsers get fingerprinted. Fabric’s egress IPs are Azure datacenter ranges. Services like Cloudflare Turnstile, DataDome, and Akamai flag them by default. Playwright might appear to work but return a CAPTCHA wall.


The Decision Matrix

ScenarioYour Weapon
Static HTML pageWeb Content — done in seconds
React/Angular/Vue SPA, no interactionWeb Browser Content (accept the speed penalty, need a gateway)
Needs interaction (pagination, login)Playwright in Notebook (accept the gotchas)
JSON API behind an SPAWeb.Contents directly — skip the browser entirely

Pro tip I discovered embarrassingly late: open DevTools → Network tab on the target page and look for XHR or fetch calls. Many SPAs load data via clean JSON APIs behind the scenes. Hit those directly with Web Content mode and bypass the browser entirely. This has saved me more hours than I can count.


Lessons Learned

  1. Check the page source first. Before writing any Power Query code, hit Ctrl+U and search for your data. If it’s not in the HTML, JavaScript is rendering it.
  2. Web Content and Web Browser Content are different execution engines in the same dialog. They share a UI but have nothing in common under the hood. Web.Contents is .NET HTTP. Web.BrowserContents is Chromium’s WebView2.
  3. Web Content fails silently. It returns skeleton HTML, not an error. If your scraped data looks wrong, check which mode you’re using.
  4. Web Browser Content requires a gateway in the cloud. This is a hard requirement, not a nice-to-have. Make sure your gateway machine has the WebView2 runtime installed.
  5. Browser rendering costs time and money. Fabric charges by CU usage. A Web Browser Content step uses significantly more resources. Use the simple version where it works.
  6. Playwright in Notebooks is not a drop-in replacement. It defaults to chromium-headless-shell, not full Chrome. Different binaries, different behaviour.
  7. Check DevTools Network tab before writing any code. The JSON API behind an SPA is often easier to scrape than the rendered page itself.
  8. Don’t fight a losing battle against anti-bot systems. If a site uses Cloudflare or DataDome, accept that Fabric’s Azure IPs won’t work and find an alternative data source.

References


Are you scraping websites through Fabric? I’d love to hear what you’ve landed on. Web Browser Content in Dataflow? Playwright in a Notebook? Or did you crack and reach for your credit card? 👇