Skip to content

Quickstart

Get from nothing to a query result in a browser tab. This walks the happy path; for the full local setup (reference generation, e2e, troubleshooting) see Running locally.

1. Set up a local Python environment (conda)

conda create -n pcw python=3.11
conda activate pcw
pip install pyspark-connect-web

See Installation for the browser-side micropip install.

2. Bring up the server side (Spark Connect + Envoy grpc-web proxy)

docker compose -f deploy/compose.yaml up

This starts a Spark 4.1.2 Connect server and an Envoy proxy that exposes:

URL What
sc://localhost:8081/;transport=grpcweb grpc-web endpoint the client connects to
http://localhost:8000/ JupyterLite site, served with the mandatory Cross-Origin-Opener-Policy: same-origin + Cross-Origin-Embedder-Policy: credentialless headers (required for SharedArrayBuffer)

See deploy/README.md for ports, version pins, and CORS/header checks.

3. Use the client

import pyspark_connect_web as pcw
pcw.install()

from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:8081/;transport=grpcweb").getOrCreate()
spark.range(10).filter("id % 2 = 0").toPandas()

In JupyterLite, open http://localhost:8000/ and run the demo notebook. Verify isolation first - crossOriginIsolated === true in the browser console - or the blocking bridge cannot work:

crossOriginIsolated === true   // must be true; else SharedArrayBuffer is unavailable

Ways to use it

Pick the path that fits - all of them run the real PySpark API in the browser.

1. In JupyterLite (a notebook, nothing to install)

Build the site and bring up the stack (Spark Connect + Envoy grpc-web + the JupyterLite site, served cross-origin isolated on :8000):

make site                                  # build the JupyterLite site into _output/
docker compose -f deploy/compose.yaml up   # serves :8000 (site) + :8081 (grpc-web) + :15002 (Spark)

Open http://localhost:8000/, then in a notebook cell run the pcw.install() + SparkSession.builder.remote(...) snippet from step 3 above. GitHub Pages / other static hosts: see JupyterLite hosting.

2. Embed it in your own web page

The site ships a small, self-contained page that boots Pyodide in a Web Worker, micropip-installs the wheel, runs pcw.install(), binds a SparkSession, and exposes window.__pcwRunPython(src). Use pyspark_connect_web/jupyterlite/harness.html as the reference for wiring worker/worker_bootstrap.js + worker/bridge.js into your app. The page must be cross-origin isolated (COOP: same-origin, COEP: credentialless) for the SharedArrayBuffer bridge.

3. Run the end-to-end example

The browser e2e brings up the whole stack and drives the v0 matrix (range/collect, groupBy/agg Arrow parity, createDataFrame, spark.sql) in real Chromium:

make site
docker compose -f deploy/compose.yaml up -d
cd tests/e2e && npm install && npx playwright install --with-deps chromium
E2E_REQUIRE_STACK=1 npx playwright test

It also runs on every push (the e2e GitHub Actions workflow).

DataFrame API examples

Once connected it is ordinary PySpark. Runnable scripts live in examples/ (quickstart, transformations, aggregations, joins, window, sql, io); they double as plain native-PySpark scripts against any Spark Connect server.

What just happened

  • pcw.install() monkey-patched PySpark's Connect stub to use a grpc-web/fetch transport. Nothing above the stub changed.
  • SparkSession.builder.remote("sc://...;transport=grpcweb") parsed the web scheme and returned an ordinary SparkSession.
  • .toPandas() built a protobuf plan, shipped it through Envoy to the Spark Connect server, and decoded the Arrow IPC result back into a pandas DataFrame - all synchronously, via the Atomics/SharedArrayBuffer bridge.

Next steps