Quickstart¶
Get from nothing to a query result in a browser tab. This walks the happy path; for the full local setup (reference generation, e2e, troubleshooting) see Running locally.
1. Set up a local Python environment (conda)¶
See Installation for the browser-side micropip install.
2. Bring up the server side (Spark Connect + Envoy grpc-web proxy)¶
This starts a Spark 4.1.2 Connect server and an Envoy proxy that exposes:
| URL | What |
|---|---|
sc://localhost:8081/;transport=grpcweb |
grpc-web endpoint the client connects to |
| http://localhost:8000/ | JupyterLite site, served with the mandatory Cross-Origin-Opener-Policy: same-origin + Cross-Origin-Embedder-Policy: credentialless headers (required for SharedArrayBuffer) |
See deploy/README.md
for ports, version pins, and CORS/header checks.
3. Use the client¶
import pyspark_connect_web as pcw
pcw.install()
from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:8081/;transport=grpcweb").getOrCreate()
spark.range(10).filter("id % 2 = 0").toPandas()
In JupyterLite, open http://localhost:8000/ and run the demo notebook. Verify
isolation first - crossOriginIsolated === true in the browser console - or the
blocking bridge cannot work:
Ways to use it¶
Pick the path that fits - all of them run the real PySpark API in the browser.
1. In JupyterLite (a notebook, nothing to install)¶
Build the site and bring up the stack (Spark Connect + Envoy grpc-web + the
JupyterLite site, served cross-origin isolated on :8000):
make site # build the JupyterLite site into _output/
docker compose -f deploy/compose.yaml up # serves :8000 (site) + :8081 (grpc-web) + :15002 (Spark)
Open http://localhost:8000/, then in a notebook cell run the pcw.install() +
SparkSession.builder.remote(...) snippet from step 3 above. GitHub Pages / other
static hosts: see JupyterLite hosting.
2. Embed it in your own web page¶
The site ships a small, self-contained page that boots Pyodide in a Web Worker,
micropip-installs the wheel, runs pcw.install(), binds a SparkSession, and
exposes window.__pcwRunPython(src). Use
pyspark_connect_web/jupyterlite/harness.html
as the reference for wiring worker/worker_bootstrap.js + worker/bridge.js into
your app. The page must be cross-origin isolated (COOP: same-origin,
COEP: credentialless) for the SharedArrayBuffer bridge.
3. Run the end-to-end example¶
The browser e2e brings up the whole stack and drives the v0 matrix
(range/collect, groupBy/agg Arrow parity, createDataFrame, spark.sql) in
real Chromium:
make site
docker compose -f deploy/compose.yaml up -d
cd tests/e2e && npm install && npx playwright install --with-deps chromium
E2E_REQUIRE_STACK=1 npx playwright test
It also runs on every push (the e2e GitHub Actions workflow).
DataFrame API examples¶
Once connected it is ordinary PySpark. Runnable scripts live in
examples/
(quickstart, transformations, aggregations, joins, window, sql, io);
they double as plain native-PySpark scripts against any Spark Connect server.
What just happened¶
pcw.install()monkey-patched PySpark's Connect stub to use a grpc-web/fetchtransport. Nothing above the stub changed.SparkSession.builder.remote("sc://...;transport=grpcweb")parsed the web scheme and returned an ordinarySparkSession..toPandas()built a protobuf plan, shipped it through Envoy to the Spark Connect server, and decoded the Arrow IPC result back into a pandas DataFrame - all synchronously, via theAtomics/SharedArrayBufferbridge.
Next steps¶
- Connection patterns -
sc://scheme, TLS, and auth. - Running locally - reference generation and the e2e harness.
- JupyterLite hosting - host the site on GitHub Pages and friends.
- Security - what to harden before going past localhost.