Skip to content

pyspark-connect-web - PySpark in JupyterLite

Run the real PySpark Connect Python client inside a browser (JupyterLite/Pyodide), talking to a Spark Connect server through a grpc-web transport. Your existing PySpark code runs unchanged - no reimplementation, no local JVM, no Python backend server.

PySpark BI: boot PySpark in the browser, pick a table, run SQL, see results, querying a real Spark Connect server over grpc-web

The embedded BI query cell demo recorded in CI against a real Spark Connect server: PySpark boots in the browser tab, then picks a table, runs SQL, and renders results. No JVM, no pip install pyspark, no client setup.

import pyspark_connect_web as pcw
pcw.install()

from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:8081/;transport=grpcweb").getOrCreate()
spark.range(10).filter("id % 2 = 0").toPandas()   # runs in your browser tab

A thin client, not local compute

This is a thin client, not local compute. PySpark's Connect client is pure Python above a single gRPC stub: it builds protobuf plans and ships them to the server. We monkey-patch only that stub with a grpc-web/fetch transport, and make calls blocking via a Web Worker + Atomics/SharedArrayBuffer bridge so .collect() returns data synchronously. Everything above the stub - DataFrame, Column, functions - is untouched.

You still need a running Spark Connect server (Spark 4.x) behind an Envoy grpc-web proxy. The browser does not run Spark; it builds plans and renders results. The win is: no Python backend, the real PySpark API, anywhere a browser runs.

flowchart LR
    U["User PySpark code (unchanged)"]
    SCC["SparkConnectClient"]
    ENVOY["Envoy grpc_web proxy"]
    SPARK["Spark Connect server (Spark 4.x)"]
    PD["pandas"]

    U -->|builds protobuf plan| SCC
    SCC -->|patched stub: grpc-web over fetch| ENVOY
    ENVOY --> SPARK
    SPARK -->|Arrow IPC| ENVOY
    ENVOY -->|decode| PD
    PD --> U

Where to go next

If you want to... Read
Install the package into a browser/JupyterLite env Installation
Run a query end-to-end as fast as possible Quickstart
Bring up Spark Connect + Envoy on your laptop Running locally
Understand sc://, TLS, and auth Connection patterns
Host the JupyterLite site (GitHub Pages, Netlify, ...) JupyterLite hosting
Understand the internals Architecture
Deploy past localhost safely Security
Look up the public API API reference

Status

Early development. The server side (deploy/) and the e2e scaffold (tests/e2e/) are in place; the browser client and the JupyterLite build are in progress. See CONTRIBUTING.md in the repository for the build plan and the design notes for the load-bearing invariants.

License

Apache-2.0. "Apache Spark", "Spark", and "PySpark" are trademarks of the Apache Software Foundation, used here only to describe interoperability.