Architecture - pyspark-connect-web¶
PySpark in JupyterLite. Run the real PySpark Connect Python client inside a browser (JupyterLite/Pyodide), talking to a Spark Connect server through a grpc-web transport. User PySpark code runs unchanged.
The one big idea¶
PySpark Connect's client is pure Python above a single gRPC stub: it builds
protobuf plans and calls self._stub.ExecutePlan(req). We monkey-patch only
that stub with a grpc-web/fetch transport, and make calls blocking via a
Web Worker + Atomics/SharedArrayBuffer bridge so .collect() returns data
synchronously. Plan building, DataFrame, Column, functions, retry policy, and the
reattachable-execute iterator all stay as upstream PySpark.
End-to-end picture¶
flowchart TD
U["User PySpark code (unchanged)"]
DF["pyspark.sql.connect: DataFrame / functions (untouched upstream)"]
SCC["SparkConnectClient"]
STUB["grpc-web stub (patched) - length-prefixed frames, 0x80 trailer"]
CH["SyncChannel - Atomics + SharedArrayBuffer (worker blocks; main thread does fetch)"]
ENVOY["Envoy proxy - :8081 grpc_web filter + CORS, :8000 static JupyterLite + COOP/COEP"]
SPARK["Spark Connect server (Spark 4.x, :15002)"]
PD["pandas DataFrame"]
U --> DF
DF -->|builds protobuf plan| SCC
SCC -->|patched stub| STUB
STUB --> CH
CH -->|grpc-web over fetch| ENVOY
ENVOY -->|gRPC over HTTP/2| SPARK
SPARK -.->|Arrow IPC result batches| CH
CH -.->|decode| PD
PD -.-> U
subgraph browser["Browser (cross-origin isolated)"]
U
DF
SCC
STUB
CH
end
Lane map¶
| Lane | Area | Files |
|---|---|---|
| 1 | grpc-web stub + framing | pyspark_connect_web/transport/* |
| 2 | monkey-patch + integration | pyspark_connect_web/__init__.py, patch.py, _contract.py |
| 3 | Pyodide sync bridge + JupyterLite | pyspark_connect_web/worker/*, jupyterlite/* |
| 4 | Arrow result decoding | pyspark_connect_web/arrow/* |
| 5 | Envoy proxy + e2e + docs | deploy/*, tests/e2e/*, docs/*, README.md |
The stub seam (frozen - the transport contract section 1)¶
The replacement stub exposes 10 callables with the gRPC-Python calling
convention fn(request, *, metadata=None, timeout=None). Streaming methods
(ExecutePlan, ReattachExecute) return an iterator of response protos; unary
methods return one proto. Protos come from pyspark.sql.connect.proto - we do
not vendor copies.
Below the stub is the byte boundary: the stub calls the SyncChannel
(unary / server_stream), which is synchronous - it blocks the worker via
Atomics.wait until the main-thread fetch writes the response back through a
SharedArrayBuffer. That blocking is what lets .collect() stay synchronous.
grpc-web wire framing (the components)¶
A grpc-web message is length-prefixed frames: [1 byte flags][4 byte big-endian
length][payload]. Flag 0x01 = compressed (rejected), 0x80 = trailer frame
whose payload is an HTTP-style header block carrying grpc-status /
grpc-message. A request can be HTTP 200 yet carry grpc-status: 13; the turns a non-OK trailer into SparkConnectGrpcException. Envoy's grpc_web
filter translates between this browser-friendly framing and upstream gRPC.
Why cross-origin isolation is mandatory¶
SharedArrayBuffer - the backbone of the blocking bridge - is only available
when the page is cross-origin isolated. That requires the page be served
with:
Envoy's static host (:8000) sets both (deploy/envoy.yaml). We use
credentialless (not require-corp): it keeps the page cross-origin isolated
while letting the cross-origin grpc-web fetch to Envoy through as a
no-credentials request, without requiring Cross-Origin-Resource-Policy on every
response. Pyodide and the wheels are vendored same-origin (the build copies
them into the site root) because under COEP the worker cannot load them from a
cross-origin CDN. The e2e harness asserts crossOriginIsolated === true before
doing anything else; if it is false, the entire bridge is dead, so this is the
first checklist gate.
Reattachable execute¶
The client issues ExecutePlan + ReattachExecute + ReleaseExecute. If a
stream breaks mid-flight, PySpark's own ExecutePlanResponseReattachableIterator
reattaches from the last seen response - but only if our stub implements all
three calls, not just the initial stream. The e2e harness kills a stream
mid-flight and asserts the result still completes.
Invariants¶
- No
grpcioinsidepyspark_connect_web/- ever (not in Pyodide). - We patch, we do not fork pyspark.
- Pinned pyspark range
>=4.0. - COOP/COEP mandatory.
.collect()stays blocking.- Reattachable execute in scope.
- Arrow results byte/row-exact vs a native reference.
Where things deviate from a normal Spark Connect setup¶
- The transport is grpc-web over
fetch, not native gRPC - hence Envoy. - Calls are made blocking from a Web Worker; the main thread does the network.
- The native reference client (
tests/e2e/reference.py) talks plain gRPC on:15002and is allowed to usegrpcio(it is outside the package).