Connection patterns¶
After pcw.install(), you connect with the ordinary
SparkSession.builder.remote(...) API. The patch teaches the connection parser a
grpc-web scheme; everything else about building a session is stock PySpark.
import pyspark_connect_web as pcw
pcw.install()
from pyspark.sql import SparkSession
spark = SparkSession.builder.remote("sc://localhost:8081/;transport=grpcweb").getOrCreate()
The sc:// scheme¶
The canonical form selects the grpc-web transport with a transport=grpcweb
parameter:
<envoy-host>:<port>is the Envoy grpc-web endpoint (default:8081), not the Spark Connect gRPC port (:15002). The browser only ever talks to Envoy.- Without
;transport=grpcweb, a stocksc://URL is left to PySpark's own (native gRPC) handling - which cannot work in the browser, sincegrpciois absent. The web transport is only engaged when the parameter is present.
http:// / https:// shorthand¶
You can also pass a plain URL; it is normalized to the canonical sc:// form:
| You pass | Normalized to |
|---|---|
https://host |
sc://host:443/;transport=grpcweb;use_ssl=true |
http://host |
sc://host:80/;transport=grpcweb |
https://host:8443 |
sc://host:8443/;transport=grpcweb;use_ssl=true |
TLS¶
- Localhost / dev runs plaintext (
http://,sc://...withoutuse_ssl). The dev Envoy config terminates no TLS - fine for a laptop, never past localhost. - Anything public must use TLS (
https://shorthand, or;use_ssl=true). Two reasons:- Browsers only grant
crossOriginIsolated(and thereforeSharedArrayBuffer) on a secure context offlocalhost. Without HTTPS the blocking bridge is dead. - The bearer token (below) rides the connection - it must never travel in plaintext.
- Browsers only grant
The prod Envoy listeners terminate TLS (TLSv1_2 minimum); cert mounting is
documented in deploy/README.md.
See Security section 3-4 for the full rationale.
Authentication¶
Spark Connect has no built-in authentication. Anyone who can reach the gRPC port can run arbitrary Spark plans. In this topology the browser reaches Spark through Envoy, so Envoy is the only enforcement point - if Envoy is open, Spark is open.
Auth is carried as channel-level metadata that the stub forwards as grpc-web
request headers - typically an Authorization: Bearer <token> header. PySpark's
ChannelBuilder.metadata() is the supported way to inject channel-level headers,
and the patch forwards exactly those pairs (see the WebChannel.params plumbing
in pyspark_connect_web/patch.py).
On the proxy side:
- The prod Envoy config ships a minimal Lua gate that rejects any request
lacking
Authorization: Bearer <token>with401(CORS preflights pass through). This stops anonymous access but does not validate the token. - For production, replace the presence-only gate with
jwt_authn(validate a JWT against your IdP's JWKS) orext_authz(delegate to an authz service). Both are sketched at the bottom ofdeploy/envoy.prod.yaml.
Token handling in the browser
The token lives in the page's JS context and is sent as a grpc-web header.
Treat it as a bearer credential: short-lived, scoped, refreshable. Do not log
it, and do not put it in URLs (it would land in proxy/access logs and the
Referer header).
Endpoint cheat-sheet¶
| Endpoint | Port | Who talks to it |
|---|---|---|
| Envoy grpc-web | :8081 |
the browser client (sc://...;transport=grpcweb) |
| Envoy static host | :8000 |
the browser loading the JupyterLite site (COOP/COEP) |
| Spark Connect (gRPC) | :15002 |
the native reference generator only; never the browser, and not exposed in prod |