API coverage¶
spark-connect-scala3 implements the mainstream, untyped Spark Connect surface
completely, and is verified end to end against live Apache Spark 3.5, 4.0, and
4.1 servers. This page records exactly what is and is not implemented so there
are no surprises.
Supported¶
| Area | What is covered |
|---|---|
SparkSession |
builder + remote(...), sql, range, table, createDataFrame, conf, catalog, read, readStream, streams, version, addTag/interrupt*, implicits |
Dataset / DataFrame |
select/selectExpr, filter/where, withColumn(s)/withColumnRenamed(s), drop, join/crossJoin/lateralJoin (inner/left/right/outer/semi/anti/cross), groupBy/rollup/cube/agg/pivot, orderBy/sort/sortWithinPartitions, limit/offset, distinct/dropDuplicates, union/unionByName/intersect(All)/except(All), sample/randomSplit, repartition/repartitionByRange/coalesce, hint, unpivot/melt, transpose, toJSON, describe/summary, na, stat, observe, withWatermark, checkpoint/localCheckpoint, persist/cache/unpersist/storageLevel, to(Local)Iterator, collect/count/head/take/first/show, schema/columns/dtypes/printSchema/explain/inputFiles, sameSemantics/semanticHash, temp/global-temp view creation |
Column |
full expression algebra: arithmetic, comparison, boolean logic, like/rlike/ilike, contains/startsWith/endsWith, isin, between, isNull/isNotNull/isNaN, cast, substr, getItem/getField, when/otherwise, over, asc/desc, bitwise ops |
functions |
400+ functions: aggregate, string, math, date/time, array/map/struct, JSON, conditional, and window functions |
Window / WindowSpec |
partitionBy, orderBy, rowsBetween, rangeBetween |
| Data sources | DataFrameReader/DataFrameWriter for CSV, JSON, Parquet, ORC, text, JDBC, and tables, with options, schema, partitionBy/bucketBy/sortBy, save modes, saveAsTable/insertInto; parse an in-memory Dataset[String] as CSV/JSON (read.csv(ds)/read.json(ds)); DataFrameWriterV2 (writeTo); MergeIntoWriter (df.mergeInto(...), MERGE INTO) |
| SQL | spark.sql(...) with named and positional parameters; temporary and global temporary views |
| Catalog | spark.catalog: list/inspect catalogs, databases, tables, columns, functions; existence checks; createTable/createExternalTable; temp-view drops; cache management |
| Structured Streaming | DataStreamReader/DataStreamWriter, output modes, triggers (ProcessingTime/Once/AvailableNow/Continuous), start/toTable, StreamingQuery, StreamingQueryManager, and StreamingQueryListener registration (addListener/removeListener/listListeners) with client-side event dispatch |
| Declarative Pipelines | build graphs of tables, materialized views, temporary views, sinks, and flows, then run them on the server |
| Typed Datasets | as[T] and spark.createDataset with compile-time Encoder derivation for case classes, tuples, primitives, Option, collections, and maps; typed actions (collect/head/first/take/collectAsList/toLocalIterator) |
| Observation | Observation for collecting named aggregate metrics while a query runs |
| Config | RuntimeConfig (spark.conf.get/set/unset/isModifiable) |
| Results | Apache Arrow IPC decoding into name-addressable Rows; connection-string parsing (sc://host:port/;k=v); bearer-token auth (implies TLS); server errors surfaced as typed Spark exceptions (AnalysisException/ParseException/SparkException) with the original message and error class |
| Resiliency | Automatic retry of transient gRPC failures (exponential backoff + jitter), and reattachable execution that resumes a broken ExecutePlan result stream via ReattachExecute (with ReleaseExecute to free server buffers). On by default; toggle with useReattachableExecute. Works against Spark 3.5+ |
Not supported¶
These are deliberately out of scope for now.
- User-defined functions (UDFs/UDAFs). Registering or applying a Scala closure runs user JVM code on the server, which the Spark Connect protocol does not transport for this client.
- Streaming
foreach/foreachBatchsinks. Same reason: they execute a user function per row/batch on the server. All built-in sinks (parquet/console/memory/kafka/...) are supported. - Typed
Datasetoperations that take a Scala closure.as[T]andcreateDatasetare supported (encoder-only, no closure), but the operations below run a user function on the server and are not available:- typed
map/flatMap/mapPartitions/reduce groupByKeyandKeyValueGroupedDataset(mapGroups,flatMapGroups,cogroup, typedagg)- typed
Aggregators Use the relational API (select/agg/functions) for these.
- typed
- MLlib over Connect (
spark.ml/pyspark.mlequivalent). Experimental upstream and not exposed here. - Artifact upload (
addArtifact). Shipping local JARs/files to the server is a niche client feature not exposed yet. - Protocol messages that are not standard Scala API. A handful of Spark
Connect relations/commands have no plain Scala DataFrame method because they
exist for other languages or for closure execution: the closure relations
(
mapPartitions,groupMap/coGroupMap,applyInPandasWithState), Python user-defined table functions and data sources, the ML relation/commands, and internal optimization relations (server-cached/chunked local relations,withRelations). The standard DataFrame, SQL, streaming, and pipeline surface is fully covered.
Why these are excluded¶
UDFs, foreach/foreachBatch, and the typed closure operations all require
sending user-compiled JVM code to the server (artifact upload + class loading),
which is a separate, security-sensitive subsystem outside this client's scope.
The encoder-derivation half of the typed Dataset[T] API does not require that, so
as[T] and createDataset are fully supported; only the closure-driven typed
operations listed above remain out of scope.
Known limitations¶
- Server stack traces are not reconstructed. Errors carry the server's message
and error class (and map to
AnalysisException/ParseException/SparkException), but the remote JVM stack trace is not rebuilt on the client.