Spark Connect for Scala 3¶
A pure-Scala-3 client for Apache Spark Connect - a gRPC DataFrame API that mirrors Apache Spark's own Scala API.
If you have written Spark in Scala, you already know most of this library. There
is no JVM Spark on the client, no spark-submit, and no local Spark
installation - only a reachable Spark Connect server.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.*
val spark = SparkSession.builder
.remote("sc://localhost:15002")
.appName("quickstart")
.getOrCreate()
spark.range(10)
.select(col("id"), (col("id") * 2).as("doubled"))
.filter(col("id") % 2 === 0)
.show()
spark.stop()
What is Spark Connect?¶
Classic Spark applications run your driver code inside the cluster's JVM. Spark Connect splits that apart: your program is a thin client that builds an unresolved logical plan and ships it to a remote server over gRPC. The server plans, optimizes, and executes the query, then streams results back as Apache Arrow batches.
Your Scala 3 program Spark Connect server Spark cluster
spark-connect-scala3 --gRPC--> (plan + optimize) ------> (execute)
^ |
+--------------------- Arrow result batches ------------------+
Because the protocol is language-agnostic, the client can live in any language. This project is that client for Scala 3.
What it supports¶
spark-connect-scala3 implements the Spark Connect DataFrame, SQL, Structured Streaming, and Declarative Pipelines API, modeled directly on Apache Spark's Scala API (SparkSession, DataFrame, Column, functions, Dataset[T], ...), so existing Spark Scala code ports almost verbatim. Results decode through Apache Arrow into ordered, name-addressable Rows.
It also supports typed Datasets: df.as[T] and spark.createDataset(values) for case classes, tuples, primitives, Option, collections, and maps. Encoders run entirely on the client, so no closure is sent to the server.
Not supported¶
The following all require running a user-provided JVM closure on the server (the same mechanism as a UDF), which Spark Connect for Scala 3 does not provide:
- User-defined functions:
functions.udf,spark.udf.register, and typedAggregator/ UDAFs. - Typed
Datasettransformations whose argument is a Scala function:map,flatMap,mapPartitions,groupByKey(and itsmapGroups/flatMapGroups/reduceGroups), andreduce. Note thatas[T]andcreateDataset, which only attach an encoder and ship no closure, are supported. - Structured Streaming
foreach/foreachBatchsinks.
Also out of scope because they are not part of the Spark Connect protocol at all:
- The RDD API (
Dataset.rdd,SparkContext, accumulators, broadcast variables). - The MLlib-over-Connect surface.
Project facts¶
- Maven coordinates:
com.github.hyukjinkwon::spark-connect-scala3-client, built for Scala 3.3.x. - Spark compatibility: built and tested against Apache Spark 4.0.x and 4.1.x (latest 4.1.2). The protobufs are sourced from Spark 4.1.2.
- Source: HyukjinKwon/spark-connect-scala3.
Where to next¶
| Guide | What is inside |
|---|---|
| Installation | The dependency, JDK flags, and a local server |
| Quickstart | Connecting and your first DataFrames |
| DataFrames | The full transformation and action surface |
| Columns and Functions | Expressions and the functions library |
| Data Sources | Reading and writing files and tables |
| SQL | Running SQL and using views |
| Catalog | Inspecting and managing metadata |
| Structured Streaming | Streaming sources, sinks, queries, and query listeners |
| Declarative Pipelines | Dataflow graphs |
| Configuration and Connection | Connection strings and runtime config |
| Examples | Runnable programs |
| API (Scaladoc) | Generated method-level reference |