Spark Connect for Scala 3¶

A pure-Scala-3 client for Apache Spark Connect - a gRPC DataFrame API that mirrors Apache Spark's own Scala API.

If you have written Spark in Scala, you already know most of this library. There is no JVM Spark on the client, no spark-submit, and no local Spark installation - only a reachable Spark Connect server.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.*

val spark = SparkSession.builder
  .remote("sc://localhost:15002")
  .appName("quickstart")
  .getOrCreate()

spark.range(10)
  .select(col("id"), (col("id") * 2).as("doubled"))
  .filter(col("id") % 2 === 0)
  .show()

spark.stop()

What is Spark Connect?¶

Classic Spark applications run your driver code inside the cluster's JVM. Spark Connect splits that apart: your program is a thin client that builds an unresolved logical plan and ships it to a remote server over gRPC. The server plans, optimizes, and executes the query, then streams results back as Apache Arrow batches.

  Your Scala 3 program            Spark Connect server            Spark cluster
  spark-connect-scala3   --gRPC-->   (plan + optimize)   ------>   (execute)
         ^                                                              |
         +--------------------- Arrow result batches ------------------+

Because the protocol is language-agnostic, the client can live in any language. This project is that client for Scala 3.

What it supports¶

spark-connect-scala3 implements the Spark Connect DataFrame, SQL, Structured Streaming, and Declarative Pipelines API, modeled directly on Apache Spark's Scala API (SparkSession, DataFrame, Column, functions, Dataset[T], ...), so existing Spark Scala code ports almost verbatim. Results decode through Apache Arrow into ordered, name-addressable Rows.

It also supports typed Datasets: df.as[T] and spark.createDataset(values) for case classes, tuples, primitives, Option, collections, and maps. Encoders run entirely on the client, so no closure is sent to the server.

Not supported¶

The following all require running a user-provided JVM closure on the server (the same mechanism as a UDF), which Spark Connect for Scala 3 does not provide:

User-defined functions: functions.udf, spark.udf.register, and typed Aggregator / UDAFs.
Typed Dataset transformations whose argument is a Scala function: map, flatMap, mapPartitions, groupByKey (and its mapGroups / flatMapGroups / reduceGroups), and reduce. Note that as[T] and createDataset, which only attach an encoder and ship no closure, are supported.
Structured Streaming foreach / foreachBatch sinks.

Also out of scope because they are not part of the Spark Connect protocol at all:

The RDD API (Dataset.rdd, SparkContext, accumulators, broadcast variables).
The MLlib-over-Connect surface.

Project facts¶

Maven coordinates: com.github.hyukjinkwon :: spark-connect-scala3-client, built for Scala 3.3.x.
Spark compatibility: built and tested against Apache Spark 4.0.x and 4.1.x (latest 4.1.2). The protobufs are sourced from Spark 4.1.2.
Source: HyukjinKwon/spark-connect-scala3.

Where to next¶

Guide	What is inside
Installation	The dependency, JDK flags, and a local server
Quickstart	Connecting and your first DataFrames
DataFrames	The full transformation and action surface
Columns and Functions	Expressions and the functions library
Data Sources	Reading and writing files and tables
SQL	Running SQL and using views
Catalog	Inspecting and managing metadata
Structured Streaming	Streaming sources, sinks, queries, and query listeners
Declarative Pipelines	Dataflow graphs
Configuration and Connection	Connection strings and runtime config
Examples	Runnable programs
API (Scaladoc)	Generated method-level reference