Skip to content

Spark Connect for Scala 3

A pure-Scala-3 client for Apache Spark Connect - a gRPC DataFrame API that mirrors Apache Spark's own Scala API.

If you have written Spark in Scala, you already know most of this library. There is no JVM Spark on the client, no spark-submit, and no local Spark installation - only a reachable Spark Connect server.

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.*

val spark = SparkSession.builder
  .remote("sc://localhost:15002")
  .appName("quickstart")
  .getOrCreate()

spark.range(10)
  .select(col("id"), (col("id") * 2).as("doubled"))
  .filter(col("id") % 2 === 0)
  .show()

spark.stop()

What is Spark Connect?

Classic Spark applications run your driver code inside the cluster's JVM. Spark Connect splits that apart: your program is a thin client that builds an unresolved logical plan and ships it to a remote server over gRPC. The server plans, optimizes, and executes the query, then streams results back as Apache Arrow batches.

  Your Scala 3 program            Spark Connect server            Spark cluster
  spark-connect-scala3   --gRPC-->   (plan + optimize)   ------>   (execute)
         ^                                                              |
         +--------------------- Arrow result batches ------------------+

Because the protocol is language-agnostic, the client can live in any language. This project is that client for Scala 3.

What it supports

spark-connect-scala3 implements the Spark Connect DataFrame, SQL, Structured Streaming, and Declarative Pipelines API, modeled directly on Apache Spark's Scala API (SparkSession, DataFrame, Column, functions, Dataset[T], ...), so existing Spark Scala code ports almost verbatim. Results decode through Apache Arrow into ordered, name-addressable Rows.

It also supports typed Datasets: df.as[T] and spark.createDataset(values) for case classes, tuples, primitives, Option, collections, and maps. Encoders run entirely on the client, so no closure is sent to the server.

Not supported

The following all require running a user-provided JVM closure on the server (the same mechanism as a UDF), which Spark Connect for Scala 3 does not provide:

  • User-defined functions: functions.udf, spark.udf.register, and typed Aggregator / UDAFs.
  • Typed Dataset transformations whose argument is a Scala function: map, flatMap, mapPartitions, groupByKey (and its mapGroups / flatMapGroups / reduceGroups), and reduce. Note that as[T] and createDataset, which only attach an encoder and ship no closure, are supported.
  • Structured Streaming foreach / foreachBatch sinks.

Also out of scope because they are not part of the Spark Connect protocol at all:

  • The RDD API (Dataset.rdd, SparkContext, accumulators, broadcast variables).
  • The MLlib-over-Connect surface.

Project facts

  • Maven coordinates: com.github.hyukjinkwon :: spark-connect-scala3-client, built for Scala 3.3.x.
  • Spark compatibility: built and tested against Apache Spark 4.0.x and 4.1.x (latest 4.1.2). The protobufs are sourced from Spark 4.1.2.
  • Source: HyukjinKwon/spark-connect-scala3.

Where to next

Guide What is inside
Installation The dependency, JDK flags, and a local server
Quickstart Connecting and your first DataFrames
DataFrames The full transformation and action surface
Columns and Functions Expressions and the functions library
Data Sources Reading and writing files and tables
SQL Running SQL and using views
Catalog Inspecting and managing metadata
Structured Streaming Streaming sources, sinks, queries, and query listeners
Declarative Pipelines Dataflow graphs
Configuration and Connection Connection strings and runtime config
Examples Runnable programs
API (Scaladoc) Generated method-level reference