spark-connect for Ruby
A production-ready, pure-Ruby client for Apache Spark Connect – a PySpark-style DataFrame API over gRPC.
If you have written PySpark, you already know most of this gem. There is no JVM, no Py4J, and no Spark installation on the client machine – only a reachable Spark Connect server.
Gem:
spark-connect| Source: HyukjinKwon/spark-connect-ruby | Targets the Spark Connect 4.1 protocol; supports Apache Spark 3.5 and above.
What is Spark Connect?
Classic Spark applications run your driver code inside the cluster’s JVM. Spark Connect splits that apart: your program is a thin client that builds an unresolved logical plan and ships it to a remote server over gRPC. The server plans, optimizes, and executes the query, then streams results back as Apache Arrow batches.
flowchart LR
A["Your Ruby program<br/><b>spark-connect</b>"] -- "gRPC: logical plan" --> B["Spark Connect server"]
B -- "Arrow result batches" --> A
B --> C["Spark cluster<br/>(plan, optimize, execute)"]
Because the protocol is language-agnostic, the client can live in any language. This gem is that client for Ruby.
Feature highlights
- DataFrame API modeled on PySpark:
select,filter/where,join,group_by/agg,order_by,union,distinct, window functions, set operations, sampling, pivot, and more. - Snake_case Ruby idiom with camelCase aliases for high-traffic names (
groupBy,withColumn,orderBy,createDataFrame, …), so PySpark snippets translate almost verbatim. - Spark SQL via
spark.sql(...), including named and positional parameters. - A rich function library under
SparkConnect::Functions(aliasedSparkConnect::F). - Typed schemas under
SparkConnect::Types::*, with DDL strings andprint_schema. - Arrow-based decoding of results into
Rowobjects (or a columnarArrow::Table). - Catalog, reader/writer, NA & stat helpers, observations, and window specs.
- Structured Streaming: streaming sources/sinks, triggers, output modes, watermarks, and a query manager.
- Declarative Pipelines (Spark 4.1+): dataflow graphs of tables, materialized views, and flows.
Install
gem install spark-connect
spark-connectdecodes results withred-arrow, which requires the Apache Arrow GLib system libraries. See Installation for the one-line setup on macOS and Linux.
Quickstart
require "spark-connect"
F = SparkConnect::F
spark = SparkConnect::SparkSession.builder
.remote("sc://localhost:15002")
.app_name("quickstart")
.get_or_create
df = spark.range(10)
.select(F.col("id"), (F.col("id") * 2).alias("doubled"))
.filter((F.col("id") % 2) == 0)
df.show
puts "rows: #{df.count}"
spark.stop
No server handy? The Installation guide shows how to start one locally in two commands.
Where to next
| Guide | What’s inside |
|---|---|
| Installation | Prerequisites, the gem, and a local server |
| Getting started | Connecting, sessions, your first DataFrames |
| DataFrames | The full transformation and action surface |
| Columns & Functions | Expressions and the F library |
| Aggregations & Windows | group_by, pivot, and analytic windows |
| Reading & Writing | Sources, sinks, and tables |
| Structured Streaming | Streaming sources/sinks, triggers, query management |
| Declarative Pipelines | Dataflow graphs of tables, views, and flows |
| Types & Schemas | The type system and value mapping |
| Configuration & Errors | Runtime config, observations, error handling |
| API reference | Full YARD method-level documentation |