spark-connect for Ruby

A production-ready, pure-Ruby client for Apache Spark Connect – a PySpark-style DataFrame API over gRPC.

Get started View on GitHub


If you have written PySpark, you already know most of this gem. There is no JVM, no Py4J, and no Spark installation on the client machine – only a reachable Spark Connect server.

Gem: spark-connect  |  Source: HyukjinKwon/spark-connect-ruby  |  Targets the Spark Connect 4.1 protocol; supports Apache Spark 3.5 and above.

What is Spark Connect?

Classic Spark applications run your driver code inside the cluster’s JVM. Spark Connect splits that apart: your program is a thin client that builds an unresolved logical plan and ships it to a remote server over gRPC. The server plans, optimizes, and executes the query, then streams results back as Apache Arrow batches.

flowchart LR
    A["Your Ruby program<br/><b>spark-connect</b>"] -- "gRPC: logical plan" --> B["Spark Connect server"]
    B -- "Arrow result batches" --> A
    B --> C["Spark cluster<br/>(plan, optimize, execute)"]

Because the protocol is language-agnostic, the client can live in any language. This gem is that client for Ruby.

Feature highlights

  • DataFrame API modeled on PySpark: select, filter/where, join, group_by/agg, order_by, union, distinct, window functions, set operations, sampling, pivot, and more.
  • Snake_case Ruby idiom with camelCase aliases for high-traffic names (groupBy, withColumn, orderBy, createDataFrame, …), so PySpark snippets translate almost verbatim.
  • Spark SQL via spark.sql(...), including named and positional parameters.
  • A rich function library under SparkConnect::Functions (aliased SparkConnect::F).
  • Typed schemas under SparkConnect::Types::*, with DDL strings and print_schema.
  • Arrow-based decoding of results into Row objects (or a columnar Arrow::Table).
  • Catalog, reader/writer, NA & stat helpers, observations, and window specs.
  • Structured Streaming: streaming sources/sinks, triggers, output modes, watermarks, and a query manager.
  • Declarative Pipelines (Spark 4.1+): dataflow graphs of tables, materialized views, and flows.

Install

gem install spark-connect

spark-connect decodes results with red-arrow, which requires the Apache Arrow GLib system libraries. See Installation for the one-line setup on macOS and Linux.

Quickstart

require "spark-connect"

F = SparkConnect::F

spark = SparkConnect::SparkSession.builder
                                  .remote("sc://localhost:15002")
                                  .app_name("quickstart")
                                  .get_or_create

df = spark.range(10)
          .select(F.col("id"), (F.col("id") * 2).alias("doubled"))
          .filter((F.col("id") % 2) == 0)

df.show
puts "rows: #{df.count}"
spark.stop

No server handy? The Installation guide shows how to start one locally in two commands.

Where to next

Guide What’s inside
Installation Prerequisites, the gem, and a local server
Getting started Connecting, sessions, your first DataFrames
DataFrames The full transformation and action surface
Columns & Functions Expressions and the F library
Aggregations & Windows group_by, pivot, and analytic windows
Reading & Writing Sources, sinks, and tables
Structured Streaming Streaming sources/sinks, triggers, query management
Declarative Pipelines Dataflow graphs of tables, views, and flows
Types & Schemas The type system and value mapping
Configuration & Errors Runtime config, observations, error handling
API reference Full YARD method-level documentation