Changelog

All notable changes to this project are documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

0.2.0 - 2026-06-10

Added

  • Declarative Pipelines (Spark 4.1+): SparkSession#pipeline returns a Pipeline (dataflow graph). Define outputs (create_table, create_materialized_view, create_temporary_view, create_sink) and flows (define_flow, define_sql), then start_run (with full_refresh/refresh/ dry/storage) which streams PipelineEvents; plus read and drop.
  • Structured Streaming: SparkSession#read_stream (DataStreamReader), DataFrame#write_stream (DataStreamWriter), StreamingQuery (status, recent/last progress, await_termination, process_all_available, stop, exception, explain), and SparkSession#streams (StreamingQueryManager: active, get, await_any_termination, reset_terminated). Supports triggers (processing-time, once, available-now, continuous), output modes, and file/console/memory/Kafka sinks. (foreach/foreachBatch and UDFs remain unsupported pending finalized protobuf definitions.)
  • Temporary views: DataFrame#create_temp_view / create_or_replace_temp_view / create_global_temp_view / create_or_replace_global_temp_view.
  • DataFrame#with_watermark, repartition_by_range, checkpoint / local_checkpoint, transform, col_regex, to_json, and each / to_local_iterator.
  • Catalog#create_table and Catalog#create_external_table.
  • SparkSession#new_session, interrupt_all / interrupt_tag / interrupt_operation, and operation tags (add_tag / remove_tag / get_tags / clear_tags) propagated on every execution.
  • Regenerated the vendored protobuf/gRPC stubs against Spark Connect 4.1.0 (adds pipelines.proto).
  • bin/console (an IRB session with spark/F/T) and bin/setup.

0.1.0 - 2026-06-10

Initial release.

Added

  • SparkConnect::SparkSession and its fluent Builder (remote, app_name, config, get_or_create/create), plus range, sql, table, read, create_data_frame, conf, catalog, version, and stop.
  • SparkConnect::DataFrame with the core transformation and action surface: select/select_expr, filter/where, with_column(s), with_column(s)_renamed, drop, distinct/drop_duplicates, order_by/sort/sort_within_partitions, limit/offset, group_by/rollup/cube/agg, join/cross_join, union/union_by_name/intersect/except/subtract, repartition/coalesce, sample, alias, hint, unpivot, to/to_df, collect/take/head/first/count/show/to_arrow, and plan introspection (schema, columns, dtypes, print_schema, explain).
  • SparkConnect::Column with Ruby operator overloads, aliasing, casting, sort ordering, predicates, when/otherwise, complex-type access, and over.
  • SparkConnect::Functions (F): a broad PySpark-compatible function library including aggregate, math, string, date/time, collection, JSON, conditional, hashing, and higher-order (lambda) functions.
  • SparkConnect::Window/WindowSpec for analytic window definitions.
  • SparkConnect::GroupedData, DataFrameNaFunctions (na), DataFrameStatFunctions (stat), and Observation.
  • SparkConnect::DataFrameReader, DataFrameWriter, and DataFrameWriterV2.
  • SparkConnect::Catalog and SparkConnect::RuntimeConfig.
  • SparkConnect::Types: a full Spark SQL type system with proto conversion, simpleString/DDL/JSON rendering, and schema trees.
  • SparkConnect::Row with positional, by-name, and method-style access.
  • Apache Arrow IPC result decoding and local-relation encoding via red-arrow.
  • gRPC client (SparkConnectClient, ChannelBuilder) with sc:// connection string parsing, TLS/bearer-token auth, and retry-with-backoff.
  • A structured error hierarchy (Error, ConnectionError, SparkConnectError, AnalysisError, ParseError, ...).
  • Vendored Spark Connect 4.1 protobuf/gRPC definitions and a regeneration script (bin/generate-protos).