Changelog

All notable changes to this project are documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

0.3.0 - 2026-06-15

Fixed

Correct TimestampType instants in create_data_frame. Timestamp columns were shipped as zone-less Arrow timestamps, so the server interpreted the epoch micros as session-local wall-clock and shifted the value by the session time zone. They are now tagged UTC; TimestampNTZType remains zone-less.
RuntimeConfig#get with a non-String default no longer raises. A non-String default (e.g. conf.get(key, 8)) was passed straight into a protobuf string field, raising Google::Protobuf::TypeError. The default is now coerced to a String, matching #set.
No duplicate rows when an execute stream is retried. The result accumulator was created outside the retry loop, so a mid-stream gRPC failure replayed already-consumed Arrow batches and duplicated rows on retry. The accumulator is now reset per attempt.
DataFrame#drop_duplicates_within_watermark is now watermark-aware. It was a plain alias of #drop_duplicates and never set the within_watermark flag, so it silently performed an ordinary deduplication. Added a dropDuplicatesWithinWatermark alias.
SparkSession::Builder#app_name is now applied. create explicitly skipped spark.app.name, making #app_name a no-op; all builder options are now forwarded to the new session.
Corrected misleading doc comments for Functions#nanvl and DataFrame#except_all.

0.2.0 - 2026-06-10

Added

Declarative Pipelines (Spark 4.1+): SparkSession#pipeline returns a Pipeline (dataflow graph). Define outputs (create_table, create_materialized_view, create_temporary_view, create_sink) and flows (define_flow, define_sql), then start_run (with full_refresh/refresh/ dry/storage) which streams PipelineEvents; plus read and drop.
Structured Streaming: SparkSession#read_stream (DataStreamReader), DataFrame#write_stream (DataStreamWriter), StreamingQuery (status, recent/last progress, await_termination, process_all_available, stop, exception, explain), and SparkSession#streams (StreamingQueryManager: active, get, await_any_termination, reset_terminated). Supports triggers (processing-time, once, available-now, continuous), output modes, and file/console/memory/Kafka sinks. (foreach/foreachBatch and UDFs remain unsupported pending finalized protobuf definitions.)
Temporary views: DataFrame#create_temp_view / create_or_replace_temp_view / create_global_temp_view / create_or_replace_global_temp_view.
DataFrame#with_watermark, repartition_by_range, checkpoint / local_checkpoint, transform, col_regex, to_json, and each / to_local_iterator.
Catalog#create_table and Catalog#create_external_table.
SparkSession#new_session, interrupt_all / interrupt_tag / interrupt_operation, and operation tags (add_tag / remove_tag / get_tags / clear_tags) propagated on every execution.
Regenerated the vendored protobuf/gRPC stubs against Spark Connect 4.1.0 (adds pipelines.proto).
bin/console (an IRB session with spark/F/T) and bin/setup.

0.1.0 - 2026-06-10

Initial release.

Added

SparkConnect::SparkSession and its fluent Builder (remote, app_name, config, get_or_create/create), plus range, sql, table, read, create_data_frame, conf, catalog, version, and stop.
SparkConnect::DataFrame with the core transformation and action surface: select/select_expr, filter/where, with_column(s), with_column(s)_renamed, drop, distinct/drop_duplicates, order_by/sort/sort_within_partitions, limit/offset, group_by/rollup/cube/agg, join/cross_join, union/union_by_name/intersect/except/subtract, repartition/coalesce, sample, alias, hint, unpivot, to/to_df, collect/take/head/first/count/show/to_arrow, and plan introspection (schema, columns, dtypes, print_schema, explain).
SparkConnect::Column with Ruby operator overloads, aliasing, casting, sort ordering, predicates, when/otherwise, complex-type access, and over.
SparkConnect::Functions (F): a broad PySpark-compatible function library including aggregate, math, string, date/time, collection, JSON, conditional, hashing, and higher-order (lambda) functions.
SparkConnect::Window/WindowSpec for analytic window definitions.
SparkConnect::GroupedData, DataFrameNaFunctions (na), DataFrameStatFunctions (stat), and Observation.
SparkConnect::DataFrameReader, DataFrameWriter, and DataFrameWriterV2.
SparkConnect::Catalog and SparkConnect::RuntimeConfig.
SparkConnect::Types: a full Spark SQL type system with proto conversion, simpleString/DDL/JSON rendering, and schema trees.
SparkConnect::Row with positional, by-name, and method-style access.
Apache Arrow IPC result decoding and local-relation encoding via red-arrow.
gRPC client (SparkConnectClient, ChannelBuilder) with sc:// connection string parsing, TLS/bearer-token auth, and retry-with-backoff.
A structured error hierarchy (Error, ConnectionError, SparkConnectError, AnalysisError, ParseError, ...).
Vendored Spark Connect 4.1 protobuf/gRPC definitions and a regeneration script (bin/generate-protos).