Changelog
All notable changes to this project are documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Unreleased
0.2.0 - 2026-06-10
Added
- Declarative Pipelines (Spark 4.1+):
SparkSession#pipelinereturns aPipeline(dataflow graph). Define outputs (create_table,create_materialized_view,create_temporary_view,create_sink) and flows (define_flow,define_sql), thenstart_run(withfull_refresh/refresh/dry/storage) which streamsPipelineEvents; plusreadanddrop. - Structured Streaming:
SparkSession#read_stream(DataStreamReader),DataFrame#write_stream(DataStreamWriter),StreamingQuery(status, recent/last progress, await_termination, process_all_available, stop, exception, explain), andSparkSession#streams(StreamingQueryManager: active, get, await_any_termination, reset_terminated). Supports triggers (processing-time, once, available-now, continuous), output modes, and file/console/memory/Kafka sinks. (foreach/foreachBatchand UDFs remain unsupported pending finalized protobuf definitions.) - Temporary views:
DataFrame#create_temp_view/create_or_replace_temp_view/create_global_temp_view/create_or_replace_global_temp_view. DataFrame#with_watermark,repartition_by_range,checkpoint/local_checkpoint,transform,col_regex,to_json, andeach/to_local_iterator.Catalog#create_tableandCatalog#create_external_table.SparkSession#new_session,interrupt_all/interrupt_tag/interrupt_operation, and operation tags (add_tag/remove_tag/get_tags/clear_tags) propagated on every execution.- Regenerated the vendored protobuf/gRPC stubs against Spark Connect 4.1.0
(adds
pipelines.proto). bin/console(an IRB session withspark/F/T) andbin/setup.
0.1.0 - 2026-06-10
Initial release.
Added
SparkConnect::SparkSessionand its fluentBuilder(remote,app_name,config,get_or_create/create), plusrange,sql,table,read,create_data_frame,conf,catalog,version, andstop.SparkConnect::DataFramewith the core transformation and action surface:select/select_expr,filter/where,with_column(s),with_column(s)_renamed,drop,distinct/drop_duplicates,order_by/sort/sort_within_partitions,limit/offset,group_by/rollup/cube/agg,join/cross_join,union/union_by_name/intersect/except/subtract,repartition/coalesce,sample,alias,hint,unpivot,to/to_df,collect/take/head/first/count/show/to_arrow, and plan introspection (schema,columns,dtypes,print_schema,explain).SparkConnect::Columnwith Ruby operator overloads, aliasing, casting, sort ordering, predicates,when/otherwise, complex-type access, andover.SparkConnect::Functions(F): a broad PySpark-compatible function library including aggregate, math, string, date/time, collection, JSON, conditional, hashing, and higher-order (lambda) functions.SparkConnect::Window/WindowSpecfor analytic window definitions.SparkConnect::GroupedData,DataFrameNaFunctions(na),DataFrameStatFunctions(stat), andObservation.SparkConnect::DataFrameReader,DataFrameWriter, andDataFrameWriterV2.SparkConnect::CatalogandSparkConnect::RuntimeConfig.SparkConnect::Types: a full Spark SQL type system with proto conversion,simpleString/DDL/JSON rendering, and schema trees.SparkConnect::Rowwith positional, by-name, and method-style access.- Apache Arrow IPC result decoding and local-relation encoding via
red-arrow. - gRPC client (
SparkConnectClient,ChannelBuilder) withsc://connection string parsing, TLS/bearer-token auth, and retry-with-backoff. - A structured error hierarchy (
Error,ConnectionError,SparkConnectError,AnalysisError,ParseError, ...). - Vendored Spark Connect 4.1 protobuf/gRPC definitions and a regeneration script
(
bin/generate-protos).