writing
Technical posts, mostly on the Databricks blog, newest first. Apache Spark release notes, PySpark deep-dives, and posts on Koalas / pandas-on-Spark.
-
Databricks Lakeguard: Supporting Fine-Grained Access Control and Multi-User Capabilities for Apache Spark Workloads SIGMOD 2025 (paper), 2025-06
SIGMOD 2025 industry paper. Describes the unified governance system that uses Spark Connect as a JDBC-like execution protocol to separate client applications from the Spark server, enforce fine-grained access policies, and isolate user code within the cluster manager.
-
Introducing Apache Spark 4.1 Databricks, 2025-12-22
Apache Spark 4.1 in Databricks Runtime 18.0 Beta: Spark Declarative Pipelines, Real-Time Mode for streaming, PySpark improvements.
-
Introducing Apache Spark 4.0 Databricks, 2025-05-28
Spark 4.0 in DBR 17.0: Spark Connect multi-language clients (Go, Swift, Rust), VARIANT type, Python improvements.
-
PySpark in 2023: A Year in Review Databricks, 2024-03-25
Recap of PySpark in 2023: Spark Connect, Arrow-optimized UDFs, English SDK, the PySpark test framework.
-
Parameterized queries with PySpark Databricks, 2024-01-03
Parameterized SQL query API in PySpark for safer, more reusable SQL templates that prevent injection.
-
Python Dependency Management in Spark Connect Databricks, 2023-11-14
Managing per-session Python dependencies in Spark Connect with virtualenv and conda archives.
-
Arrow-optimized Python UDFs in Apache Spark 3.5 Databricks, 2023-11-06
Apache Arrow-based serialization speeds up regular Python UDFs in Spark 3.5 and DBR 14.0.
-
Introducing Apache Spark 3.5 Databricks, 2023-09-15
Spark Connect GA in Scala, DeepSpeed distributor, RocksDB improvements, PySpark error class migration.
-
Spark Connect Available in Apache Spark 3.4 Databricks, 2023-04-18
Introduces the decoupled client/server Spark Connect architecture shipping in Spark 3.4.
-
Introducing Apache Spark 3.4 for Databricks Runtime 13.0 Databricks, 2023-04-14
Spark 3.4 in DBR 13.0: Spark Connect, PyTorch distributor, pandas 2.0 support.
-
Python Arbitrary Stateful Processing in Structured Streaming Databricks, 2022-10-18
applyInPandasWithState for arbitrary stateful streaming aggregations in PySpark.
-
Introducing Apache Spark 3.3 for Databricks Runtime 11.0 Databricks, 2022-06-15
Spark 3.3 in DBR 11.0: row-level Bloom filters and broader pandas API coverage.
-
How to Monitor Streaming Queries in PySpark Databricks, 2022-05-27
Using PySpark's Observable API to ship Structured Streaming metrics to external monitoring systems.
-
Introducing Apache Spark 3.2 Databricks, 2021-10-19
Spark 3.2 in DBR 10.0: pandas API on Spark, ANSI SQL improvements, RocksDB state store.
-
Pandas API on Apache Spark 3.2 Databricks, 2021-10-04
Koalas merged into PySpark as the official pandas API on Spark in 3.2.
-
Benchmark: Koalas (PySpark) and Dask Databricks, 2021-04-07
Koalas vs Dask benchmark: roughly 4 to 25 times faster than Dask depending on workload.
-
Introducing Apache Spark 3.1 Databricks, 2021-03-02
Spark 3.1 in DBR 8.0: Python usability, ANSI SQL, query optimizer improvements.
-
How to Manage Python Dependencies in PySpark Databricks, 2020-12-22
Shipping Python packages with PySpark jobs via PEX, conda-pack, and venv-pack archives.
-
An Update on Project Zen: Improving Apache Spark for Python Users Databricks, 2020-09-04
Project Zen update: PySpark docs redesign, type hints, classified error handling, install profiles.
-
Interoperability between Koalas and Apache Spark Databricks, 2020-08-11
Interchanging data and operations between Koalas DataFrames and PySpark DataFrames.
-
A Comprehensive Look at Dates and Timestamps in Apache Spark 3.0 Databricks, 2020-07-22
How to effectively use dates and timestamps in Spark 3.0: calendars, time zones, and the proleptic Gregorian switch.
-
Koalas 1.0: Scale Pandas with Apache Spark Databricks, 2020-06-24
Koalas 1.0 with about 80 percent pandas API coverage, Spark 3.0 support, new Spark accessor.
-
Vectorized R I/O in Upcoming Apache Spark 3.0 Databricks, 2020-06-01
Arrow-based vectorization for SparkR gapply, dapply, and DataFrame I/O in Spark 3.0.
-
New Pandas UDFs and Python Type Hints in the Upcoming Release of Apache Spark 3.0 Databricks, 2020-05-20
Redesigned Pandas UDF API built on Python type hints for clearer, more Pythonic UDFs.
-
10 Minutes from pandas to Koalas on Apache Spark Databricks, 2020-03-31
Quick-start tutorial mapping common pandas operations to their Koalas equivalents.
-
Integrating Apache Hive with Apache Spark - Hive Warehouse Connector Cloudera Community (formerly Hortonworks), 2018-10-03
The Hive Warehouse Connector for reading and writing data between Spark and Hive.