writing | Hyukjin Kwon

Databricks Lakeguard: Supporting Fine-Grained Access Control and Multi-User Capabilities for Apache Spark Workloads SIGMOD 2025 (paper), 2025-06

SIGMOD 2025 industry paper. Describes the unified governance system that uses Spark Connect as a JDBC-like execution protocol to separate client applications from the Spark server, enforce fine-grained access policies, and isolate user code within the cluster manager.

Introducing Apache Spark 4.1 Databricks, 2025-12-22

Apache Spark 4.1 in Databricks Runtime 18.0 Beta: Spark Declarative Pipelines, Real-Time Mode for streaming, PySpark improvements.

Introducing Apache Spark 4.0 Databricks, 2025-05-28

Spark 4.0 in DBR 17.0: Spark Connect multi-language clients (Go, Swift, Rust), VARIANT type, Python improvements.

PySpark in 2023: A Year in Review Databricks, 2024-03-25

Recap of PySpark in 2023: Spark Connect, Arrow-optimized UDFs, English SDK, the PySpark test framework.

Parameterized queries with PySpark Databricks, 2024-01-03

Parameterized SQL query API in PySpark for safer, more reusable SQL templates that prevent injection.

Python Dependency Management in Spark Connect Databricks, 2023-11-14

Managing per-session Python dependencies in Spark Connect with virtualenv and conda archives.

Arrow-optimized Python UDFs in Apache Spark 3.5 Databricks, 2023-11-06

Apache Arrow-based serialization speeds up regular Python UDFs in Spark 3.5 and DBR 14.0.

Introducing Apache Spark 3.5 Databricks, 2023-09-15

Spark Connect GA in Scala, DeepSpeed distributor, RocksDB improvements, PySpark error class migration.

Spark Connect Available in Apache Spark 3.4 Databricks, 2023-04-18

Introduces the decoupled client/server Spark Connect architecture shipping in Spark 3.4.

Introducing Apache Spark 3.4 for Databricks Runtime 13.0 Databricks, 2023-04-14

Spark 3.4 in DBR 13.0: Spark Connect, PyTorch distributor, pandas 2.0 support.

Python Arbitrary Stateful Processing in Structured Streaming Databricks, 2022-10-18

applyInPandasWithState for arbitrary stateful streaming aggregations in PySpark.

Introducing Apache Spark 3.3 for Databricks Runtime 11.0 Databricks, 2022-06-15

Spark 3.3 in DBR 11.0: row-level Bloom filters and broader pandas API coverage.

How to Monitor Streaming Queries in PySpark Databricks, 2022-05-27

Using PySpark's Observable API to ship Structured Streaming metrics to external monitoring systems.

Introducing Apache Spark 3.2 Databricks, 2021-10-19

Spark 3.2 in DBR 10.0: pandas API on Spark, ANSI SQL improvements, RocksDB state store.

Pandas API on Apache Spark 3.2 Databricks, 2021-10-04

Koalas merged into PySpark as the official pandas API on Spark in 3.2.

Benchmark: Koalas (PySpark) and Dask Databricks, 2021-04-07

Koalas vs Dask benchmark: roughly 4 to 25 times faster than Dask depending on workload.

Introducing Apache Spark 3.1 Databricks, 2021-03-02

Spark 3.1 in DBR 8.0: Python usability, ANSI SQL, query optimizer improvements.

How to Manage Python Dependencies in PySpark Databricks, 2020-12-22

Shipping Python packages with PySpark jobs via PEX, conda-pack, and venv-pack archives.

An Update on Project Zen: Improving Apache Spark for Python Users Databricks, 2020-09-04

Project Zen update: PySpark docs redesign, type hints, classified error handling, install profiles.

Interoperability between Koalas and Apache Spark Databricks, 2020-08-11

Interchanging data and operations between Koalas DataFrames and PySpark DataFrames.

A Comprehensive Look at Dates and Timestamps in Apache Spark 3.0 Databricks, 2020-07-22

How to effectively use dates and timestamps in Spark 3.0: calendars, time zones, and the proleptic Gregorian switch.

Koalas 1.0: Scale Pandas with Apache Spark Databricks, 2020-06-24

Koalas 1.0 with about 80 percent pandas API coverage, Spark 3.0 support, new Spark accessor.

Vectorized R I/O in Upcoming Apache Spark 3.0 Databricks, 2020-06-01

Arrow-based vectorization for SparkR gapply, dapply, and DataFrame I/O in Spark 3.0.

New Pandas UDFs and Python Type Hints in the Upcoming Release of Apache Spark 3.0 Databricks, 2020-05-20

Redesigned Pandas UDF API built on Python type hints for clearer, more Pythonic UDFs.

10 Minutes from pandas to Koalas on Apache Spark Databricks, 2020-03-31

Quick-start tutorial mapping common pandas operations to their Koalas equivalents.

Integrating Apache Hive with Apache Spark - Hive Warehouse Connector Cloudera Community (formerly Hortonworks), 2018-10-03

The Hive Warehouse Connector for reading and writing data between Spark and Hive.