talks | Hyukjin Kwon

No-Code Change in Your Python UDF for Arrow Optimization Jun 10, 2025

Data + AI Summit 2025

How Arrow-optimized Python UDFs in Apache Spark deliver large speedups for existing Python UDFs without any user code change.

아무것도 안고치고 Python UDF 2배 빠르게 만들기 Apr 29, 2025

Data Intelligence Day 2025

Korean-language talk. Make your Python UDF 2x faster without changing anything: Arrow-optimized Python UDFs in Apache Spark.

Profile, debug and monitor my PySpark workloads Oct 26, 2024

PyCon APAC 2024

How to profile, debug, and monitor PySpark workloads in distributed environments using cProfile, the Spark UI, and observable streaming metrics.

How do I debug my PySpark workloads? 2024

PyCon Hong Kong, with Allison Wang

Practical methods for debugging and profiling PySpark applications in distributed environments using cProfile and other standard tools.

Demystifying pandas with PySpark when scaling out Jul 29, 2024

PyData Vermont 2024

Walking through how to scale pandas workloads with the pandas-on-Spark API in PySpark, what changes for distributed execution, and the practical pitfalls when moving from local pandas to a Spark cluster.

Dependency Management in Spark Connect: Simple, Isolated, Powerful Jun 12, 2024

Data + AI Summit 2024, with Akhil Gudesa

How Spark Connect simplifies dependency management in distributed environments, by packaging and updating custom Python and Scala environments per session.

오픈소스로 시작해서 실리콘밸리까지 Apr 23, 2024

Databricks

Korean-language talk. From open source to Silicon Valley: career path through Apache Spark and how OSS contributions led to Databricks.

Scaling pandas to any size with PySpark Aug 17, 2023

EuroSciPy 2023, Switzerland

Scaling pandas workloads to arbitrary data sizes using the pandas API on Spark in PySpark.

pandas와 PySpark로 데이터 워크로드 확장하기 Aug 12, 2023

PyCon Korea 2023, South Korea

Korean-language talk at PyCon Korea 2023. Scaling data workloads with pandas and PySpark.

Python with Spark Connect Jun 29, 2023

Data + AI Summit 2023, San Francisco

Using Python with Spark Connect, the decoupled client/server architecture introduced in Spark 3.4, and the developer-experience improvements it enables.

Lakehouse / Spark AMA Jun 29, 2023

Data + AI Summit 2023, San Francisco

Live AMA covering Apache Spark, Spark Connect, and the lakehouse architecture with several Spark committers.

Scaling data workloads using the best of both worlds: pandas and Spark Jun 20, 2023

PyData Seattle 2023, with Chengyin Eng

How to combine pandas and PySpark idiomatically to scale data analysis workloads, with implementation details and best-practice guidance for analysts and scientists.

Spark Connect로 어디서든 쉽게 원격으로 PySpark 사용하기 Apr 25, 2023

Databricks

Korean-language talk. Easily use PySpark remotely from anywhere with Spark Connect.

PySpark in Apache Spark 3.3 and Beyond Jun 29, 2022

Data + AI Summit 2022, San Francisco, with Xinrong Meng

PySpark improvements in Spark 3.3 and the roadmap ahead: default index support for the pandas API, type hints in source, UDF profiler, and upcoming Structured Streaming and Arrow work.

Databricks Korea Lakehouse Day 2022 Apr 20, 2022

Databricks

Project Zen: Making Data Science Easier in PySpark May 26, 2021

Data + AI Summit 2021, San Francisco

Project Zen: making PySpark more Pythonic with better docs, type hints, error messages, and pandas interoperability.

Pandas UDF and Python Type Hint in Apache Spark 3.0 Jun 24, 2020

Spark + AI Summit 2020

Introducing the redesigned Pandas UDF API in Spark 3.0: type hints, the new Pandas Function API, and the rationale for the redesign.

Vectorized R Execution in Apache Spark Oct 16, 2019

Spark AI Summit 2019 EUROPE

Vectorization in Apache Spark: Arrow-based columnar exchange, Pandas UDFs, and the SparkR performance work that brought vectorized gapply / dapply and DataFrame I/O to SparkR.

What's New in Apache Spark 2.3 and Spark 2.4 Oct 11, 2018

Dataworks 2019, Singapore

Walkthrough of Spark 2.3 and 2.4 highlights: Data Source API V2, vectorized ORC reader, Pandas UDFs, continuous Structured Streaming, Kubernetes support, and barrier execution mode.