talks | Hyukjin Kwon

No-Code Change in Your Python UDF for Arrow Optimization 2025

Data + AI Summit

How Arrow-optimized Python UDFs in Apache Spark deliver large speedups for existing Python UDFs without any user code change.

아무것도 안고치고 Python UDF 2배 빠르게 만들기 2025

AWS Summit Seoul (probable)

Korean-language talk. Make your Python UDF 2x faster without changing anything: Arrow-optimized Python UDFs in Apache Spark.

Profile, debug and monitor my PySpark workloads 2024

PyCon APAC (probable)

How to profile, debug, and monitor PySpark workloads in distributed environments using cProfile, the Spark UI, and observable streaming metrics.

How do I debug my PySpark workloads? 2024

PyCon Hong Kong, with Allison Wang

Practical methods for debugging and profiling PySpark applications in distributed environments using cProfile and other standard tools.

Demystifying pandas with PySpark when scaling out 2024

PyData Vermont

Walking through how to scale pandas workloads with the pandas-on-Spark API in PySpark, what changes for distributed execution, and the practical pitfalls when moving from local pandas to a Spark cluster.

Dependency Management in Spark Connect: Simple, Isolated, Powerful 2024

Data + AI Summit, with Akhil Gudesa

How Spark Connect simplifies dependency management in distributed environments, by packaging and updating custom Python and Scala environments per session.

오픈소스로 시작해서 실리콘밸리까지 2024

Korean tech talk

Korean-language talk. From open source to Silicon Valley: career path through Apache Spark and how OSS contributions led to Databricks.

Scaling pandas to any size with PySpark 2023

PyData (US)

Scaling pandas workloads to arbitrary data sizes using the pandas API on Spark in PySpark.

pandas와 PySpark로 데이터 워크로드 확장하기 2023

PyCon Korea

Korean-language talk at PyCon Korea 2023. Scaling data workloads with pandas and PySpark.

Python with Spark Connect 2023

Data + AI Summit

Using Python with Spark Connect, the decoupled client/server architecture introduced in Spark 3.4, and the developer-experience improvements it enables.

Lakehouse / Spark AMA 2023

Data + AI Summit (YouTube)

Live AMA covering Apache Spark, Spark Connect, and the lakehouse architecture with several Spark committers.

Scaling data workloads using the best of both worlds: pandas and Spark 2023

PyData Seattle, with Chengyin Eng

How to combine pandas and PySpark idiomatically to scale data analysis workloads, with implementation details and best-practice guidance for analysts and scientists.

Spark Connect로 어디서든 쉽게 원격으로 PySpark 사용하기 2023

Korean tech event

Korean-language talk. Easily use PySpark remotely from anywhere with Spark Connect.

Pandas UDF and Python Type Hint in Apache Spark 3.0 2020

Spark + AI Summit

Introducing the redesigned Pandas UDF API in Spark 3.0: type hints, the new Pandas Function API, and the rationale for the redesign.

Vectorized R Execution in Apache Spark 2019

Spark + AI Summit Europe

Vectorization in Apache Spark: Arrow-based columnar exchange, Pandas UDFs, and the SparkR performance work that brought vectorized gapply / dapply and DataFrame I/O to SparkR.

What's New in Apache Spark 2.3 and Spark 2.4 2019

DataWorks Summit Singapore

Walkthrough of Spark 2.3 and 2.4 highlights: Data Source API V2, vectorized ORC reader, Pandas UDFs, continuous Structured Streaming, Kubernetes support, and barrier execution mode.