talks
Selected conference talks and recorded sessions. Edit the
TALKS list in content.py.
-
Data + AI Summit 2025
How Arrow-optimized Python UDFs in Apache Spark deliver large speedups for existing Python UDFs without any user code change.
-
아무것도 안고치고 Python UDF 2배 빠르게 만들기 Apr 29, 2025
Data Intelligence Day 2025
Korean-language talk. Make your Python UDF 2x faster without changing anything: Arrow-optimized Python UDFs in Apache Spark.
-
Profile, debug and monitor my PySpark workloads Oct 26, 2024
PyCon APAC 2024
How to profile, debug, and monitor PySpark workloads in distributed environments using cProfile, the Spark UI, and observable streaming metrics.
-
PyCon Hong Kong, with Allison Wang
Practical methods for debugging and profiling PySpark applications in distributed environments using cProfile and other standard tools.
-
Demystifying pandas with PySpark when scaling out Jul 29, 2024
PyData Vermont 2024
Walking through how to scale pandas workloads with the pandas-on-Spark API in PySpark, what changes for distributed execution, and the practical pitfalls when moving from local pandas to a Spark cluster.
-
Data + AI Summit 2024, with Akhil Gudesa
How Spark Connect simplifies dependency management in distributed environments, by packaging and updating custom Python and Scala environments per session.
-
오픈소스로 시작해서 실리콘밸리까지 Apr 23, 2024
Databricks
Korean-language talk. From open source to Silicon Valley: career path through Apache Spark and how OSS contributions led to Databricks.
-
Scaling pandas to any size with PySpark Aug 17, 2023
EuroSciPy 2023, Switzerland
Scaling pandas workloads to arbitrary data sizes using the pandas API on Spark in PySpark.
-
pandas와 PySpark로 데이터 워크로드 확장하기 Aug 12, 2023
PyCon Korea 2023, South Korea
Korean-language talk at PyCon Korea 2023. Scaling data workloads with pandas and PySpark.
-
Python with Spark Connect Jun 29, 2023
Data + AI Summit 2023, San Francisco
Using Python with Spark Connect, the decoupled client/server architecture introduced in Spark 3.4, and the developer-experience improvements it enables.
-
Lakehouse / Spark AMA Jun 29, 2023
Data + AI Summit 2023, San Francisco
Live AMA covering Apache Spark, Spark Connect, and the lakehouse architecture with several Spark committers.
-
PyData Seattle 2023, with Chengyin Eng
How to combine pandas and PySpark idiomatically to scale data analysis workloads, with implementation details and best-practice guidance for analysts and scientists.
-
Spark Connect로 어디서든 쉽게 원격으로 PySpark 사용하기 Apr 25, 2023
Databricks
Korean-language talk. Easily use PySpark remotely from anywhere with Spark Connect.
-
PySpark in Apache Spark 3.3 and Beyond Jun 29, 2022
Data + AI Summit 2022, San Francisco, with Xinrong Meng
PySpark improvements in Spark 3.3 and the roadmap ahead: default index support for the pandas API, type hints in source, UDF profiler, and upcoming Structured Streaming and Arrow work.
-
Databricks Korea Lakehouse Day 2022 Apr 20, 2022
Databricks
-
Project Zen: Making Data Science Easier in PySpark May 26, 2021
Data + AI Summit 2021, San Francisco
Project Zen: making PySpark more Pythonic with better docs, type hints, error messages, and pandas interoperability.
-
Pandas UDF and Python Type Hint in Apache Spark 3.0 Jun 24, 2020
Spark + AI Summit 2020
Introducing the redesigned Pandas UDF API in Spark 3.0: type hints, the new Pandas Function API, and the rationale for the redesign.
-
Vectorized R Execution in Apache Spark Oct 16, 2019
Spark AI Summit 2019 EUROPE
Vectorization in Apache Spark: Arrow-based columnar exchange, Pandas UDFs, and the SparkR performance work that brought vectorized gapply / dapply and DataFrame I/O to SparkR.
-
What's New in Apache Spark 2.3 and Spark 2.4 Oct 11, 2018
Dataworks 2019, Singapore
Walkthrough of Spark 2.3 and 2.4 highlights: Data Source API V2, vectorized ORC reader, Pandas UDFs, continuous Structured Streaming, Kubernetes support, and barrier execution mode.