talks
Selected conference talks and recorded sessions. Edit the
TALKS list in content.py.
-
Data + AI Summit
How Arrow-optimized Python UDFs in Apache Spark deliver large speedups for existing Python UDFs without any user code change.
-
AWS Summit Seoul (probable)
Korean-language talk. Make your Python UDF 2x faster without changing anything: Arrow-optimized Python UDFs in Apache Spark.
-
PyCon APAC (probable)
How to profile, debug, and monitor PySpark workloads in distributed environments using cProfile, the Spark UI, and observable streaming metrics.
-
PyCon Hong Kong, with Allison Wang
Practical methods for debugging and profiling PySpark applications in distributed environments using cProfile and other standard tools.
-
PyData Vermont
Walking through how to scale pandas workloads with the pandas-on-Spark API in PySpark, what changes for distributed execution, and the practical pitfalls when moving from local pandas to a Spark cluster.
-
Data + AI Summit, with Akhil Gudesa
How Spark Connect simplifies dependency management in distributed environments, by packaging and updating custom Python and Scala environments per session.
-
오픈소스로 시작해서 실리콘밸리까지 2024
Korean tech talk
Korean-language talk. From open source to Silicon Valley: career path through Apache Spark and how OSS contributions led to Databricks.
-
PyData (US)
Scaling pandas workloads to arbitrary data sizes using the pandas API on Spark in PySpark.
-
PyCon Korea
Korean-language talk at PyCon Korea 2023. Scaling data workloads with pandas and PySpark.
-
Data + AI Summit
Using Python with Spark Connect, the decoupled client/server architecture introduced in Spark 3.4, and the developer-experience improvements it enables.
-
Data + AI Summit (YouTube)
Live AMA covering Apache Spark, Spark Connect, and the lakehouse architecture with several Spark committers.
-
PyData Seattle, with Chengyin Eng
How to combine pandas and PySpark idiomatically to scale data analysis workloads, with implementation details and best-practice guidance for analysts and scientists.
-
Korean tech event
Korean-language talk. Easily use PySpark remotely from anywhere with Spark Connect.
-
Spark + AI Summit
Introducing the redesigned Pandas UDF API in Spark 3.0: type hints, the new Pandas Function API, and the rationale for the redesign.
-
Spark + AI Summit Europe
Vectorization in Apache Spark: Arrow-based columnar exchange, Pandas UDFs, and the SparkR performance work that brought vectorized gapply / dapply and DataFrame I/O to SparkR.
-
DataWorks Summit Singapore
Walkthrough of Spark 2.3 and 2.4 highlights: Data Source API V2, vectorized ORC reader, Pandas UDFs, continuous Structured Streaming, Kubernetes support, and barrier execution mode.