about
I'm a Staff Software Engineer at Databricks and tech-lead for the open-source PySpark team. I've spent most of the last decade working on Apache Spark, primarily PySpark, Spark SQL, SparkR, and the development infrastructure that keeps the project healthy. I'm an Apache Software Foundation member and a PMC member and committer on Apache Spark.
I came to the Spark ecosystem by way of Hortonworks and Cloudera, where I worked on the Hive and Spark integration before joining Databricks. Before that, I was a senior computer engineer at MOBIGEN in Seoul, and earlier I interned at LG Electronics. I studied at UCL (MSc, Information Science).
Most of my work centers on making PySpark feel like a first-class Python library 🐍: Pandas UDFs and Python type hints, Arrow-optimized Python UDFs, the pandas API on Spark, and the Python side of Spark Connect. I led Project Zen, the broader push to make PySpark more Pythonic, and have co-authored most of the Apache Spark release announcements on the Databricks blog.
In 2022, Apache Spark received the ACM SIGMOD Systems Award 🏆, recognizing the project as "an innovative, widely-used, open-source, unified data processing system encompassing relational, streaming, and machine-learning workloads." I'm one of the contributors named in the award.
Outside of code, I'm a PADI Freediver Instructor 🤿 and teach students on a freelance basis around Seoul. I dive scuba too, with logged dives across Korea, Taiwan, Thailand, Vietnam, the Philippines, Indonesia, the Maldives, Saipan, and Guam. Apnea and software engineering have less in common than you'd think. Both reward calm under uncomfortable conditions, and both punish you for trying too hard.
elsewhere