Ecosystem integrations¶
Because the driver returns native Apache Arrow data through the standard ADBC interface, it plugs into the wider Arrow and ADBC ecosystem with no special glue. The same patterns you would use with any other ADBC driver (PostgreSQL, SQLite, Snowflake) work here against Apache Spark Connect.
Every example on this page is executed against a live Spark Connect server on
every CI run (Spark 3.5.x, 4.0.x, and 4.1.x), so it stays correct. The runnable sources
live under examples/python/.
Install the integration you need
pandas¶
pandas 2.0+ accepts an ADBC connection in read_sql and pulls Arrow batches
under the hood, which is faster and preserves types better than the legacy path.
import pandas as pd
import adbc_driver_spark.dbapi as dbapi
with dbapi.connect("sc://localhost:15002") as conn:
df = pd.read_sql("SELECT id, id * id AS square FROM range(10)", conn)
print(df)
The cursor also offers a driver-native shortcut:
with dbapi.connect("sc://localhost:15002") as conn:
with conn.cursor() as cur:
cur.execute("SELECT AVG(id) AS mean_id FROM range(100)")
print(cur.fetch_df()) # pandas.DataFrame
Polars¶
Polars reads directly from the ADBC connection with pl.read_database.
import polars as pl
import adbc_driver_spark.dbapi as dbapi
with dbapi.connect("sc://localhost:15002") as conn:
df = pl.read_database(
"SELECT id, id * id AS square FROM range(10)",
connection=conn,
)
print(df.select(pl.col("square").sum()))
DuckDB¶
Push heavy aggregation to Spark, then do fast local analytics in DuckDB on the same Arrow buffers. DuckDB scans an Arrow table by referencing the Python variable name in SQL, with no copy.
import duckdb
import adbc_driver_spark.dbapi as dbapi
with dbapi.connect("sc://localhost:15002") as conn:
with conn.cursor() as cur:
cur.execute("SELECT id, id % 3 AS bucket FROM range(1000)")
spark_result = cur.fetch_arrow_table()
rows = duckdb.sql(
"SELECT bucket, COUNT(*) AS n FROM spark_result GROUP BY bucket ORDER BY bucket"
).fetchall()
print(rows)
PyArrow streaming¶
For large results, stream Arrow record batches with bounded memory using
fetch_record_batch, which returns a pyarrow.RecordBatchReader.
import pyarrow.compute as pc
import adbc_driver_spark.dbapi as dbapi
with dbapi.connect("sc://localhost:15002") as conn:
with conn.cursor() as cur:
cur.execute("SELECT id, id * id AS square FROM range(100000)")
reader = cur.fetch_record_batch()
total_rows = 0
running_sum = 0
for batch in reader: # one batch at a time
total_rows += batch.num_rows
running_sum += pc.sum(batch.column("square")).as_py()
print(total_rows, running_sum)
Other Arrow consumers¶
Any library that understands the Arrow C stream interface or a pyarrow.Table
works the same way. cursor.fetch_arrow_table() and cursor.fetch_record_batch()
feed Datafusion, Polars, DuckDB, NumPy (via to_pandas), and the Arrow PyCapsule
protocol without bespoke code.
Writing data¶
The driver supports writing through SQL (CREATE TABLE, INSERT, and
CREATE TABLE ... AS SELECT) via cursor.execute:
with dbapi.connect("sc://localhost:15002") as conn:
with conn.cursor() as cur:
cur.execute("CREATE OR REPLACE TEMP VIEW demo AS SELECT id FROM range(5)")
cur.execute("SELECT COUNT(*) FROM demo")
print(cur.fetchone())
Bulk ADBC ingest is not yet implemented
The ADBC bulk-ingest path (cursor.adbc_ingest(...), used to push a pandas
or Arrow table into a new table in one call) is not implemented yet and
raises NotSupportedError. Use SQL INSERT or CREATE TABLE ... AS SELECT
to write data for now. Progress is tracked on the issue tracker.