Skip to content

Examples

Runnable programs live under modules/examples/. Each one connects to the server named by the SPARK_REMOTE environment variable (default sc://localhost:15002), or to a connection string passed as the first program argument.

Run any example with sbt:

sbt "examples/runMain examples.QuickStart"
sbt "examples/runMain examples.Aggregations"
sbt "examples/runMain examples.JoinExample sc://localhost:15002"
Program What it shows
QuickStart Connect, build a range, project a derived column, run actions.
Aggregations Grouped aggregation with several aggregate functions.
JoinExample Joining two DataFrames and aggregating the result.
WordCount Split and explode text, then group and count.
SqlAndViews Mix the DataFrame API with raw SQL through temporary views, including parameterised SQL.
WindowFunctions Window functions: per-department salary ranking and running totals.
ReadWrite Round-trip a DataFrame through Parquet on the server's filesystem.
StructuredStreaming Read the built-in rate source, transform it, write to the in-memory sink, and query the sink.
DeclarativePipeline Build and run a Spark Declarative Pipelines dataflow graph (requires an Apache Spark 4.1+ server).
Smoke A minimal end-to-end check against a live Spark Connect server.

Run any of them by name, for example:

sbt "examples/runMain examples.WindowFunctions"
sbt "examples/runMain examples.StructuredStreaming"

QuickStart

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.*

object QuickStart:
  def main(args: Array[String]): Unit =
    val remote = if args.nonEmpty then args(0) else "sc://localhost:15002"
    val spark = SparkSession.builder.remote(remote).getOrCreate()
    try
      val df = spark.range(1, 6).select(col("id"), (col("id") * col("id")).as("square"))
      df.show()
      println(s"row count = ${df.count()}")
    finally
      spark.stop()

Aggregations

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.*

object Aggregations:
  def main(args: Array[String]): Unit =
    val remote = if args.nonEmpty then args(0) else "sc://localhost:15002"
    val spark = SparkSession.builder.remote(remote).getOrCreate()
    try
      val sales = spark.sql(
        "SELECT * FROM VALUES " +
          "('KR', 'book', 12.0), ('KR', 'pen', 3.5), ('KR', 'book', 9.0), " +
          "('US', 'book', 15.0), ('US', 'pen', 2.0) AS t(country, item, amount)")

      sales.groupBy("country")
        .agg(
          count("*").as("orders"),
          round(sum("amount"), 2).as("total"),
          round(avg("amount"), 2).as("avg"),
          max("amount").as("max"))
        .orderBy("country")
        .show()
    finally
      spark.stop()

JoinExample

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.*

object JoinExample:
  def main(args: Array[String]): Unit =
    val remote = if args.nonEmpty then args(0) else "sc://localhost:15002"
    val spark = SparkSession.builder.remote(remote).getOrCreate()
    try
      val orders = spark.sql(
        "SELECT * FROM VALUES (1, 'KR', 120.0), (2, 'US', 80.0), (3, 'KR', 50.0) " +
          "AS t(id, country, amount)")
      val regions = spark.sql(
        "SELECT * FROM VALUES ('KR', 'Asia'), ('US', 'Americas') AS t(country, region)")

      orders.join(regions, "country")
        .groupBy("region")
        .agg(round(sum("amount"), 2).as("total"))
        .orderBy("region")
        .show()
    finally
      spark.stop()

See Quickstart and the guides for the full API.