org.apache.spark.sql

Thrown when a query fails to analyze on the server: an unknown table or column, a type mismatch, an ambiguous reference, and similar. Mirrors org.apache.spark.sql.AnalysisException.

Attributes

Supertypes: class SparkException

class RuntimeException

class Exception

class Throwable

trait Serializable

class Object

trait Matchable

class Any
Show all
Known subtypes: class ParseException

A column in a Dataset: a lazily-evaluated reference to a column or a computation over columns. Columns are immutable; operators and methods return new Columns.

Build columns with functions.col, functions.lit, by indexing a DataFrame (df("id")), or by combining other columns with operators.

 import org.apache.spark.sql.functions._
 (col("age") + 1).as("next_age")
 col("name").like("a%") && (col("age") >= 18)

Attributes

Companion: object
Supertypes: class Object

trait Matchable

class Any

Attributes

Companion: class
Supertypes: class Object

trait Matchable

class Any
Self type: Column.type

Turns a local sequence into a DataFrame. Product elements (tuples / case classes) become multi-column rows; any other value becomes a single column. Column types are inferred from the first row.

Attributes

Supertypes: class Object

trait Matchable

class Any

Functionality for working with missing data in a Dataset, reached via df.na. Mirrors PySpark's DataFrame.na (DataFrameNaFunctions).

 df.na.drop()
 df.na.fill(0)
 df.na.fill(Map("name" -> "unknown", "age" -> 0))
 df.na.replace("name", Map("UNKNOWN" -> "unnamed"))

Attributes

Supertypes: class Object

trait Matchable

class Any

Loads data from external storage systems (e.g. file systems, key-value stores, JDBC) into a DataFrame. Use SparkSession.read to access this.

Mirrors the public surface of org.apache.spark.sql.DataFrameReader over the Spark Connect protocol.

Attributes

Example

 spark.read.format("csv").option("header", true).load("data.csv")
 spark.read.json("events.json")
 spark.read.table("my_table")

Supertypes

class Object

trait Matchable

class Any

Statistical functions for a Dataset, reached via df.stat. Mirrors PySpark's DataFrame.stat (DataFrameStatFunctions).

 df.stat.corr("x", "y")
 df.stat.approxQuantile("x", Array(0.25, 0.5, 0.75), 0.01)
 df.stat.crosstab("a", "b").show()

Attributes

Supertypes: class Object

trait Matchable

class Any

Saves the contents of a DataFrame to external storage systems (e.g. file systems, key-value stores, tables). Use Dataset.write to access this.

Mirrors the public surface of org.apache.spark.sql.DataFrameWriter over the Spark Connect protocol.

Attributes

Example

 df.write.format("parquet").mode("overwrite").save("out.parquet")
 df.write.mode("append").saveAsTable("my_table")

Supertypes

class Object

trait Matchable

class Any

Interface used to write a Dataset to external storage using the v2 (catalog) API. Use Dataset.writeTo to access this.

 df.writeTo("catalog.db.table").using("parquet").create()
 df.writeTo("catalog.db.table").append()

Attributes

Supertypes: class Object

trait Matchable

class Any

A distributed collection of rows. A Dataset is lazy: transformations build up a protobuf logical plan, and nothing executes until an action (e.g. collect, show, count) is called.

Mirrors the public surface of org.apache.spark.sql.Dataset over the Spark Connect protocol.

Attributes

Supertypes: class Object

trait Matchable

class Any

Converts between JVM values of type T and the Row representation produced and consumed by Spark Connect. Encoders are derived at compile time for primitives, Option, collections, maps, tuples, and case classes, so df.as[Person] and spark.createDataset(people) work with no server-side closures.

Encoders intentionally do NOT cover Dataset.map/flatMap/groupByKey/reduce/mapPartitions: those evaluate a JVM closure per row on the server (the same mechanism as a UDF), which Spark Connect for Scala 3 does not support.

Attributes

Companion: object
Supertypes: trait Serializable

class Object

trait Matchable

class Any

Companion holding the derived Encoder instances. Because they live here, df.as[T] and spark.createDataset resolve an encoder for any supported T with no explicit import.

Attributes

Companion: trait
Supertypes: class Object

trait Matchable

class Any
Self type: Encoder.type

Spark-conventional alias; re-exports the derived Encoder instances and factory.

Attributes

Supertypes: class Object

trait Matchable

class Any
Self type: Encoders.type

A Row backed by an Array[Any].

Attributes

Supertypes: trait Row

trait Serializable

class Object

trait Matchable

class Any
Known subtypes: class GenericRowWithSchema

A Row backed by an Array[Any] together with a schema.

Attributes

Supertypes: class GenericRow

trait Row

trait Serializable

class Object

trait Matchable

class Any
Show all

Builder for a MERGE INTO statement, obtained from Dataset.mergeInto. Specify one or more whenMatched / whenNotMatched / whenNotMatchedBySource clauses, then call merge to execute the statement against the target table.

 spark.table("source")
   .mergeInto("target", col("source.id") === col("target.id"))
   .whenMatched().updateAll()
   .whenNotMatched().insertAll()
   .whenNotMatchedBySource().delete()
   .merge()

Attributes

Supertypes: class Object

trait Matchable

class Any

Holder for named aggregate metrics computed while a Dataset is being materialised, without an extra pass over the data. Pair with Dataset.observe.

 val obs = new Observation("metrics")
 df.observe(obs, count(lit(1)).as("rows"), max("id").as("max_id")).collect()
 obs.get  // Map("rows" -> 100, "max_id" -> 99)

The result of get is a map from metric column name to its value. get blocks until the metrics have been observed (i.e. until an action on the observed Dataset has run and the result code has called setMetricsFromLiterals).

Value parameters

name: a unique name for this observation.

Attributes

Companion: object
Supertypes: class Object

trait Matchable

class Any

Companion for Observation. Maintains a registry mapping observation names to instances so the result-collection code can route observed metric values back to the originating Observation, and provides a helper for decoding proto.Expression.Literal values.

Attributes

Companion: class
Supertypes: class Object

trait Matchable

class Any
Self type: Observation.type

A set of methods for aggregations on a Dataset, created by Dataset.groupBy, Dataset.rollup, Dataset.cube, or Dataset.pivot.

Attributes

Supertypes: class Object

trait Matchable

class Any

Represents one row of output from a relational operator. Mirrors org.apache.spark.sql.Row.

Attributes

Companion: object
Supertypes: trait Serializable

class Object

trait Matchable

class Any
Known subtypes: class GenericRow

class GenericRowWithSchema

Attributes

Companion: trait
Supertypes: class Object

trait Matchable

class Any
Self type: Row.type

Runtime configuration interface for Spark, accessible via spark.conf. Reads and writes are delegated to the Spark Connect server's Config RPC.

Attributes

Supertypes: class Object

trait Matchable

class Any

Implicit conversions available via import spark.implicits.*: the $"columnName" column syntax and a .toDF method on local sequences of values, tuples, or case classes.

 import spark.implicits.*
 val df = Seq((1, "a"), (2, "b")).toDF("id", "name")
 df.select($"id", $"name")

Attributes

Supertypes: class Object

trait Matchable

class Any
Known subtypes: object implicits

Specifies the expected behavior when saving a DataFrame to a data source that already has data. Mirrors org.apache.spark.sql.SaveMode.

Attributes

Companion: object
Supertypes: trait Enum

trait Serializable

trait Product

trait Equals

class Object

trait Matchable

class Any
Show all

Attributes

Companion: enum
Supertypes: trait Sum

trait Mirror

class Object

trait Matchable

class Any
Self type: SaveMode.type

The entry point to programming Spark with the DataFrame API over Spark Connect.

 val spark = SparkSession.builder
   .remote("sc://localhost:15002")
   .appName("my-app")
   .getOrCreate()

 spark.range(10).filter(col("id") % 2 === 0).show()
 spark.stop()

A session owns the SparkConnectClient, a monotonic plan-id allocator (so every relation is uniquely identifiable to the server), an Arrow allocator for decoding results, and the RuntimeConfig facade.

Attributes

Companion: object
Supertypes: trait AutoCloseable

class Object

trait Matchable

class Any

Attributes

Companion: class
Supertypes: class Object

trait Matchable

class Any
Self type: SparkSession.type

A WHEN MATCHED [AND condition] clause of a MergeIntoWriter.

Attributes

Supertypes: class Object

trait Matchable

class Any

A WHEN NOT MATCHED [AND condition] clause of a MergeIntoWriter.

Attributes

Supertypes: class Object

trait Matchable

class Any

A WHEN NOT MATCHED BY SOURCE [AND condition] clause of a MergeIntoWriter.

Attributes

Supertypes: class Object

trait Matchable

class Any

Built-in functions for working with Columns, mirroring org.apache.spark.sql.functions.

 import org.apache.spark.sql.functions._
 df.select(col("id"), upper(col("name")), (col("x") + 1).as("x1"))
 df.groupBy("dept").agg(avg("salary"), count(lit(1)))

This object exposes a comprehensive subset of Spark's function library. Any Spark function not listed here can still be invoked by name via callUDF / expr.

Following Spark's convention, a String argument denotes a column name for most functions (e.g. sum("salary") aggregates the salary column), while functions whose parameters are genuinely literal (regex patterns, date formats, JSON paths, ...) treat their String arguments as literal values.

Attributes

Supertypes: class Object

trait Matchable

class Any
Self type: functions.type

org.apache.spark.sql

Members list

Packages

Type members

Classlikes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Value parameters

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Types

Attributes