org.apache.spark.sql

Members list

Type members

Classlikes

class AnalysisException(message: String, cause: Throwable, errorClass: Option[String]) extends SparkException

Thrown when a query fails to analyze on the server: an unknown table or column, a type mismatch, an ambiguous reference, and similar. Mirrors org.apache.spark.sql.AnalysisException.

Thrown when a query fails to analyze on the server: an unknown table or column, a type mismatch, an ambiguous reference, and similar. Mirrors org.apache.spark.sql.AnalysisException.

Attributes

Supertypes
class RuntimeException
class Exception
class Throwable
trait Serializable
class Object
trait Matchable
class Any
Show all
Known subtypes
class Column(val expr: Expression)

A column in a Dataset: a lazily-evaluated reference to a column or a computation over columns. Columns are immutable; operators and methods return new Columns.

A column in a Dataset: a lazily-evaluated reference to a column or a computation over columns. Columns are immutable; operators and methods return new Columns.

Build columns with functions.col, functions.lit, by indexing a DataFrame (df("id")), or by combining other columns with operators.

 import org.apache.spark.sql.functions._
 (col("age") + 1).as("next_age")
 col("name").like("a%") && (col("age") >= 18)

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any
object Column

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type
Column.type
final class DataFrameHolder

Turns a local sequence into a DataFrame. Product elements (tuples / case classes) become multi-column rows; any other value becomes a single column. Column types are inferred from the first row.

Turns a local sequence into a DataFrame. Product elements (tuples / case classes) become multi-column rows; any other value becomes a single column. Column types are inferred from the first row.

Attributes

Supertypes
class Object
trait Matchable
class Any

Functionality for working with missing data in a Dataset, reached via df.na. Mirrors PySpark's DataFrame.na (DataFrameNaFunctions).

Functionality for working with missing data in a Dataset, reached via df.na. Mirrors PySpark's DataFrame.na (DataFrameNaFunctions).

 df.na.drop()
 df.na.fill(0)
 df.na.fill(Map("name" -> "unknown", "age" -> 0))
 df.na.replace("name", Map("UNKNOWN" -> "unnamed"))

Attributes

Supertypes
class Object
trait Matchable
class Any

Loads data from external storage systems (e.g. file systems, key-value stores, JDBC) into a DataFrame. Use SparkSession.read to access this.

Loads data from external storage systems (e.g. file systems, key-value stores, JDBC) into a DataFrame. Use SparkSession.read to access this.

Mirrors the public surface of org.apache.spark.sql.DataFrameReader over the Spark Connect protocol.

Attributes

Example
 spark.read.format("csv").option("header", true).load("data.csv")
 spark.read.json("events.json")
 spark.read.table("my_table")
Supertypes
class Object
trait Matchable
class Any

Statistical functions for a Dataset, reached via df.stat. Mirrors PySpark's DataFrame.stat (DataFrameStatFunctions).

Statistical functions for a Dataset, reached via df.stat. Mirrors PySpark's DataFrame.stat (DataFrameStatFunctions).

 df.stat.corr("x", "y")
 df.stat.approxQuantile("x", Array(0.25, 0.5, 0.75), 0.01)
 df.stat.crosstab("a", "b").show()

Attributes

Supertypes
class Object
trait Matchable
class Any

Saves the contents of a DataFrame to external storage systems (e.g. file systems, key-value stores, tables). Use Dataset.write to access this.

Saves the contents of a DataFrame to external storage systems (e.g. file systems, key-value stores, tables). Use Dataset.write to access this.

Mirrors the public surface of org.apache.spark.sql.DataFrameWriter over the Spark Connect protocol.

Attributes

Example
 df.write.format("parquet").mode("overwrite").save("out.parquet")
 df.write.mode("append").saveAsTable("my_table")
Supertypes
class Object
trait Matchable
class Any
final class DataFrameWriterV2

Interface used to write a Dataset to external storage using the v2 (catalog) API. Use Dataset.writeTo to access this.

Interface used to write a Dataset to external storage using the v2 (catalog) API. Use Dataset.writeTo to access this.

 df.writeTo("catalog.db.table").using("parquet").create()
 df.writeTo("catalog.db.table").append()

Attributes

Supertypes
class Object
trait Matchable
class Any
class Dataset[T]

A distributed collection of rows. A Dataset is lazy: transformations build up a protobuf logical plan, and nothing executes until an action (e.g. collect, show, count) is called.

A distributed collection of rows. A Dataset is lazy: transformations build up a protobuf logical plan, and nothing executes until an action (e.g. collect, show, count) is called.

Mirrors the public surface of org.apache.spark.sql.Dataset over the Spark Connect protocol.

Attributes

Supertypes
class Object
trait Matchable
class Any
trait Encoder[T] extends Serializable

Converts between JVM values of type T and the Row representation produced and consumed by Spark Connect. Encoders are derived at compile time for primitives, Option, collections, maps, tuples, and case classes, so df.as[Person] and spark.createDataset(people) work with no server-side closures.

Converts between JVM values of type T and the Row representation produced and consumed by Spark Connect. Encoders are derived at compile time for primitives, Option, collections, maps, tuples, and case classes, so df.as[Person] and spark.createDataset(people) work with no server-side closures.

Encoders intentionally do NOT cover Dataset.map/flatMap/groupByKey/reduce/mapPartitions: those evaluate a JVM closure per row on the server (the same mechanism as a UDF), which Spark Connect for Scala 3 does not support.

Attributes

Companion
object
Supertypes
trait Serializable
class Object
trait Matchable
class Any
object Encoder

Companion holding the derived Encoder instances. Because they live here, df.as[T] and spark.createDataset resolve an encoder for any supported T with no explicit import.

Companion holding the derived Encoder instances. Because they live here, df.as[T] and spark.createDataset resolve an encoder for any supported T with no explicit import.

Attributes

Companion
trait
Supertypes
class Object
trait Matchable
class Any
Self type
Encoder.type
object Encoders

Spark-conventional alias; re-exports the derived Encoder instances and factory.

Spark-conventional alias; re-exports the derived Encoder instances and factory.

Attributes

Supertypes
class Object
trait Matchable
class Any
Self type
Encoders.type
class GenericRow(val values: Array[Any]) extends Row

A Row backed by an Array[Any].

A Row backed by an Array[Any].

Attributes

Supertypes
trait Row
trait Serializable
class Object
trait Matchable
class Any
Known subtypes
class GenericRowWithSchema(values: Array[Any], val schema: StructType) extends GenericRow

A Row backed by an Array[Any] together with a schema.

A Row backed by an Array[Any] together with a schema.

Attributes

Supertypes
class GenericRow
trait Row
trait Serializable
class Object
trait Matchable
class Any
Show all
class MergeIntoWriter[T]

Builder for a MERGE INTO statement, obtained from Dataset.mergeInto. Specify one or more whenMatched / whenNotMatched / whenNotMatchedBySource clauses, then call merge to execute the statement against the target table.

Builder for a MERGE INTO statement, obtained from Dataset.mergeInto. Specify one or more whenMatched / whenNotMatched / whenNotMatchedBySource clauses, then call merge to execute the statement against the target table.

 spark.table("source")
   .mergeInto("target", col("source.id") === col("target.id"))
   .whenMatched().updateAll()
   .whenNotMatched().insertAll()
   .whenNotMatchedBySource().delete()
   .merge()

Attributes

Supertypes
class Object
trait Matchable
class Any
class Observation(val name: String)

Holder for named aggregate metrics computed while a Dataset is being materialised, without an extra pass over the data. Pair with Dataset.observe.

Holder for named aggregate metrics computed while a Dataset is being materialised, without an extra pass over the data. Pair with Dataset.observe.

 val obs = new Observation("metrics")
 df.observe(obs, count(lit(1)).as("rows"), max("id").as("max_id")).collect()
 obs.get  // Map("rows" -> 100, "max_id" -> 99)

The result of get is a map from metric column name to its value. get blocks until the metrics have been observed (i.e. until an action on the observed Dataset has run and the result code has called setMetricsFromLiterals).

Value parameters

name

a unique name for this observation.

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any
object Observation

Companion for Observation. Maintains a registry mapping observation names to instances so the result-collection code can route observed metric values back to the originating Observation, and provides a helper for decoding proto.Expression.Literal values.

Companion for Observation. Maintains a registry mapping observation names to instances so the result-collection code can route observed metric values back to the originating Observation, and provides a helper for decoding proto.Expression.Literal values.

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type

A set of methods for aggregations on a Dataset, created by Dataset.groupBy, Dataset.rollup, Dataset.cube, or Dataset.pivot.

A set of methods for aggregations on a Dataset, created by Dataset.groupBy, Dataset.rollup, Dataset.cube, or Dataset.pivot.

Attributes

Supertypes
class Object
trait Matchable
class Any
trait Row extends Serializable

Represents one row of output from a relational operator. Mirrors org.apache.spark.sql.Row.

Represents one row of output from a relational operator. Mirrors org.apache.spark.sql.Row.

Attributes

Companion
object
Supertypes
trait Serializable
class Object
trait Matchable
class Any
Known subtypes
object Row

Attributes

Companion
trait
Supertypes
class Object
trait Matchable
class Any
Self type
Row.type

Runtime configuration interface for Spark, accessible via spark.conf. Reads and writes are delegated to the Spark Connect server's Config RPC.

Runtime configuration interface for Spark, accessible via spark.conf. Reads and writes are delegated to the Spark Connect server's Config RPC.

Attributes

Supertypes
class Object
trait Matchable
class Any
class SQLImplicits

Implicit conversions available via import spark.implicits.*: the $"columnName" column syntax and a .toDF method on local sequences of values, tuples, or case classes.

Implicit conversions available via import spark.implicits.*: the $"columnName" column syntax and a .toDF method on local sequences of values, tuples, or case classes.

 import spark.implicits.*
 val df = Seq((1, "a"), (2, "b")).toDF("id", "name")
 df.select($"id", $"name")

Attributes

Supertypes
class Object
trait Matchable
class Any
Known subtypes
object implicits
enum SaveMode

Specifies the expected behavior when saving a DataFrame to a data source that already has data. Mirrors org.apache.spark.sql.SaveMode.

Specifies the expected behavior when saving a DataFrame to a data source that already has data. Mirrors org.apache.spark.sql.SaveMode.

Attributes

Companion
object
Supertypes
trait Enum
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object SaveMode

Attributes

Companion
enum
Supertypes
trait Sum
trait Mirror
class Object
trait Matchable
class Any
Self type
SaveMode.type
class SparkSession extends AutoCloseable

The entry point to programming Spark with the DataFrame API over Spark Connect.

The entry point to programming Spark with the DataFrame API over Spark Connect.

 val spark = SparkSession.builder
   .remote("sc://localhost:15002")
   .appName("my-app")
   .getOrCreate()

 spark.range(10).filter(col("id") % 2 === 0).show()
 spark.stop()

A session owns the SparkConnectClient, a monotonic plan-id allocator (so every relation is uniquely identifiable to the server), an Arrow allocator for decoding results, and the RuntimeConfig facade.

Attributes

Companion
object
Supertypes
trait AutoCloseable
class Object
trait Matchable
class Any
object SparkSession

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type
class WhenMatched[T]

A WHEN MATCHED [AND condition] clause of a MergeIntoWriter.

A WHEN MATCHED [AND condition] clause of a MergeIntoWriter.

Attributes

Supertypes
class Object
trait Matchable
class Any
class WhenNotMatched[T]

A WHEN NOT MATCHED [AND condition] clause of a MergeIntoWriter.

A WHEN NOT MATCHED [AND condition] clause of a MergeIntoWriter.

Attributes

Supertypes
class Object
trait Matchable
class Any

A WHEN NOT MATCHED BY SOURCE [AND condition] clause of a MergeIntoWriter.

A WHEN NOT MATCHED BY SOURCE [AND condition] clause of a MergeIntoWriter.

Attributes

Supertypes
class Object
trait Matchable
class Any
object functions

Built-in functions for working with Columns, mirroring org.apache.spark.sql.functions.

Built-in functions for working with Columns, mirroring org.apache.spark.sql.functions.

 import org.apache.spark.sql.functions._
 df.select(col("id"), upper(col("name")), (col("x") + 1).as("x1"))
 df.groupBy("dept").agg(avg("salary"), count(lit(1)))

This object exposes a comprehensive subset of Spark's function library. Any Spark function not listed here can still be invoked by name via callUDF / expr.

Following Spark's convention, a String argument denotes a column name for most functions (e.g. sum("salary") aggregates the salary column), while functions whose parameters are genuinely literal (regex patterns, date formats, JSON paths, ...) treat their String arguments as literal values.

Attributes

Supertypes
class Object
trait Matchable
class Any
Self type
functions.type

Types

A Dataset of Rows. Spark Connect for Scala 3 currently models all DataFrames as untyped row datasets, mirroring DataFrame = Dataset[Row] in Apache Spark.

A Dataset of Rows. Spark Connect for Scala 3 currently models all DataFrames as untyped row datasets, mirroring DataFrame = Dataset[Row] in Apache Spark.

Attributes