Dataset

org.apache.spark.sql.Dataset
class Dataset[T]

A distributed collection of rows. A Dataset is lazy: transformations build up a protobuf logical plan, and nothing executes until an action (e.g. collect, show, count) is called.

Mirrors the public surface of org.apache.spark.sql.Dataset over the Spark Connect protocol.

Attributes

Graph
Supertypes
class Object
trait Matchable
class Any

Members list

Value members

Concrete methods

def agg(expr: Column, exprs: Column*): DataFrame
def alias(alias: String): DataFrame
def apply(colName: String): Column

Selects a column by name. Alias for col.

Selects a column by name. Alias for col.

Attributes

def as[U](using enc: Encoder[U]): Dataset[U]

Returns a new Dataset where each record is mapped to type U via its Encoder. This is a purely client-side reinterpretation (no server-side closure), so it works over Spark Connect.

Returns a new Dataset where each record is mapped to type U via its Encoder. This is a purely client-side reinterpretation (no server-side closure), so it works over Spark Connect.

Attributes

def as(alias: String): DataFrame
def cache(): this.type

Persists this Dataset with the default storage level.

Persists this Dataset with the default storage level.

Attributes

Eagerly checkpoints this Dataset to reliable storage and returns the checkpointed copy.

Eagerly checkpoints this Dataset to reliable storage and returns the checkpointed copy.

Attributes

def checkpoint(eager: Boolean): DataFrame

Checkpoints this Dataset to reliable storage.

Checkpoints this Dataset to reliable storage.

Attributes

def coalesce(numPartitions: Int): DataFrame
def col(colName: String): Column

Selects a column by name, qualified by this Dataset's plan id so that it resolves unambiguously even in self-joins.

Selects a column by name, qualified by this Dataset's plan id so that it resolves unambiguously even in self-joins.

Attributes

def colRegex(colName: String): Column

Selects columns based on a column name regular expression.

Selects columns based on a column name regular expression.

Attributes

def collect(): Array[T]
def collectAsList(): List[T]
def columns: Array[String]
def count(): Long
def createGlobalTempView(viewName: String): Unit
def createOrReplaceGlobalTempView(viewName: String): Unit
def createOrReplaceTempView(viewName: String): Unit
def createTempView(viewName: String): Unit
def crossJoin(right: Dataset[_]): DataFrame
def describe(cols: String*): DataFrame

Computes basic statistics (count, mean, stddev, min, max) for numeric and string columns.

Computes basic statistics (count, mean, stddev, min, max) for numeric and string columns.

Attributes

def drop(colNames: String*): DataFrame
def drop(col: Column): DataFrame
def dropDuplicates(colNames: Seq[String]): DataFrame
def dropDuplicates(col1: String, cols: String*): DataFrame

Drops duplicates within the event-time watermark, keeping state bounded for streaming.

Drops duplicates within the event-time watermark, keeping state bounded for streaming.

Attributes

def dropDuplicatesWithinWatermark(colNames: Seq[String]): DataFrame
def dropDuplicatesWithinWatermark(col1: String, cols: String*): DataFrame
def dtypes: Array[(String, String)]
def except(other: Dataset[_]): DataFrame
def exceptAll(other: Dataset[_]): DataFrame
def explain(): Unit
def explain(extended: Boolean): Unit
def explain(mode: String): Unit
def filter(condition: Column): DataFrame
def filter(conditionExpr: String): DataFrame
def first(): T
def groupBy(col1: String, cols: String*): RelationalGroupedDataset
def head(n: Int): Array[T]
def head(): T
def hint(name: String, parameters: Any*): DataFrame
def inputFiles: Array[String]
def intersect(other: Dataset[_]): DataFrame
def intersectAll(other: Dataset[_]): DataFrame
def isEmpty: Boolean
def isLocal: Boolean
def isStreaming: Boolean
def join(right: Dataset[_]): DataFrame
def join(right: Dataset[_], usingColumn: String): DataFrame
def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
def join(right: Dataset[_], joinExprs: Column): DataFrame
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
def lateralJoin(right: Dataset[_]): DataFrame

Lateral join with a correlated right relation.

Lateral join with a correlated right relation.

Attributes

def lateralJoin(right: Dataset[_], joinExprs: Column): DataFrame
def lateralJoin(right: Dataset[_], joinType: String): DataFrame
def lateralJoin(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
def limit(n: Int): DataFrame

Eagerly locally checkpoints this Dataset.

Eagerly locally checkpoints this Dataset.

Attributes

def localCheckpoint(eager: Boolean): DataFrame

Locally checkpoints this Dataset.

Locally checkpoints this Dataset.

Attributes

def melt(ids: Array[Column], values: Array[Column], variableColumnName: String, valueColumnName: String): DataFrame

Alias for unpivot.

Alias for unpivot.

Attributes

def melt(ids: Array[Column], variableColumnName: String, valueColumnName: String): DataFrame

Alias for unpivot.

Alias for unpivot.

Attributes

def mergeInto(table: String, condition: Column): MergeIntoWriter[T]

Merges this Dataset (the source) into the table (the target) using condition to match rows. Returns a MergeIntoWriter to configure the WHEN clauses; call merge() to run it.

Merges this Dataset (the source) into the table (the target) using condition to match rows. Returns a MergeIntoWriter to configure the WHEN clauses; call merge() to run it.

Attributes

Returns a DataFrameNaFunctions for working with missing data.

Returns a DataFrameNaFunctions for working with missing data.

Attributes

def observe(observation: Observation, expr: Column, exprs: Column*): DataFrame

Defines named observed metrics computed while this Dataset is processed.

Defines named observed metrics computed while this Dataset is processed.

Attributes

def offset(n: Int): DataFrame
def orderBy(sortCol: String, sortCols: String*): DataFrame
def orderBy(sortExprs: Column*): DataFrame
def persist(): this.type

Persists this Dataset with the default storage level (MEMORY_AND_DISK).

Persists this Dataset with the default storage level (MEMORY_AND_DISK).

Attributes

def persist(newLevel: StorageLevel): this.type

Persists this Dataset with the given storage level.

Persists this Dataset with the given storage level.

Attributes

def printSchema(): Unit
def printSchema(level: Int): Unit
def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]

Randomly splits this Dataset with the given weights and a fixed seed.

Randomly splits this Dataset with the given weights and a fixed seed.

Attributes

def randomSplit(weights: Array[Double]): Array[DataFrame]

Randomly splits this Dataset with the given weights.

Randomly splits this Dataset with the given weights.

Attributes

def repartition(numPartitions: Int): DataFrame
def repartition(numPartitions: Int, partitionExprs: Column*): DataFrame
def repartition(partitionExprs: Column*): DataFrame
def repartitionByRange(numPartitions: Int, partitionExprs: Column*): DataFrame

Range-partitions by the given expressions into numPartitions.

Range-partitions by the given expressions into numPartitions.

Attributes

def repartitionByRange(partitionExprs: Column*): DataFrame

Range-partitions by the given expressions.

Range-partitions by the given expressions.

Attributes

def sameSemantics(other: Dataset[_]): Boolean
def sample(fraction: Double): DataFrame
def sample(fraction: Double, seed: Long): DataFrame
def sample(withReplacement: Boolean, fraction: Double): DataFrame
def sample(withReplacement: Boolean, fraction: Double, seed: Long): DataFrame

The schema of this Dataset.

The schema of this Dataset.

Attributes

def select(cols: Column*): DataFrame
def select(col: String, cols: String*): DataFrame
def selectExpr(exprs: String*): DataFrame
def semanticHash(): Int
def show(): Unit
def show(numRows: Int): Unit
def show(truncate: Boolean): Unit
def show(numRows: Int, truncate: Boolean): Unit
def show(numRows: Int, truncate: Int): Unit
def show(numRows: Int, truncate: Int, vertical: Boolean): Unit
def sort(sortCol: String, sortCols: String*): DataFrame
def sort(sortExprs: Column*): DataFrame
def sortWithinPartitions(sortCol: String, sortCols: String*): DataFrame

Returns a DataFrameStatFunctions for statistic functions.

Returns a DataFrameStatFunctions for statistic functions.

Attributes

Returns the current storage level of this Dataset.

Returns the current storage level of this Dataset.

Attributes

def summary(statistics: String*): DataFrame

Computes the requested summary statistics; defaults match Spark's summary().

Computes the requested summary statistics; defaults match Spark's summary().

Attributes

def take(n: Int): Array[T]
def takeAsList(n: Int): List[T]
def to(schema: StructType): DataFrame

Returns a new Dataset where each row is reconciled to match the specified schema (by column name, reordering and casting as needed).

Returns a new Dataset where each row is reconciled to match the specified schema (by column name, reordering and casting as needed).

Attributes

def toDF(): DataFrame
def toDF(colNames: String*): DataFrame

Returns the content as a DataFrame of JSON strings in a single value column.

Returns the content as a DataFrame of JSON strings in a single value column.

Attributes

def toLocalIterator(): Iterator[T]
def transform[U](t: (Dataset[T]) => Dataset[U]): Dataset[U]

Concisely applies a transformation to this Dataset.

Concisely applies a transformation to this Dataset.

Attributes

Transposes the DataFrame, turning the first column into the new column names.

Transposes the DataFrame, turning the first column into the new column names.

Attributes

def transpose(indexColumn: Column): DataFrame

Transposes the DataFrame using indexColumn for the new column names.

Transposes the DataFrame using indexColumn for the new column names.

Attributes

def union(other: Dataset[_]): DataFrame
def unionAll(other: Dataset[_]): DataFrame
def unionByName(other: Dataset[_]): DataFrame
def unionByName(other: Dataset[_], allowMissingColumns: Boolean): DataFrame
def unpersist(blocking: Boolean): this.type

Marks this Dataset as non-persistent.

Marks this Dataset as non-persistent.

Attributes

def unpersist(): this.type

Marks this Dataset as non-persistent.

Marks this Dataset as non-persistent.

Attributes

def unpivot(ids: Array[Column], values: Array[Column], variableColumnName: String, valueColumnName: String): DataFrame

Unpivots (melts) a DataFrame from wide to long format.

Unpivots (melts) a DataFrame from wide to long format.

Attributes

def unpivot(ids: Array[Column], variableColumnName: String, valueColumnName: String): DataFrame

Unpivots, inferring the value columns from those not in ids.

Unpivots, inferring the value columns from those not in ids.

Attributes

def where(condition: Column): DataFrame
def where(conditionExpr: String): DataFrame
def withColumn(colName: String, col: Column): DataFrame
def withColumnRenamed(existingName: String, newName: String): DataFrame
def withColumnsRenamed(renames: Map[String, String]): DataFrame
def withWatermark(eventTime: String, delayThreshold: String): DataFrame

Defines an event-time watermark for this streaming Dataset.

Defines an event-time watermark for this streaming Dataset.

Attributes

Interface for saving the content of this Dataset to external storage.

Interface for saving the content of this Dataset to external storage.

Attributes

Interface for saving the content of a streaming Dataset to external storage.

Interface for saving the content of a streaming Dataset to external storage.

Attributes

def writeTo(table: String): DataFrameWriterV2

Creates a v2 (catalog) write configuration builder.

Creates a v2 (catalog) write configuration builder.

Attributes

Concrete fields