Spark Connect for Scala 3/org.apache.spark/org.apache.spark.sql/org.apache.spark.sql.pipelines/Pipeline

Pipeline

org.apache.spark.sql.pipelines.Pipeline

See thePipeline companion object

A Spark Declarative Pipeline (SDP) dataflow graph.

A pipeline is built by registering outputs (tables, materialized views, temporary views, or sinks) and the flows that populate them, then started with startRun. Each flow is defined by a DataFrame (an unresolved relation), so flows are composed with the same API used for ordinary queries.

Create one with Pipeline.create.

Attributes

Note

foreach/foreachBatch flows and query-function evaluation are not supported (they require user-defined functions); define each flow with a relation instead.

Example

 val pipe = Pipeline.create(spark, storage = Some("/tmp/pipeline_storage"))
 pipe.createMaterializedView("bronze", Some(spark.read.json("/data/raw")))
 pipe.createTable("silver", Some(pipe.read("bronze").filter(col("ok"))))
 val events = pipe.startRun()

Companion

object

Graph

Supertypes

class Object

trait Matchable

class Any

Members list

Value members

Concrete methods

Define a materialized view and the flow that populates it.

Attributes

Returns: the resolved output identifier.

Define a streaming sink and the flow that feeds it.

Value parameters

df: the flow feeding the sink.
format: the streaming write format.
name: the sink name.
options: the streaming write options.

Attributes

Returns: the resolved output identifier.

Define a published table and the flow that populates it.

Value parameters

df: the query that populates the table (a flow), or None for a table with no flow.
name: the table name.

Attributes

Returns: the resolved output identifier.

Define a (non-published) temporary view and its flow.

Attributes

Returns: the resolved output identifier.

Define a flow that writes the contents of df into target.

Value parameters

df: the relation defining the flow.
name: the flow name.
once: define as a one-time (batch) flow.
sqlConf: SQL configurations set when running this flow.
target: the dataset the flow writes to (defaults to name).

Attributes

Returns: the resolved flow name.

Value parameters

sqlFilePath: the originating SQL file path, if any.
sqlText: the SQL source.

Attributes

Drop this dataflow graph and stop any attached flows.

Attributes

Reference a dataset defined in this pipeline as a DataFrame (so later flows can read from earlier outputs).

Value parameters

name: the dataset name.

Attributes

Returns: a DataFrame reading the named dataset.

Resolve the graph and run a pipeline update. Blocks until the update completes, returning the events emitted during the run.

Value parameters

dry: validate the graph without executing flows.
fullRefresh: datasets to reset and recompute.
fullRefreshAll: reset and recompute everything.
refresh: datasets to update.
storage: checkpoint/metadata storage location.

Attributes

Returns: the events emitted during the run.

Concrete fields

In this article

Generated with