Pipeline

org.apache.spark.sql.pipelines.Pipeline
See thePipeline companion object
class Pipeline

A Spark Declarative Pipeline (SDP) dataflow graph.

A pipeline is built by registering outputs (tables, materialized views, temporary views, or sinks) and the flows that populate them, then started with startRun. Each flow is defined by a DataFrame (an unresolved relation), so flows are composed with the same API used for ordinary queries.

Create one with Pipeline.create.

Attributes

Note

foreach/foreachBatch flows and query-function evaluation are not supported (they require user-defined functions); define each flow with a relation instead.

Example
 val pipe = Pipeline.create(spark, storage = Some("/tmp/pipeline_storage"))
 pipe.createMaterializedView("bronze", Some(spark.read.json("/data/raw")))
 pipe.createTable("silver", Some(pipe.read("bronze").filter(col("ok"))))
 val events = pipe.startRun()
Companion
object
Graph
Supertypes
class Object
trait Matchable
class Any

Members list

Value members

Concrete methods

def createMaterializedView(name: String, df: Option[DataFrame], comment: Option[String], format: Option[String], partitionCols: Seq[String], clusteringColumns: Seq[String], tableProperties: Map[String, String], schema: Option[Either[String, DataType]]): String

Define a materialized view and the flow that populates it.

Define a materialized view and the flow that populates it.

Attributes

Returns

the resolved output identifier.

def createSink(name: String, df: DataFrame, format: Option[String], options: Map[String, String]): String

Define a streaming sink and the flow that feeds it.

Define a streaming sink and the flow that feeds it.

Value parameters

df

the flow feeding the sink.

format

the streaming write format.

name

the sink name.

options

the streaming write options.

Attributes

Returns

the resolved output identifier.

def createTable(name: String, df: Option[DataFrame], comment: Option[String], format: Option[String], partitionCols: Seq[String], clusteringColumns: Seq[String], tableProperties: Map[String, String], schema: Option[Either[String, DataType]]): String

Define a published table and the flow that populates it.

Define a published table and the flow that populates it.

Value parameters

df

the query that populates the table (a flow), or None for a table with no flow.

name

the table name.

Attributes

Returns

the resolved output identifier.

def createTemporaryView(name: String, df: Option[DataFrame], comment: Option[String]): String

Define a (non-published) temporary view and its flow.

Define a (non-published) temporary view and its flow.

Attributes

Returns

the resolved output identifier.

def defineFlow(name: String, df: DataFrame, target: Option[String], once: Boolean, sqlConf: Map[String, String]): String

Define a flow that writes the contents of df into target.

Define a flow that writes the contents of df into target.

Value parameters

df

the relation defining the flow.

name

the flow name.

once

define as a one-time (batch) flow.

sqlConf

SQL configurations set when running this flow.

target

the dataset the flow writes to (defaults to name).

Attributes

Returns

the resolved flow name.

def defineSql(sqlText: String, sqlFilePath: Option[String]): Unit

Register datasets and flows from a SQL definition.

Register datasets and flows from a SQL definition.

Value parameters

sqlFilePath

the originating SQL file path, if any.

sqlText

the SQL source.

Attributes

def drop(): Unit

Drop this dataflow graph and stop any attached flows.

Drop this dataflow graph and stop any attached flows.

Attributes

def read(name: String): DataFrame

Reference a dataset defined in this pipeline as a DataFrame (so later flows can read from earlier outputs).

Reference a dataset defined in this pipeline as a DataFrame (so later flows can read from earlier outputs).

Value parameters

name

the dataset name.

Attributes

Returns

a DataFrame reading the named dataset.

def startRun(fullRefresh: Seq[String], fullRefreshAll: Boolean, refresh: Seq[String], dry: Boolean, storage: Option[String]): Seq[PipelineEvent]

Resolve the graph and run a pipeline update. Blocks until the update completes, returning the events emitted during the run.

Resolve the graph and run a pipeline update. Blocks until the update completes, returning the events emitted during the run.

Value parameters

dry

validate the graph without executing flows.

fullRefresh

datasets to reset and recompute.

fullRefreshAll

reset and recompute everything.

refresh

datasets to update.

storage

checkpoint/metadata storage location.

Attributes

Returns

the events emitted during the run.

Concrete fields

val graphId: String