A Spark Declarative Pipeline (SDP) dataflow graph.
A pipeline is built by registering outputs (tables, materialized views, temporary views, or sinks) and the flows that populate them, then started with startRun. Each flow is defined by a DataFrame (an unresolved relation), so flows are composed with the same API used for ordinary queries.
Create one with Pipeline.create.
Attributes
- Note
-
foreach/foreachBatchflows and query-function evaluation are not supported (they require user-defined functions); define each flow with a relation instead. - Example
-
val pipe = Pipeline.create(spark, storage = Some("/tmp/pipeline_storage")) pipe.createMaterializedView("bronze", Some(spark.read.json("/data/raw"))) pipe.createTable("silver", Some(pipe.read("bronze").filter(col("ok")))) val events = pipe.startRun() - Companion
- object
- Graph
-
- Supertypes
-
class Objecttrait Matchableclass Any
Members list
Value members
Concrete methods
Define a materialized view and the flow that populates it.
Define a materialized view and the flow that populates it.
Attributes
- Returns
-
the resolved output identifier.
Define a streaming sink and the flow that feeds it.
Define a streaming sink and the flow that feeds it.
Value parameters
- df
-
the flow feeding the sink.
- format
-
the streaming write format.
- name
-
the sink name.
- options
-
the streaming write options.
Attributes
- Returns
-
the resolved output identifier.
Define a published table and the flow that populates it.
Define a published table and the flow that populates it.
Value parameters
- df
-
the query that populates the table (a flow), or
Nonefor a table with no flow. - name
-
the table name.
Attributes
- Returns
-
the resolved output identifier.
Define a (non-published) temporary view and its flow.
Define a (non-published) temporary view and its flow.
Attributes
- Returns
-
the resolved output identifier.
Define a flow that writes the contents of df into target.
Define a flow that writes the contents of df into target.
Value parameters
- df
-
the relation defining the flow.
- name
-
the flow name.
- once
-
define as a one-time (batch) flow.
- sqlConf
-
SQL configurations set when running this flow.
- target
-
the dataset the flow writes to (defaults to
name).
Attributes
- Returns
-
the resolved flow name.
Register datasets and flows from a SQL definition.
Register datasets and flows from a SQL definition.
Value parameters
- sqlFilePath
-
the originating SQL file path, if any.
- sqlText
-
the SQL source.
Attributes
Drop this dataflow graph and stop any attached flows.
Drop this dataflow graph and stop any attached flows.
Attributes
Reference a dataset defined in this pipeline as a DataFrame (so later flows can read from earlier outputs).
Resolve the graph and run a pipeline update. Blocks until the update completes, returning the events emitted during the run.
Resolve the graph and run a pipeline update. Blocks until the update completes, returning the events emitted during the run.
Value parameters
- dry
-
validate the graph without executing flows.
- fullRefresh
-
datasets to reset and recompute.
- fullRefreshAll
-
reset and recompute everything.
- refresh
-
datasets to update.
- storage
-
checkpoint/metadata storage location.
Attributes
- Returns
-
the events emitted during the run.