public class Pipeline extends Object
Pipeline manages a directed acyclic graph of PTransforms, and the
PCollections that the PTransforms consume and produce.
A Pipeline is initialized with a PipelineRunner that will later
execute the Pipeline.
Pipelines are independent, so they can be constructed and executed
concurrently.
Each Pipeline is self-contained and isolated from any other
Pipeline. The PValues that are inputs and outputs of each of a
Pipeline's PTransforms are also owned by that
Pipeline. A PValue owned by one Pipeline can be read only by
PTransforms also owned by that Pipeline.
Here is a typical example of use:
// Start by defining the options for the pipeline.
PipelineOptions options = PipelineOptionsFactory.create();
// Then create the pipeline. The runner is determined by the options.
Pipeline p = Pipeline.create(options);
// A root PTransform, like TextIO.Read or Create, gets added
// to the Pipeline by being applied:
PCollection<String> lines =
p.apply(TextIO.Read.from("gs://bucket/dir/file*.txt"));
// A Pipeline can have multiple root transforms:
PCollection<String> moreLines =
p.apply(TextIO.Read.from("gs://bucket/other/dir/file*.txt"));
PCollection<String> yetMoreLines =
p.apply(Create.of("yet", "more", "lines").withCoder(StringUtf8Coder.of()));
// Further PTransforms can be applied, in an arbitrary (acyclic) graph.
// Subsequent PTransforms (and intermediate PCollections etc.) are
// implicitly part of the same Pipeline.
PCollection<String> allLines =
PCollectionList.of(lines).and(moreLines).and(yetMoreLines)
.apply(new Flatten<String>());
PCollection<KV<String, Integer>> wordCounts =
allLines
.apply(ParDo.of(new ExtractWords()))
.apply(new Count<String>());
PCollection<String> formattedWordCounts =
wordCounts.apply(ParDo.of(new FormatCounts()));
formattedWordCounts.apply(TextIO.Write.to("gs://bucket/dir/counts.txt"));
// PTransforms aren't executed when they're applied, rather they're
// just added to the Pipeline. Once the whole Pipeline of PTransforms
// is constructed, the Pipeline's PTransforms can be run using a
// PipelineRunner. The default PipelineRunner executes the Pipeline
// directly, sequentially, in this one process, which is useful for
// unit tests and simple experiments:
p.run();
| Modifier and Type | Class and Description |
|---|---|
static class |
Pipeline.PipelineExecutionException
|
static interface |
Pipeline.PipelineVisitor
A
Pipeline.PipelineVisitor can be passed into
traverseTopologically(org.apache.beam.sdk.Pipeline.PipelineVisitor) to be called for each of the
transforms and values in the Pipeline. |
| Modifier | Constructor and Description |
|---|---|
protected |
Pipeline(PipelineRunner<?> runner)
Deprecated.
replaced by
Pipeline(PipelineRunner, PipelineOptions) |
protected |
Pipeline(PipelineRunner<?> runner,
PipelineOptions options) |
| Modifier and Type | Method and Description |
|---|---|
void |
addValueInternal(PValue value)
|
<OutputT extends POutput> |
apply(PTransform<? super PBegin,OutputT> root)
Like
apply(String, PTransform) but the transform node in the Pipeline
graph will be named according to PTransform.getName(). |
<OutputT extends POutput> |
apply(String name,
PTransform<? super PBegin,OutputT> root)
|
static <InputT extends PInput,OutputT extends POutput> |
applyTransform(InputT input,
PTransform<? super InputT,OutputT> transform)
Like
applyTransform(String, PInput, PTransform) but defaulting to the name
provided by the PTransform. |
static <InputT extends PInput,OutputT extends POutput> |
applyTransform(String name,
InputT input,
PTransform<? super InputT,OutputT> transform)
Applies the given
PTransform to this input InputT and returns
its OutputT. |
PBegin |
begin()
Returns a
PBegin owned by this Pipeline. |
static Pipeline |
create(PipelineOptions options)
Constructs a pipeline from the provided options.
|
CoderRegistry |
getCoderRegistry()
Returns the
CoderRegistry that this Pipeline uses. |
String |
getFullNameForTesting(PTransform<?,?> transform)
Deprecated.
this method is no longer compatible with the design of
Pipeline,
as PTransforms can be applied multiple times, with different names
each time. |
PipelineOptions |
getOptions()
Returns the configured
PipelineOptions. |
PipelineRunner<?> |
getRunner()
Returns the configured
PipelineRunner. |
PipelineResult |
run()
Runs the
Pipeline using its PipelineRunner. |
void |
setCoderRegistry(CoderRegistry coderRegistry)
Sets the
CoderRegistry that this Pipeline uses. |
String |
toString() |
void |
traverseTopologically(Pipeline.PipelineVisitor visitor)
Invokes the
PipelineVisitor's
Pipeline.PipelineVisitor.visitPrimitiveTransform(org.apache.beam.sdk.runners.TransformTreeNode) and
Pipeline.PipelineVisitor.visitValue(org.apache.beam.sdk.values.PValue, org.apache.beam.sdk.runners.TransformTreeNode) operations on each of this
Pipeline's transform and value nodes, in forward
topological order. |
@Deprecated protected Pipeline(PipelineRunner<?> runner)
Pipeline(PipelineRunner, PipelineOptions)protected Pipeline(PipelineRunner<?> runner, PipelineOptions options)
public static Pipeline create(PipelineOptions options)
public PBegin begin()
public <OutputT extends POutput> OutputT apply(PTransform<? super PBegin,OutputT> root)
apply(String, PTransform) but the transform node in the Pipeline
graph will be named according to PTransform.getName().apply(String, PTransform)public <OutputT extends POutput> OutputT apply(String name, PTransform<? super PBegin,OutputT> root)
PTransform, such as Read or Create,
to this Pipeline.
The node in the Pipeline graph will use the provided name.
This name is used in various places, including the monitoring UI, logging,
and to stably identify this node in the Pipeline graph upon update.
Alias for begin().apply(name, root).
public PipelineResult run()
Pipeline using its PipelineRunner.public CoderRegistry getCoderRegistry()
CoderRegistry that this Pipeline uses.public void setCoderRegistry(CoderRegistry coderRegistry)
CoderRegistry that this Pipeline uses.public void traverseTopologically(Pipeline.PipelineVisitor visitor)
PipelineVisitor's
Pipeline.PipelineVisitor.visitPrimitiveTransform(org.apache.beam.sdk.runners.TransformTreeNode) and
Pipeline.PipelineVisitor.visitValue(org.apache.beam.sdk.values.PValue, org.apache.beam.sdk.runners.TransformTreeNode) operations on each of this
Pipeline's transform and value nodes, in forward
topological order.
Traversal of the Pipeline causes PTransforms and
PValues owned by the Pipeline to be marked as finished,
at which point they may no longer be modified.
Typically invoked by PipelineRunner subclasses.
public static <InputT extends PInput,OutputT extends POutput> OutputT applyTransform(InputT input, PTransform<? super InputT,OutputT> transform)
applyTransform(String, PInput, PTransform) but defaulting to the name
provided by the PTransform.public static <InputT extends PInput,OutputT extends POutput> OutputT applyTransform(String name, InputT input, PTransform<? super InputT,OutputT> transform)
PTransform to this input InputT and returns
its OutputT. This uses name to identify this specific application
of the transform. This name is used in various places, including the monitoring UI,
logging, and to stably identify this application node in the Pipeline graph during
update.
Each PInput subclass that provides an apply method should delegate to
this method to ensure proper registration with the PipelineRunner.
public PipelineRunner<?> getRunner()
PipelineRunner.public PipelineOptions getOptions()
PipelineOptions.@Deprecated public String getFullNameForTesting(PTransform<?,?> transform)
Pipeline,
as PTransforms can be applied multiple times, with different names
each time.public void addValueInternal(PValue value)