T - the type that will be written to the Sink.@Experimental(value=SOURCE_SINK) public abstract class Sink<T> extends Object implements Serializable, HasDisplayData
Sink represents a resource that can be written to using the Write transform.
A parallel write to a Sink consists of three phases:
The Write transform can be used in a Dataflow pipeline to perform this write.
Specifically, a Write transform can be applied to a PCollection p by:
p.apply(Write.to(new MySink()));
Implementing a Sink and the corresponding write operations requires extending three
abstract classes:
Sink: an immutable logical description of the location/resource to write to.
Depending on the type of sink, it may contain fields such as the path to an output directory
on a filesystem, a database table name, etc. Implementors of Sink must
implement two methods: validate(org.apache.beam.sdk.options.PipelineOptions) and createWriteOperation(org.apache.beam.sdk.options.PipelineOptions).
Validate is called by the Write transform at pipeline creation, and should
validate that the Sink can be written to. The createWriteOperation method is also called at
pipeline creation, and should return a WriteOperation object that defines how to write to the
Sink. Note that implementations of Sink must be serializable and Sinks must be immutable.
Sink.WriteOperation: The WriteOperation implements the initialization and
finalization phases of a write. Implementors of Sink.WriteOperation must implement
corresponding Sink.WriteOperation.initialize(org.apache.beam.sdk.options.PipelineOptions) and Sink.WriteOperation.finalize(java.lang.Iterable<WriteT>, org.apache.beam.sdk.options.PipelineOptions) methods. A
WriteOperation must also implement Sink.WriteOperation.createWriter(org.apache.beam.sdk.options.PipelineOptions) that creates Writers,
Sink.WriteOperation.getWriterResultCoder() that returns a Coder for the result of a
parallel write, and a Sink.WriteOperation.getSink() that returns the Sink that the write
operation corresponds to. See below for more information about these methods and restrictions on
their implementation.
Sink.Writer: A Writer writes a bundle of records. Writer defines four methods:
Sink.Writer.open(java.lang.String), which is called once at the start of writing a bundle; Sink.Writer.write(T),
which writes a single record from the bundle; Sink.Writer.close(), which is called once at the
end of writing a bundle; and Sink.Writer.getWriteOperation(), which returns the write operation
that the writer belongs to.
Sink.WriteOperation.initialize(org.apache.beam.sdk.options.PipelineOptions) and Sink.WriteOperation.finalize(java.lang.Iterable<WriteT>, org.apache.beam.sdk.options.PipelineOptions) are conceptually called
once: at the beginning and end of a Write transform. However, implementors must ensure that these
methods are idempotent, as they may be called multiple times on different machines in the case of
failure/retry or for redundancy.
The finalize method of WriteOperation is passed an Iterable of a writer result type. This writer result type should encode the result of a write and, in most cases, some encoding of the unique bundle id.
All implementations of Sink.WriteOperation must be serializable.
WriteOperation may have mutable state. For instance, Sink.WriteOperation.initialize(org.apache.beam.sdk.options.PipelineOptions) may
mutate the object state. These mutations will be visible in Sink.WriteOperation.createWriter(org.apache.beam.sdk.options.PipelineOptions)
and Sink.WriteOperation.finalize(java.lang.Iterable<WriteT>, org.apache.beam.sdk.options.PipelineOptions) because the object will be serialized after initialize and
deserialized before these calls. However, it is not serialized again after createWriter is
called, as createWriter will be called within workers to create Writers for the bundles that are
distributed to these workers. Therefore, newWriter should not mutate the WriteOperation state (as
these mutations will not be visible in finalize).
In order to ensure fault-tolerance, a bundle may be executed multiple times (e.g., in the
event of failure/retry or for redundancy). However, exactly one of these executions will have its
result passed to the WriteOperation's finalize method. Each call to Sink.Writer.open(java.lang.String) is passed
a unique bundle id when it is called by the Write transform, so even redundant or retried
bundles will have a unique way of identifying their output.
The bundle id should be used to guarantee that a bundle's output is unique. This uniqueness guarantee is important; if a bundle is to be output to a file, for example, the name of the file must be unique to avoid conflicts with other Writers. The bundle id should be encoded in the writer result returned by the Writer and subsequently used by the WriteOperation's finalize method to identify the results of successful writes.
For example, consider the scenario where a Writer writes files containing serialized records and the WriteOperation's finalization step is to merge or rename these output files. In this case, a Writer may use its unique id to name its output file (to avoid conflicts) and return the name of the file it wrote as its writer result. The WriteOperation will then receive an Iterable of output file names that it can then merge or rename using some bundle naming scheme.
Sink.WriteOperations and Sink.Writers must agree on a writer result type that will be
returned by a Writer after it writes a bundle. This type can be a client-defined object or an
existing type; Sink.WriteOperation.getWriterResultCoder() should return a Coder for the
type.
A note about thread safety: Any use of static members or methods in Writer should be thread safe, as different instances of Writer objects may be created in different threads on the same worker.
| Modifier and Type | Class and Description |
|---|---|
static class |
Sink.WriteOperation<T,WriteT>
A
Sink.WriteOperation defines the process of a parallel write of objects to a Sink. |
static class |
Sink.Writer<T,WriteT>
A Writer writes a bundle of elements from a PCollection to a sink.
|
| Constructor and Description |
|---|
Sink() |
| Modifier and Type | Method and Description |
|---|---|
abstract Sink.WriteOperation<T,?> |
createWriteOperation(PipelineOptions options)
Returns an instance of a
Sink.WriteOperation that can write to this Sink. |
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
abstract void |
validate(PipelineOptions options)
Ensures that the sink is valid and can be written to before the write operation begins.
|
public abstract void validate(PipelineOptions options)
Preconditions to implement this method.public abstract Sink.WriteOperation<T,?> createWriteOperation(PipelineOptions options)
Sink.WriteOperation that can write to this Sink.public void populateDisplayData(DisplayData.Builder builder)
populateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData). Implementations may call
super.populateDisplayData(builder) in order to register display data in the current
namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use
the namespace of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData in interface HasDisplayDatabuilder - The builder to populate with display data.HasDisplayData