InputT - the type of the (main) input elementsOutputT - the type of the (main) output elementspublic abstract class DoFn<InputT,OutputT> extends Object implements Serializable, HasDisplayData
ParDo providing the code to use to process
elements of the input
PCollection.
See ParDo for more explanation, examples of use, and
discussion of constraints on DoFns, including their
serializability, lack of access to global shared mutable state,
requirements for failure tolerance, and benefits of optimization.
DoFns can be tested in the context of a particular
Pipeline by running that Pipeline on sample input
and then checking its output. Unit testing of a DoFn,
separately from any ParDo transform or Pipeline,
can be done via the DoFnTester harness.
DoFnWithContext (currently experimental) offers an alternative
mechanism for accessing DoFn.ProcessContext.window() without the need
to implement DoFn.RequiresWindowAccess.
See also processElement(org.apache.beam.sdk.transforms.DoFn<InputT, OutputT>.ProcessContext) for details on implementing the transformation
from InputT to OutputT.
| Modifier and Type | Class and Description |
|---|---|
class |
DoFn.Context
Information accessible to all methods in this
DoFn. |
class |
DoFn.ProcessContext
Information accessible when running
processElement(org.apache.beam.sdk.transforms.DoFn<InputT, OutputT>.ProcessContext). |
static interface |
DoFn.RequiresWindowAccess
Interface for signaling that a
DoFn needs to access the window the
element is being processed in, via DoFn.ProcessContext.window(). |
| Constructor and Description |
|---|
DoFn() |
| Modifier and Type | Method and Description |
|---|---|
protected <AggInputT,AggOutputT> |
createAggregator(String name,
Combine.CombineFn<? super AggInputT,?,AggOutputT> combiner)
Returns an
Aggregator with aggregation logic specified by the
Combine.CombineFn argument. |
protected <AggInputT> |
createAggregator(String name,
SerializableFunction<Iterable<AggInputT>,AggInputT> combiner)
Returns an
Aggregator with the aggregation logic specified by the
SerializableFunction argument. |
void |
finishBundle(DoFn.Context c)
Finishes processing this batch of elements.
|
Duration |
getAllowedTimestampSkew()
Deprecated.
does not interact well with the watermark.
|
protected TypeDescriptor<InputT> |
getInputTypeDescriptor()
Returns a
TypeDescriptor capturing what is known statically
about the input type of this DoFn instance's most-derived
class. |
protected TypeDescriptor<OutputT> |
getOutputTypeDescriptor()
Returns a
TypeDescriptor capturing what is known statically
about the output type of this DoFn instance's
most-derived class. |
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
abstract void |
processElement(DoFn.ProcessContext c)
Processes one input element.
|
void |
startBundle(DoFn.Context c)
Prepares this
DoFn instance for processing a batch of elements. |
@Deprecated public Duration getAllowedTimestampSkew()
DoFn.Context.outputWithTimestamp(OutputT, org.joda.time.Instant).
The default value is Duration.ZERO, in which case
timestamps can only be shifted forward to future. For infinite
skew, return Duration.millis(Long.MAX_VALUE).
Note that producing an element whose timestamp is less than the current timestamp may result in late data, i.e. returning a non-zero value here does not impact watermark calculations used for firing windows.
public void startBundle(DoFn.Context c) throws Exception
DoFn instance for processing a batch of elements.
By default, does nothing.
Exceptionpublic abstract void processElement(DoFn.ProcessContext c) throws Exception
The current element of the input PCollection is returned by
c.element(). It should be considered immutable. The Dataflow
runtime will not mutate the element, so it is safe to cache, etc. The element should not be
mutated by any of the DoFn methods, because it may be cached elsewhere, retained by the
Dataflow runtime, or used in other unspecified ways.
A value is added to the main output PCollection by DoFn.Context.output(OutputT).
Once passed to output the element should be considered immutable and not be modified in
any way. It may be cached elsewhere, retained by the Dataflow runtime, or used in other
unspecified ways.
ExceptionDoFn.ProcessContextpublic void finishBundle(DoFn.Context c) throws Exception
By default, does nothing.
Exceptionpublic void populateDisplayData(DisplayData.Builder builder)
populateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData). Implementations may call
super.populateDisplayData(builder) in order to register display data in the current
namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use
the namespace of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData in interface HasDisplayDatabuilder - The builder to populate with display data.HasDisplayDataprotected TypeDescriptor<InputT> getInputTypeDescriptor()
TypeDescriptor capturing what is known statically
about the input type of this DoFn instance's most-derived
class.
See getOutputTypeDescriptor() for more discussion.
protected TypeDescriptor<OutputT> getOutputTypeDescriptor()
TypeDescriptor capturing what is known statically
about the output type of this DoFn instance's
most-derived class.
In the normal case of a concrete DoFn subclass with
no generic type parameters of its own (including anonymous inner
classes), this will be a complete non-generic type, which is good
for choosing a default output Coder<OutputT> for the output
PCollection<OutputT>.
protected final <AggInputT,AggOutputT> Aggregator<AggInputT,AggOutputT> createAggregator(String name, Combine.CombineFn<? super AggInputT,?,AggOutputT> combiner)
Aggregator with aggregation logic specified by the
Combine.CombineFn argument. The name provided must be unique across
Aggregators created within the DoFn. Aggregators can only be created
during pipeline construction.name - the name of the aggregatorcombiner - the Combine.CombineFn to use in the aggregatorNullPointerException - if the name or combiner is nullIllegalArgumentException - if the given name collides with another
aggregator in this scopeIllegalStateException - if called during pipeline processing.protected final <AggInputT> Aggregator<AggInputT,AggInputT> createAggregator(String name, SerializableFunction<Iterable<AggInputT>,AggInputT> combiner)
Aggregator with the aggregation logic specified by the
SerializableFunction argument. The name provided must be unique
across Aggregators created within the DoFn. Aggregators can only be
created during pipeline construction.name - the name of the aggregatorcombiner - the SerializableFunction to use in the aggregatorNullPointerException - if the name or combiner is nullIllegalArgumentException - if the given name collides with another
aggregator in this scopeIllegalStateException - if called during pipeline processing.