public class IntraBundleParallelization extends Object
DoFns, using threaded execution to
process multiple elements concurrently within a bundle.
Note, that each Dataflow worker will already process multiple bundles concurrently and usage of this class is meant only for cases where processing elements from within a bundle is limited by blocking calls.
CPU intensive or IO intensive tasks are in general a poor fit for parallelization. This is because a limited resource that is already maximally utilized does not benefit from sub-division of work. The parallelization will increase the amount of time to process each element yet the throughput for processing will remain relatively the same. For example, if the local disk (an IO resource) has a maximum write rate of 10 MiB/s, and processing each element requires to write 20 MiBs to disk, then processing one element to disk will take 2 seconds. Yet processing 3 elements concurrently (each getting an equal share of the maximum write rate) will take at least 6 seconds to complete (there is additional overhead in the extra parallelization).
To parallelize a DoFn to 10 threads:
PCollection<T> data = ...;
data.apply(
IntraBundleParallelization.of(new MyDoFn())
.withMaxParallelism(10)));
An uncaught exception from the wrapped DoFn will result in the exception
being rethrown in later calls to IntraBundleParallelization.MultiThreadedIntraBundleProcessingDoFn.processElement(org.apache.beam.sdk.transforms.DoFn<InputT, OutputT>.ProcessContext)
or a call to IntraBundleParallelization.MultiThreadedIntraBundleProcessingDoFn.finishBundle(org.apache.beam.sdk.transforms.DoFn<InputT, OutputT>.Context).
| Modifier and Type | Class and Description |
|---|---|
static class |
IntraBundleParallelization.Bound<InputT,OutputT>
A
PTransform that, when applied to a PCollection<InputT>,
invokes a user-specified DoFn<InputT, OutputT> on all its elements,
with all its outputs collected into an output
PCollection<OutputT>. |
static class |
IntraBundleParallelization.MultiThreadedIntraBundleProcessingDoFn<InputT,OutputT>
A multi-threaded
DoFn wrapper. |
static class |
IntraBundleParallelization.Unbound
An incomplete
IntraBundleParallelization transform, with unbound input/output types. |
| Constructor and Description |
|---|
IntraBundleParallelization() |
| Modifier and Type | Method and Description |
|---|---|
static <InputT,OutputT> |
of(DoFn<InputT,OutputT> doFn)
Creates a
IntraBundleParallelization PTransform for the given
DoFn that processes elements using multiple threads. |
static IntraBundleParallelization.Unbound |
withMaxParallelism(int maxParallelism)
Creates a
IntraBundleParallelization PTransform with the specified
maximum concurrency level. |
public static <InputT,OutputT> IntraBundleParallelization.Bound<InputT,OutputT> of(DoFn<InputT,OutputT> doFn)
IntraBundleParallelization PTransform for the given
DoFn that processes elements using multiple threads.
Note that the specified doFn needs to be thread safe.
public static IntraBundleParallelization.Unbound withMaxParallelism(int maxParallelism)
IntraBundleParallelization PTransform with the specified
maximum concurrency level.