K - type of keysInputT - type of input valuesAccumT - type of mutable accumulator valuesOutputT - type of output valuespublic abstract static class Combine.KeyedCombineFn<K,InputT,AccumT,OutputT> extends Object
KeyedCombineFn<K, InputT, AccumT, OutputT> specifies how to combine
a collection of input values of type InputT, associated with
a key of type K, into a single output value of type
OutputT. It does this via one or more intermediate mutable
accumulator values of type AccumT.
The overall process to combine a collection of input
InputT values associated with an input K key into a
single output OutputT value is as follows:
InputT values are partitioned into one or more
batches.
createAccumulator(K) operation is
invoked to create a fresh mutable accumulator value of type
AccumT, initialized to represent the combination of zero
values.
InputT value in a batch, the
addInput(K, AccumT, InputT) operation is invoked to add the value to that
batch's accumulator AccumT value. The accumulator may just
record the new value (e.g., if AccumT == List<InputT>, or may do
work to represent the combination more compactly.
mergeAccumulators(K, java.lang.Iterable<AccumT>) operation is invoked to
combine a collection of accumulator AccumT values into a
single combined output accumulator AccumT value, once the
merging accumulators have had all all the input values in their
batches added to them. This operation is invoked repeatedly,
until there is only one accumulator value left.
extractOutput(K, AccumT) operation is invoked on the final
accumulator AccumT value to get the output OutputT value.
All of these operations are passed the K key that the
values being combined are associated with.
For example:
public class ConcatFn
extends KeyedCombineFn<String, Integer, ConcatFn.Accum, String> {
public static class Accum {
String s = "";
}
public Accum createAccumulator(String key) {
return new Accum();
}
public Accum addInput(String key, Accum accum, Integer input) {
accum.s += "+" + input;
return accum;
}
public Accum mergeAccumulators(String key, Iterable<Accum> accums) {
Accum merged = new Accum();
for (Accum accum : accums) {
merged.s += accum.s;
}
return merged;
}
public String extractOutput(String key, Accum accum) {
return key + accum.s;
}
}
PCollection<KV<String, Integer>> pc = ...;
PCollection<KV<String, String>> pc2 = pc.apply(
Combine.perKey(new ConcatFn()));
Keyed combining functions used by Combine.PerKey,
Combine.GroupedValues, and PTransforms derived
from them should be associative and commutative.
Associativity is required because input values are first broken
up into subgroups before being combined, and their intermediate
results further combined, in an arbitrary tree structure.
Commutativity is required because any order of the input values
is ignored when breaking up input values into groups.
| Constructor and Description |
|---|
KeyedCombineFn() |
| Modifier and Type | Method and Description |
|---|---|
abstract AccumT |
addInput(K key,
AccumT accumulator,
InputT value)
Adds the given input value to the given accumulator, returning the new accumulator value.
|
OutputT |
apply(K key,
Iterable<? extends InputT> inputs)
Applies this
KeyedCombineFn to a key and a collection
of input values to produce a combined output value. |
AccumT |
compact(K key,
AccumT accumulator)
Returns an accumulator that represents the same logical value as the
input accumulator, but may have a more compact representation.
|
abstract AccumT |
createAccumulator(K key)
Returns a new, mutable accumulator value representing the accumulation of zero input values.
|
abstract OutputT |
extractOutput(K key,
AccumT accumulator)
Returns the output value that is the result of combining all
the input values represented by the given accumulator.
|
Combine.CombineFn<InputT,AccumT,OutputT> |
forKey(K key,
Coder<K> keyCoder)
Returns the a regular
CombineFnBase.GlobalCombineFn that operates on a specific key. |
TypeVariable<?> |
getAccumTVariable()
Returns the
TypeVariable of AccumT. |
Coder<AccumT> |
getAccumulatorCoder(CoderRegistry registry,
Coder<K> keyCoder,
Coder<InputT> inputCoder)
Returns the
Coder to use for accumulator AccumT
values, or null if it is not able to be inferred. |
Coder<OutputT> |
getDefaultOutputCoder(CoderRegistry registry,
Coder<K> keyCoder,
Coder<InputT> inputCoder)
Returns the
Coder to use by default for output
OutputT values, or null if it is not able to be inferred. |
TypeVariable<?> |
getInputTVariable()
Returns the
TypeVariable of InputT. |
TypeVariable<?> |
getKTypeVariable()
Returns the
TypeVariable of K. |
TypeVariable<?> |
getOutputTVariable()
Returns the
TypeVariable of OutputT. |
abstract AccumT |
mergeAccumulators(K key,
Iterable<AccumT> accumulators)
Returns an accumulator representing the accumulation of all the
input values accumulated in the merging accumulators.
|
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
public abstract AccumT createAccumulator(K key)
key - the key that all the accumulated values using the
accumulator are associated withpublic abstract AccumT addInput(K key, AccumT accumulator, InputT value)
For efficiency, the input accumulator may be modified and returned.
key - the key that all the accumulated values using the
accumulator are associated withpublic abstract AccumT mergeAccumulators(K key, Iterable<AccumT> accumulators)
May modify any of the argument accumulators. May return a fresh accumulator, or may return one of the (modified) argument accumulators.
key - the key that all the accumulators are associated
withpublic abstract OutputT extractOutput(K key, AccumT accumulator)
key - the key that all the accumulated values using the
accumulator are associated withpublic AccumT compact(K key, AccumT accumulator)
For most CombineFns this would be a no-op, but should be overridden by CombineFns that (for example) buffer up elements and combine them in batches.
For efficiency, the input accumulator may be modified and returned.
By default returns the original accumulator.
public Combine.CombineFn<InputT,AccumT,OutputT> forKey(K key, Coder<K> keyCoder)
CombineFnBase.PerKeyCombineFnCombineFnBase.GlobalCombineFn that operates on a specific key.public OutputT apply(K key, Iterable<? extends InputT> inputs)
KeyedCombineFn to a key and a collection
of input values to produce a combined output value.
Useful when testing the behavior of a KeyedCombineFn
separately from a Combine transform.
public Coder<AccumT> getAccumulatorCoder(CoderRegistry registry, Coder<K> keyCoder, Coder<InputT> inputCoder) throws CannotProvideCoderException
CombineFnBase.PerKeyCombineFnCoder to use for accumulator AccumT
values, or null if it is not able to be inferred.
By default, uses the knowledge of the Coder being
used for K keys and input InputT values and the
enclosing Pipeline's CoderRegistry to try to
infer the Coder for AccumT values.
This is the Coder used to send data through a communication-intensive shuffle step, so a compact and efficient representation may have significant performance benefits.
getAccumulatorCoder in interface CombineFnBase.PerKeyCombineFn<K,InputT,AccumT,OutputT>CannotProvideCoderExceptionpublic Coder<OutputT> getDefaultOutputCoder(CoderRegistry registry, Coder<K> keyCoder, Coder<InputT> inputCoder) throws CannotProvideCoderException
CombineFnBase.PerKeyCombineFnCoder to use by default for output
OutputT values, or null if it is not able to be inferred.
By default, uses the knowledge of the Coder being
used for K keys and input InputT values and the
enclosing Pipeline's CoderRegistry to try to
infer the Coder for OutputT values.
getDefaultOutputCoder in interface CombineFnBase.PerKeyCombineFn<K,InputT,AccumT,OutputT>CannotProvideCoderExceptionpublic TypeVariable<?> getKTypeVariable()
TypeVariable of K.public TypeVariable<?> getInputTVariable()
TypeVariable of InputT.public TypeVariable<?> getAccumTVariable()
TypeVariable of AccumT.public TypeVariable<?> getOutputTVariable()
TypeVariable of OutputT.public void populateDisplayData(DisplayData.Builder builder)
populateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData). Implementations may call
super.populateDisplayData(builder) in order to register display data in the current
namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use
the namespace of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData in interface HasDisplayDatabuilder - The builder to populate with display data.HasDisplayData