Class KinesisIO
- java.lang.Object
-
- org.apache.beam.sdk.io.aws2.kinesis.KinesisIO
-
@Experimental(SOURCE_SINK) public final class KinesisIO extends java.lang.ObjectIO to read from Kinesis streams.Reading from Kinesis
Example usages:
p.apply(KinesisIO.read() .withStreamName("streamName") .withInitialPositionInStream(InitialPositionInStream.LATEST) .apply( ... ) // other transformationsAt a minimum you have to provide:
- the name of the stream to read
- the position in the stream where to start reading, e.g.
InitialPositionInStream.LATEST,InitialPositionInStream.TRIM_HORIZON, or alternatively, using an arbitrary point in time withKinesisIO.Read.withInitialTimestampInStream(Instant).
Watermarks
Kinesis IO uses arrival time for watermarks by default. To use processing time instead, use
KinesisIO.Read.withProcessingTimeWatermarkPolicy():p.apply(KinesisIO.read() .withStreamName("streamName") .withInitialPositionInStream(InitialPositionInStream.LATEST) .withProcessingTimeWatermarkPolicy())It is also possible to specify a custom watermark policy to control watermark computation using
KinesisIO.Read.withCustomWatermarkPolicy(WatermarkPolicyFactory). This requires implementingWatermarkPolicywith a correspondingWatermarkPolicyFactory.Throttling
By default Kinesis IO will poll the Kinesis
getRecords()API as fast as possible as long as records are returned. TheRateLimitPolicyFactory.DefaultRateLimiterwill start throttling oncegetRecords()returns an empty response or if API calls get throttled by AWS.A
RateLimitPolicyis always applied to each shard individually.You may provide a custom rate limit policy using
KinesisIO.Read.withCustomRateLimitPolicy(RateLimitPolicyFactory). This requires implementingRateLimitPolicywith a correspondingRateLimitPolicyFactory.Writing to Kinesis
Example usages:
PCollection<KV<String, byte[]>> data = ...; data.apply(KinesisIO.write() .withStreamName("streamName") .withPartitionKey(KV::getKey) .withSerializer(KV::getValue);Note: Usage of
KVis just for illustration purposes here.At a minimum you have to provide:
- the name of the Kinesis stream to write to,
- a
KinesisPartitionerto distribute records across shards of the stream - and a function to serialize your data to bytes on the stream
ClientConfiguration, see below.Partitioning of writes
Choosing the right partitioning strategy by means of aKinesisPartitioneris one of the key considerations when writing to Kinesis. Typically, you should aime to evenly distribute data across all shards of the stream.Partition keys are used as input to a hash function that maps the partition key and associated data to a specific shard. If the cardinality of your partition keys is of the same order of magnitude as the number of shards in the stream, the hash function will likely not distribute your keys evenly among shards. This may result in heavily skewed shards with some shards not utilized at all.
If you require finer control over the distribution of records, override
KinesisPartitioner.getExplicitHashKey(Object)according to your needs. However, this might impact record aggregation.Aggregation of records
To better leverage Kinesis API limits and to improve producer throughput, the writer aggregates multiple users records into an aggregated KPL record.Records of the same effective hash key get aggregated. The effective hash key is:
- the explicit hash key, if provided.
- the lower bound of the hash key range of the target shard according to the given partition key, if available.
- or otherwise the hashed partition key
To provide shard aware aggregation in 2., hash key ranges of shards are loaded and refreshed periodically. This allows to aggregate records into a number of aggregates that matches the number of shards in the stream to max out Kinesis API limits the best possible way.
Note:There's an important downside to consider when using shard aware aggregation: records get assigned to a shard (via an explicit hash key) on the client side, but respective client side state can't be guaranteed to always be up-to-date. If a shard gets split, all aggregates are mapped to the lower child shard until state is refreshed. Timing, however, will diverge between the different workers.
If using an
KinesisPartitioner.ExplicitPartitioneror disabling shard refresh viaKinesisIO.RecordAggregation, no shard details will be loaded (and used).Record aggregation can be entirely disabled using
KinesisIO.Write.withRecordAggregationDisabled().Configuration of AWS clients
AWS clients for all AWS IOs can be configured using
AwsOptions, e.g.--awsRegion=us-west-1.AwsOptionscontain reasonable defaults based on default providers forRegionandAwsCredentialsProvider.If you require more advanced configuration, you may change the
ClientBuilderFactoryusingAwsOptions.setClientBuilderFactory(Class).Configuration for a specific IO can be overwritten using
withClientConfiguration(), which also allows to configure the retry behavior for the respective IO.Retries
Retries for failed requests can be configured using
ClientConfiguration.Builder.retry(Consumer)and are handled by the AWS SDK unless there's a partial success (batch requests). The SDK uses a backoff strategy with equal jitter for computing the delay before the next retry.Note: Once retries are exhausted the error is surfaced to the runner which may then opt to retry the current partition in entirety or abort if the max number of retries of the runner is reached.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classKinesisIO.ReadImplementation ofread().static classKinesisIO.RecordAggregationConfiguration of Kinesis record aggregation.static classKinesisIO.Write<T>Implementation ofwrite().
-
Constructor Summary
Constructors Constructor Description KinesisIO()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static KinesisIO.Readread()Returns a newKinesisIO.Readtransform for reading from Kinesis.static <T> KinesisIO.Write<T>write()Returns a newKinesisIO.Writetransform for writing to Kinesis.
-
-
-
Method Detail
-
read
public static KinesisIO.Read read()
Returns a newKinesisIO.Readtransform for reading from Kinesis.
-
write
public static <T> KinesisIO.Write<T> write()
Returns a newKinesisIO.Writetransform for writing to Kinesis.
-
-