public class KafkaIO extends Object
KafkaIO source returns unbounded collection of Kafka records as
PCollection<KafkaRecord<K, V>>. A KafkaRecord includes basic
metadata like topic-partition and offset, along with key and value associated with a Kafka
record.
Although most applications consumer single topic, the source can be configured to consume
multiple topics or even a specific set of TopicPartitions.
To configure a Kafka source, you must specify at the minimum Kafka bootstrapServers and one or more topics to consume. The following example illustrates various options for configuring the source :
pipeline
.apply(KafkaIO.read()
.withBootstrapServers("broker_1:9092,broker_2:9092")
.withTopics(ImmutableList.of("topic_a", "topic_b"))
// above two are required configuration. returns PCollection<KafkaRecord<byte[], byte[]>
// rest of the settings are optional :
// set a Coder for Key and Value (note the change to return type)
.withKeyCoder(BigEndianLongCoder.of()) // PCollection<KafkaRecord<Long, byte[]>
.withValueCoder(StringUtf8Coder.of()) // PCollection<KafkaRecord<Long, String>
// you can further customize KafkaConsumer used to read the records by adding more
// settings for ConsumerConfig. e.g :
.updateConsumerProperties(ImmutableMap.of("receive.buffer.bytes", 1024 * 1024))
// custom function for calculating record timestamp (default is processing time)
.withTimestampFn(new MyTypestampFunction())
// custom function for watermark (default is record timestamp)
.withWatermarkFn(new MyWatermarkFunction())
// finally, if you don't need Kafka metadata, you can drop it
.withoutMetadata() // PCollection<KV<Long, String>>
)
.apply(Values.<String>create()) // PCollection<String>
...
UnboundedKafkaSource#generateInitialSplits(int, PipelineOptions) for more details on
splits and checkpoint support.
When the pipeline starts for the first time without any checkpoint, the source starts
consuming from the latest offsets. You can override this behavior to consume from the
beginning by setting appropriate appropriate properties in ConsumerConfig, through
KafkaIO.Read.updateConsumerProperties(Map).
pipeline
.apply(...) // returns PCollection<KV<Long, String>>
.apply(KafkaIO.write()
.withBootstrapServers("broker_1:9092,broker_2:9092")
.withTopic("results")
// set Coder for Key and Value
.withKeyCoder(BigEndianLongCoder.of())
.withValueCoder(StringUtf8Coder.of())
// you can further customize KafkaProducer used to write the records by adding more
// settings for ProducerConfig. e.g, to enable compression :
.updateProducerProperties(ImmutableMap.of("compression.type", "gzip"))
);
Often you might want to write just values without any keys to Kafka. Use values() to
write records with default empty(null) key:
PCollection<String> strings = ...;
strings.apply(KafkaIO.write()
.withBootstrapServers("broker_1:9092,broker_2:9092")
.withTopic("results")
.withValueCoder(StringUtf8Coder.of()) // just need coder for value
.values() // writes values to Kafka with default key
);
ConsumerConfig for source or in
ProducerConfig for sink. E.g. if you would like to enable offset
auto commit (for external monitoring or other purposes), you can set
"group.id", "enable.auto.commit", etc.| Modifier and Type | Class and Description |
|---|---|
static class |
KafkaIO.CoderBasedKafkaSerializer<T>
Implements Kafka's
Serializer with a Coder. |
static class |
KafkaIO.Read<K,V>
A
PTransform to read from Kafka topics. |
static class |
KafkaIO.TypedRead<K,V>
A
PTransform to read from Kafka topics. |
static class |
KafkaIO.TypedWithoutMetadata<K,V>
A
PTransform to read from Kafka topics. |
static class |
KafkaIO.TypedWrite<K,V>
A
PTransform to write to a Kafka topic. |
static class |
KafkaIO.Write<K,V>
A
PTransform to write to a Kafka topic. |
| Modifier and Type | Method and Description |
|---|---|
static KafkaIO.Read<byte[],byte[]> |
read()
Creates an uninitialized
KafkaIO.Read PTransform. |
static KafkaIO.Write<byte[],byte[]> |
write()
Creates an uninitialized
KafkaIO.Write PTransform. |
public static KafkaIO.Read<byte[],byte[]> read()
KafkaIO.Read PTransform. Before use, basic Kafka
configuration should set with KafkaIO.Read.withBootstrapServers(String) and
KafkaIO.Read.withTopics(List). Other optional settings include key and value coders,
custom timestamp and watermark functions.public static KafkaIO.Write<byte[],byte[]> write()
KafkaIO.Write PTransform. Before use, Kafka configuration
should be set with KafkaIO.Write.withBootstrapServers(String) and KafkaIO.Write.withTopic(java.lang.String)
along with Coders for (optional) key and values.Copyright © 2016 The Apache Software Foundation. All rights reserved.