T - Type of the elements emitted by this sinkpublic class BucketingSink<T>
extends org.apache.flink.streaming.api.functions.sink.RichSinkFunction<T>
implements org.apache.flink.api.java.typeutils.InputTypeConfigurable, org.apache.flink.streaming.api.checkpoint.CheckpointedFunction, org.apache.flink.runtime.state.CheckpointListener, org.apache.flink.streaming.runtime.tasks.ProcessingTimeCallback
FileSystem files within
buckets. This is integrated with the checkpointing mechanism to provide exactly once semantics.
When creating the sink a basePath must be specified. The base directory contains
one directory for every bucket. The bucket directories themselves contain several part files,
one for each parallel subtask of the sink. These part files contain the actual output data.
The sink uses a Bucketer to determine in which bucket directory each element should
be written to inside the base directory. The Bucketer can, for example, use time or
a property of the element to determine the bucket directory. The default Bucketer is a
DateTimeBucketer which will create one new bucket every hour. You can specify
a custom Bucketer using setBucketer(Bucketer). For example, use the
BasePathBucketer if you don't want to have buckets but still want to write part-files
in a fault-tolerant way.
The filenames of the part files contain the part prefix, the parallel subtask index of the sink
and a rolling counter. For example the file "part-1-17" contains the data from
subtask 1 of the sink and is the 17th bucket created by that subtask. Per default
the part prefix is "part" but this can be configured using setPartPrefix(String).
When a part file becomes bigger than the user-specified batch size or when the part file becomes older
than the user-specified roll over interval the current part file is closed, the part counter is increased
and a new part file is created. The batch size defaults to 384MB, this can be configured
using setBatchSize(long). The roll over interval defaults to Long.MAX_VALUE and
this can be configured using setBatchRolloverInterval(long).
In some scenarios, the open buckets are required to change based on time. In these cases, the sink needs to determine when a bucket has become inactive, in order to flush and close the part file. To support this there are two configurable settings:
setInactiveBucketCheckInterval(long),
andsetInactiveBucketThreshold(long)60, 000 ms, or 1 min.
Part files can be in one of three states: in-progress, pending or finished.
The reason for this is how the sink works together with the checkpointing mechanism to provide exactly-once
semantics and fault-tolerance. The part file that is currently being written to is in-progress. Once
a part file is closed for writing it becomes pending. When a checkpoint is successful the currently
pending files will be moved to finished.
If case of a failure, and in order to guarantee exactly-once semantics, the sink should roll back to the state it
had when that last successful checkpoint occurred. To this end, when restoring, the restored files in pending
state are transferred into the finished state while any in-progress files are rolled back, so that
they do not contain data that arrived after the checkpoint from which we restore. If the FileSystem supports
the truncate() method this will be used to reset the file back to its previous state. If not, a special
file with the same name as the part file and the suffix ".valid-length" will be created that contains the
length up to which the file contains valid data. When reading the file, it must be ensured that it is only read up
to that point. The prefixes and suffixes for the different file states and valid-length files can be configured
using the adequate setter method, e.g. setPendingSuffix(String).
NOTE:
"" to make the sink work in a non-fault-tolerant way but
still provide output without prefixes and suffixes.
Writer. By default, a
StringWriter is used, which writes the result of toString() for
every element, separated by newlines. You can configure the writer using the
setWriter(Writer). For example, SequenceFileWriter
can be used to write Hadoop SequenceFiles.
closePartFilesByTime(long) closes buckets that have not been written to for
inactiveBucketThreshold or if they are older than batchRolloverInterval.
Example:
new BucketingSink<Tuple2<IntWritable, Text>>(outPath)
.setWriter(new SequenceFileWriter<IntWritable, Text>())
.setBucketer(new DateTimeBucketer("yyyy-MM-dd--HHmm")
This will create a sink that writes to SequenceFiles and rolls every minute.
| 构造器和说明 |
|---|
BucketingSink(String basePath)
Creates a new
BucketingSink that writes files to the given base directory. |
| 限定符和类型 | 方法和说明 |
|---|---|
void |
close() |
static org.apache.hadoop.fs.FileSystem |
createHadoopFileSystem(org.apache.hadoop.fs.Path path,
org.apache.flink.configuration.Configuration extraUserConf) |
BucketingSink<T> |
disableCleanupOnOpen()
已过时。
This option is deprecated and remains only for backwards compatibility.
We do not clean up lingering files anymore.
|
org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.State<T> |
getState() |
void |
initializeState(org.apache.flink.runtime.state.FunctionInitializationContext context) |
void |
invoke(T value) |
void |
notifyCheckpointComplete(long checkpointId) |
void |
onProcessingTime(long timestamp) |
void |
open(org.apache.flink.configuration.Configuration parameters) |
BucketingSink<T> |
setAsyncTimeout(long timeout)
Sets the default timeout for asynchronous operations such as recoverLease and truncate.
|
BucketingSink<T> |
setBatchRolloverInterval(long batchRolloverInterval)
Sets the roll over interval in milliseconds.
|
BucketingSink<T> |
setBatchSize(long batchSize)
Sets the maximum bucket size in bytes.
|
BucketingSink<T> |
setBucketer(Bucketer<T> bucketer)
Sets the
Bucketer to use for determining the bucket files to write to. |
BucketingSink<T> |
setFSConfig(org.apache.flink.configuration.Configuration config)
Specify a custom
Configuration that will be used when creating
the FileSystem for writing. |
BucketingSink<T> |
setFSConfig(org.apache.hadoop.conf.Configuration config)
Specify a custom
Configuration that will be used when creating
the FileSystem for writing. |
BucketingSink<T> |
setInactiveBucketCheckInterval(long interval)
Sets the default time between checks for inactive buckets.
|
BucketingSink<T> |
setInactiveBucketThreshold(long threshold)
Sets the default threshold for marking a bucket as inactive and closing its part files.
|
BucketingSink<T> |
setInProgressPrefix(String inProgressPrefix)
Sets the prefix of in-progress part files.
|
BucketingSink<T> |
setInProgressSuffix(String inProgressSuffix)
Sets the suffix of in-progress part files.
|
void |
setInputType(org.apache.flink.api.common.typeinfo.TypeInformation<?> type,
org.apache.flink.api.common.ExecutionConfig executionConfig) |
BucketingSink<T> |
setPartPrefix(String partPrefix)
Sets the prefix of part files.
|
BucketingSink<T> |
setPartSuffix(String partSuffix)
Sets the prefix of part files.
|
BucketingSink<T> |
setPendingPrefix(String pendingPrefix)
Sets the prefix of pending part files.
|
BucketingSink<T> |
setPendingSuffix(String pendingSuffix)
Sets the suffix of pending part files.
|
BucketingSink<T> |
setUseTruncate(boolean useTruncate)
Sets whether to use
FileSystem.truncate() to truncate written bucket files back to
a consistent state in case of a restore from checkpoint. |
BucketingSink<T> |
setValidLengthPrefix(String validLengthPrefix)
Sets the prefix of valid-length files.
|
BucketingSink<T> |
setValidLengthSuffix(String validLengthSuffix)
Sets the suffix of valid-length files.
|
BucketingSink<T> |
setWriter(Writer<T> writer)
Sets the
Writer to be used for writing the incoming elements to bucket files. |
void |
snapshotState(org.apache.flink.runtime.state.FunctionSnapshotContext context) |
getIterationRuntimeContext, getRuntimeContext, setRuntimeContextpublic BucketingSink(String basePath)
BucketingSink that writes files to the given base directory.
This uses aDateTimeBucketer as Bucketer and a StringWriter has writer.
The maximum bucket size is set to 384 MB.
basePath - The directory to which to write the bucket files.public BucketingSink<T> setFSConfig(org.apache.flink.configuration.Configuration config)
Configuration that will be used when creating
the FileSystem for writing.public BucketingSink<T> setFSConfig(org.apache.hadoop.conf.Configuration config)
Configuration that will be used when creating
the FileSystem for writing.public void setInputType(org.apache.flink.api.common.typeinfo.TypeInformation<?> type,
org.apache.flink.api.common.ExecutionConfig executionConfig)
setInputType 在接口中 org.apache.flink.api.java.typeutils.InputTypeConfigurablepublic void initializeState(org.apache.flink.runtime.state.FunctionInitializationContext context)
throws Exception
initializeState 在接口中 org.apache.flink.streaming.api.checkpoint.CheckpointedFunctionExceptionpublic void open(org.apache.flink.configuration.Configuration parameters)
throws Exception
open 在接口中 org.apache.flink.api.common.functions.RichFunctionopen 在类中 org.apache.flink.api.common.functions.AbstractRichFunctionExceptionpublic void close()
throws Exception
close 在接口中 org.apache.flink.api.common.functions.RichFunctionclose 在类中 org.apache.flink.api.common.functions.AbstractRichFunctionExceptionpublic void onProcessingTime(long timestamp)
throws Exception
onProcessingTime 在接口中 org.apache.flink.streaming.runtime.tasks.ProcessingTimeCallbackExceptionpublic void notifyCheckpointComplete(long checkpointId)
throws Exception
notifyCheckpointComplete 在接口中 org.apache.flink.runtime.state.CheckpointListenerExceptionpublic void snapshotState(org.apache.flink.runtime.state.FunctionSnapshotContext context)
throws Exception
snapshotState 在接口中 org.apache.flink.streaming.api.checkpoint.CheckpointedFunctionExceptionpublic BucketingSink<T> setBatchSize(long batchSize)
When a bucket part file becomes larger than this size a new bucket part file is started and
the old one is closed. The name of the bucket files depends on the Bucketer.
batchSize - The bucket part file size in bytes.public BucketingSink<T> setBatchRolloverInterval(long batchRolloverInterval)
When a bucket part file is older than the roll over interval, a new bucket part file is
started and the old one is closed. The name of the bucket file depends on the Bucketer.
Additionally, the old part file is also closed if the bucket is not written to for a minimum of
inactiveBucketThreshold ms.
batchRolloverInterval - The roll over interval in millisecondspublic BucketingSink<T> setInactiveBucketCheckInterval(long interval)
interval - The timeout, in milliseconds.public BucketingSink<T> setInactiveBucketThreshold(long threshold)
batchRolloverInterval ms.threshold - The timeout, in milliseconds.public BucketingSink<T> setBucketer(Bucketer<T> bucketer)
Bucketer to use for determining the bucket files to write to.bucketer - The bucketer to use.public BucketingSink<T> setWriter(Writer<T> writer)
Writer to be used for writing the incoming elements to bucket files.writer - The Writer to use.public BucketingSink<T> setInProgressSuffix(String inProgressSuffix)
"in-progress".public BucketingSink<T> setInProgressPrefix(String inProgressPrefix)
"_".public BucketingSink<T> setPendingSuffix(String pendingSuffix)
".pending".public BucketingSink<T> setPendingPrefix(String pendingPrefix)
"_".public BucketingSink<T> setValidLengthSuffix(String validLengthSuffix)
".valid-length".public BucketingSink<T> setValidLengthPrefix(String validLengthPrefix)
"_".public BucketingSink<T> setPartSuffix(String partSuffix)
public BucketingSink<T> setPartPrefix(String partPrefix)
"part".public BucketingSink<T> setUseTruncate(boolean useTruncate)
FileSystem.truncate() to truncate written bucket files back to
a consistent state in case of a restore from checkpoint. If truncate() is not used
this sink will write valid-length files for corresponding bucket files that have to be used
when reading from bucket files to make sure to not read too far.@Deprecated public BucketingSink<T> disableCleanupOnOpen()
This should only be disabled if using the sink without checkpoints, to not remove the files already in the directory.
public BucketingSink<T> setAsyncTimeout(long timeout)
timeout - The timeout, in milliseconds.@VisibleForTesting public org.apache.flink.streaming.connectors.fs.bucketing.BucketingSink.State<T> getState()
public static org.apache.hadoop.fs.FileSystem createHadoopFileSystem(org.apache.hadoop.fs.Path path,
@Nullable
org.apache.flink.configuration.Configuration extraUserConf)
throws IOException
IOExceptionCopyright © 2014–2019 The Apache Software Foundation. All rights reserved.