T - the type of values written to the sink.public abstract static class FileBasedSink.FileBasedWriteOperation<T> extends Sink.WriteOperation<T,FileBasedSink.FileResult>
Sink.WriteOperation that manages the process of writing to a
FileBasedSink.
The primary responsibilities of the FileBasedWriteOperation is the management of output
files. During a write, FileBasedSink.FileBasedWriters write bundles to temporary file
locations. After the bundles have been written,
finalize(java.lang.Iterable<org.apache.beam.sdk.io.FileBasedSink.FileResult>, org.apache.beam.sdk.options.PipelineOptions) is given a list of the temporary
files containing the output bundles.
Subclass implementations of FileBasedWriteOperation must implement
createWriter(org.apache.beam.sdk.options.PipelineOptions) to return a concrete
FileBasedSinkWriter.
{baseTemporaryFilename}-temp-{bundleId}, where bundleId is the unique id of the bundle.
For example, if baseTemporaryFilename is "gs://my-bucket/my_temp_output", the output for a
bundle with bundle id 15723 will be "gs://my-bucket/my_temp_output-temp-15723".
Final output files are written to baseOutputFilename with the format
{baseOutputFilename}-0000i-of-0000n.{extension} where n is the total number of bundles
written and extension is the file extension. Both baseOutputFilename and extension are required
constructor arguments.
Subclass implementations can change the file naming template by supplying a value for
FileBasedSink.fileNamingTemplate.
temporaryFileRetention controls the behavior
for managing temporary files. By default, temporary files will be removed. Subclasses can
provide a different value to the constructor.
Note that in the case of permanent failure of a bundle's write, no clean up of temporary files will occur.
If there are no elements in the PCollection being written, no output will be generated.
| Modifier and Type | Class and Description |
|---|---|
static class |
FileBasedSink.FileBasedWriteOperation.TemporaryFileRetention
Options for handling of temporary output files.
|
| Modifier and Type | Field and Description |
|---|---|
protected String |
baseTemporaryFilename
Base filename used for temporary output files.
|
protected FileBasedSink<T> |
sink
The Sink that this WriteOperation will write to.
|
protected static String |
TEMPORARY_FILENAME_SEPARATOR
Name separator for temporary files.
|
protected FileBasedSink.FileBasedWriteOperation.TemporaryFileRetention |
temporaryFileRetention
Option to keep or remove temporary output files.
|
| Constructor and Description |
|---|
FileBasedWriteOperation(FileBasedSink<T> sink)
Construct a FileBasedWriteOperation using the same base filename for both temporary and
output files.
|
FileBasedWriteOperation(FileBasedSink<T> sink,
String baseTemporaryFilename)
Construct a FileBasedWriteOperation.
|
FileBasedWriteOperation(FileBasedSink<T> sink,
String baseTemporaryFilename,
FileBasedSink.FileBasedWriteOperation.TemporaryFileRetention temporaryFileRetention)
Create a new FileBasedWriteOperation.
|
| Modifier and Type | Method and Description |
|---|---|
protected static String |
buildTemporaryFilename(String prefix,
String suffix)
Build a temporary filename using the temporary filename separator with the given prefix and
suffix.
|
protected List<String> |
copyToOutputFiles(List<String> filenames,
PipelineOptions options)
Copy temporary files to final output filenames using the file naming template.
|
abstract FileBasedSink.FileBasedWriter<T> |
createWriter(PipelineOptions options)
Clients must implement to return a subclass of
FileBasedSink.FileBasedWriter. |
void |
finalize(Iterable<FileBasedSink.FileResult> writerResults,
PipelineOptions options)
Finalizes writing by copying temporary output files to their final location and optionally
removing temporary files.
|
protected List<String> |
generateDestinationFilenames(int numFiles)
Generate output bundle filenames.
|
FileBasedSink<T> |
getSink()
Returns the FileBasedSink for this write operation.
|
Coder<FileBasedSink.FileResult> |
getWriterResultCoder()
Provides a coder for
FileBasedSink.FileResult. |
void |
initialize(PipelineOptions options)
Initialization of the sink.
|
protected void |
removeTemporaryFiles(PipelineOptions options)
Removes temporary output files.
|
protected final FileBasedSink<T> sink
protected final FileBasedSink.FileBasedWriteOperation.TemporaryFileRetention temporaryFileRetention
protected final String baseTemporaryFilename
protected static final String TEMPORARY_FILENAME_SEPARATOR
{baseTemporaryFilename}-temp-{bundleId}.public FileBasedWriteOperation(FileBasedSink<T> sink)
sink - the FileBasedSink that will be used to configure this write operation.public FileBasedWriteOperation(FileBasedSink<T> sink, String baseTemporaryFilename)
sink - the FileBasedSink that will be used to configure this write operation.baseTemporaryFilename - the base filename to be used for temporary output files.public FileBasedWriteOperation(FileBasedSink<T> sink, String baseTemporaryFilename, FileBasedSink.FileBasedWriteOperation.TemporaryFileRetention temporaryFileRetention)
sink - the FileBasedSink that will be used to configure this write operation.baseTemporaryFilename - the base filename to be used for temporary output files.temporaryFileRetention - defines how temporary files are handled.protected static final String buildTemporaryFilename(String prefix, String suffix)
public abstract FileBasedSink.FileBasedWriter<T> createWriter(PipelineOptions options) throws Exception
FileBasedSink.FileBasedWriter. This
method must satisfy the restrictions placed on implementations of
Sink.WriteOperation.createWriter(org.apache.beam.sdk.options.PipelineOptions). Namely, it must not mutate the state of the object.createWriter in class Sink.WriteOperation<T,FileBasedSink.FileResult>Exceptionpublic void initialize(PipelineOptions options) throws Exception
Sink.WriteOperation.initialize(org.apache.beam.sdk.options.PipelineOptions).initialize in class Sink.WriteOperation<T,FileBasedSink.FileResult>Exceptionpublic void finalize(Iterable<FileBasedSink.FileResult> writerResults, PipelineOptions options) throws Exception
Finalization may be overridden by subclass implementations to perform customized
finalization (e.g., initiating some operation on output bundles, merging them, etc.).
writerResults contains the filenames of written bundles.
If subclasses override this method, they must guarantee that its implementation is idempotent, as it may be executed multiple times in the case of failure or for redundancy. It is a best practice to attempt to try to make this method atomic.
finalize in class Sink.WriteOperation<T,FileBasedSink.FileResult>writerResults - the results of writes (FileResult).Exceptionprotected final List<String> copyToOutputFiles(List<String> filenames, PipelineOptions options) throws IOException
Can be called from subclasses that override finalize(java.lang.Iterable<org.apache.beam.sdk.io.FileBasedSink.FileResult>, org.apache.beam.sdk.options.PipelineOptions).
Files will be named according to the file naming template. The order of the output files will be the same as the sorted order of the input filenames. In other words, if the input filenames are ["C", "A", "B"], baseOutputFilename is "file", the extension is ".txt", and the fileNamingTemplate is "-SSS-of-NNN", the contents of A will be copied to file-000-of-003.txt, the contents of B will be copied to file-001-of-003.txt, etc.
filenames - the filenames of temporary files.IOExceptionprotected final List<String> generateDestinationFilenames(int numFiles)
protected final void removeTemporaryFiles(PipelineOptions options) throws IOException
Can be called from subclasses that override finalize(java.lang.Iterable<org.apache.beam.sdk.io.FileBasedSink.FileResult>, org.apache.beam.sdk.options.PipelineOptions).
Note:If finalize is overridden and does not rename or otherwise finalize
temporary files, this method will remove them.
IOExceptionpublic Coder<FileBasedSink.FileResult> getWriterResultCoder()
FileBasedSink.FileResult.getWriterResultCoder in class Sink.WriteOperation<T,FileBasedSink.FileResult>public FileBasedSink<T> getSink()
getSink in class Sink.WriteOperation<T,FileBasedSink.FileResult>