T - The type to read from the compressed file.@Experimental(value=SOURCE_SINK) public class CompressedSource<T> extends FileBasedSource<T>
CompressedSources wraps a delegate
FileBasedSource that is able to read the decompressed file format.
For example, use the following to read from a gzip-compressed XML file:
XmlSource mySource = XmlSource.from(...);
PCollection<T> collection = p.apply(Read.from(CompressedSource
.from(mySource)
.withDecompression(CompressedSource.CompressionMode.GZIP)));
Supported compression algorithms are CompressedSource.CompressionMode.GZIP and
CompressedSource.CompressionMode.BZIP2. User-defined compression types are supported by implementing
CompressedSource.DecompressingChannelFactory.
By default, the compression algorithm is selected from those supported in
CompressedSource.CompressionMode based on the file name provided to the source, namely
".bz2" indicates CompressedSource.CompressionMode.BZIP2 and ".gz" indicates
CompressedSource.CompressionMode.GZIP. If the file name does not match any of the supported
algorithms, it is assumed to be uncompressed data.
| Modifier and Type | Class and Description |
|---|---|
static class |
CompressedSource.CompressedReader<T>
Reader for a
CompressedSource. |
static class |
CompressedSource.CompressionMode
Default compression types supported by the
CompressedSource. |
static interface |
CompressedSource.DecompressingChannelFactory
Factory interface for creating channels that decompress the content of an underlying channel.
|
FileBasedSource.FileBasedReader<T>, FileBasedSource.ModeOffsetBasedSource.OffsetBasedReader<T>BoundedSource.BoundedReader<T>Source.Reader<T>| Modifier and Type | Method and Description |
|---|---|
protected FileBasedSource<T> |
createForSubrangeOfFile(String fileName,
long start,
long end)
Creates a
CompressedSource for a subrange of a file. |
protected FileBasedSource.FileBasedReader<T> |
createSingleFileReader(PipelineOptions options)
Creates a
FileBasedReader to read a single file. |
static <T> CompressedSource<T> |
from(FileBasedSource<T> sourceDelegate)
Creates a
CompressedSource from an underlying FileBasedSource. |
CompressedSource.DecompressingChannelFactory |
getChannelFactory() |
Coder<T> |
getDefaultOutputCoder()
Returns the delegate source's default output coder.
|
protected boolean |
isSplittable()
Determines whether a single file represented by this source is splittable.
|
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
boolean |
producesSortedKeys(PipelineOptions options)
Returns whether the delegate source produces sorted keys.
|
static <T> Read.Bounded<T> |
readFromSource(FileBasedSource<T> sourceDelegate,
CompressedSource.DecompressingChannelFactory channelFactory)
Creates a
Read transform that reads from that reads from the underlying
FileBasedSource sourceDelegate after decompressing it with a CompressedSource.DecompressingChannelFactory. |
void |
validate()
Validates that the delegate source is a valid source and that the channel factory is not null.
|
CompressedSource<T> |
withDecompression(CompressedSource.DecompressingChannelFactory channelFactory)
Return a
CompressedSource that is like this one but will decompress its underlying file
with the given CompressedSource.DecompressingChannelFactory. |
createReader, createSourceForSubrange, expandFilePattern, getEstimatedSizeBytes, getFileOrPatternSpec, getMaxEndOffset, getMode, splitIntoBundles, toStringgetBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffsetpublic static <T> Read.Bounded<T> readFromSource(FileBasedSource<T> sourceDelegate, CompressedSource.DecompressingChannelFactory channelFactory)
Read transform that reads from that reads from the underlying
FileBasedSource sourceDelegate after decompressing it with a CompressedSource.DecompressingChannelFactory.public static <T> CompressedSource<T> from(FileBasedSource<T> sourceDelegate)
CompressedSource from an underlying FileBasedSource. The type
of compression used will be based on the file name extension unless explicitly
configured via withDecompression(org.apache.beam.sdk.io.CompressedSource.DecompressingChannelFactory).public CompressedSource<T> withDecompression(CompressedSource.DecompressingChannelFactory channelFactory)
CompressedSource that is like this one but will decompress its underlying file
with the given CompressedSource.DecompressingChannelFactory.public void validate()
validate in class FileBasedSource<T>protected FileBasedSource<T> createForSubrangeOfFile(String fileName, long start, long end)
CompressedSource for a subrange of a file. Called by superclass to create a
source for a single file.createForSubrangeOfFile in class FileBasedSource<T>fileName - file backing the new FileBasedSource.start - starting byte offset of the new FileBasedSource.end - ending byte offset of the new FileBasedSource. May be Long.MAX_VALUE,
in which case it will be inferred using FileBasedSource.getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions).protected final boolean isSplittable()
throws Exception
isSplittable in class FileBasedSource<T>Exceptionprotected final FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options)
FileBasedReader to read a single file.
Uses the delegate source to create a single file reader for the delegate source. Utilizes the default decompression channel factory to not wrap the source reader if the file name does not represent a compressed file allowing for splitting of the source.
createSingleFileReader in class FileBasedSource<T>public final boolean producesSortedKeys(PipelineOptions options) throws Exception
producesSortedKeys in class BoundedSource<T>Exceptionpublic void populateDisplayData(DisplayData.Builder builder)
SourcepopulateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData). Implementations may call
super.populateDisplayData(builder) in order to register display data in the current
namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use
the namespace of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData in interface HasDisplayDatapopulateDisplayData in class FileBasedSource<T>builder - The builder to populate with display data.HasDisplayDatapublic final Coder<T> getDefaultOutputCoder()
getDefaultOutputCoder in class Source<T>public final CompressedSource.DecompressingChannelFactory getChannelFactory()