T - Type of records represented by the source.public abstract class FileBasedSource<T> extends OffsetBasedSource<T>
Sources. Extend this class to implement your own
file-based custom source.
A file-based Source is a Source backed by a file pattern defined as a Java
glob, a single file, or a offset range for a single file. See OffsetBasedSource and
RangeTracker for semantics of offset ranges.
This source stores a String that is an IOChannelFactory specification for a
file or file pattern. There should be an IOChannelFactory defined for the file
specification provided. Please refer to IOChannelUtils and IOChannelFactory for
more information on this.
In addition to the methods left abstract from BoundedSource, subclasses must implement
methods to create a sub-source and a reader for a range of a single file -
createForSubrangeOfFile(java.lang.String, long, long) and createSingleFileReader(org.apache.beam.sdk.options.PipelineOptions). Please refer to
XmlSource for an example implementation of FileBasedSource.
| Modifier and Type | Class and Description |
|---|---|
static class |
FileBasedSource.FileBasedReader<T>
A
reader that implements code common to readers of
FileBasedSources. |
static class |
FileBasedSource.Mode
A given
FileBasedSource represents a file resource of one of these types. |
OffsetBasedSource.OffsetBasedReader<T>BoundedSource.BoundedReader<T>Source.Reader<T>| Constructor and Description |
|---|
FileBasedSource(String fileOrPatternSpec,
long minBundleSize)
Create a
FileBaseSource based on a file or a file pattern specification. |
FileBasedSource(String fileName,
long minBundleSize,
long startOffset,
long endOffset)
Create a
FileBasedSource based on a single file. |
| Modifier and Type | Method and Description |
|---|---|
protected abstract FileBasedSource<T> |
createForSubrangeOfFile(String fileName,
long start,
long end)
Creates and returns a new
FileBasedSource of the same type as the current
FileBasedSource backed by a given file and an offset range. |
BoundedSource.BoundedReader<T> |
createReader(PipelineOptions options)
Returns a new
BoundedSource.BoundedReader that reads from this source. |
protected abstract FileBasedSource.FileBasedReader<T> |
createSingleFileReader(PipelineOptions options)
Creates and returns an instance of a
FileBasedReader implementation for the current
source assuming the source represents a single file. |
FileBasedSource<T> |
createSourceForSubrange(long start,
long end)
Returns an
OffsetBasedSource for a subrange of the current source. |
protected static Collection<String> |
expandFilePattern(String fileOrPatternSpec) |
long |
getEstimatedSizeBytes(PipelineOptions options)
An estimate of the total size (in bytes) of the data that would be read from this source.
|
String |
getFileOrPatternSpec() |
long |
getMaxEndOffset(PipelineOptions options)
Returns the actual ending offset of the current source.
|
FileBasedSource.Mode |
getMode() |
protected boolean |
isSplittable()
Determines whether a file represented by this source is can be split into bundles.
|
void |
populateDisplayData(DisplayData.Builder builder)
Register display data for the given transform or component.
|
List<? extends FileBasedSource<T>> |
splitIntoBundles(long desiredBundleSizeBytes,
PipelineOptions options)
Splits the source into bundles of approximately
desiredBundleSizeBytes. |
String |
toString() |
void |
validate()
Checks that this source is valid, before it can be used in a pipeline.
|
getBytesPerOffset, getEndOffset, getMinBundleSize, getStartOffsetproducesSortedKeysgetDefaultOutputCoderpublic FileBasedSource(String fileOrPatternSpec, long minBundleSize)
FileBaseSource based on a file or a file pattern specification. This
constructor must be used when creating a new FileBasedSource for a file pattern.
See OffsetBasedSource for a detailed description of minBundleSize.
fileOrPatternSpec - IOChannelFactory specification of file or file pattern
represented by the FileBasedSource.minBundleSize - minimum bundle size in bytes.public FileBasedSource(String fileName, long minBundleSize, long startOffset, long endOffset)
FileBasedSource based on a single file. This constructor must be used when
creating a new FileBasedSource for a subrange of a single file.
Additionally, this constructor must be used to create new FileBasedSources when
subclasses implement the method createForSubrangeOfFile(java.lang.String, long, long).
See OffsetBasedSource for detailed descriptions of minBundleSize,
startOffset, and endOffset.
fileName - IOChannelFactory specification of the file represented by the
FileBasedSource.minBundleSize - minimum bundle size in bytes.startOffset - starting byte offset.endOffset - ending byte offset. If the specified value >= #getMaxEndOffset() it
implies #getMaxEndOffSet().public final String getFileOrPatternSpec()
public final FileBasedSource.Mode getMode()
public final FileBasedSource<T> createSourceForSubrange(long start, long end)
OffsetBasedSourceOffsetBasedSource for a subrange of the current source. The
subrange [start, end) must be within the range [startOffset, endOffset) of
the current source, i.e. startOffset <= start < end <= endOffset.createSourceForSubrange in class OffsetBasedSource<T>protected abstract FileBasedSource<T> createForSubrangeOfFile(String fileName, long start, long end)
FileBasedSource of the same type as the current
FileBasedSource backed by a given file and an offset range. When current source is
being split, this method is used to generate new sub-sources. When creating the source
subclasses must call the constructor FileBasedSource(String, long, long, long) of
FileBasedSource with corresponding parameter values passed here.fileName - file backing the new FileBasedSource.start - starting byte offset of the new FileBasedSource.end - ending byte offset of the new FileBasedSource. May be Long.MAX_VALUE,
in which case it will be inferred using getMaxEndOffset(org.apache.beam.sdk.options.PipelineOptions).protected abstract FileBasedSource.FileBasedReader<T> createSingleFileReader(PipelineOptions options)
FileBasedReader implementation for the current
source assuming the source represents a single file. File patterns will be handled by
FileBasedSource implementation automatically.public final long getEstimatedSizeBytes(PipelineOptions options) throws IOException
BoundedSourcegetEstimatedSizeBytes in class OffsetBasedSource<T>IOExceptionpublic void populateDisplayData(DisplayData.Builder builder)
SourcepopulateDisplayData(DisplayData.Builder) is invoked by Pipeline runners to collect
display data via DisplayData.from(HasDisplayData). Implementations may call
super.populateDisplayData(builder) in order to register display data in the current
namespace, but should otherwise use subcomponent.populateDisplayData(builder) to use
the namespace of the subcomponent.
By default, does not register any display data. Implementors may override this method to provide their own display data.
populateDisplayData in interface HasDisplayDatapopulateDisplayData in class OffsetBasedSource<T>builder - The builder to populate with display data.HasDisplayDatapublic final List<? extends FileBasedSource<T>> splitIntoBundles(long desiredBundleSizeBytes, PipelineOptions options) throws Exception
BoundedSourcedesiredBundleSizeBytes.splitIntoBundles in class OffsetBasedSource<T>Exceptionprotected boolean isSplittable()
throws Exception
By default, a file is splittable if it is on a file system that supports efficient read seeking. Subclasses may override to provide different behavior.
Exceptionpublic final BoundedSource.BoundedReader<T> createReader(PipelineOptions options) throws IOException
BoundedSourceBoundedSource.BoundedReader that reads from this source.createReader in class BoundedSource<T>IOExceptionpublic String toString()
toString in class OffsetBasedSource<T>public void validate()
SourceIt is recommended to use Preconditions for implementing
this method.
validate in class OffsetBasedSource<T>public final long getMaxEndOffset(PipelineOptions options) throws IOException
OffsetBasedSource[startOffset, endOffset) such that the
range used is [startOffset, min(endOffset, maxEndOffset)).
As an example in which OffsetBasedSource is used to implement a file source, suppose
that this source was constructed with an endOffset of Long.MAX_VALUE to
indicate that a file should be read to the end. Then this function should determine
the actual, exact size of the file in bytes and return it.
getMaxEndOffset in class OffsetBasedSource<T>IOExceptionprotected static final Collection<String> expandFilePattern(String fileOrPatternSpec) throws IOException
IOException