Class ContextualTextIO
- java.lang.Object
-
- org.apache.beam.sdk.io.contextualtextio.ContextualTextIO
-
public class ContextualTextIO extends java.lang.ObjectPTransforms that read text files and collect contextual information of the elements in the input.Prefer
TextIOwhen not reading files with multi-line records or additional record metadata is not required.Reading from text files
To read a
PCollectionfrom one or more text files, useContextualTextIO.read(). To instantiate a transform useContextualTextIO.Read.from(String)and specify the path of the file(s) to be read. Alternatively, if the filenames to be read are themselves in aPCollectionyou can useFileIOto match them andreadFiles()to read them.read()returns aPCollectionofRows with schemaRecordWithMetadata.getSchema(), each corresponding to one line of an input UTF-8 text file (split into lines delimited by '\n', '\r', '\r\n', or specified delimiter viaContextualTextIO.Read.withDelimiter(byte[])).Filepattern expansion and watching
By default, the filepatterns are expanded only once. The combination of
FileIO.Match.continuously(Duration, TerminationCondition)andreadFiles()allow streaming of new files matching the filepattern(s).By default,
read()prohibits filepatterns that match no files, andreadFiles()allows them in case the filepattern contains a glob wildcard character. UseContextualTextIO.Read.withEmptyMatchTreatment(org.apache.beam.sdk.io.fs.EmptyMatchTreatment)orFileIO.Match.withEmptyMatchTreatment(EmptyMatchTreatment)plusreadFiles()to configure this behavior.Example 1: reading a file or filepattern.
Pipeline p = ...; // A simple Read of a file: PCollection<Row> records = p.apply(ContextualTextIO.read().from("/local/path/to/file.txt"));Example 2: reading a PCollection of filenames.
Pipeline p = ...; // E.g. the filenames might be computed from other data in the pipeline, or // read from a data source. PCollection<String> filenames = ...; // Read all files in the collection. PCollection<Row> records = filenames .apply(FileIO.matchAll()) .apply(FileIO.readMatches()) .apply(ContextualTextIO.readFiles());Example 3: streaming new files matching a filepattern.
Pipeline p = ...; PCollection<Row> records = p.apply(ContextualTextIO.read() .from("/local/path/to/files/*") .watchForNewFiles( // Check for new files every minute Duration.standardMinutes(1), // Stop watching the filepattern if no new files appear within an hour afterTimeSinceNewOutput(Duration.standardHours(1))));Example 4: reading a file or file pattern of RFC4180-compliant CSV files with fields that may contain line breaks.
Example of such a file could be:
"aaa","b CRLF bb","ccc" CRLF zzz,yyy,xxx
Pipeline p = ...; PCollection<Row> records = p.apply(ContextualTextIO.read() .from("/local/path/to/files/*.csv") .withHasMultilineCSVRecords(true));Example 5: reading while watching for new files
Pipeline p = ...; PCollection<Row> records = p.apply(FileIO.match() .filepattern("filepattern") .continuously( Duration.millis(100), Watch.Growth.afterTimeSinceNewOutput(Duration.standardSeconds(3)))) .apply(FileIO.readMatches()) .apply(ContextualTextIO.readFiles());Example 6: reading with recordNum metadata.
Pipeline p = ...; PCollection<Row> records = p.apply(ContextualTextIO.read() .from("/local/path/to/files/*.csv") .setWithRecordNumMetadata(true));NOTE: When using
ContextualTextIO.Read.withHasMultilineCSVRecords(Boolean), a single reader will be used to process the file, rather than multiple readers which can read from different offsets. For a large file this can result in lower performance.NOTE: Use
ContextualTextIO.Read.withRecordNumMetadata()when recordNum metadata is required. Computing absolute record positions currently introduces a grouping step, which increases the resources used by the pipeline. By default withRecordNumMetadata is set to false, in this case record objects will not contain absolute record positions within the entire file, but will still contain relative positions in respective offsets.Reading a very large number of files
If it is known that the filepattern will match a very large number of files (e.g. tens of thousands or more), use
ContextualTextIO.Read.withHintMatchesManyFiles()for better performance and scalability. Note that it may decrease performance if the filepattern matches only a small number of files.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static classContextualTextIO.ReadImplementation ofread().static classContextualTextIO.ReadFilesImplementation ofreadFiles().
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static ContextualTextIO.Readread()APTransformthat reads from one or more text files and returns a boundedPCollectioncontaining oneelementfor each line in the input files.static ContextualTextIO.ReadFilesreadFiles()Likeread(), but reads each file in aPCollectionofFileIO.ReadableFile, returned byFileIO.readMatches().
-
-
-
Method Detail
-
read
public static ContextualTextIO.Read read()
APTransformthat reads from one or more text files and returns a boundedPCollectioncontaining oneelementfor each line in the input files.
-
readFiles
public static ContextualTextIO.ReadFiles readFiles()
Likeread(), but reads each file in aPCollectionofFileIO.ReadableFile, returned byFileIO.readMatches().
-
-