public class Pipelines extends Object
| Modifier and Type | Class and Description |
|---|---|
static class |
Pipelines.DummySink
Dummy sink that does nothing.
|
| Constructor and Description |
|---|
Pipelines() |
| Modifier and Type | Method and Description |
|---|---|
static org.apache.flink.streaming.api.datastream.DataStream<Object> |
append(org.apache.flink.configuration.Configuration conf,
org.apache.flink.table.types.logical.RowType rowType,
org.apache.flink.streaming.api.datastream.DataStream<org.apache.flink.table.data.RowData> dataStream)
Insert the dataset with append mode(no upsert or deduplication).
|
static org.apache.flink.streaming.api.datastream.DataStream<HoodieRecord> |
bootstrap(org.apache.flink.configuration.Configuration conf,
org.apache.flink.table.types.logical.RowType rowType,
org.apache.flink.streaming.api.datastream.DataStream<org.apache.flink.table.data.RowData> dataStream)
Constructs bootstrap pipeline as streaming.
|
static org.apache.flink.streaming.api.datastream.DataStream<HoodieRecord> |
bootstrap(org.apache.flink.configuration.Configuration conf,
org.apache.flink.table.types.logical.RowType rowType,
org.apache.flink.streaming.api.datastream.DataStream<org.apache.flink.table.data.RowData> dataStream,
boolean bounded,
boolean overwrite)
Constructs bootstrap pipeline.
|
static org.apache.flink.streaming.api.datastream.DataStreamSink<Object> |
bulkInsert(org.apache.flink.configuration.Configuration conf,
org.apache.flink.table.types.logical.RowType rowType,
org.apache.flink.streaming.api.datastream.DataStream<org.apache.flink.table.data.RowData> dataStream)
Bulk insert the input dataset at once.
|
static org.apache.flink.streaming.api.datastream.DataStreamSink<Object> |
clean(org.apache.flink.configuration.Configuration conf,
org.apache.flink.streaming.api.datastream.DataStream<Object> dataStream) |
static org.apache.flink.streaming.api.datastream.DataStreamSink<ClusteringCommitEvent> |
cluster(org.apache.flink.configuration.Configuration conf,
org.apache.flink.table.types.logical.RowType rowType,
org.apache.flink.streaming.api.datastream.DataStream<Object> dataStream)
The clustering tasks pipeline.
|
static org.apache.flink.streaming.api.datastream.DataStreamSink<CompactionCommitEvent> |
compact(org.apache.flink.configuration.Configuration conf,
org.apache.flink.streaming.api.datastream.DataStream<Object> dataStream)
The compaction tasks pipeline.
|
static org.apache.flink.streaming.api.datastream.DataStreamSink<Object> |
dummySink(org.apache.flink.streaming.api.datastream.DataStream<Object> dataStream) |
static String |
getTablePath(org.apache.flink.configuration.Configuration conf) |
static org.apache.flink.streaming.api.datastream.DataStream<Object> |
hoodieStreamWrite(org.apache.flink.configuration.Configuration conf,
org.apache.flink.streaming.api.datastream.DataStream<HoodieRecord> dataStream)
The streaming write pipeline.
|
static String |
opName(String operatorN,
org.apache.flink.configuration.Configuration conf) |
static String |
opUID(String operatorN,
org.apache.flink.configuration.Configuration conf) |
static org.apache.flink.streaming.api.datastream.DataStream<HoodieRecord> |
rowDataToHoodieRecord(org.apache.flink.configuration.Configuration conf,
org.apache.flink.table.types.logical.RowType rowType,
org.apache.flink.streaming.api.datastream.DataStream<org.apache.flink.table.data.RowData> dataStream)
Transforms the row data to hoodie records.
|
public static org.apache.flink.streaming.api.datastream.DataStreamSink<Object> bulkInsert(org.apache.flink.configuration.Configuration conf, org.apache.flink.table.types.logical.RowType rowType, org.apache.flink.streaming.api.datastream.DataStream<org.apache.flink.table.data.RowData> dataStream)
By default, the input dataset would shuffle by the partition path first then sort by the partition path before passing around to the write function. The whole pipeline looks like the following:
| input1 | ===\ /=== |sorter| === | task1 | (p1, p2)
shuffle
| input2 | ===/ \=== |sorter| === | task2 | (p3, p4)
Note: Both input1 and input2's dataset come from partitions: p1, p2, p3, p4
The write task switches to new file handle each time it receives a record from the different partition path, the shuffle and sort would reduce small files.
The bulk insert should be run in batch execution mode.
conf - The configurationrowType - The input row typedataStream - The input data streampublic static org.apache.flink.streaming.api.datastream.DataStream<Object> append(org.apache.flink.configuration.Configuration conf, org.apache.flink.table.types.logical.RowType rowType, org.apache.flink.streaming.api.datastream.DataStream<org.apache.flink.table.data.RowData> dataStream)
The input dataset would be rebalanced among the write tasks:
| input1 | ===\ /=== | task1 | (p1, p2, p3, p4)
shuffle
| input2 | ===/ \=== | task2 | (p1, p2, p3, p4)
Note: Both input1 and input2's dataset come from partitions: p1, p2, p3, p4
The write task switches to new file handle each time it receives a record from the different partition path, so there may be many small files.
conf - The configurationrowType - The input row typedataStream - The input data streampublic static org.apache.flink.streaming.api.datastream.DataStream<HoodieRecord> bootstrap(org.apache.flink.configuration.Configuration conf, org.apache.flink.table.types.logical.RowType rowType, org.apache.flink.streaming.api.datastream.DataStream<org.apache.flink.table.data.RowData> dataStream)
public static org.apache.flink.streaming.api.datastream.DataStream<HoodieRecord> bootstrap(org.apache.flink.configuration.Configuration conf, org.apache.flink.table.types.logical.RowType rowType, org.apache.flink.streaming.api.datastream.DataStream<org.apache.flink.table.data.RowData> dataStream, boolean bounded, boolean overwrite)
conf - The configurationrowType - The row typedataStream - The data streambounded - Whether the source is boundedoverwrite - Whether it is insert overwritepublic static org.apache.flink.streaming.api.datastream.DataStream<HoodieRecord> rowDataToHoodieRecord(org.apache.flink.configuration.Configuration conf, org.apache.flink.table.types.logical.RowType rowType, org.apache.flink.streaming.api.datastream.DataStream<org.apache.flink.table.data.RowData> dataStream)
public static org.apache.flink.streaming.api.datastream.DataStream<Object> hoodieStreamWrite(org.apache.flink.configuration.Configuration conf, org.apache.flink.streaming.api.datastream.DataStream<HoodieRecord> dataStream)
The input dataset shuffles by the primary key first then shuffles by the file group ID before passing around to the write function. The whole pipeline looks like the following:
| input1 | ===\ /=== | bucket assigner | ===\ /=== | task1 |
shuffle(by PK) shuffle(by bucket ID)
| input2 | ===/ \=== | bucket assigner | ===/ \=== | task2 |
Note: a file group must be handled by one write task to avoid write conflict.
The bucket assigner assigns the inputs to suitable file groups, the write task caches and flushes the data set to disk.
conf - The configurationdataStream - The input data streampublic static org.apache.flink.streaming.api.datastream.DataStreamSink<CompactionCommitEvent> compact(org.apache.flink.configuration.Configuration conf, org.apache.flink.streaming.api.datastream.DataStream<Object> dataStream)
The compaction plan operator monitors the new compaction plan on the timeline then distributes the sub-plans to the compaction tasks. The compaction task then handle over the metadata to commit task for compaction transaction commit. The whole pipeline looks like the following:
/=== | task1 | ===\
| plan generation | ===> hash | commit |
\=== | task2 | ===/
Note: both the compaction plan generation task and commission task are singleton.
conf - The configurationdataStream - The input data streampublic static org.apache.flink.streaming.api.datastream.DataStreamSink<ClusteringCommitEvent> cluster(org.apache.flink.configuration.Configuration conf, org.apache.flink.table.types.logical.RowType rowType, org.apache.flink.streaming.api.datastream.DataStream<Object> dataStream)
The clustering plan operator monitors the new clustering plan on the timeline then distributes the sub-plans to the clustering tasks. The clustering task then handle over the metadata to commit task for clustering transaction commit. The whole pipeline looks like the following:
/=== | task1 | ===\
| plan generation | ===> hash | commit |
\=== | task2 | ===/
Note: both the clustering plan generation task and commission task are singleton.
conf - The configurationrowType - The input row typedataStream - The input data streampublic static org.apache.flink.streaming.api.datastream.DataStreamSink<Object> clean(org.apache.flink.configuration.Configuration conf, org.apache.flink.streaming.api.datastream.DataStream<Object> dataStream)
public static org.apache.flink.streaming.api.datastream.DataStreamSink<Object> dummySink(org.apache.flink.streaming.api.datastream.DataStream<Object> dataStream)
public static String opName(String operatorN, org.apache.flink.configuration.Configuration conf)
public static String opUID(String operatorN, org.apache.flink.configuration.Configuration conf)
public static String getTablePath(org.apache.flink.configuration.Configuration conf)
Copyright © 2023 The Apache Software Foundation. All rights reserved.