@InterfaceAudience.Public public abstract class TableInputFormatBase extends org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>
TableInputFormats. Receives a Connection, a TableName,
an Scan instance that defines the input columns etc. Subclasses may use
other TableRecordReader implementations.
Subclasses MUST ensure initializeTable(Connection, TableName) is called for an instance to
function properly. Each of the entry points to this class used by the MapReduce framework,
createRecordReader(InputSplit, TaskAttemptContext) and getSplits(JobContext),
will call initialize(JobContext) as a convenient centralized location to handle
retrieving the necessary configuration information. If your subclass overrides either of these
methods, either call the parent version or call initialize yourself.
An example of a subclass:
class ExampleTIF extends TableInputFormatBase {
@Override
protected void initialize(JobContext context) throws IOException {
// We are responsible for the lifecycle of this connection until we hand it over in
// initializeTable.
Connection connection = ConnectionFactory.createConnection(HBaseConfiguration.create(
job.getConfiguration()));
TableName tableName = TableName.valueOf("exampleTable");
// mandatory. once passed here, TableInputFormatBase will handle closing the connection.
initializeTable(connection, tableName);
byte[][] inputColumns = new byte [][] { Bytes.toBytes("columnA"),
Bytes.toBytes("columnB") };
// optional, by default we'll get everything for the table.
Scan scan = new Scan();
for (byte[] family : inputColumns) {
scan.addFamily(family);
}
Filter exampleFilter = new RowFilter(CompareOp.EQUAL, new RegexStringComparator("aa.*"));
scan.setFilter(exampleFilter);
setScan(scan);
}
}
| Modifier and Type | Field and Description |
|---|---|
static String |
INPUT_AUTOBALANCE_MAXSKEWRATIO
Specify if ratio for data skew in M/R jobs, it goes well with the enabling hbase.mapreduce
.input.autobalance property.
|
static String |
MAPREDUCE_INPUT_AUTOBALANCE
Specify if we enable auto-balance for input in M/R jobs.
|
static String |
TABLE_ROW_TEXTKEY
Specify if the row key in table is text (ASCII between 32~126),
default is true.
|
| Constructor and Description |
|---|
TableInputFormatBase() |
| Modifier and Type | Method and Description |
|---|---|
protected void |
closeTable()
Close the Table and related objects that were initialized via
initializeTable(Connection, TableName). |
org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result> |
createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext context)
Builds a
TableRecordReader. |
protected Admin |
getAdmin()
Allows subclasses to get the
Admin. |
protected RegionLocator |
getRegionLocator()
Allows subclasses to get the
RegionLocator. |
Scan |
getScan()
Gets the scan defining the actual details like columns etc.
|
static byte[] |
getSplitKey(byte[] start,
byte[] end,
boolean isText)
select a split point in the region.
|
List<org.apache.hadoop.mapreduce.InputSplit> |
getSplits(org.apache.hadoop.mapreduce.JobContext context)
Calculates the splits that will serve as input for the map tasks.
|
protected Pair<byte[][],byte[][]> |
getStartEndKeys() |
protected Table |
getTable()
Allows subclasses to get the
Table. |
protected boolean |
includeRegionInSplit(byte[] startKey,
byte[] endKey)
Test if the given region is to be included in the InputSplit while splitting
the regions of a table.
|
protected void |
initialize(org.apache.hadoop.mapreduce.JobContext context)
Handle subclass specific set up.
|
protected void |
initializeTable(Connection connection,
TableName tableName)
Allows subclasses to initialize the table information.
|
void |
setScan(Scan scan)
Sets the scan defining the actual details like columns etc.
|
protected void |
setTableRecordReader(TableRecordReader tableRecordReader)
Allows subclasses to set the
TableRecordReader. |
public static final String MAPREDUCE_INPUT_AUTOBALANCE
public static final String INPUT_AUTOBALANCE_MAXSKEWRATIO
public static final String TABLE_ROW_TEXTKEY
public org.apache.hadoop.mapreduce.RecordReader<ImmutableBytesWritable,Result> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext context) throws IOException
TableRecordReader. If no TableRecordReader was provided, uses
the default.createRecordReader in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>split - The split to work with.context - The current context.IOException - When creating the reader fails.InputFormat.createRecordReader(
org.apache.hadoop.mapreduce.InputSplit,
org.apache.hadoop.mapreduce.TaskAttemptContext)protected Pair<byte[][],byte[][]> getStartEndKeys() throws IOException
IOExceptionpublic List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext context) throws IOException
getSplits in class org.apache.hadoop.mapreduce.InputFormat<ImmutableBytesWritable,Result>context - The current job context.IOException - When creating the list of splits fails.InputFormat.getSplits(
org.apache.hadoop.mapreduce.JobContext)@InterfaceAudience.Private
public static byte[] getSplitKey(byte[] start,
byte[] end,
boolean isText)
| start key | end key | is text | split point |
|---|---|---|---|
| 'a', 'a', 'a', 'b', 'c', 'd', 'e', 'f', 'g' | 'a', 'a', 'a', 'f', 'f', 'f' | true | 'a', 'a', 'a', 'd', 'd', -78, 50, -77, 51 |
| '1', '1', '1', '0', '0', '0' | '1', '1', '2', '5', '7', '9', '0' | true | '1', '1', '1', -78, -77, -76, -104 |
| '1', '1', '1', '0' | '1', '1', '2', '0' | true | '1', '1', '1', -80 |
| 13, -19, 126, 127 | 13, -19, 127, 0 | false | 13, -19, 126, -65 |
start - Start key of the regionend - End key of the regionisText - It determines to use text key mode or binary key modeprotected boolean includeRegionInSplit(byte[] startKey,
byte[] endKey)
This optimization is effective when there is a specific reasoning to exclude an entire region from the M-R job,
(and hence, not contributing to the InputSplit), given the start and end keys of the same.
Useful when we need to remember the last-processed top record and revisit the [last, current) interval for M-R processing,
continuously. In addition to reducing InputSplits, reduces the load on the region server as well, due to the ordering of the keys.
Note: It is possible that endKey.length() == 0 , for the last (recent) region.
Override this method, if you want to bulk exclude regions altogether from M-R. By default, no region is excluded( i.e. all regions are included).
startKey - Start key of the regionendKey - End key of the regionprotected RegionLocator getRegionLocator()
RegionLocator.protected void initializeTable(Connection connection, TableName tableName) throws IOException
connection - The Connection to the HBase cluster. MUST be unmanaged. We will close.tableName - The TableName of the table to process.IOExceptionpublic Scan getScan()
public void setScan(Scan scan)
scan - The scan to set.protected void setTableRecordReader(TableRecordReader tableRecordReader)
TableRecordReader.tableRecordReader - A different TableRecordReader
implementation.protected void initialize(org.apache.hadoop.mapreduce.JobContext context)
throws IOException
createRecordReader(InputSplit, TaskAttemptContext) and getSplits(JobContext),
will call initialize(JobContext) as a convenient centralized location to handle
retrieving the necessary configuration information and calling
initializeTable(Connection, TableName).
Subclasses should implement their initialize call such that it is safe to call multiple times.
The current TableInputFormatBase implementation relies on a non-null table reference to decide
if an initialize call is needed, but this behavior may change in the future. In particular,
it is critical that initializeTable not be called multiple times since this will leak
Connection instances.IOExceptionprotected void closeTable()
throws IOException
initializeTable(Connection, TableName).IOExceptionCopyright © 2007–2017 The Apache Software Foundation. All rights reserved.