@InterfaceAudience.Public public abstract class TableInputFormatBase extends Object implements org.apache.hadoop.mapred.InputFormat<ImmutableBytesWritable,Result>
TableInputFormat
s. Receives a Table
, a
byte[] of input columns and optionally a Filter
.
Subclasses may use other TableRecordReader implementations.
Subclasses MUST ensure initializeTable(Connection, TableName) is called for an instance to
function properly. Each of the entry points to this class used by the MapReduce framework,
getRecordReader(InputSplit, JobConf, Reporter)
and getSplits(JobConf, int)
,
will call initialize(JobConf)
as a convenient centralized location to handle
retrieving the necessary configuration information. If your subclass overrides either of these
methods, either call the parent version or call initialize yourself.
An example of a subclass:
class ExampleTIF extends TableInputFormatBase { @Override protected void initialize(JobConf context) throws IOException { // We are responsible for the lifecycle of this connection until we hand it over in // initializeTable. Connection connection = ConnectionFactory.createConnection(HBaseConfiguration.create(job)); TableName tableName = TableName.valueOf("exampleTable"); // mandatory. once passed here, TableInputFormatBase will handle closing the connection. initializeTable(connection, tableName); byte[][] inputColumns = new byte [][] { Bytes.toBytes("columnA"), Bytes.toBytes("columnB") }; // mandatory setInputColumns(inputColumns); // optional, by default we'll get everything for the given columns. Filter exampleFilter = new RowFilter(CompareOp.EQUAL, new RegexStringComparator("aa.*")); setRowFilter(exampleFilter); } }
Constructor and Description |
---|
TableInputFormatBase() |
Modifier and Type | Method and Description |
---|---|
protected void |
closeTable()
Close the Table and related objects that were initialized via
initializeTable(Connection, TableName) . |
org.apache.hadoop.mapred.RecordReader<ImmutableBytesWritable,Result> |
getRecordReader(org.apache.hadoop.mapred.InputSplit split,
org.apache.hadoop.mapred.JobConf job,
org.apache.hadoop.mapred.Reporter reporter)
Builds a TableRecordReader.
|
org.apache.hadoop.mapred.InputSplit[] |
getSplits(org.apache.hadoop.mapred.JobConf job,
int numSplits)
Calculates the splits that will serve as input for the map tasks.
|
protected Table |
getTable()
Allows subclasses to get the
Table . |
protected void |
initialize(org.apache.hadoop.mapred.JobConf job)
Handle subclass specific set up.
|
protected void |
initializeTable(Connection connection,
TableName tableName)
Allows subclasses to initialize the table information.
|
protected void |
setInputColumns(byte[][] inputColumns) |
protected void |
setRowFilter(Filter rowFilter)
Allows subclasses to set the
Filter to be used. |
protected void |
setTableRecordReader(TableRecordReader tableRecordReader)
Allows subclasses to set the
TableRecordReader . |
public org.apache.hadoop.mapred.RecordReader<ImmutableBytesWritable,Result> getRecordReader(org.apache.hadoop.mapred.InputSplit split, org.apache.hadoop.mapred.JobConf job, org.apache.hadoop.mapred.Reporter reporter) throws IOException
getRecordReader
in interface org.apache.hadoop.mapred.InputFormat<ImmutableBytesWritable,Result>
IOException
InputFormat.getRecordReader(InputSplit,
JobConf, Reporter)
public org.apache.hadoop.mapred.InputSplit[] getSplits(org.apache.hadoop.mapred.JobConf job, int numSplits) throws IOException
HRegion
s in the table.
If the number of splits is smaller than the number of
HRegion
s then splits are spanned across
multiple HRegion
s
and are grouped the most evenly possible. In the
case splits are uneven the bigger splits are placed first in the
InputSplit
array.getSplits
in interface org.apache.hadoop.mapred.InputFormat<ImmutableBytesWritable,Result>
job
- the map task JobConf
numSplits
- a hint to calculate the number of splits (mapred.map.tasks).IOException
InputFormat.getSplits(org.apache.hadoop.mapred.JobConf, int)
protected void initializeTable(Connection connection, TableName tableName) throws IOException
connection
- The Connection to the HBase cluster. MUST be unmanaged. We will close.tableName
- The TableName
of the table to process.IOException
protected void setInputColumns(byte[][] inputColumns)
inputColumns
- to be passed in Result
to the map task.protected void setTableRecordReader(TableRecordReader tableRecordReader)
TableRecordReader
.tableRecordReader
- to provide other TableRecordReader
implementations.protected void setRowFilter(Filter rowFilter)
Filter
to be used.rowFilter
- protected void initialize(org.apache.hadoop.mapred.JobConf job) throws IOException
getRecordReader(InputSplit, JobConf, Reporter)
and getSplits(JobConf, int)
,
will call initialize(JobConf)
as a convenient centralized location to handle
retrieving the necessary configuration information and calling
initializeTable(Connection, TableName)
.
Subclasses should implement their initialize call such that it is safe to call multiple times.
The current TableInputFormatBase implementation relies on a non-null table reference to decide
if an initialize call is needed, but this behavior may change in the future. In particular,
it is critical that initializeTable not be called multiple times since this will leak
Connection instances.IOException
protected void closeTable() throws IOException
initializeTable(Connection, TableName)
.IOException
Copyright © 2007–2019 Cloudera. All rights reserved.