@InterfaceAudience.Public public class TableMapReduceUtil extends Object
TableMapper
and TableReducer
Constructor and Description |
---|
TableMapReduceUtil() |
Modifier and Type | Method and Description |
---|---|
static void |
addDependencyJars(org.apache.hadoop.conf.Configuration conf,
Class<?>... classes)
Deprecated.
rely on
addDependencyJars(Job) instead. |
static void |
addDependencyJars(org.apache.hadoop.mapreduce.Job job)
Add the HBase dependency jars as well as jars for any of the configured
job classes to the job configuration, so that JobClient will ship them
to the cluster and add them to the DistributedCache.
|
static void |
addDependencyJarsForClasses(org.apache.hadoop.conf.Configuration conf,
Class<?>... classes)
Add the jars containing the given classes to the job's configuration
such that JobClient will ship them to the cluster and add them to
the DistributedCache.
|
static void |
addHBaseDependencyJars(org.apache.hadoop.conf.Configuration conf)
Add HBase and its dependencies (only) to the job configuration.
|
static String |
buildDependencyClasspath(org.apache.hadoop.conf.Configuration conf)
Returns a classpath string built from the content of the "tmpjars" value in
conf . |
static String |
convertScanToString(Scan scan)
Writes the given scan into a Base64 encoded string.
|
static Scan |
convertStringToScan(String base64)
Converts the given Base64 string back into a Scan instance.
|
static void |
initCredentials(org.apache.hadoop.mapreduce.Job job) |
static void |
initCredentialsForCluster(org.apache.hadoop.mapreduce.Job job,
org.apache.hadoop.conf.Configuration conf)
Obtain an authentication token, for the specified cluster, on behalf of the current user
and add it to the credentials for the given map reduce job.
|
static void |
initCredentialsForCluster(org.apache.hadoop.mapreduce.Job job,
String quorumAddress)
Deprecated.
Since 1.2.0, use
initCredentialsForCluster(Job, Configuration) instead. |
static void |
initMultiTableSnapshotMapperJob(Map<String,Collection<Scan>> snapshotScans,
Class<? extends TableMapper> mapper,
Class<?> outputKeyClass,
Class<?> outputValueClass,
org.apache.hadoop.mapreduce.Job job,
boolean addDependencyJars,
org.apache.hadoop.fs.Path tmpRestoreDir)
Sets up the job for reading from one or more table snapshots, with one or more scans
per snapshot.
|
static void |
initTableMapperJob(byte[] table,
Scan scan,
Class<? extends TableMapper> mapper,
Class<?> outputKeyClass,
Class<?> outputValueClass,
org.apache.hadoop.mapreduce.Job job)
Use this before submitting a TableMap job.
|
static void |
initTableMapperJob(byte[] table,
Scan scan,
Class<? extends TableMapper> mapper,
Class<?> outputKeyClass,
Class<?> outputValueClass,
org.apache.hadoop.mapreduce.Job job,
boolean addDependencyJars)
Use this before submitting a TableMap job.
|
static void |
initTableMapperJob(byte[] table,
Scan scan,
Class<? extends TableMapper> mapper,
Class<?> outputKeyClass,
Class<?> outputValueClass,
org.apache.hadoop.mapreduce.Job job,
boolean addDependencyJars,
Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass)
Use this before submitting a TableMap job.
|
static void |
initTableMapperJob(List<Scan> scans,
Class<? extends TableMapper> mapper,
Class<?> outputKeyClass,
Class<?> outputValueClass,
org.apache.hadoop.mapreduce.Job job)
Use this before submitting a Multi TableMap job.
|
static void |
initTableMapperJob(List<Scan> scans,
Class<? extends TableMapper> mapper,
Class<?> outputKeyClass,
Class<?> outputValueClass,
org.apache.hadoop.mapreduce.Job job,
boolean addDependencyJars)
Use this before submitting a Multi TableMap job.
|
static void |
initTableMapperJob(List<Scan> scans,
Class<? extends TableMapper> mapper,
Class<?> outputKeyClass,
Class<?> outputValueClass,
org.apache.hadoop.mapreduce.Job job,
boolean addDependencyJars,
boolean initCredentials)
Use this before submitting a Multi TableMap job.
|
static void |
initTableMapperJob(String table,
Scan scan,
Class<? extends TableMapper> mapper,
Class<?> outputKeyClass,
Class<?> outputValueClass,
org.apache.hadoop.mapreduce.Job job)
Use this before submitting a TableMap job.
|
static void |
initTableMapperJob(String table,
Scan scan,
Class<? extends TableMapper> mapper,
Class<?> outputKeyClass,
Class<?> outputValueClass,
org.apache.hadoop.mapreduce.Job job,
boolean addDependencyJars)
Use this before submitting a TableMap job.
|
static void |
initTableMapperJob(String table,
Scan scan,
Class<? extends TableMapper> mapper,
Class<?> outputKeyClass,
Class<?> outputValueClass,
org.apache.hadoop.mapreduce.Job job,
boolean addDependencyJars,
boolean initCredentials,
Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass)
Use this before submitting a TableMap job.
|
static void |
initTableMapperJob(String table,
Scan scan,
Class<? extends TableMapper> mapper,
Class<?> outputKeyClass,
Class<?> outputValueClass,
org.apache.hadoop.mapreduce.Job job,
boolean addDependencyJars,
Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass)
Use this before submitting a TableMap job.
|
static void |
initTableMapperJob(TableName table,
Scan scan,
Class<? extends TableMapper> mapper,
Class<?> outputKeyClass,
Class<?> outputValueClass,
org.apache.hadoop.mapreduce.Job job)
Use this before submitting a TableMap job.
|
static void |
initTableReducerJob(String table,
Class<? extends TableReducer> reducer,
org.apache.hadoop.mapreduce.Job job)
Use this before submitting a TableReduce job.
|
static void |
initTableReducerJob(String table,
Class<? extends TableReducer> reducer,
org.apache.hadoop.mapreduce.Job job,
Class partitioner)
Use this before submitting a TableReduce job.
|
static void |
initTableReducerJob(String table,
Class<? extends TableReducer> reducer,
org.apache.hadoop.mapreduce.Job job,
Class partitioner,
String quorumAddress,
String serverClass,
String serverImpl)
Use this before submitting a TableReduce job.
|
static void |
initTableReducerJob(String table,
Class<? extends TableReducer> reducer,
org.apache.hadoop.mapreduce.Job job,
Class partitioner,
String quorumAddress,
String serverClass,
String serverImpl,
boolean addDependencyJars)
Use this before submitting a TableReduce job.
|
static void |
initTableSnapshotMapperJob(String snapshotName,
Scan scan,
Class<? extends TableMapper> mapper,
Class<?> outputKeyClass,
Class<?> outputValueClass,
org.apache.hadoop.mapreduce.Job job,
boolean addDependencyJars,
org.apache.hadoop.fs.Path tmpRestoreDir)
Sets up the job for reading from a table snapshot.
|
static void |
initTableSnapshotMapperJob(String snapshotName,
Scan scan,
Class<? extends TableMapper> mapper,
Class<?> outputKeyClass,
Class<?> outputValueClass,
org.apache.hadoop.mapreduce.Job job,
boolean addDependencyJars,
org.apache.hadoop.fs.Path tmpRestoreDir,
RegionSplitter.SplitAlgorithm splitAlgo,
int numSplitsPerRegion)
Sets up the job for reading from a table snapshot.
|
static void |
limitNumReduceTasks(String table,
org.apache.hadoop.mapreduce.Job job)
Ensures that the given number of reduce tasks for the given job
configuration does not exceed the number of regions for the given table.
|
static void |
resetCacheConfig(org.apache.hadoop.conf.Configuration conf)
Enable a basic on-heap cache for these jobs.
|
static void |
setNumReduceTasks(String table,
org.apache.hadoop.mapreduce.Job job)
Sets the number of reduce tasks for the given job configuration to the
number of regions the given table has.
|
static void |
setScannerCaching(org.apache.hadoop.mapreduce.Job job,
int batchSize)
Sets the number of rows to return and cache with each scanner iteration.
|
public static void initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException
table
- The table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is
carrying all necessary HBase configuration.IOException
- When setting up the details fails.public static void initTableMapperJob(TableName table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException
table
- The table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is
carrying all necessary HBase configuration.IOException
- When setting up the details fails.public static void initTableMapperJob(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException
table
- Binary representation of the table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is
carrying all necessary HBase configuration.IOException
- When setting up the details fails.public static void initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass) throws IOException
table
- The table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is
carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured
job classes via the distributed cache (tmpjars).IOException
- When setting up the details fails.public static void initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, boolean initCredentials, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass) throws IOException
table
- The table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is
carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured
job classes via the distributed cache (tmpjars).initCredentials
- whether to initialize hbase auth credentials for the jobinputFormatClass
- the input formatIOException
- When setting up the details fails.public static void initTableMapperJob(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, Class<? extends org.apache.hadoop.mapreduce.InputFormat> inputFormatClass) throws IOException
table
- Binary representation of the table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is
carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured
job classes via the distributed cache (tmpjars).inputFormatClass
- The class of the input formatIOException
- When setting up the details fails.public static void initTableMapperJob(byte[] table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars) throws IOException
table
- Binary representation of the table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is
carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured
job classes via the distributed cache (tmpjars).IOException
- When setting up the details fails.public static void initTableMapperJob(String table, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars) throws IOException
table
- The table name to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is
carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured
job classes via the distributed cache (tmpjars).IOException
- When setting up the details fails.public static void resetCacheConfig(org.apache.hadoop.conf.Configuration conf)
public static void initMultiTableSnapshotMapperJob(Map<String,Collection<Scan>> snapshotScans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir) throws IOException
snapshotScans
- map of snapshot name to scans on that snapshot.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is
carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured
job classes via the distributed cache (tmpjars).IOException
public static void initTableSnapshotMapperJob(String snapshotName, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir) throws IOException
snapshotName
- The name of the snapshot (of a table) to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying all necessary HBase
configuration.addDependencyJars
- upload HBase jars and jars for any of the configured job classes via
the distributed cache (tmpjars).tmpRestoreDir
- a temporary directory to copy the snapshot files into. Current user should
have write permissions to this directory, and this should not be a subdirectory of
rootdir. After the job is finished, restore directory can be deleted.IOException
- When setting up the details fails.TableSnapshotInputFormat
public static void initTableSnapshotMapperJob(String snapshotName, Scan scan, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, org.apache.hadoop.fs.Path tmpRestoreDir, RegionSplitter.SplitAlgorithm splitAlgo, int numSplitsPerRegion) throws IOException
snapshotName
- The name of the snapshot (of a table) to read from.scan
- The scan instance with the columns, time range etc.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is
carrying all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the configured
job classes via the distributed cache (tmpjars).tmpRestoreDir
- a temporary directory to copy the snapshot files into. Current user should
have write permissions to this directory, and this should not be a subdirectory of rootdir.
After the job is finished, restore directory can be deleted.splitAlgo
- algorithm to splitnumSplitsPerRegion
- how many input splits to generate per one regionIOException
- When setting up the details fails.TableSnapshotInputFormat
public static void initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job) throws IOException
scans
- The list of Scan
objects to read from.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying
all necessary HBase configuration.IOException
- When setting up the details fails.public static void initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars) throws IOException
scans
- The list of Scan
objects to read from.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying
all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the
configured job classes via the distributed cache (tmpjars).IOException
- When setting up the details fails.public static void initTableMapperJob(List<Scan> scans, Class<? extends TableMapper> mapper, Class<?> outputKeyClass, Class<?> outputValueClass, org.apache.hadoop.mapreduce.Job job, boolean addDependencyJars, boolean initCredentials) throws IOException
scans
- The list of Scan
objects to read from.mapper
- The mapper class to use.outputKeyClass
- The class of the output key.outputValueClass
- The class of the output value.job
- The current job to adjust. Make sure the passed job is carrying
all necessary HBase configuration.addDependencyJars
- upload HBase jars and jars for any of the
configured job classes via the distributed cache (tmpjars).initCredentials
- whether to initialize hbase auth credentials for the jobIOException
- When setting up the details fails.public static void initCredentials(org.apache.hadoop.mapreduce.Job job) throws IOException
IOException
@Deprecated public static void initCredentialsForCluster(org.apache.hadoop.mapreduce.Job job, String quorumAddress) throws IOException
initCredentialsForCluster(Job, Configuration)
instead.job
- The job that requires the permission.quorumAddress
- string that contains the 3 required configuratinsIOException
- When the authentication token cannot be obtained.public static void initCredentialsForCluster(org.apache.hadoop.mapreduce.Job job, org.apache.hadoop.conf.Configuration conf) throws IOException
job
- The job that requires the permission.conf
- The configuration to use in connecting to the peer clusterIOException
- When the authentication token cannot be obtained.public static String convertScanToString(Scan scan) throws IOException
scan
- The scan to write out.IOException
- When writing the scan fails.public static Scan convertStringToScan(String base64) throws IOException
base64
- The scan details.IOException
- When reading the scan instance fails.public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job) throws IOException
table
- The output table.reducer
- The reducer class to use.job
- The current job to adjust.IOException
- When determining the region count fails.public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner) throws IOException
table
- The output table.reducer
- The reducer class to use.job
- The current job to adjust.partitioner
- Partitioner to use. Pass null
to use
default partitioner.IOException
- When determining the region count fails.public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress, String serverClass, String serverImpl) throws IOException
table
- The output table.reducer
- The reducer class to use.job
- The current job to adjust. Make sure the passed job is
carrying all necessary HBase configuration.partitioner
- Partitioner to use. Pass null
to use
default partitioner.quorumAddress
- Distant cluster to write to; default is null for
output to the cluster that is designated in hbase-site.xml
.
Set this String to the zookeeper ensemble of an alternate remote cluster
when you would have the reduce write a cluster that is other than the
default; e.g. copying tables between clusters, the source would be
designated by hbase-site.xml
and this param would have the
ensemble address of the remote cluster. The format to pass is particular.
Pass <hbase.zookeeper.quorum>:<
hbase.zookeeper.client.port>:<zookeeper.znode.parent>
such as server,server2,server3:2181:/hbase
.serverClass
- redefined hbase.regionserver.classserverImpl
- redefined hbase.regionserver.implIOException
- When determining the region count fails.public static void initTableReducerJob(String table, Class<? extends TableReducer> reducer, org.apache.hadoop.mapreduce.Job job, Class partitioner, String quorumAddress, String serverClass, String serverImpl, boolean addDependencyJars) throws IOException
table
- The output table.reducer
- The reducer class to use.job
- The current job to adjust. Make sure the passed job is
carrying all necessary HBase configuration.partitioner
- Partitioner to use. Pass null
to use
default partitioner.quorumAddress
- Distant cluster to write to; default is null for
output to the cluster that is designated in hbase-site.xml
.
Set this String to the zookeeper ensemble of an alternate remote cluster
when you would have the reduce write a cluster that is other than the
default; e.g. copying tables between clusters, the source would be
designated by hbase-site.xml
and this param would have the
ensemble address of the remote cluster. The format to pass is particular.
Pass <hbase.zookeeper.quorum>:<
hbase.zookeeper.client.port>:<zookeeper.znode.parent>
such as server,server2,server3:2181:/hbase
.serverClass
- redefined hbase.regionserver.classserverImpl
- redefined hbase.regionserver.impladdDependencyJars
- upload HBase jars and jars for any of the configured
job classes via the distributed cache (tmpjars).IOException
- When determining the region count fails.public static void limitNumReduceTasks(String table, org.apache.hadoop.mapreduce.Job job) throws IOException
table
- The table to get the region count for.job
- The current job to adjust.IOException
- When retrieving the table details fails.public static void setNumReduceTasks(String table, org.apache.hadoop.mapreduce.Job job) throws IOException
table
- The table to get the region count for.job
- The current job to adjust.IOException
- When retrieving the table details fails.public static void setScannerCaching(org.apache.hadoop.mapreduce.Job job, int batchSize)
job
- The current job to adjust.batchSize
- The number of rows to return in batch with each scanner
iteration.public static void addHBaseDependencyJars(org.apache.hadoop.conf.Configuration conf) throws IOException
This is intended as a low-level API, facilitating code reuse between this class and its mapred counterpart. It also of use to external tools that need to build a MapReduce job that interacts with HBase but want fine-grained control over the jars shipped to the cluster.
conf
- The Configuration object to extend with dependencies.IOException
TableMapReduceUtil
,
PIG-3285public static String buildDependencyClasspath(org.apache.hadoop.conf.Configuration conf)
conf
.
Also exposed to shell scripts via `bin/hbase mapredcp`.public static void addDependencyJars(org.apache.hadoop.mapreduce.Job job) throws IOException
IOException
@Deprecated public static void addDependencyJars(org.apache.hadoop.conf.Configuration conf, Class<?>... classes) throws IOException
addDependencyJars(Job)
instead.IOException
@InterfaceAudience.Private public static void addDependencyJarsForClasses(org.apache.hadoop.conf.Configuration conf, Class<?>... classes) throws IOException
conf
- The Hadoop Configuration to modifyclasses
- will add just those dependencies needed to find the given classesIOException
- if an underlying library call fails.Copyright © 2007–2019 Cloudera. All rights reserved.