public class ExtremalTupleByNthField extends EvalFunc<Tuple> implements Algebraic, Accumulator<Tuple>
define myMin ExtremalTupleByNthField( '4', 'min' );
T = group G ALL;
R = foreach T generate myMin(G);
is equivalent to:
T = order G by $3 asc;
R = limit G 1;
Note above 4 indicates the field with index 3 in the tuple. The 4th field can
be any comparable type, so you can use float, int, string, or even tuples.
By default constructor, this UDF behaves as MaxTupleBy1stField in that it
chooses the max tuple by the comparable in the first field.
define myMax ExtremalTupleByNthField( '3' );
T = group G ALL;
R = foreach T generate myMax(G);
is equivalent to:
T = order G by $2 desc;
R = limit G 1;
define biggestBag ExtremalTupleByNthField('1', max);
R = group TABLE by (key1, key2);
G = cogroup L by key1, R by group.key1;
V = foreach G generate L, biggestBag(R);
This results in each L(eft) bag associated with only the largest bag from the
R(ight) table. If all bags in R are of equal size, the comparator continues
on to perform element-wise comparison. In case of a complete tie in the
comparison, which result is returned is nondeterministic. But because this
class is able to compare any comparable we are able to specify a secondary
key.
define biggestBag ExtremalTupleByNthField('1', max);
G = cogroup L by key1, M by key1, R by key1;
V = foreach G generate FLATTEN(L),
biggestBag(R.($0, $1, $2, $5)) as best_result_by_0,
biggestBag(R.($3, $1, $2, $5)) as best_result_by_3,
biggestBag(M.($0, $2)) as best_misc_data;
this will generate two sets of results and misc data based on two separate
criterion. Since all tuples in the bags have the same size (4, 4, 2
respectively), the tuple comparator continues on and compares the members of
tuples until it finds one. best_result_by_0 and best_result_by3 are ordered
by 1st and 4th member of the tuples. Within each group, ties are broken by
second and third field.
Finally, note that the udf implements both Algebraic and Accumulator, so it
is relatively efficient because it's a one-pass algorithm.Modifier and Type | Class and Description |
---|---|
static class |
ExtremalTupleByNthField.HelperClass
Utility classes and methods
|
EvalFunc.SchemaType
log, pigLogger, reporter, returnType
Constructor and Description |
---|
ExtremalTupleByNthField()
Constructors
|
ExtremalTupleByNthField(java.lang.String fieldIndexString) |
ExtremalTupleByNthField(java.lang.String fieldIndexString,
java.lang.String order) |
Modifier and Type | Method and Description |
---|---|
void |
accumulate(Tuple b)
Pass tuples to the UDF.
|
void |
cleanup()
Called after getValue() to prepare processing for next key.
|
Tuple |
exec(Tuple input)
The EvalFunc interface
|
protected static Tuple |
extreme(int pind,
int psign,
Tuple input,
PigProgressable reporter) |
java.lang.String |
getFinal()
Get the final function.
|
java.lang.String |
getInitial()
Algebraic interface
|
java.lang.String |
getIntermed()
Get the intermediate function.
|
java.lang.reflect.Type |
getReturnType()
Get the Type that this EvalFunc returns.
|
Tuple |
getValue()
Called when all tuples from current key have been passed to accumulate.
|
Schema |
outputSchema(Schema input)
Report the schema of the output of this UDF.
|
protected static int |
parseFieldIndex(java.lang.String inputFieldIndex) |
protected static int |
parseOrdering(java.lang.String order) |
allowCompileTimeCalculation, finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLoadCaster, getLogger, getPigLogger, getReporter, getSchemaName, getSchemaType, getShipFiles, isAsynchronous, needEndOfAllInputProcessing, progress, setEndOfAllInput, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
public ExtremalTupleByNthField() throws ExecException
ExecException
public ExtremalTupleByNthField(java.lang.String fieldIndexString) throws ExecException
ExecException
public ExtremalTupleByNthField(java.lang.String fieldIndexString, java.lang.String order) throws ExecException
ExecException
public java.lang.reflect.Type getReturnType()
EvalFunc
getReturnType
in class EvalFunc<Tuple>
public Schema outputSchema(Schema input)
EvalFunc
The default implementation interprets the OutputSchema
annotation,
if one is present. Otherwise, it returns null
(no known output schema).
outputSchema
in class EvalFunc<Tuple>
input
- Schema of the inputpublic java.lang.String getInitial()
getInitial
in interface Algebraic
public java.lang.String getIntermed()
Algebraic
getIntermed
in interface Algebraic
public java.lang.String getFinal()
Algebraic
public void accumulate(Tuple b) throws java.io.IOException
Accumulator
accumulate
in interface Accumulator<Tuple>
b
- A tuple containing a single field, which is a bag. The bag will contain the set
of tuples being passed to the UDF in this iteration.java.io.IOException
public void cleanup()
Accumulator
cleanup
in interface Accumulator<Tuple>
public Tuple getValue()
Accumulator
getValue
in interface Accumulator<Tuple>
protected static final Tuple extreme(int pind, int psign, Tuple input, PigProgressable reporter) throws ExecException
ExecException
protected static int parseFieldIndex(java.lang.String inputFieldIndex) throws ExecException
ExecException
protected static int parseOrdering(java.lang.String order)
Copyright © 2007-2017 The Apache Software Foundation