public class WordCorpusFrequencyJob extends Object
| Modifier and Type | Class and Description |
|---|---|
static class |
WordCorpusFrequencyJob.DocumentSumReducer
Sums up all the documents per token index by docID.
|
static class |
WordCorpusFrequencyJob.TokenMapper
Write a token with its document id.
|
static class |
WordCorpusFrequencyJob.WordCorpusCounter |
| Modifier and Type | Field and Description |
|---|---|
static String |
DICT_OUT_PATH_KEY |
static String |
MIN_WORD_COUNT_KEY |
static String |
TOKENIZER_CLASS_KEY |
| Constructor and Description |
|---|
WordCorpusFrequencyJob() |
| Modifier and Type | Method and Description |
|---|---|
static org.apache.hadoop.mapreduce.Job |
createJob(String in,
String dictOut,
String out,
org.apache.hadoop.conf.Configuration conf)
Creates a token frequency job.
|
static long |
getNumberOfDocuments(org.apache.hadoop.mapreduce.Job finishedJob)
Gets the counter of the input lines read, in this case it should be the
number of documents.
|
static long |
getNumberOfTokens(org.apache.hadoop.mapreduce.Job finishedJob)
Gets the counter of the reduce output values.
|
static Tokenizer |
getTokenizer(org.apache.hadoop.conf.Configuration conf)
Gets a tokenizer, based on the configured class in "tokenizer.class".
|
static void |
main(String[] args) |
public static final String DICT_OUT_PATH_KEY
public static final String MIN_WORD_COUNT_KEY
public static final String TOKENIZER_CLASS_KEY
public static Tokenizer getTokenizer(org.apache.hadoop.conf.Configuration conf)
public static long getNumberOfDocuments(org.apache.hadoop.mapreduce.Job finishedJob)
throws IOException
finishedJob - the job that has successfully finished.IOExceptionpublic static long getNumberOfTokens(org.apache.hadoop.mapreduce.Job finishedJob)
throws IOException
finishedJob - the job that has successfully finished.IOExceptionpublic static org.apache.hadoop.mapreduce.Job createJob(String in, String dictOut, String out, org.apache.hadoop.conf.Configuration conf) throws IOException
in - the input path, may comma separate multiple paths.dictOut - the output path of the dictionary.out - the output directory.conf - the configuration.IOExceptionCopyright © 2016. All rights reserved.