public class TfIdfCalculatorJob extends Object
WordCorpusFrequencyJob.| Modifier and Type | Class and Description |
|---|---|
static class |
TfIdfCalculatorJob.DocumentVectorizerReducer
Calculate the sparse vector with TF-IDF.
|
| Modifier and Type | Field and Description |
|---|---|
static String |
NUMBER_OF_DOCUMENTS_KEY |
static String |
NUMBER_OF_TOKENS_KEY |
static String |
SPAM_DOCUMENT_PERCENTAGE_KEY |
static String |
WORD_COUNT_OUTPUT_KEY |
| Constructor and Description |
|---|
TfIdfCalculatorJob() |
| Modifier and Type | Method and Description |
|---|---|
static org.apache.hadoop.mapreduce.Job |
createJob(String in,
String out,
org.apache.hadoop.conf.Configuration conf,
long numberOfDocuments,
long numberOfTokens)
Creates a tf-idf job.
|
static void |
main(String[] args)
Calculates TF-IDF vectors from text input in the following format:
|
public static final String NUMBER_OF_DOCUMENTS_KEY
public static final String NUMBER_OF_TOKENS_KEY
public static final String SPAM_DOCUMENT_PERCENTAGE_KEY
public static final String WORD_COUNT_OUTPUT_KEY
public static void main(String[] args) throws Exception
documentid \t corpus
SequenceFile with Text as key and VectorWritable as
value.Exceptionpublic static org.apache.hadoop.mapreduce.Job createJob(String in, String out, org.apache.hadoop.conf.Configuration conf, long numberOfDocuments, long numberOfTokens) throws IOException
in - the input path, the output of the WordCorpusFrequencyJob.out - the output directory.conf - the configuration.numberOfDocuments - the number of documents in the corpus per token.
(map input counter value of WordCorpusFrequencyJob.)numberOfTokens - the number of tokens in the corpus. (reduce input
group counter value of WordCorpusFrequencyJob.)IOExceptionCopyright © 2016. All rights reserved.