public static class LatentDirichletAllocation.GibbsSample extends Object
LatentDirichletAllocation.GibbsSample class
encapsulates all of the information related to a single Gibbs
sample for latent Dirichlet allocation (LDA). A sample
consists of the assignment of a topic identifier to each
token in the corpus. Other methods in this class are derived
from either the topic samples, the data being estimated, and
the LDA parameters such as priors.
Instances of
this class are created by the sampling method in the containing
class, LatentDirichletAllocation. For convenience, the
sample includes all of the data used to construct the sample,
as well as the hyperparameters used for sampling.
As described in the class documentation for the containing
class LatentDirichletAllocation, the primary content in
a Gibbs sample for LDA is the assignment of a single topic to
each token in the corpus. Cumulative counts for topics in
documents and words in topics as well as total counts are also
available; they do not entail any additional computation costs
as the sampler maintains them as part of the sample.
The sample also contains meta information about the state of the sampling procedure. The epoch at which the sample was produced is provided, as well as an indication of how many topic assignments changed between this sample and the previous sample (note that this is the previous sample in the chain, not necessarily the previous sample handled by the LDA handler; the handler only gets the samples separated by the specified lag.
The sample may be used to generate an LDA model. The resulting model may then be used for estimation of unseen documents. Typically, models derived from several samples are used for Bayesian computations, as described in the class documentation above.
| Modifier and Type | Method and Description |
|---|---|
double |
corpusLog2Probability()
Returns an estimate of the log (base 2) likelihood of the
corpus given the point estimates of topic and document
multinomials determined from this sample.
|
int |
documentLength(int doc)
Returns the length of the specified document in tokens.
|
int |
documentTopicCount(int doc,
int topic)
Returns the number of times the specified topic was
assigned to the specified document in this sample.
|
double |
documentTopicPrior()
Returns the uniform Dirichlet concentration hyperparameter
α for document distributions over topics
from which this sample was produced. |
double |
documentTopicProb(int doc,
int topic)
Returns the estimate of the probability of the topic being
assigned to a word in the specified document given the
topic * assignments in this sample.
|
int |
epoch()
Returns the epoch in which this sample was generated.
|
LatentDirichletAllocation |
lda()
Returns a latent Dirichlet allocation model corresponding
to this sample.
|
int |
numChangedTopics()
Returns the total number of topic assignments to tokens
that changed between the last sample and this one.
|
int |
numDocuments()
Returns the number of documents on which the sample was
based.
|
int |
numTokens()
Returns the number of tokens in documents on which the
sample was based.
|
int |
numTopics()
Returns the number of topics for this sample.
|
int |
numWords()
Returns the number of distinct words in the documents on
which the sample was based.
|
int |
topicCount(int topic)
Returns the total number of tokens assigned to the specified
topic in this sample.
|
short |
topicSample(int doc,
int token)
Returns the topic identifier sampled for the specified
token position in the specified document.
|
int |
topicWordCount(int topic,
int word)
Returns the number of times tokens for the specified word
were assigned to the specified topic.
|
double |
topicWordPrior()
Returns the uniform Dirichlet concentration hyperparameter
β for topic distributions over words from
which this sample was produced. |
double |
topicWordProb(int topic,
int word)
Returns the probability estimate for the specified word in
the specified topic in this sample.
|
int |
word(int doc,
int token)
Returns the word identifier for the specified token position
in the specified document.
|
int |
wordCount(int word)
Returns the number of times tokens of the specified word
appeared in the corpus.
|
public int epoch()
public int numDocuments()
public int numWords()
public int numTokens()
public int numTopics()
public short topicSample(int doc,
int token)
doc - Identifier for a document.token - Token position in the specified document.IndexOutOfBoundsException - If the document
identifier is not between 0 (inclusive) and the number of
documents (exclusive), or if the token is not between 0
(inclusive) and the number of tokens (exclusive) in the
specified document.public int word(int doc,
int token)
doc - Identifier for a document.token - Token position in the specified document.IndexOutOfBoundsException - If the document
identifier is not between 0 (inclusive) and the number of
documents (exclusive), or if the token is not between 0
(inclusive) and the number of tokens (exclusive) in the
specified document.public double documentTopicPrior()
α for document distributions over topics
from which this sample was produced.public double topicWordPrior()
β for topic distributions over words from
which this sample was produced.public int documentTopicCount(int doc,
int topic)
doc - Identifier for a document.topic - Identifier for a topic.IndexOutOfBoundsException - If the document identifier
is not between 0 (inclusive) and the number of documents
(exclusive) or if the topic identifier is not between 0 (inclusive)
and the number of topics (exclusive).public int documentLength(int doc)
doc - Identifier for a document.IndexOutOfBoundsException - If the document
identifier is not between 0 (inclusive) and the number of
documents (exclusive).public int topicWordCount(int topic,
int word)
topic - Identifier for a topic.word - Identifier for a word.IndexOutOfBoundsException - If the specified topic is
not between 0 (inclusive) and the number of topics (exclusive),
or if the word is not between 0 (inclusive) and the number of
words (exclusive).public int topicCount(int topic)
topic - Identifier for a topic.IllegalArgumentException - If the specified topic is
not between 0 (inclusive) and the number of topics (exclusive).public int numChangedTopics()
public double topicWordProb(int topic,
int word)
LatentDirichletAllocation using the topic assignment
counts in this sample and the topic-word prior.topic - Identifier for a topic.word - Identifier for a word.IndexOutOfBoundsException - If the specified topic is
not between 0 (inclusive) and the number of topics (exclusive),
or if the word is not between 0 (inclusive) and the number of
words (exclusive).public int wordCount(int word)
word - Identifier of a word.IndexOutOfBoundsException - If the word identifier is
not between 0 (inclusive) and the number of words
(exclusive).public double documentTopicProb(int doc,
int topic)
LatentDirichletAllocation using the topic assignment
counts in this sample and the document-topic prior.doc - Identifier of a document.topic - Identifier for a topic.IndexOutOfBoundsException - If the document identifier
is not between 0 (inclusive) and the number of documents
(exclusive) or if the topic identifier is not between 0 (inclusive)
and the number of topics (exclusive).public double corpusLog2Probability()
This likelihood calculation uses the methods
documentTopicProb(int,int) and topicWordProb(int,int) for estimating likelihoods
according the following formula:
corpusLog2Probability() = Σdoc,i log2 Σtopic p(topic|doc) * p(word[doc][i]|topic)
Note that this is not the complete corpus likelihood, which requires integrating over possible topic and document multinomials given the priors.
public LatentDirichletAllocation lda()
topicWordProb(int,int),
and the document-topic prior is as specified in the call
to LDA that produced this sample.Copyright © 2016 Alias-i, Inc.. All rights reserved.