public class LatentDirichletAllocation extends Object implements Serializable
LatentDirichletAllocation object represents a latent
Dirichlet allocation (LDA) model. LDA provides a Bayesian model of
document generation in which each document is generated by
a mixture of topical multinomials. An LDA model specifies the
number of topics, a Dirichlet prior over topic mixtures for a
document, and a discrete distribution over words for each topic.
A document is generated from an LDA model by first selecting a multinomial over topics given the Dirichlet prior. Then for each token in the document, a topic is generated from the document-specific topic distribution, and then a word is generated from the discrete distribution for that topic. Note that document length is not generated by this model; a fully generative model would need a way of generating lengths (e.g. a Poisson distribution) or terminating documents (e.g. a disginuished end-of-document symbol).
An LDA model may be estimated from an unlabeled training corpus (collection of documents) using a second Dirichlet prior, this time over word distributions in a topic. This class provides a static inference method that produces (collapsed Gibbs) samples from the posterior distribution of topic assignments to words, any one of which may be used to construct an LDA model instance.
An LDA model can be used to infer the topic mixture of unseen text documents, which can be used to compare documents by topical similarity. A fixed LDA model may also be used to estimate the likelihood of a word occurring in a document given the other words in the document. A collection of LDA models may be used for fully Bayesian reasoning at the corpus (collection of documents) level.
LDA may be applied to arbitrary multinomial data. To apply it
to text, a tokenizer factory converts a text document to bag of
words and a symbol table converts these tokens to integer outcomes.
The static method tokenizeDocuments(CharSequence[],TokenizerFactory,SymbolTable,int)
is available to carry out the text to multinomial conversion with
pruning of low counts.
LDA operates over a fixed vocabulary of discrete outcomes, which we call words for convenience, and represent as a set of integers:
words = { 0, 1, ..., numWords-1 }
A corpus is a ragged array of documents, each document consisting of an array of words:
int[][] words = { words[0], words[1], ..., words[numDocs-1] }
A given document words[i], i < numDocs,
is itself represented as a bag of words, each word being
represented as an integer:
int[] words[i] = { words[i][0], words[i][1], ..., words[i][words[i].length-1] }
The documents do not all need to be the same length, so the
two-dimensional array words is ragged.
A particular LDA model is defined over a fixed number of topics, also represented as integers:
topics = { 0, 1, ..., numTopics-1 }
For a given topic topic < numTopics, LDA specifies a
discrete distribution φ[topic] over words:
φ[topic][word] >= 0.0 Σword < numWords φ[topic][word] = 1.0
In an LDA model, each document in the corpus is generated from a
document-specific mixture of topics θ[doc]. The
distribution θ[doc] is a discrete distribution over
topics:
θ[doc][topic] >= 0.0 Σtopic < numTopics θ[doc][topic] = 1.0
Under LDA's generative model for documents, a document-specific
topic mixture θ[doc] is generated from a uniform
Dirichlet distribution with concentration parameter
α. The Dirichlet is the conjugate prior for the
multinomial in which α acts as a prior count
assigned to a topic in a document. Typically, LDA is run with a
fairly diffuse prior with concentration α <
1, leading to skewed posterior topic distributions.
Given a topic distribution θ[doc] for a
document, tokens are generated (conditionally) independently. For
each token in the document, a topic
topics[doc][token] is generated according to the
topic distribution θ[doc], then the word instance
words[doc][token] is
generated given the topic using the topic-specific distribution
over tokens φ[topics[doc][token]].
For estimation purposes, LDA places a uniform Dirichlet prior
with concentration parameter β on each of the
topic distributions φ[topic]. The first step in
modeling a corpus is to generate the topic distributions
φ from a Dirichlet parameterized by
β.
In sampling notation, the LDA generative model is expressed as follows:
φ[topic] ~ Dirichlet(β) θ[doc] ~ Dirichlet(α) topics[doc][token] ~ Discrete(θ[doc]) words[doc][token] ~ Discrete(φ[topic[doc][token]])
A generative Bayesian model is based on a the joint probablity
of all observed and unobserved variables, given the priors. Given a
text corpus, only the words are observed. The unobserved variables
include the assignment of a topic to each word, the assignment of a
topic distribution to each document, and the assignment of word
distributions to each topic. We let
topic[doc][token], doc < numDocs, token <
docLength[doc], be the topic assigned to the specified token
in the specified document. This leaves two Dirichlet priors, one
parameterized by α for topics in documents, and
one parameterized by β for words in topics.
These priors are treated as hyperparameters for purposes of
estimation; cross-validation may be used to provide so-called empirical
Bayes estimates for the priors.
The full LDA probability model for a corpus follows the generative process outlined above. First, topic distributions are chosen at the corpus level for each topic given their Dirichlet prior, and then the remaining variables are generating given these topic distributions:
p(words, topics, θ, &phi | α, &beta)
= Πtopic < numTopics
p(φ[topic] | β)
* p(words, topics, θ | α, φ)
Note that because the multinomial parameters for
φ[topic] are continuous,
p(φ[topic] | β) represents a density, not a discrete
distribution. Thus it does not make sense to talk about the
probability of a given multinomial φ[topic]; non-zero
results only arise from integrals over measurable subsets of the
multinomial simplex. It is possible to sample from a density, so
the generative model is well founded.
A document is generated by first generating its topic distribution given the Dirichlet prior, and then generating its topics and words:
p(words, topics, θ | α, φ)
= Πdoc < numDocs
p(θ[doc] | α)
* p(words[doc], topics[doc] | θ[doc], φ)
The topic and word are generated from the multinomials
θ[doc] and the topic distributions
φ using the chain rule, first generating the topic
given the document's topic distribution and then generating the
word given the topic's word distribution.
p(words[doc], topics[doc] | θ[doc], φ)
= Πtoken < words[doc].length
p(topics[doc][token] | θ[doc])
* p(words[doc][token] | φ[topics[doc][token]])
Given the topic and document multinomials, this distribution is discrete, and thus may be evaluated. It may also be marginalized by summation:
p(words[doc] | θ[doc], φ)
= Π token < words[doc].length
Σtopic < numTopics p(topic | θ[doc]) * p(words[doc][token] | φ[topic])
Conditional probablities are computed in the usual way by marginalizing other variables through integration. Unfortunately, this simple mathematical operation is often intractable computationally.
This class uses a collapsed form of Gibbs sampling over the posterior distribution of topic assignments given the documents and Dirichlet priors:
p(topics | words, α β)
This distribution may be derived from the joint distribution by marginalizing (also known as "collapsing" or "integrating out") the contributions of the document-topic and topic-word distributions.
The Gibbs sampler used to estimate LDA models produces samples that consist of a topic assignment to every token in the corpus. The conjugacy of the Dirichlet prior for multinomials makes the sampling straightforward.
An initial sample is produced by randomly assigning topics to tokens. Then, the sampler works iteratively through the corpus, one token at a time. At each token, it samples a new topic assignment to that token given all the topic assignments to other tokens in the corpus:
The notationp(topics[doc][token] | words, topics')
topics' represents the set of topic
assignments other than to topics[doc][token]. This
collapsed posterior conditional is estimated directly:
The counts are all defined relative top(topics[doc][token] = topic | words, topics') = (count'(doc,topic) + α) / (docLength[doc]-1 + numTopics*α) * (count'(topic,word) + β) / (count'(topic) + numWords*β)
topics'; that
is, the current topic assignment for the token being sampled is not
considered in the counts. Note that the two factors are estimates
of θ[doc] and φ[topic] with all
data other than the assignment to the current token. Note how the
prior concentrations arise as additive smoothing factors in these
estimates, a result of the Dirichlet's conjugacy to the
multinomial. For the purposes of sampling, the document-length
normalization in the denominator of the first term is not necessary,
as it remains constant across topics.
The posterior Dirichlet distributions may be computed using just the counts. For instance, the posterior distribution for topics in documents is estimated as:
p(&theta[doc]|α, β, words) = Dirichlet(count(doc,0)+β, count(doc,1)+β, ..., count(doc,numTopics-1)+β)
The sampling distribution is defined from the maximum a posteriori (MAP) estimates of the multinomial distribution over topics in a document:
which we know from the Dirichlet distribution is:θ*[doc] = ARGMAXθ[doc] p(θ[doc] | α, β, words)
By the same reasoning, the MAP word distribution in topics is:θ*[doc][topic] = (count(doc,topic) + α) / (docLength[doc] + numTopics*α)
φ*[topic][word] = (count(topic,word) + β) / (count(topic) + numWords*β)
A complete Gibbs sample is represented as an instance of LatentDirichletAllocation.GibbsSample, which provides access to
the topic assignment to every token, as well as methods to compute
θ* and φ*
as defined above. A sample also maintains the original priors and
word counts. Just the estimates of the topic-word distributions
φ[topic] and the prior topic concentration
α are sufficient to define an LDA model. Note
that the imputed values of θ*[doc]
used during estimation are part of a sample, but are not part of
the LDA model itself. The LDA model contains enough information to
estimate θ* for an arbitrary
document, as described in the next section.
The Gibbs sampling algorithm starts with a random assignment of topics to words, then simply iterates through the tokens in turn, sampling topics according to the distribution defined above. After each run through the entire corpus, a callback is made to a handler for the samples. This setup may be configured for an initial burnin period, essentially just discarding the first batch of samples. Then it may be configured to sample only periodically thereafter to avoid correlations between samples.
An LDA model consists of a topic distribution Dirichlet prior
α and a word distribution
φ[topic] for each topic. Given an LDA model and a
new document words = { words[0], ..., words[length-1] }
consisting of a sequence of words, the posterior distribution over
topic weights is given by:
Although this distribution is not solvable analytically, it is easy to estimate using a simplified form of the LDA estimator's Gibbs sampler. The conditional distribution of a topic assignmentp(θ | words, α, φ)
topics[token] to a single token given an assignment
topics' to all other tokens is given by:
This leads to a straightforward sampler over posterior topic assignments, from which we may directly compute the Dirichlet posterior over topic distributions or a MAP topic distribution.p(topic[token] | topics', words, α, φ) ∝ p(topic[token], words[token] | topics', α φ) = p(topic[token] | topics', α) * p(words[token] | φ[topic[token]]) = (count(topic[token]) + α) / (words.length - 1 + numTopics * α) * p(words[token] | φ[topic[token]])
This class provides a method to sample these topic assignments,
which may then be used to form Dirichlet distributions or MAP point
estimates of θ* for the document
words.
An LDA model may be used to estimate the likelihood of a word given a previous bag of words:
This integral is easily evaluated using sampling over the topic distributionsp(word | words, α, φ) = ∫p(word | θ, φ) p(θ | words, α, φ) dθ
p(θ | words, α, φ) and
averaging the word probability determined by each sample.
The word probability for a sample θ is defined by:
Although this approach could theoretically be applied to generate the probability of a document one word at a time, the cost would be prohibitive, as there are quadratically many samples required because samples for thep(word | θ, φ) = Σtopic < numTopics p(topic | θ) * p(word | φ[topic])
n-th word consist of topic
assignments to the previous n-1 words.
An LDA model may be used for a variety of statistical calculations. For instance, it may be used to determine the distribution of topics to words, and using these distributions, may determine word similarity. Similarly, document similarity may be determined by the topic distributions in a document.
Point estimates are derived using a single LDA model. For Bayesian calculation, multiple samples are taken to produce multiple LDA models. The results of a calculation on these models is then averaged to produce a Bayesian estimate of the quantity of interest. The sampling methodology is effectively numerically computing the integral over the posterior.
Bayesian calculations over multiple samples are complicated by the exchangeability of topics in the LDA model. In particular, there is no guarantee that topics are the same between samples, thus it is not acceptable to combine samples in topic-level reasoning. For instance, it does not make sense to estimate the probability of a topic in a document using multiple samples.
LDA has also been applied to collaborative filtering. Movies act as words, with users modeled as documents, the bag of words they've seen. Given an LDA model and a user's films, the user's topic distribution may be inferred and used to estimate the likelihood of seeing unseen films.
| Modifier and Type | Class and Description |
|---|---|
static class |
LatentDirichletAllocation.GibbsSample
The
LatentDirichletAllocation.GibbsSample class
encapsulates all of the information related to a single Gibbs
sample for latent Dirichlet allocation (LDA). |
| Constructor and Description |
|---|
LatentDirichletAllocation(double docTopicPrior,
double[][] topicWordProbs)
Construct a latent Dirichelt allocation (LDA) model using the
specified document-topic prior and topic-word distributions.
|
| Modifier and Type | Method and Description |
|---|---|
double[] |
bayesTopicEstimate(int[] tokens,
int numSamples,
int burnin,
int sampleLag,
Random random)
Return the Bayesian point estimate of the topic distribution
for a document consisting of the specified tokens, using Gibbs
sampling with the specified parameters.
|
double |
documentTopicPrior()
Returns the concentration value of the uniform Dirichlet prior over
topic distributions for documents.
|
static Iterator<LatentDirichletAllocation.GibbsSample> |
gibbsSample(int[][] docWords,
short numTopics,
double docTopicPrior,
double topicWordPrior,
Random random)
Return an iterator over Gibbs samples for the specified
document-word corpus, number of topics, priors and randomizer.
|
static LatentDirichletAllocation.GibbsSample |
gibbsSampler(int[][] docWords,
short numTopics,
double docTopicPrior,
double topicWordPrior,
int burninEpochs,
int sampleLag,
int numSamples,
Random random,
ObjectHandler<LatentDirichletAllocation.GibbsSample> handler)
Run Gibbs sampling for the specified multinomial data, number
of topics, priors, search parameters, randomization and
callback sample handler.
|
int |
numTopics()
Returns the number of topics in this LDA model.
|
int |
numWords()
Returns the number of words on which this LDA model
is based.
|
short[][] |
sampleTopics(int[] tokens,
int numSamples,
int burnin,
int sampleLag,
Random random)
Returns the specified number of Gibbs samples of topics for the
specified tokens using the specified number of burnin epochs,
the specified lag between samples, and the specified
randomizer.
|
static int[] |
tokenizeDocument(CharSequence text,
TokenizerFactory tokenizerFactory,
SymbolTable symbolTable)
Tokenizes the specified text document using the specified tokenizer
factory returning only tokens that exist in the symbol table.
|
static int[][] |
tokenizeDocuments(CharSequence[] texts,
TokenizerFactory tokenizerFactory,
SymbolTable symbolTable,
int minCount)
Tokenize an array of text documents represented as character
sequences into a form usable by LDA, using the specified
tokenizer factory and symbol table.
|
double[] |
wordProbabilities(int topic)
Returns an array representing of probabilities of words in the
specified topic.
|
double |
wordProbability(int topic,
int word)
Returns the probability of the specified word in the specified
topic.
|
public LatentDirichletAllocation(double docTopicPrior,
double[][] topicWordProbs)
The topic-word probability array topicWordProbs
represents a collection of discrete distributions
topicwordProbs[topic] for topics, and thus must
satisfy:
topicWordProbs[topic][word] >= 0.0 Σword < numWords topicWordProbs[topic][word] = 1.0
Warning: These requirements are not checked by the constructor.
See the class documentation above for an explanation of the parameters and what can be done with a model.
docTopicPrior - The document-topic prior.topicWordProbs - Array of discrete distributions over words,
indexed by topic.IllegalArgumentException - If the document-topic prior is
not finite and positive, or if the topic-word probabilities
arrays are not all the same length with entries between 0.0 and
1.0 inclusive.public int numTopics()
public int numWords()
public double documentTopicPrior()
public double wordProbability(int topic,
int word)
topic - Topic identifier.word - Word identifier.public double[] wordProbabilities(int topic)
The returned result is a copy of the underlying data in the model so that changing it will not change the model.
topic - Topic identifier.public short[][] sampleTopics(int[] tokens,
int numSamples,
int burnin,
int sampleLag,
Random random)
See the class documentation for more information on how the samples are computed.
tokens - The tokens making up the document.numSamples - Number of Gibbs samples to return.burnin - The number of samples to take and throw away
during the burnin period.sampleLag - The interval between samples after burnin.random - The random number generator to use for this sampling
process.IndexOutOfBoundsException - If there are tokens whose
value is less than zero, or whose value is greater than the
number of tokens in this model.IllegalArgumentException - If the number of samples is
not positive, the sample lag is not positive, or if the burnin
period is negative. if the number of samples, burnin, and lag
are not positive numbers.public double[] bayesTopicEstimate(int[] tokens,
int numSamples,
int burnin,
int sampleLag,
Random random)
See the method sampleTopics(int[],int,int,int,Random)
and the class documentation for more information on the sampling
procedure.
tokens - The tokens making up the document.numSamples - Number of Gibbs samples to return.burnin - The number of samples to take and throw away
during the burnin period.sampleLag - The interval between samples after burnin.random - The random number generator to use for this sampling
process.IndexOutOfBoundsException - If there are tokens whose
value is less than zero, or whose value is greater than the
number of tokens in this model.IllegalArgumentException - If the number of samples is
not positive, the sample lag is not positive, or if the burnin
period is negative.public static LatentDirichletAllocation.GibbsSample gibbsSampler(int[][] docWords, short numTopics, double docTopicPrior, double topicWordPrior, int burninEpochs, int sampleLag, int numSamples, Random random, ObjectHandler<LatentDirichletAllocation.GibbsSample> handler)
LatentDirichletAllocation.GibbsSample. This method will return
the final sample and also send intermediate samples to an
optional handler.
The class documentation above explains Gibbs sampling for LDA as used in this method.
The primary input is an array of documents, where each document is represented as an array of integers representing the tokens that appear in it. These tokens should be numbered contiguously from 0 for space efficiency. The topic assignments in the Gibbs sample are aligned as parallel arrays to the array of documents.
The next three parameters are the hyperparameters of the model, specifically the number of topics, the prior count assigned to topics in a document, and the prior count assigned to words in topics. A rule of thumb for the document-topic prior is to set it to 5 divided by the number of topics (or less if there are very few topics; 0.1 is typically the maximum value used). A good general value for the topic-word prior is 0.01. Both of these priors will be diffuse and tend to lead to skewed posterior distributions.
The following three parameters specify how the sampling is
to be done. First, the sampler is "burned in" for a
number of epochs specified by the burnin parameter. After burn
in, samples are taken after fixed numbers of documents to avoid
correlation in the samples; the sampling frequency is specified
by the sample lag. Finally, the number of samples to be taken
is specified. For instance, if the burnin is 1000, the sample
lag is 250, and the number of samples is 5, then samples are
taken after 1000, 1250, 1500, 1750 and 2000 epochs. If a
non-null handler object is specified in the method call, its
handle(GibbsSample) method is called with each the
samples produced as above.
The final sample in the chain of samples is returned as the result. Note that this sample will also have been passed to the specified handler as the last sample for the handler.
A random number generator must be supplied as an argument.
This may just be a new instance of Random or
a custom extension. It is used for all randomization in this
method.
docWords - Corpus of documents to be processed.numTopics - Number of latent topics to generate.docTopicPrior - Prior count of topics in a document.topicWordPrior - Prior count of words in a topic.burninEpochs - Number of epochs to run before taking a sample.sampleLag - Frequency between samples.numSamples - Number of samples to take before exiting.random - Random number generator.handler - Handler to which the samples are sent.public static Iterator<LatentDirichletAllocation.GibbsSample> gibbsSample(int[][] docWords, short numTopics, double docTopicPrior, double topicWordPrior, Random random)
gibbsSampler(int[][],short,double,double,int,int,int,Random,ObjectHandler).
See that method and the class documentation for more details.docWords - Corpus of documents to be processed.numTopics - Number of latent topics to generate.docTopicPrior - Prior count of topics in a document.topicWordPrior - Prior count of words in a topic.random - Random number generator.public static int[][] tokenizeDocuments(CharSequence[] texts, TokenizerFactory tokenizerFactory, SymbolTable symbolTable, int minCount)
Warning: With some tokenizer factories and or minimum count thresholds, there may be documents with no tokens in them.
texts - The text corpus.tokenizerFactory - A tokenizer factory for tokenizing the texts.symbolTable - Symbol table used to convert tokens to identifiers.minCount - Minimum count for a token to be included in a
document's representation.public static int[] tokenizeDocument(CharSequence text, TokenizerFactory tokenizerFactory, SymbolTable symbolTable)
text - Character sequence to tokenize.tokenizerFactory - Tokenizer factory for tokenization.symbolTable - Symbol table to use for converting tokens
to symbols.Copyright © 2016 Alias-i, Inc.. All rights reserved.