public class LogisticRegression extends Object implements Compilable, Serializable
LogisticRegression instance is a multi-class vector
classifier model generating conditional probability estimates of
categories. This class also provides static factory methods for
estimating multinomial regression models using stochastic gradient
descent (SGD) to find maximum likelihood or maximum a posteriori
(MAP) estimates with Laplace, Gaussian, Cauchy priors on
coefficients.
The classification package contains a class LogisticRegressionClassifier which adapts this
class's models and estimators to act as generic classifiers given
an instance of FeatureExtractor.
|
Also Known As (AKA)
Multinomial logistic regression is also known as polytomous, polychotomous, or multi-class logistic regression, or just multilogit regression. Binary logistic regression is an instance of a generalized linear model (GLM) with the logit link function. The logit function is the inverse of the logistic function, and the logistic function is sometimes called the sigmoid function or the s-curve. Logistic regression estimation obeys the maximum entropy principle, and thus logistic regression is sometimes called "maximum entropy modeling", and the resulting classifier the "maximum entropy classifier". The generalization of binomial logistic regression to multinomial logistic regression is sometimes called a softmax or exponential model. Maximum a priori (MAP) estimation with Gaussian priors is often referred to as "ridge regression"; with Laplace priors MAP estimation is known as the "lasso". MAP estimation with Gaussian, Laplace or Cauchy priors is known as parameter shrinkage. Gaussian and Laplace are forms of regularized regression, with the Gaussian version being regularized with the L2 norm (Euclidean distance, called the Frobenius norm for matrices of parameters) and the Laplace version being regularized with the L1 norm (taxicab distance or Manhattan metric); other Minkowski metrics may be used for shrinkage. Binary logistic regression is equivalent to a one-layer, single-output neural network with a logistic activation function trained under log loss. This is sometimes called classification with a single neuron. |
numInputDimensions() returns the number of dimensions (features)
in the model. Because the model is well-behaved under sparse
vectors, the dimensionality may be returned as Integer.MAX_VALUE, a common default choice for sparse vectors.
A logistic regression model also fixes the number of output
categories. The method numOutcomes() returns the number
of categories. These outcome categories will be represented as
integers from 0 to numOutcomes()-1
inclusive.
A model is parameterized by a real-valued vector for every
category other than the last, each of which must be of the same
dimensionality as the model's input feature dimensionality. The
constructor LogisticRegression(Vector[]) takes an array of
Vector objects, which may be dense or sparse, but must all
be of the same dimensionality.
The likelihood of a given output category k <
numOutcomes() given an input vector x of
dimensionality numFeatures() is given by:
p(c | x, β) = exp(βk * x) / Z(x) if c < numOutcomes()-1
1 / Z(x) if c = numOutcomes()-1
where βk * x is vector dot (or inner)
product:
and where the normalizing denominator, called the partition function, is:βk * x = Σi < numDimensions() βk,i * xi
Z(x) = 1 + Σk < numOutcomes()-1 exp(βk * x)
(x,c) and a prior, which
must be an instance of RegressionPrior. The error function
is just the negative log likelihood and log prior:
Err(D,β) = -( log2 p(β|σ2) + Σ{(x,c') in D} log2 p(c'|x,β))
where p(β|σ2) is the likelihood of the parameters
β in the prior, and p(c|x,β) is
the probability of category c given input x
and parameters β.
The maximum a posteriori estimate is such that the gradient (vector of partial derivatives of parameters) is zero. If the data is not linearly separable, a maximum likelihood solution must exist. If the data is not linearly separable and none of the data dimensions is colinear, the solution will be unique. If there is an informative Cauchy, Gaussian or Laplace prior, there will be a unique MAP solution even in the face of linear separability or colinear dimensions. Proofs of solution exists require showing the matrix of second partial derivatives of the error with respect to pairs of parameters, is positive semi-definite; if it is positive definite, the error is strictly concave and the MAP solution is unique.
The gradient
for parameter vector βc for
outcome c < k-1 is:
grad(Err(D,βc))
= ∂Err(D,β) / ∂βc
= ∂(- log p(β|σ2)) / ∂βc
+ ∂( - Σ{(x,c') in D} log p(c' | x, β))
where the gradient of the priors are described in the
class documentation for RegressionPrior, and the
gradient of the likelihood function is:.
∂(-Σ{(x,c') in D} log p(c' | x, β)) / ∂βc
= - Σ{(x,c') in D} ∂ log p(c' | x, β)) /∂βc
= - Σ{(x,c') in D} x * (p(c' | x, β) - I(c = c'))
where the indicator function I(c=c') is equal to 1 if
c=c' and equal to 0 otherwise.
1, which makes the
parameters βc,0 intercepts. The priors
allow the intercept to be given an uninformative prior even if the
other dimensions have informative priors.
Variance normalization can be achieved by setting the variance prior parameter independently for each dimension.
i, then in addition to the raw value
xi, anotehr feature j may be
introduced with value xi2.
Similarly, interaction terms are often added for features
The resulting model is linear in the derived features, but
will no longer be linear in the original features.
This class estimates logistic regression models using stochastic
gradient descent (SGD). The SGD method runs through the data one
or more times, considering one training case at a time, adjusting
the parameters along some multiple of the contribution to the gradient
of the error for that case.
With informative priors, the search space
is strictly concave, and there will be a unique solution. In cases
of linear dependence between dimensions or in separable data,
maximum likelihood estimation may diverge.
The basic algorithm is:
The argument Finally, the search parameters include an instance of
The last block may be smaller than the others, but it will
be treated the same way. First its classifications are computed,
then the gradient updates are made, then the prior updates.
Larger block sizes tend to lead to more robust fitting, but
may be slower to converge in terms of number of epochs.
In fitting models with priors, large block sizes will cause
each epoch to run faster because the dense operation of adjusting
for priors is performed less frequently.
If the block size is set to the corpus size, gradient descent
reduces to conjugate gradient descent, although step sizes will
still be calculated with the learning rate, not by a line search
along the gradient direction.
xi and xj,
with a new feature xk being defined
with value xi xj.
Stochastic Gradient Descent
where we discuss the learning rate and convergence conditions
in the next section. The gradient of the error is described
above, and the gradient contribution of the prior and its
parameters
β = 0;
for (epoch = 0; epoch < maxEpochs; ++epoch)
for training case (x,c') in D
for category c < numOutcomes-1
βc -= learningRate(epoch) * grad(Err(x,c,c',β,σ2))
if (epoch > minEpochs && converged)
return βσ are described in the class
documentation for RegressionPrior. Note that the error
gradient must be divided by the number of training cases to
get the incremental contribution of the prior gradient.
The actual algorithm uses a lazy form of updating the contribution
of the gradient of the prior. The result is an algorithm that
handles sparse input data touching only the non-zero dimensions of
inputs during parameter updates.
Learning Parameters
In addition to the model parameters (including priors) and training
data (input vectors and reference categories), the regression
estimation method also requires four parameters that control
search. The simplest search parameters are the minimum and maximum
epoch parameters, which control the number of epochs used for
optimzation.
minImprovement determines how much
improvement in training data and model log likelihood under the
current model is necessary to go onto the next epoch. This is
measured relatively, with the algorithm stopping when the current
epoch's error err is relatively close to the previous
epoch's error, errLast:
Setting this to a low value will lead to slow, but accurate
coefficient estimates.
abs(err - errLast)/(abs(err) + abs(errLast)) < minImprovement
AnnealingSchedule which impelements the learningRate(epoch)
method. See that method for concrete implementations, including
a standard inverse epoch annealing and exponential decay annealing.
Blocked Updates
The implementation of stochastic gradient descent used in this
class for fitting a logistic regression model calculates the
likelihoods for an entire block of examples at once without
changing the model parameters. The parameters are then updated for
the entire block at once.
Serialization and Compilation
For convenience, this class implements both the Serializable
and Compilable interfaces. Serializing or compiling
a logistic regression model has the same effect. The model
read back in from its serialized state will be an instance of
this class, LogisticRegression.
References
Logistic regression is discussed in most machine learning and
statistics textbooks. These three machine learning textbooks all
introduce some form of stochastic gradient descent and logistic
regression (often not together, and often under different names as
listed in the AKA section above):
An introduction to traditional statistical modeling with logistic
regression may be found in:
A discussion of text classification using regression that evaluates
with respect to support vector machines (SVMs) and considers
informative Laplace and Gaussian priors varying by dimension (which
this class supports), see:
| Constructor and Description |
|---|
LogisticRegression(Vector weightVector)
Construct a binomial logistic regression model with the
specified parameter vector.
|
LogisticRegression(Vector[] weightVectors)
Construct a multinomial logistic regression model with
the specified weight vectors.
|
| Modifier and Type | Method and Description |
|---|---|
double[] |
classify(Vector x)
Returns an array of conditional probabilities indexed by
outcomes for the specified input vector.
|
void |
classify(Vector x,
double[] ysHat)
Fills the specified array with the conditional probabilities
indexed by outcomes for the specified input vector.
|
void |
compileTo(ObjectOutput out)
Compiles this model to the specified object output.
|
static LogisticRegression |
estimate(Vector[] xs,
int[] cs,
RegressionPrior prior,
AnnealingSchedule annealingSchedule,
Reporter reporter,
double minImprovement,
int minEpochs,
int maxEpochs)
Estimate a logistic regression model from the specified input
data using the specified Gaussian prior, initial learning rate
and annealing rate, the minimum improvement per epoch, the
minimum and maximum number of estimation epochs, and a
reporter.
|
static LogisticRegression |
estimate(Vector[] xs,
int[] cs,
RegressionPrior prior,
int blockSize,
LogisticRegression hotStart,
AnnealingSchedule annealingSchedule,
double minImprovement,
int rollingAverageSize,
int minEpochs,
int maxEpochs,
ObjectHandler<LogisticRegression> handler,
Reporter reporter)
Estimate a logistic regression model from the specified input
data using the specified Gaussian prior, initial learning rate
and annealing rate, the minimum improvement per epoch, the
minimum and maximum number of estimation epochs, and a
reporter.
|
static double |
log2Likelihood(Vector[] inputs,
int[] cats,
LogisticRegression regression)
Returns the log (base 2) likelihood of the specified inputs
with the specified categories using the specified regression
model.
|
int |
numInputDimensions()
Returns the dimensionality of inputs for this logistic
regression model.
|
int |
numOutcomes()
Returns the number of outcomes for this logistic regression
model.
|
Vector[] |
weightVectors()
Returns an array of views of the weight vectors used for this
regression model.
|
public LogisticRegression(Vector[] weightVectors)
k-1
weight vectors, the result is a multinomial classifier
with k outcomes.
The weight vectors are stored rather than copied, so changes to them will affect this class.
See the class definition above for more information on logistic regression.
weightVectors - Weight vectors definining this regression
model.IllegalArgumentException - If the array of weight vectors
does not have at least one element or if there are two weight
vectors with different numbers of dimensions.public LogisticRegression(Vector weightVector)
The weight vector is stored rather than copied, so changes to it will affect this class.
weightVector - The weights of features defining this
model.public int numInputDimensions()
public int numOutcomes()
public Vector[] weightVectors()
public double[] classify(Vector x)
i that is equal to the
probability of the outcome i for the specified
input. The sum of the returned values will be 1.0 (modulo
arithmetic precision).
See the class definition above for more information on how the conditional probabilities are computed.
x - The input vector.IllegalArgumentException - If the specified vector is not
the same dimensionality as this logistic regression instance.public void classify(Vector x, double[] ysHat)
The resulting array has a value for index i
that is equal to the probability of the outcome i
for the specified input. The sum of the returned values will
be 1.0 (modulo arithmetic precision).
See the class definition above for more information on how the conditional probabilities are computed.
x - The input vector.ysHat - Array into which conditional probabilities are written.IllegalArgumentException - If the specified vector is not
the same dimensionality as this logistic regression instance.public void compileTo(ObjectOutput out) throws IOException
LogisticRegression.
Compilation does the same thing as serialization.
compileTo in interface Compilableout - Object output to which this model is compiled.IOException - If there is an underlying I/O error during
serialization.public static LogisticRegression estimate(Vector[] xs, int[] cs, RegressionPrior prior, AnnealingSchedule annealingSchedule, Reporter reporter, double minImprovement, int minEpochs, int maxEpochs)
See the class documentation above for more information on logistic regression and the stochastic gradient descent algorithm used to implement this method.
Reporting: Reports at the debug level provide epoch-by-epoch feedback. Reports at the info level indicate inputs and major milestones in the algorithm. Reports at the fatal levels are for thrown exceptions.
xs - Input vectors indexed by training case.cs - Output categories indexed by training case.prior - The prior to be used for regression.annealingSchedule - Class to compute learning rate for each epoch.minImprovement - The minimum relative improvement in
log likelihood for the corpus to continue to another epoch.minEpochs - Minimum number of epochs.maxEpochs - Maximum number of epochs.reporter - Reporter to which progress reports are written, or
null if no progress reports are needed.IllegalArgumentException - If the set of input vectors
does not contain at least one instance, if the number of output
categories isn't the same as the input categories, if two input
vectors have different dimensions, or if the prior has a
different number of dimensions than the instances.public static LogisticRegression estimate(Vector[] xs, int[] cs, RegressionPrior prior, int blockSize, LogisticRegression hotStart, AnnealingSchedule annealingSchedule, double minImprovement, int rollingAverageSize, int minEpochs, int maxEpochs, ObjectHandler<LogisticRegression> handler, Reporter reporter)
See the class documentation above for more information on logistic regression and the stochastic gradient descent algorithm used to implement this method.
Reporting: Reports at the debug level provide epoch-by-epoch feedback. Reports at the info level indicate inputs and major milestones in the algorithm. Reports at the fatal levels are for thrown exceptions.
xs - Input vectors indexed by training case.cs - Output categories indexed by training case.prior - The prior to be used for regression.blockSize - Number of examples whose gradient is
computed before updating coefficients.hotStart - Logistic regression from which to retrieve
initial weights or null to use zero vectors.annealingSchedule - Class to compute learning rate for each epoch.minImprovement - The minimum relative improvement in
log likelihood for the corpus to continue to another epoch.minEpochs - Minimum number of epochs.maxEpochs - Maximum number of epochs.handler - Handler for intermediate regression results.reporter - Reporter to which progress reports are written, or
null if no progress reports are needed.IllegalArgumentException - If the set of input vectors
does not contain at least one instance, if the number of output
categories isn't the same as the input categories, if two input
vectors have different dimensions, or if the prior has a
different number of dimensions than the instances.public static double log2Likelihood(Vector[] inputs, int[] cats, LogisticRegression regression)
inputs - Input vectors.cats - Categories for input vectors.regression - Model to use for computing likelihood.IllegalArgumentException - If the inputs and categories
are not the same length.Copyright © 2016 Alias-i, Inc.. All rights reserved.