Stochastic Gradient Ascent
maxSGA.RdStochastic Gradient Ascent–based optimizers
Usage
maxSGA(fn = NULL, grad = NULL, hess = NULL, start,
nObs,
constraints = NULL, finalHessian = FALSE,
fixed = NULL, control=NULL, ... )
maxAdam(fn = NULL, grad = NULL, hess = NULL, start,
nObs,
constraints = NULL, finalHessian = FALSE,
fixed = NULL, control=NULL, ... )Arguments
- fn
the function to be maximized. As the objective function values are not directly used for optimization, this argument is optional, given
gradis provided. It must have the parameter vector as the first argument, and it must have an argumentindexto specify the integer index of the selected observations. It must return either a single number, or a numeric vector (this is is summed internally). If the parameters are out of range,fnshould returnNA. See details for constant parameters.fnmay also return attributes "gradient" and/or "hessian". If these attributes are set, the algorithm uses the corresponding values as gradient and Hessian.- grad
gradient of the objective function. It must have the parameter vector as the first argument, and it must have an argument
indexto specify the integer index of selected observations. It must return either a gradient vector of the objective function, or a matrix, where columns correspond to individual parameters. The column sums are treated as gradient components. IfNULL, finite-difference gradients are computed. Iffnreturns an object with attributegradient, this argument is ignored.If
gradis not supplied, it is computed by finite-difference method usingfn. However, this is only adviseable for small-scale tests, not for any production run. Obviously,fnmust be correctly defined in that case.- hess
Hessian matrix of the function. Mainly for compatibility reasons, only used for computing the final Hessian if asked to do so by setting
finalHessiantoTRUE. It must have the parameter vector as the first argument and it must return the Hessian matrix of the objective function. If missing, either finite-difference Hessian, based ongradientor BHHH approach is computed if asked to do so.- start
initial parameter values. If these have names, the names are also used for results.
- nObs
number of observations. This is used to partition the data into individual batches. The resulting batch indices are forwarded to the
gradfunction through the argumentindex.- constraints
either
NULLfor unconstrained optimization or a list with two components. The components may be eithereqAandeqBfor equality-constrained optimization \(A \theta + B = 0\); orineqAandineqBfor inequality constraints \(A \theta + B > 0\). More than one row inineqAandineqBcorresponds to more than one linear constraint, in that case all these must be zero (equality) or positive (inequality constraints). The equality-constrained problem is forwarded tosumt, the inequality-constrained case toconstrOptim2.- finalHessian
how (and if) to calculate the final Hessian. Either
FALSE(do not calculate),TRUE(use analytic/finite-difference Hessian) or"bhhh"/"BHHH"for the information equality approach. The latter approach is only suitable when working with a log-likelihood function, and it requires the gradient/log-likelihood to be supplied by individual observations.Hessian matrix is not often used for optimization problems where one applies SGA, but even if one is not interested in standard errors, it may provide useful information about the model performance. If computed by finite-difference method, the Hessian computation may be very slow.
- fixed
parameters to be treated as constants at their
startvalues. If present, it is treated as an index vector ofstartparameters.- control
list of control parameters. The ones used by these optimizers are
- SGA_momentum
0, numeric momentum parameter for SGA. Must lie in interval \([0,1]\). See details.
Adam-specific parameters
- Adam_momentum1
0.9, numeric in interval \((0,1)\), the first moment momentum
- Adam_momentum2
0.999, numeric in interval \((0,1)\), the second moment momentum
General stochastic gradient parameters:
- SG_learningRate
step size the SGA algorithm takes in the gradient direction. If 1, the step equals to the gradient value. A good value is often 0.01–0.3
- SG_batchSize
SGA batch size, an integer between 1 and
nObs. IfNULL(default), the full batch gradient is computed.- SG_clip
NULL, gradient clipping threshold. The algorithm ensures that \(||g(\theta)||_2^2 \le \kappa\) where \(\kappa\) is theSG_clipvalue. If the actual norm of the gradient exceeds (square root of) \(\kappa\), the gradient will be scaled back accordingly while preserving its direction.NULLmeans no clipping.
Stopping conditions:
- gradtol
stopping condition. Stop if norm of the gradient is less than
gradtol. Default 0, i.e. do not use this condition. This condition is useful if the objective is to drive full batch gradient to zero on training data. It is not a good objective in case of the stochastic gradient, and if the objective is to optimize the objective on validation data.- SG_patience
NULL, or integer. Stopping condition: the algorithm counts how many times the objective function has been worse than its best value so far, and if this exceedsSG_patience, the algorithm stops.- SG_patienceStep
1L, integer. After how many epochs to check the patience value.
1means to check at each epoch, and hence to compute the objective function. This may be undesirable if the objective function is costly to compute.- iterlim
stopping condition. Stop if more than
iterlimepochs, returncode=4. Epoch is a set of iterations that cycles through all observations. In case of full batch, iterations and epochs are equivalent. Ifiterlim = 0, does not do any learning and returns the initial values unchanged.- printLevel
this argument determines the level of printing which is done during the optimization process. The default value 0 means that no printing occurs, 1 prints the initial and final details, 2 prints all the main tracing information for every epoch. Higher values will result in even more output.
- storeParameters
logical, whether to store and return the parameter values at each epoch. If
TRUE, the stored values can be retrieved withstoredParameters-method. The parameters are stored as a matrix with rows corresponding to the epochs and columns to the parameter components. There areiterlim+ 1 rows, where the first one corresponds to the initial parameters.Default
FALSE.- storeValues
logical, whether to store and return the objective function values at each epoch. If
TRUE, the stored values can be retrieved withstoredValues-method. There areiterlim+ 1 values, where the first one corresponds to the value at the initial parameters.Default
FALSE.
See
maxControlfor more information.- ...
further arguments to
fn,gradandhess. To maintain compatibility with the earlier versions, ... also passes certain control options to the optimizers.
Details
Gradient Ascent (GA) is a optimization method where the algorithm repeatedly takes small steps in the gradient's direction, the parameter vector \(\theta\) is updated as \(\theta \leftarrow theta + \mathrm{learning rate}\cdot \nabla f(\theta)\). In case of Stochastic GA (SGA), the gradient is not computed on the full set of observations but on a small subset, batch, potentially a single observation only. In certain circumstances this converges much faster than when using all observation (see Bottou et al, 2018).
If SGA_momentum is positive, the SGA algorithm updates the parameters
\(\theta\) in two steps. First, the momentum is used to update
the “velocity” \(v\) as
\(v \leftarrow \mathrm{momentum}\cdot v + \mathrm{learning
rate}\cdot \nabla f(\theta)\), and thereafter the parameter
\(\theta\) is updates as
\(\theta \leftarrow \theta + v\). Initial
velocity is set to 0.
The Adam algorithm is more complex and uses first and second moments of stochastic gradients to automatically adjust the learning rate. See Goodfellow et al, 2016, page 301.
The function fn is not directly used for optimization, only
for printing or as a stopping condition. In this sense
it is up to the user to decide what the function
returns, if anything. For instance, it may be useful for fn to compute the
objective function on either full training data, or on validation data,
and just ignore the index argument. The latter is useful if
using patience-based stopping.
However, one may also
choose to select the observations determined by the index to
compute the objective function on the current data batch.
Value
object of class "maxim". Data can be extracted through the following methods:
maxValuefnvalue at maximum (the last calculated value if not converged.)coefestimated parameter value.
gradientvector, last calculated gradient value. Should be close to 0 in case of normal convergence.
- estfun
matrix of gradients at parameter value
estimateevaluated at each observation (only ifgradreturns a matrix orgradis not specified andfnreturns a vector).hessianHessian at the maximum (the last calculated value if not converged).
storedValuesreturn values stored at each epoch
storedParametersreturn parameters stored at each epoch
returnCodea numeric code that describes the convergence or error.
returnMessagea short message, describing the return code.
activeParlogical vector, which parameters are optimized over. Contains only
TRUE-s if no parameters are fixed.nIternumber of iterations.
maximTypecharacter string, type of maximization.
maxControlthe optimization control parameters in the form of a
MaxControlobject.
References
Bottou, L.; Curtis, F. & Nocedal, J.: Optimization Methods for Large-Scale Machine Learning SIAM Review, 2018, 60, 223–311.
Goodfellow, I.; Bengio, Y.; Courville, A. (2016): Deep Learning, MIT Press
Henningsen, A. and Toomet, O. (2011): maxLik: A package for maximum likelihood estimation in R Computational Statistics 26, 443–458
See also
A good starting point to learn about the usage of stochastic gradient ascent in maxLik package is the vignette “Stochastic Gradient Ascent in maxLik”.
The other related functions are
maxNR for Newton-Raphson, a popular Hessian-based maximization;
maxBFGS for maximization using the BFGS, Nelder-Mead (NM),
and Simulated Annealing (SANN) method (based on optim),
also supporting inequality constraints;
maxLik for a general framework for maximum likelihood
estimation (MLE);
optim for different gradient-based optimization
methods.
Examples
## estimate the exponential distribution parameter by ML
set.seed(1)
t <- rexp(100, 2)
loglik <- function(theta, index) sum(log(theta) - theta*t[index])
## Note the log-likelihood and gradient are summed over observations
gradlik <- function(theta, index) sum(1/theta - t[index])
## Estimate with full-batch
a <- maxSGA(loglik, gradlik, start=1, control=list(iterlim=1000,
SG_batchSize=10), nObs=100)
# note that loglik is not really needed, and is not used
# here, unless more print verbosity is asked
summary(a)
#> --------------------------------------------
#> Stochastic Gradient Ascent
#> Number of iterations: 1000
#> Return code: 4
#> Iteration limit exceeded (iterlim)
#> Function value:
#> Estimates:
#> estimate gradient
#> [1,] 2.099567 -0.1279617
#> --------------------------------------------
##
## demonstrate the usage of index, and using
## fn for computing the objective function on validation data.
## Create a linear model where variables are very unequally scaled
##
## OLS loglik function: compute the function value on validation data only
loglik <- function(beta, index) {
e <- yValid - XValid %*% beta
-crossprod(e)/length(y)
}
## OLS gradient: compute it on training data only
## Use 'index' to select the subset corresponding to the minibatch
gradlik <- function(beta, index) {
e <- yTrain[index] - XTrain[index,,drop=FALSE] %*% beta
g <- t(-2*t(XTrain[index,,drop=FALSE]) %*% e)
-g/length(index)
}
N <- 1000
## two random variables: one with scale 1, the other with 100
X <- cbind(rnorm(N), rnorm(N, sd=100))
beta <- c(1, 1) # true parameter values
y <- X %*% beta + rnorm(N, sd=0.2)
## training-validation split
iTrain <- sample(N, 0.8*N)
XTrain <- X[iTrain,,drop=FALSE]
XValid <- X[-iTrain,,drop=FALSE]
yTrain <- y[iTrain]
yValid <- y[-iTrain]
##
## do this without momentum: learning rate must stay small for the gradient not to explode
cat(" No momentum:\n")
#> No momentum:
a <- maxSGA(loglik, gradlik, start=c(10,10),
control=list(printLevel=1, iterlim=50,
SG_batchSize=30, SG_learningRate=0.0001, SGA_momentum=0
), nObs=length(yTrain))
#> Initial function value: -199008.4
#> --------------
#> Iteration limit exceeded (iterlim)
#> 50 iterations
#> estimate: 8.05622774286993, 1.56881067557959
#> Function value: -816.6898
print(summary(a)) # the first component is off, the second one is close to the true value
#> --------------------------------------------
#> Stochastic Gradient Ascent
#> Number of iterations: 50
#> Return code: 4
#> Iteration limit exceeded (iterlim)
#> Function value: -816.6898
#> Estimates:
#> estimate gradient
#> [1,] 8.056228 -22.18443
#> [2,] 1.568811 -10806.06327
#> --------------------------------------------
## do with momentum 0.99
cat(" Momentum 0.99:\n")
#> Momentum 0.99:
a <- maxSGA(loglik, gradlik, start=c(10,10),
control=list(printLevel=1, iterlim=50,
SG_batchSize=30, SG_learningRate=0.0001, SGA_momentum=0.99
# no momentum
), nObs=length(yTrain))
#> Initial function value: -199008.4
#> --------------
#> Iteration limit exceeded (iterlim)
#> 50 iterations
#> estimate: 8.02519802176179e+23, 8.64902046641093e+25
#> Function value: -1.835145e+55
print(summary(a)) # close to true value
#> --------------------------------------------
#> Stochastic Gradient Ascent
#> Number of iterations: 50
#> Return code: 4
#> Iteration limit exceeded (iterlim)
#> Function value: -1.835145e+55
#> Estimates:
#> estimate gradient
#> [1,] 8.025198e+23 2.210386e+27
#> [2,] 8.649020e+25 -1.822211e+30
#> --------------------------------------------