Newton- and Quasi-Newton Maximization
maxNR.RdUnconstrained and equality-constrained maximization based on the quadratic approximation (Newton) method. The Newton-Raphson, BFGS (Broyden 1970, Fletcher 1970, Goldfarb 1970, Shanno 1970), and BHHH (Berndt, Hall, Hall, Hausman 1974) methods are available.
Usage
maxNR(fn, grad = NULL, hess = NULL, start,
constraints = NULL, finalHessian = TRUE, bhhhHessian=FALSE,
fixed = NULL, activePar = NULL, control=NULL, ... )
maxBFGSR(fn, grad = NULL, hess = NULL, start,
constraints = NULL, finalHessian = TRUE,
fixed = NULL, activePar = NULL, control=NULL, ... )
maxBHHH(fn, grad = NULL, hess = NULL, start,
finalHessian = "BHHH", ... )Arguments
- fn
the function to be maximized. It must have the parameter vector as the first argument and it must return either a single number, or a numeric vector (this is is summed internally). If the BHHH method is used and argument
gradientis not given,fnmust return a numeric vector of observation-specific log-likelihood values. If the parameters are out of range,fnshould returnNA. See details for constant parameters.fnmay also return attributes "gradient" and/or "hessian". If these attributes are set, the algorithm uses the corresponding values as gradient and Hessian.- grad
gradient of the objective function. It must have the parameter vector as the first argument and it must return either a gradient vector of the objective function, or a matrix, where columns correspond to individual parameters. The column sums are treated as gradient components. If
NULL, finite-difference gradients are computed. If BHHH method is used,gradmust return a matrix, where rows corresponds to the gradient vectors for individual observations and the columns to the individual parameters. Iffnreturns an object with attributegradient, this argument is ignored.- hess
Hessian matrix of the function. It must have the parameter vector as the first argument and it must return the Hessian matrix of the objective function. If missing, finite-difference Hessian, based on
gradient, is computed. Hessian is used by the Newton-Raphson method only, and eventually by the other methods iffinalHessianis requested.- start
initial parameter values. If start values are named, those names are also carried over to the results.
- constraints
either
NULLfor unconstrained optimization or a list with two components. The components may be eithereqAandeqBfor equality-constrained optimization \(A \theta + B = 0\); orineqAandineqBfor inequality constraints \(A \theta + B > 0\). More than one row inineqAandineqBcorresponds to more than one linear constraint, in that case all these must be zero (equality) or positive (inequality constraints). The equality-constrained problem is forwarded tosumt, the inequality-constrained case toconstrOptim2.- finalHessian
how (and if) to calculate the final Hessian. Either
FALSE(do not calculate),TRUE(use analytic/finite-difference Hessian) or"bhhh"/"BHHH"for the information equality approach. The latter approach is only suitable for maximizing log-likelihood functions. It requires the gradient/log-likelihood to be supplied by individual observations. Note that computing the (actual, not BHHH) final Hessian does not carry any extra penalty for the NR method, but does for the other methods.- bhhhHessian
logical. Indicating whether to use the information equality approximation (Bernd, Hall, Hall, and Hausman, 1974) for the Hessian. This effectively transforms
maxNRintomaxBHHHand is mainly designed for internal use.- fixed
parameters to be treated as constants at their
startvalues. If present, it is treated as an index vector ofstartparameters.- activePar
this argument is retained for backward compatibility only; please use argument
fixedinstead.- control
list of control parameters. The control parameters used by these optimizers are
- tol
\(10^{-8}\), stopping condition. Stop if the absolute difference between successive iterations is less than
tol. Returncode=2.If set to a negative value, the criterion is never fulfilled, and hence disabled.
- reltol
sqrt(.Machine$double.eps), stopping condition. Relative convergence tolerance: the algorithm stops if the relative improvement between iterations is less than ‘reltol’. Return code 8. Negative value disables condition.
- gradtol
stopping condition. Stop if norm of the gradient is less than
gradtol. Return code 1. Negative value disables condition.- steptol
1e-10, stopping/error condition. If
qac == "stephalving"and the quadratic approximation leads to a worse, instead of a better value, or toNA, the step length is halved and a new attempt is made. If necessary, this procedure is repeated until step <steptol, thereafter code 3 is returned.- lambdatol
\(10^{-6}\), controls whether Hessian is treated as negative definite. If the largest of the eigenvalues of the Hessian is larger than
-lambdatol(Hessian is not negative definite), a suitable diagonal matrix is subtracted from the Hessian (quadratic hill-climbing) in order to enforce negative definiteness.- qrtol
\(10^{-10}\), QR-decomposition tolerance for the Hessian inversion.
- qac
"stephalving", Quadratic Approximation Correction. When the new guess is worse than the initial one, the algorithm attemts to correct it: "stephalving" decreases the step but keeps the direction, "marquardt" uses Marquardt (1963) method by decreasing the step length while also moving closer to the pure gradient direction. It may be faster and more robust choice in areas where quadratic approximation behaves poorly.
maxNRandmaxBHHHonly.- marquardt_lambda0
\(10^{-2}\), positive numeric, initial correction term for Marquardt (1963) correction.
- marquardt_lambdaStep
2, how much the Marquardt (1963) correction term is decreased/increased at each successful/unsuccesful step.
maxNRandmaxBHHHonly.- marquardt_maxLambda
\(10^{12}\), maximum allowed Marquardt (1963) correction term. If exceeded, the algorithm exits with return code 3.
maxNRandmaxBHHHonly.- iterlim
stopping condition. Stop if more than
iterlimiterations, returncode=4.- printLevel
this argument determines the level of printing which is done during the optimization process. The default value 0 means that no printing occurs, 1 prints the initial and final details, 2 prints all the main tracing information for every iteration. Higher values will result in even more output.
- ...
further arguments to
fn,gradandhess. Further arguments tomaxBHHHare also passed tomaxNR. To maintain compatibility with the earlier versions, ... also passes a number of control options (tol,reltol,gradtol,steptol,lambdatol,qrtol,iterlim) to the optimizers.
Details
The idea of the Newton method is to approximate the function at a given location by a multidimensional quadratic function, and use the estimated maximum as the start value for the next iteration. Such an approximation requires knowledge of both gradient and Hessian, the latter of which can be quite costly to compute. Several methods for approximating Hessian exist, including BFGS and BHHH.
The BHHH (information equality) approximation is only valid for
log-likelihood functions.
It requires the score (gradient) values by individual observations and hence
those must be returned
by individual observations by grad or fn.
The Hessian is approximated as the negative of the sum of the outer products
of the gradients of individual observations, or, in the matrix form,
$$
\mathsf{H}^{BHHH}
=
-\frac{1}{N} \sum_{i=1}^N
\left[
\frac{\partial \ell(\boldsymbol{\vartheta})}
{\boldsymbol{\vartheta}}
\frac{\partial \ell(\boldsymbol{\vartheta})}
{\boldsymbol{\vartheta}'}
\right]
$$
The functions maxNR, maxBFGSR, and maxBHHH
can work with constant parameters, useful if a parameter value
converges to the boundary of support, or for testing.
One way is to put
fixed to non-NULL, specifying which parameters should be treated as
constants. The
parameters can also be fixed in runtime (only for maxNR and maxBHHH) by
signaling it with the
fn return value. See Henningsen & Toomet (2011) for details.
Value
object of class "maxim". Data can be extracted through the following methods:
maxValuefnvalue at maximum (the last calculated value if not converged.)coefestimated parameter value.
- gradient
vector, last calculated gradient value. Should be close to 0 in case of normal convergence.
- estfun
matrix of gradients at parameter value
estimateevaluated at each observation (only ifgradreturns a matrix orgradis not specified andfnreturns a vector).- hessian
Hessian at the maximum (the last calculated value if not converged).
- returnCode
return code:
- 1
gradient close to zero (normal convergence).
- 2
successive function values within tolerance limit (normal convergence).
- 3
last step could not find higher value (probably not converged). This is related to line search step getting too small, usually because hitting the boundary of the parameter space. It may also be related to attempts to move to a wrong direction because of numerical errors. In some cases it can be helped by changing
steptol.- 4
iteration limit exceeded.
- 5
infinite value.
- 6
infinite gradient.
- 7
infinite Hessian.
- 8
successive function values within relative tolerance limit (normal convergence).
- 9
(BFGS) Hessian approximation cannot be improved because of gradient did not change. May be related to numerical approximation problems or wrong analytic gradient.
- 100
Initial value out of range.
- returnMessage
a short message, describing the return code.
- activePar
logical vector, which parameters are optimized over. Contains only
TRUE-s if no parameters are fixed.- nIter
number of iterations.
- maximType
character string, type of maximization.
- maxControl
the optimization control parameters in the form of a
MaxControlobject.
The following components can only be extracted directly (with \$):
- last.step
a list describing the last unsuccessful step if
code=3with following components:- theta0
previous parameter value
- f0
fnvalue attheta0- climb
the movement vector to the maximum of the quadratic approximation
- constraints
A list, describing the constrained optimization (
NULLif unconstrained). Includes the following components:- type
type of constrained optimization
- outer.iterations
number of iterations in the constraints step
- barrier.value
value of the barrier function
Warning
No attempt is made to ensure that user-provided analytic
gradient/Hessian is correct. The users are
encouraged to use compareDerivatives function,
designed for this purpose. If analytic gradient/Hessian are wrong,
the algorithm may not converge, or may converge to a wrong point.
As the BHHH method uses the likelihood-specific information equality, it is only suitable for maximizing log-likelihood functions!
Quasi-Newton methods, including those mentioned above, do not work
well in non-concave regions. This is especially the case with the
implementation in maxBFGSR. The user is advised to
experiment with various tolerance options to achieve convergence.
References
Berndt, E., Hall, B., Hall, R. and Hausman, J. (1974): Estimation and Inference in Nonlinear Structural Models, Annals of Social Measurement 3, 653–665.
Broyden, C.G. (1970): The Convergence of a Class of Double-rank Minimization Algorithms, Journal of the Institute of Mathematics and Its Applications 6, 76–90.
Fletcher, R. (1970): A New Approach to Variable Metric Algorithms, Computer Journal 13, 317–322.
Goldfarb, D. (1970): A Family of Variable Metric Updates Derived by Variational Means, Mathematics of Computation 24, 23–26.
Henningsen, A. and Toomet, O. (2011): maxLik: A package for maximum likelihood estimation in R Computational Statistics 26, 443–458
Marquardt, D.W., (1963) An Algorithm for Least-Squares Estimation of Nonlinear Parameters, Journal of the Society for Industrial & Applied Mathematics 11, 2, 431–441
Shanno, D.F. (1970): Conditioning of Quasi-Newton Methods for Function Minimization, Mathematics of Computation 24, 647–656.
Author
Ott Toomet, Arne Henningsen,
function maxBFGSR was originally developed by Yves Croissant
(and placed in 'mlogit' package)
See also
maxLik for a general framework for maximum likelihood
estimation (MLE);
maxBHHH for maximizations using the Berndt, Hall, Hall,
Hausman (1974) algorithm (which is a wrapper function to maxNR);
maxBFGS for maximization using the BFGS, Nelder-Mead (NM),
and Simulated Annealing (SANN) method (based on optim),
also supporting inequality constraints;
nlm for Newton-Raphson optimization; and
optim for different gradient-based optimization
methods.
Examples
## Fit exponential distribution by ML
t <- rexp(100, 2) # create data with parameter 2
loglik <- function(theta) sum(log(theta) - theta*t)
## Note the log-likelihood and gradient are summed over observations
gradlik <- function(theta) sum(1/theta - t)
hesslik <- function(theta) -100/theta^2
## Estimate with finite-difference gradient and Hessian
a <- maxNR(loglik, start=1, control=list(printLevel=2))
#> ----- Initial parameters: -----
#> fcn value: -45.95304
#> parameter initial gradient free
#> [1,] 1 54.04696 1
#> Condition number of the (active) hessian: 1
#> -----Iteration 1 -----
#> -----Iteration 2 -----
#> -----Iteration 3 -----
#> -----Iteration 4 -----
#> -----Iteration 5 -----
#> --------------
#> gradient close to zero (gradtol)
#> 5 iterations
#> estimate: 2.176135
#> Function value: -22.24499
summary(a)
#> --------------------------------------------
#> Newton-Raphson maximisation
#> Number of iterations: 5
#> Return code: 1
#> gradient close to zero (gradtol)
#> Function value: -22.24499
#> Estimates:
#> estimate gradient
#> [1,] 2.176135 -4.618528e-07
#> --------------------------------------------
## You would probably prefer 1/mean(t) instead ;-)
## The same example with analytic gradient and Hessian
a <- maxNR(loglik, gradlik, hesslik, start=1)
summary(a)
#> --------------------------------------------
#> Newton-Raphson maximisation
#> Number of iterations: 5
#> Return code: 1
#> gradient close to zero (gradtol)
#> Function value: -22.24499
#> Estimates:
#> estimate gradient
#> [1,] 2.176134 1.291203e-07
#> --------------------------------------------
## BFGS estimation with finite-difference gradient
a <- maxBFGSR( loglik, start=1 )
summary(a)
#> --------------------------------------------
#> BFGSR maximization
#> Number of iterations: 7
#> Return code: 2
#> successive function values within tolerance limit (tol)
#> Function value: -22.24504
#> Estimates:
#> estimate gradient
#> [1,] 2.178476 -0.04939792
#> --------------------------------------------
## For the BHHH method we need likelihood values and gradients
## of individual observations, not the sum of those
loglikInd <- function(theta) log(theta) - theta*t
gradlikInd <- function(theta) 1/theta - t
## Estimate with analytic gradient
a <- maxBHHH(loglikInd, gradlikInd, start=1)
summary(a)
#> --------------------------------------------
#> BHHH maximisation
#> Number of iterations: 5
#> Return code: 8
#> successive function values within relative tolerance limit (reltol)
#> Function value: -22.24499
#> Estimates:
#> estimate gradient
#> [1,] 2.176132 5.182317e-05
#> --------------------------------------------
## Example with a vector argument: Estimate the mean and
## variance of a random normal sample by maximum likelihood
## Note: you might want to use maxLik instead
loglik <- function(param) {
# param is a 2-vector of c(mean, sd)
mu <- param[1]
sigma <- param[2]
ll <- -0.5*N*log(2*pi) - N*log(sigma) - sum(0.5*(x - mu)^2/sigma^2)
ll
}
x <- rnorm(100, 1, 2) # use mean=1, sd=2
N <- length(x)
res <- maxNR(loglik, start=c(0,1)) # use 'wrong' start values
summary(res)
#> --------------------------------------------
#> Newton-Raphson maximisation
#> Number of iterations: 7
#> Return code: 2
#> successive function values within tolerance limit (tol)
#> Function value: -209.9562
#> Estimates:
#> estimate gradient
#> [1,] 0.8876451 4.831691e-07
#> [2,] 1.9751086 1.108447e-06
#> --------------------------------------------
## The previous example with named parameters and a fixed value
resFix <- maxNR(loglik, start=c(mu=0, sigma=1), fixed="sigma")
summary(resFix) # 'sigma' is exactly 1.000 now.
#> --------------------------------------------
#> Newton-Raphson maximisation
#> Number of iterations: 3
#> Return code: 1
#> gradient close to zero (gradtol)
#> Function value: -286.9465
#> Estimates:
#> estimate gradient
#> mu 0.8876451 -5.684342e-08
#> sigma 1.0000000 NA
#> --------------------------------------------
### Constrained optimization
###
## We maximize exp(-x^2 - y^2) where x+y = 1
hatf <- function(theta) {
x <- theta[1]
y <- theta[2]
exp(-(x^2 + y^2))
## Note: you may prefer exp(- theta %*% theta) instead
}
## use constraints: x + y = 1
A <- matrix(c(1, 1), 1, 2)
B <- -1
res <- maxNR(hatf, start=c(0,0), constraints=list(eqA=A, eqB=B),
control=list(printLevel=1))
print(summary(res))
#> --------------------------------------------
#> Newton-Raphson maximisation
#> Number of iterations: 1
#> Return code: 1
#> gradient close to zero (gradtol)
#> Function value: 0.6065399
#> Estimates:
#> estimate gradient
#> [1,] 0.4999848 -0.6065307
#> [2,] 0.4999848 -0.6065307
#>
#> Constrained optimization based on SUMT
#> Return code: 1
#> penalty close to zero
#> 5 outer iterations, barrier value 9.196992e-10
#> --------------------------------------------