Newton- and Quasi-Newton Maximization

Unconstrained and equality-constrained maximization based on the quadratic approximation (Newton) method. The Newton-Raphson, BFGS (Broyden 1970, Fletcher 1970, Goldfarb 1970, Shanno 1970), and BHHH (Berndt, Hall, Hall, Hausman 1974) methods are available.

Usage

maxNR(fn, grad = NULL, hess = NULL, start,
      constraints = NULL, finalHessian = TRUE, bhhhHessian=FALSE,
      fixed = NULL, activePar = NULL, control=NULL, ... )
maxBFGSR(fn, grad = NULL, hess = NULL, start,
      constraints = NULL, finalHessian = TRUE,
      fixed = NULL, activePar = NULL, control=NULL, ... )
maxBHHH(fn, grad = NULL, hess = NULL, start, 
      finalHessian = "BHHH", ... )

Arguments

fn

the function to be maximized. It must have the parameter vector as the first argument and it must return either a single number, or a numeric vector (this is is summed internally). If the BHHH method is used and argument gradient is not given, fn must return a numeric vector of observation-specific log-likelihood values. If the parameters are out of range, fn should return NA. See details for constant parameters.

fn may also return attributes "gradient" and/or "hessian". If these attributes are set, the algorithm uses the corresponding values as gradient and Hessian.

grad

gradient of the objective function. It must have the parameter vector as the first argument and it must return either a gradient vector of the objective function, or a matrix, where columns correspond to individual parameters. The column sums are treated as gradient components. If NULL, finite-difference gradients are computed. If BHHH method is used, grad must return a matrix, where rows corresponds to the gradient vectors for individual observations and the columns to the individual parameters. If fn returns an object with attribute gradient, this argument is ignored.

hess

Hessian matrix of the function. It must have the parameter vector as the first argument and it must return the Hessian matrix of the objective function. If missing, finite-difference Hessian, based on gradient, is computed. Hessian is used by the Newton-Raphson method only, and eventually by the other methods if finalHessian is requested.

start

initial parameter values. If start values are named, those names are also carried over to the results.

constraints

either NULL for unconstrained optimization or a list with two components. The components may be either eqA and eqB for equality-constrained optimization $A \theta + B = 0$; or ineqA and ineqB for inequality constraints $A \theta + B > 0$. More than one row in ineqA and ineqB corresponds to more than one linear constraint, in that case all these must be zero (equality) or positive (inequality constraints). The equality-constrained problem is forwarded to sumt, the inequality-constrained case to constrOptim2.

finalHessian

how (and if) to calculate the final Hessian. Either FALSE (do not calculate), TRUE (use analytic/finite-difference Hessian) or "bhhh"/"BHHH" for the information equality approach. The latter approach is only suitable for maximizing log-likelihood functions. It requires the gradient/log-likelihood to be supplied by individual observations. Note that computing the (actual, not BHHH) final Hessian does not carry any extra penalty for the NR method, but does for the other methods.

bhhhHessian

logical. Indicating whether to use the information equality approximation (Bernd, Hall, Hall, and Hausman, 1974) for the Hessian. This effectively transforms maxNR into maxBHHH and is mainly designed for internal use.

fixed

parameters to be treated as constants at their start values. If present, it is treated as an index vector of start parameters.

activePar

this argument is retained for backward compatibility only; please use argument fixed instead.

control

list of control parameters. The control parameters used by these optimizers are

tol

$10^{-8}$, stopping condition. Stop if the absolute difference between successive iterations is less than tol. Return code=2.

If set to a negative value, the criterion is never fulfilled, and hence disabled.

reltol

sqrt(.Machine$double.eps), stopping condition. Relative convergence tolerance: the algorithm stops if the relative improvement between iterations is less than ‘reltol’. Return code 8. Negative value disables condition.

gradtol

stopping condition. Stop if norm of the gradient is less than gradtol. Return code 1. Negative value disables condition.

steptol

1e-10, stopping/error condition. If qac == "stephalving" and the quadratic approximation leads to a worse, instead of a better value, or to NA, the step length is halved and a new attempt is made. If necessary, this procedure is repeated until step < steptol, thereafter code 3 is returned.

lambdatol

$10^{-6}$, controls whether Hessian is treated as negative definite. If the largest of the eigenvalues of the Hessian is larger than -lambdatol (Hessian is not negative definite), a suitable diagonal matrix is subtracted from the Hessian (quadratic hill-climbing) in order to enforce negative definiteness.

qrtol

$10^{-10}$, QR-decomposition tolerance for the Hessian inversion.

qac

"stephalving", Quadratic Approximation Correction. When the new guess is worse than the initial one, the algorithm attemts to correct it: "stephalving" decreases the step but keeps the direction, "marquardt" uses Marquardt (1963) method by decreasing the step length while also moving closer to the pure gradient direction. It may be faster and more robust choice in areas where quadratic approximation behaves poorly. maxNR and maxBHHH only.

marquardt_lambda0

$10^{-2}$, positive numeric, initial correction term for Marquardt (1963) correction.

marquardt_lambdaStep

2, how much the Marquardt (1963) correction term is decreased/increased at each successful/unsuccesful step. maxNR and maxBHHH only.

marquardt_maxLambda

$10^{12}$, maximum allowed Marquardt (1963) correction term. If exceeded, the algorithm exits with return code 3. maxNR and maxBHHH only.

iterlim

stopping condition. Stop if more than iterlim iterations, return code=4.

printLevel

this argument determines the level of printing which is done during the optimization process. The default value 0 means that no printing occurs, 1 prints the initial and final details, 2 prints all the main tracing information for every iteration. Higher values will result in even more output.

...

further arguments to fn, grad and hess. Further arguments to maxBHHH are also passed to maxNR. To maintain compatibility with the earlier versions, ... also passes a number of control options (tol, reltol, gradtol, steptol, lambdatol, qrtol, iterlim) to the optimizers.

Details

The idea of the Newton method is to approximate the function at a given location by a multidimensional quadratic function, and use the estimated maximum as the start value for the next iteration. Such an approximation requires knowledge of both gradient and Hessian, the latter of which can be quite costly to compute. Several methods for approximating Hessian exist, including BFGS and BHHH.

The BHHH (information equality) approximation is only valid for log-likelihood functions. It requires the score (gradient) values by individual observations and hence those must be returned by individual observations by grad or fn. The Hessian is approximated as the negative of the sum of the outer products of the gradients of individual observations, or, in the matrix form, $$ \mathsf{H}^{BHHH} = -\frac{1}{N} \sum_{i=1}^N \left[ \frac{\partial \ell(\boldsymbol{\vartheta})} {\boldsymbol{\vartheta}} \frac{\partial \ell(\boldsymbol{\vartheta})} {\boldsymbol{\vartheta}'} \right] $$

The functions maxNR, maxBFGSR, and maxBHHH can work with constant parameters, useful if a parameter value converges to the boundary of support, or for testing. One way is to put fixed to non-NULL, specifying which parameters should be treated as constants. The parameters can also be fixed in runtime (only for maxNR and maxBHHH) by signaling it with the fn return value. See Henningsen & Toomet (2011) for details.

Value

object of class "maxim". Data can be extracted through the following methods:

maxValue

fn value at maximum (the last calculated value if not converged.)

coef

estimated parameter value.

gradient

vector, last calculated gradient value. Should be close to 0 in case of normal convergence.

estfun

matrix of gradients at parameter value estimate evaluated at each observation (only if grad returns a matrix or grad is not specified and fn returns a vector).

hessian

Hessian at the maximum (the last calculated value if not converged).

returnCode

return code:

1: gradient close to zero (normal convergence).
2: successive function values within tolerance limit (normal convergence).
3: last step could not find higher value (probably not converged). This is related to line search step getting too small, usually because hitting the boundary of the parameter space. It may also be related to attempts to move to a wrong direction because of numerical errors. In some cases it can be helped by changing steptol.
4: iteration limit exceeded.
5: infinite value.
6: infinite gradient.
7: infinite Hessian.
8: successive function values within relative tolerance limit (normal convergence).
9: (BFGS) Hessian approximation cannot be improved because of gradient did not change. May be related to numerical approximation problems or wrong analytic gradient.
100: Initial value out of range.

returnMessage

a short message, describing the return code.

activePar

logical vector, which parameters are optimized over. Contains only TRUE-s if no parameters are fixed.

nIter

number of iterations.

maximType

character string, type of maximization.

maxControl

the optimization control parameters in the form of a MaxControl object.

The following components can only be extracted directly (with \$):

last.step

a list describing the last unsuccessful step if code=3 with following components:

theta0: previous parameter value
f0: fn value at theta0
climb: the movement vector to the maximum of the quadratic approximation

constraints

A list, describing the constrained optimization (NULL if unconstrained). Includes the following components:

type: type of constrained optimization
outer.iterations: number of iterations in the constraints step
barrier.value: value of the barrier function

Warning

No attempt is made to ensure that user-provided analytic gradient/Hessian is correct. The users are encouraged to use compareDerivatives function, designed for this purpose. If analytic gradient/Hessian are wrong, the algorithm may not converge, or may converge to a wrong point.

As the BHHH method uses the likelihood-specific information equality, it is only suitable for maximizing log-likelihood functions!

Quasi-Newton methods, including those mentioned above, do not work well in non-concave regions. This is especially the case with the implementation in maxBFGSR. The user is advised to experiment with various tolerance options to achieve convergence.

References

Berndt, E., Hall, B., Hall, R. and Hausman, J. (1974): Estimation and Inference in Nonlinear Structural Models, Annals of Social Measurement 3, 653–665.

Broyden, C.G. (1970): The Convergence of a Class of Double-rank Minimization Algorithms, Journal of the Institute of Mathematics and Its Applications 6, 76–90.

Fletcher, R. (1970): A New Approach to Variable Metric Algorithms, Computer Journal 13, 317–322.

Goldfarb, D. (1970): A Family of Variable Metric Updates Derived by Variational Means, Mathematics of Computation 24, 23–26.

Henningsen, A. and Toomet, O. (2011): maxLik: A package for maximum likelihood estimation in R Computational Statistics 26, 443–458

Marquardt, D.W., (1963) An Algorithm for Least-Squares Estimation of Nonlinear Parameters, Journal of the Society for Industrial & Applied Mathematics 11, 2, 431–441

Shanno, D.F. (1970): Conditioning of Quasi-Newton Methods for Function Minimization, Mathematics of Computation 24, 647–656.

Author

Ott Toomet, Arne Henningsen, function maxBFGSR was originally developed by Yves Croissant (and placed in 'mlogit' package)

Examples

## Fit exponential distribution by ML
t <- rexp(100, 2)  # create data with parameter 2
loglik <- function(theta) sum(log(theta) - theta*t)
## Note the log-likelihood and gradient are summed over observations
gradlik <- function(theta) sum(1/theta - t)
hesslik <- function(theta) -100/theta^2
## Estimate with finite-difference gradient and Hessian
a <- maxNR(loglik, start=1, control=list(printLevel=2))
#> ----- Initial parameters: -----
#> fcn value: -45.95304 
#>      parameter initial gradient free
#> [1,]         1         54.04696    1
#> Condition number of the (active) hessian: 1 
#> -----Iteration 1 -----
#> -----Iteration 2 -----
#> -----Iteration 3 -----
#> -----Iteration 4 -----
#> -----Iteration 5 -----
#> --------------
#> gradient close to zero (gradtol) 
#> 5  iterations
#> estimate: 2.176135 
#> Function value: -22.24499 
summary(a)
#> --------------------------------------------
#> Newton-Raphson maximisation 
#> Number of iterations: 5 
#> Return code: 1 
#> gradient close to zero (gradtol) 
#> Function value: -22.24499 
#> Estimates:
#>      estimate      gradient
#> [1,] 2.176135 -4.618528e-07
#> --------------------------------------------
## You would probably prefer 1/mean(t) instead ;-)

## The same example with analytic gradient and Hessian
a <- maxNR(loglik, gradlik, hesslik, start=1)
summary(a)
#> --------------------------------------------
#> Newton-Raphson maximisation 
#> Number of iterations: 5 
#> Return code: 1 
#> gradient close to zero (gradtol) 
#> Function value: -22.24499 
#> Estimates:
#>      estimate     gradient
#> [1,] 2.176134 1.291203e-07
#> --------------------------------------------

## BFGS estimation with finite-difference gradient
a <- maxBFGSR( loglik, start=1 )
summary(a)
#> --------------------------------------------
#> BFGSR maximization 
#> Number of iterations: 7 
#> Return code: 2 
#> successive function values within tolerance limit (tol) 
#> Function value: -22.24504 
#> Estimates:
#>      estimate    gradient
#> [1,] 2.178476 -0.04939792
#> --------------------------------------------

## For the BHHH method we need likelihood values and gradients
## of individual observations, not the sum of those
loglikInd <- function(theta) log(theta) - theta*t
gradlikInd <- function(theta) 1/theta - t
## Estimate with analytic gradient
a <- maxBHHH(loglikInd, gradlikInd, start=1)
summary(a)
#> --------------------------------------------
#> BHHH maximisation 
#> Number of iterations: 5 
#> Return code: 8 
#> successive function values within relative tolerance limit (reltol) 
#> Function value: -22.24499 
#> Estimates:
#>      estimate     gradient
#> [1,] 2.176132 5.182317e-05
#> --------------------------------------------

## Example with a vector argument:  Estimate the mean and
## variance of a random normal sample by maximum likelihood
## Note: you might want to use maxLik instead
loglik <- function(param) {
                           # param is a 2-vector of c(mean, sd)
  mu <- param[1]
  sigma <- param[2]
  ll <- -0.5*N*log(2*pi) - N*log(sigma) - sum(0.5*(x - mu)^2/sigma^2)
  ll
}
x <- rnorm(100, 1, 2) # use mean=1, sd=2
N <- length(x)
res <- maxNR(loglik, start=c(0,1)) # use 'wrong' start values
summary(res)
#> --------------------------------------------
#> Newton-Raphson maximisation 
#> Number of iterations: 7 
#> Return code: 2 
#> successive function values within tolerance limit (tol) 
#> Function value: -209.9562 
#> Estimates:
#>       estimate     gradient
#> [1,] 0.8876451 4.831691e-07
#> [2,] 1.9751086 1.108447e-06
#> --------------------------------------------

## The previous example with named parameters and a fixed value
resFix <- maxNR(loglik, start=c(mu=0, sigma=1), fixed="sigma")
summary(resFix)  # 'sigma' is exactly 1.000 now.
#> --------------------------------------------
#> Newton-Raphson maximisation 
#> Number of iterations: 3 
#> Return code: 1 
#> gradient close to zero (gradtol) 
#> Function value: -286.9465 
#> Estimates:
#>        estimate      gradient
#> mu    0.8876451 -5.684342e-08
#> sigma 1.0000000            NA
#> --------------------------------------------

### Constrained optimization
###
## We maximize exp(-x^2 - y^2) where x+y = 1
hatf <- function(theta) {
  x <- theta[1]
  y <- theta[2]
  exp(-(x^2 + y^2))
  ## Note: you may prefer exp(- theta %*% theta) instead
}
## use constraints: x + y = 1
A <- matrix(c(1, 1), 1, 2)
B <- -1
res <- maxNR(hatf, start=c(0,0), constraints=list(eqA=A, eqB=B),
             control=list(printLevel=1))
print(summary(res))
#> --------------------------------------------
#> Newton-Raphson maximisation 
#> Number of iterations: 1 
#> Return code: 1 
#> gradient close to zero (gradtol) 
#> Function value: 0.6065399 
#> Estimates:
#>       estimate   gradient
#> [1,] 0.4999848 -0.6065307
#> [2,] 0.4999848 -0.6065307
#> 
#> Constrained optimization based on SUMT 
#> Return code: 1 
#> penalty close to zero 
#> 5  outer iterations, barrier value 9.196992e-10 
#> --------------------------------------------