Bare-bones linear model fitting function

fastLm estimates the linear model using one of several methods implemented using the Eigen linear algebra library.

fastLmPure(X, y, method = 0L)
fastLm(X, ...)
# Default S3 method
fastLm(X, y, method = 0L, ...)
# S3 method for class 'formula'
fastLm(formula, data = list(), method = 0L, ...)

Arguments

y: the response vector
X: a model matrix
formula: an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given in the ‘Details’ section of the documentation for lm.
data: an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.
method: an integer scalar with value 0 for the column-pivoted QR decomposition, 1 for the unpivoted QR decomposition, 2 for the LLT Cholesky, 3 for the LDLT Cholesky, 4 for the Jacobi singular value decomposition (SVD) and 5 for a method based on the eigenvalue-eigenvector decomposition of \(\mathbf{X}^\prime\mathbf{X}\). Default is zero.
...: not used

Details

Linear models should be estimated using the lm function. In some cases, lm.fit may be appropriate.

The fastLmPure function provides a reference use case of the Eigen C++ template library via the wrapper functions in the RcppEigen package.

The fastLm function provides a more standard implementation of a linear model fit, offering both a default and a formula interface as well as print, summary and predict methods.

Internally the fastLm function, by default, uses a QR decomposition with column pivots, which is a rank-revealing decomposition, so that it can handle rank-deficient cases effectively. Other methods for determining least squares solutions are available according to the value of the method argument.

An example of the type of situation requiring extra care in checking for rank deficiency is a two-way layout with missing cells (see the examples section). These cases require a special pivoting scheme of “pivot only on (apparent) rank deficiency” which is not part of conventional linear algebra software.

Value

fastLmPure returns a list with several components:

coefficients: a vector of coefficients
se: a vector of the standard errors of the coefficient estimates
rank: a scalar denoting the computed rank of the model matrix
df.residual: a scalar denoting the degrees of freedom in the model
residuals: the vector of residuals
s: a numeric scalar - the root mean square for residuals
fitted.values: the vector of fitted value

fastLm returns a richer object which also includes the call argument similar to the lm or rlm functions..

References

Douglas Bates and Dirk Eddelbuettel (2013). Fast and Elegant Numerical Linear Algebra Using the RcppEigen Package. Journal of Statistical Software, 52(5), 1-24. URL http://www.jstatsoft.org/v52/i05/.

Author

Eigen is described at http://eigen.tuxfamily.org/index.php?title=Main_Page. RcppEigen is written by Douglas Bates, Dirk Eddelbuettel and Romain Francois.

Examples

  data(trees, package="datasets")
  mm <- cbind(1, log(trees$Girth))   # model matrix
  y  <- log(trees$Volume)            # response

  ## bare-bones direct interface
  flm <- fastLmPure(mm, y)
  print(flm)
#> $coefficients
#> [1] -2.353325  2.199970
#> 
#> $se
#> [1] 0.23066284 0.08983455
#> 
#> $rank
#> [1] 2
#> 
#> $df.residual
#> [1] 29
#> 
#> $residuals
#>  [1]  0.029770334 -0.048343313 -0.108675736 -0.022348590  0.072716847
#>  [6]  0.099013629 -0.174701646 -0.020550966  0.176067980  0.029107001
#> [11]  0.205184486  0.043971170  0.062839654  0.001010516 -0.163706638
#> [16] -0.172405965  0.247962548 -0.029135115 -0.158376200 -0.205999241
#> [21]  0.088437490 -0.027410923  0.062096059 -0.100836986 -0.035300119
#> [26]  0.096435946  0.076549210  0.072452205 -0.063824212 -0.073580387
#> [31]  0.041580960
#> 
#> $s
#> [1] 0.1149578
#> 
#> $fitted.values
#>  [1] 2.302374 2.380487 2.431063 2.819630 2.861140 2.881605 2.921973 2.921973
#>  [9] 2.941882 2.961613 2.981168 3.000551 3.000551 3.057697 3.113395 3.272498
#> [17] 3.272498 3.339678 3.404867 3.420867 3.452522 3.483728 3.529722 3.746287
#> [25] 3.787154 3.918144 3.943431 3.993150 4.005406 4.005406 4.302224
#> 

  ## standard R interface for formula or data returning object of class fastLm
  flmmod <- fastLm( log(Volume) ~ log(Girth), data=trees)
  summary(flmmod)
#> 
#> Call:
#> fastLm.formula(formula = log(Volume) ~ log(Girth), data = trees)
#> 
#> Residuals:
#>      Min.   1st Qu.    Median   3rd Qu.      Max. 
#> -0.206000 -0.068700  0.001011  0.072580  0.248000 
#> 
#>              Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept) -2.353325   0.230663 -10.202  4.18e-11 ***
#> log(Girth)   2.199970   0.089835  24.489 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 0.115 on 29 degrees of freedom
#> Multiple R-squared: 0.9539,	Adjusted R-squared: 0.9523

  ## case where non-rank-revealing methods break down
  dd <- data.frame(f1 = gl(4, 6, labels = LETTERS[1:4]),
                   f2 = gl(3, 2, labels = letters[1:3]))[-(7:8), ]
  xtabs(~ f2 + f1, dd)     # one missing cell
#>    f1
#> f2  A B C D
#>   a 2 0 2 2
#>   b 2 2 2 2
#>   c 2 2 2 2
  mm <- model.matrix(~ f1 * f2, dd)
  kappa(mm)                # large, indicating rank deficiency
#> [1] 9.23756e+16
  set.seed(1)
  dd$y <- mm %*% seq_len(ncol(mm)) + rnorm(nrow(mm), sd = 0.1)
  summary(lm(y ~ f1 * f2, dd))     # detects rank deficiency
#> 
#> Call:
#> lm(formula = y ~ f1 * f2, data = dd)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.12155 -0.04702  0.00000  0.04702  0.12155 
#> 
#> Coefficients: (1 not defined because of singularities)
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.97786    0.05816   16.81 3.41e-09 ***
#> f1B         12.03807    0.08226  146.35  < 2e-16 ***
#> f1C          3.11722    0.08226   37.90 5.22e-13 ***
#> f1D          4.06852    0.08226   49.46 2.83e-14 ***
#> f2b          5.06012    0.08226   61.52 2.59e-15 ***
#> f2c          5.99759    0.08226   72.91 4.01e-16 ***
#> f1B:f2b     -3.01476    0.11633  -25.92 3.27e-11 ***
#> f1C:f2b      7.70300    0.11633   66.22 1.16e-15 ***
#> f1D:f2b      8.96425    0.11633   77.06  < 2e-16 ***
#> f1B:f2c           NA         NA      NA       NA    
#> f1C:f2c     10.96133    0.11633   94.23  < 2e-16 ***
#> f1D:f2c     12.04108    0.11633  103.51  < 2e-16 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 0.08226 on 11 degrees of freedom
#> Multiple R-squared:  0.9999,	Adjusted R-squared:  0.9999 
#> F-statistic: 1.855e+04 on 10 and 11 DF,  p-value: < 2.2e-16
#> 
  try(summary(fastLm(y ~ f1 * f2, dd))) # also detects rank deficiency
#> 
#> Call:
#> fastLm.formula(formula = y ~ f1 * f2, data = dd)
#> 
#> Residuals:
#>     Min.  1st Qu.   Median  3rd Qu.     Max. 
#> -0.12150 -0.04702  0.00000  0.04702  0.12150 
#> 
#>              Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept)  0.977859   0.058165  16.812 3.413e-09 ***
#> f1B         12.038068   0.082258 146.346 < 2.2e-16 ***
#> f1C          3.117222   0.082258  37.896 5.221e-13 ***
#> f1D          4.068523   0.082258  49.461 2.833e-14 ***
#> f2b          5.060123   0.082258  61.516 2.593e-15 ***
#> f2c          5.997592   0.082258  72.912 4.015e-16 ***
#> f1B:f2b     -3.014763   0.116330 -25.916 3.266e-11 ***
#> f1C:f2b      7.702999   0.116330  66.217 1.156e-15 ***
#> f1D:f2b      8.964251   0.116330  77.059 < 2.2e-16 ***
#> f1B:f2c            NA         NA      NA        NA    
#> f1C:f2c     10.961326   0.116330  94.226 < 2.2e-16 ***
#> f1D:f2c     12.041081   0.116330 103.508 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#> 
#> Residual standard error: 0.08226 on 11 degrees of freedom
#> Multiple R-squared: 0.9999,	Adjusted R-squared: 0.9999