lm.fit.sparse.RdA basic computing engine for sparse linear least squares regression.
Note that the exact interface (arguments, return value) currently is experimental, and is bound to change. Use at your own risk!
lm.fit.sparse(x, y, w = NULL, offset = NULL,
method = c("qr", "cholesky"),
tol = 1e-7, singular.ok = TRUE, order = NULL,
transpose = FALSE)sparse design matrix of dimension n * p, i.e.,
an R object of a class extending
dsparseMatrix; typically the result of
sparse.model.matrix.
vector of observations of length n, or a matrix with
n rows.
vector of weights (length n) to be used in the fitting
process. Weighted least squares is used with weights w,
i.e., sum(w * e^2) is minimized.
Not yet implemented !
numeric of length n). This can be used to
specify an a priori known component to be included in the
linear predictor during fitting.
a character string specifying the (factorization)
method. Currently, "qr" or "cholesky".
[for back-compatibility only; unused:] tolerance for the
qr decomposition. Default is 1e-7.
[for back-compatibility only; unused:] logical. If
FALSE, a singular model is an error.
integer or NULL, for method == "qr", will
determine how the fill-reducing ordering (aka permutation) for the
“symbolic” part is determined (in cs_amd()), with the
options
natural,
Chol,
LU, and
QR,
where 3 is the default.
logical; if true, use the transposed matrix t(x) instead of
x.
Either a single numeric vector or a list of four numeric vectors.
glm4 is an alternative (much) more general fitting
function.
sparse.model.matrix from the Matrix package;
the non-sparse function in standard R's package stats:
lm.fit().
dd <- expand.grid(a = as.factor(1:3),
b = as.factor(1:4),
c = as.factor(1:2),
d= as.factor(1:8))
n <- nrow(dd <- dd[rep(seq_len(nrow(dd)), each = 10), ])
set.seed(17)
dM <- cbind(dd, x = round(rnorm(n), 1))
## randomly drop some
n <- nrow(dM <- dM[- sample(n, 50),])
dM <- within(dM, { A <- c(2,5,10)[a]
B <- c(-10,-1, 3:4)[b]
C <- c(-8,8)[c]
D <- c(10*(-5:-2), 20*c(0, 3:5))[d]
Y <- A + B + A*B + C + D + A*D + C*x + rnorm(n)/10
wts <- sample(1:10, n, replace=TRUE)
rm(A,B,C,D)
})
str(dM) # 1870 x 7
#> 'data.frame': 1870 obs. of 7 variables:
#> $ a : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
#> $ b : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
#> $ c : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
#> $ d : Factor w/ 8 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ x : num -1 -0.1 -0.2 -0.8 0.8 -0.2 1 1.7 0.3 0.4 ...
#> $ wts: int 6 1 7 7 5 9 5 4 5 8 ...
#> $ Y : num -178 -185 -184 -180 -192 ...
X <- Matrix::sparse.model.matrix( ~ (a+b+c+d)^2 + c*x, data = dM)
dim(X) # 1870 x 69
#> [1] 1870 69
X[1:10, 1:20]
#> 10 x 20 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 20 column names ‘(Intercept)’, ‘a2’, ‘a3’ ... ]]
#>
#> 1 1 . . . . . . . . . . . . . -1.0 . . . . .
#> 1.1 1 . . . . . . . . . . . . . -0.1 . . . . .
#> 1.2 1 . . . . . . . . . . . . . -0.2 . . . . .
#> 1.3 1 . . . . . . . . . . . . . -0.8 . . . . .
#> 1.4 1 . . . . . . . . . . . . . 0.8 . . . . .
#> 1.5 1 . . . . . . . . . . . . . -0.2 . . . . .
#> 1.6 1 . . . . . . . . . . . . . 1.0 . . . . .
#> 1.7 1 . . . . . . . . . . . . . 1.7 . . . . .
#> 1.8 1 . . . . . . . . . . . . . 0.3 . . . . .
#> 1.9 1 . . . . . . . . . . . . . 0.4 . . . . .
## For now, use 'MatrixModels:::' --- TODO : export once interface is clear!
Xd <- as(X,"matrix")
system.time(fmDense <- lm.fit(Xd, y = dM[,"Y"])) # {base} functionality
#> user system elapsed
#> 0.003 0.000 0.003
system.time( r1 <- MatrixModels:::lm.fit.sparse(X, y = dM[,"Y"]) ) # *is* faster
#> user system elapsed
#> 0.003 0.000 0.003
stopifnot(all.equal(r1, unname(fmDense$coeff), tolerance = 1e-12))
system.time(
r2 <- MatrixModels:::lm.fit.sparse(X, y = dM[,"Y"], method = "chol") )
#> user system elapsed
#> 0.002 0.000 0.002
stopifnot(all.equal(r1, r2$coef, tolerance = 1e-12),
all.equal(fmDense$residuals, r2$residuals, tolerance=1e-9)
)
## with weights:
system.time(fmD.w <- with(dM, lm.wfit(Xd, Y, w = wts)))
#> user system elapsed
#> 0.004 0.000 0.004
system.time(fm.w1 <- with(dM, MatrixModels:::lm.fit.sparse(X, Y, w = wts)))
#> user system elapsed
#> 0.003 0.000 0.003
system.time(fm.w2 <- with(dM, MatrixModels:::lm.fit.sparse(X, Y, w = wts,
method = "chol") ))
#> user system elapsed
#> 0.002 0.000 0.002
stopifnot(all.equal(fm.w1, unname(fmD.w$coeff), tolerance = 1e-12),
all.equal(fm.w2$coef, fm.w1, tolerance = 1e-12),
all.equal(fmD.w$residuals, fm.w2$residuals, tolerance=1e-9)
)