Redundancy Analysis
redun.RdUses flexible parametric additive models (see areg and its
use of regression splines), or alternatively to run a regular regression
after replacing continuous variables with ranks, to
determine how well each variable can be predicted from the remaining
variables. Variables are dropped in a stepwise fashion, removing the
most predictable variable at each step. The remaining variables are used
to predict. The process continues until no variable still in the list
of predictors can be predicted with an \(R^2\) or adjusted \(R^2\)
of at least r2 or until dropping the variable with the highest
\(R^2\) (adjusted or ordinary) would cause a variable that was dropped
earlier to no longer be predicted at least at the r2 level from
the now smaller list of predictors.
There is also an option qrank to expand each variable into two
columns containing the rank and square of the rank. Whenever ranks are
used, they are computed as fractional ranks for numerical reasons.
Arguments
- formula
a formula. Enclose a variable in
I()to force linearity. Alternately, can be a numeric matrix, in which case the data are not run throughdataframeReduce. This is useful when running the data throughtranscanfirst for nonlinearly transforming the data.- data
a data frame, which must be omitted if
formulais a matrix- subset
usual subsetting expression
- r2
ordinary or adjusted \(R^2\) cutoff for redundancy
- type
specify
"adjusted"to use adjusted \(R^2\)- nk
number of knots to use for continuous variables. Use
nk=0to force linearity for all variables.- tlinear
set to
FALSEto allow a variable to be automatically nonlinearly transformed (seeareg) while being predicted. By default, only continuous variables on the right hand side (i.e., while they are being predictors) are automatically transformed, using regression splines. Estimating transformations for target (dependent) variables causes more overfitting than doing so for predictors.- rank
set to
TRUEto replace non-categorical varibles with ranks before running the analysis. This causesnkto be set to zero.- qrank
set to
TRUEto also include squares of ranks to allow for non-monotonic transformations- allcat
set to
TRUEto ensure that all categories of categorical variables having more than two categories are redundant (see details below)- minfreq
For a binary or categorical variable, there must be at least two categories with at least
minfreqobservations or the variable will be dropped and not checked for redundancy against other variables.minfreqalso specifies the minimum frequency of a category or its complement before that category is considered whenallcat=TRUE.- iterms
set to
TRUEto consider derived terms (dummy variables and nonlinear spline components) as separate variables. This will perform a redundancy analysis on pieces of the variables.- pc
if
iterms=TRUEyou can setpctoTRUEto replace the submatrix of terms corresponding to each variable with the orthogonal principal components before doing the redundancy analysis. The components are based on the correlation matrix.- pr
set to
TRUEto monitor progress of the stepwise algorithm- ...
arguments to pass to
dataframeReduceto remove "difficult" variables fromdataifformulais~.to use all variables indata(datamust be specified when these arguments are used). Ignored forprint.- x
an object created by
redun- digits
number of digits to which to round \(R^2\) values when printing
- long
set to
FALSEto prevent theprintmethod from printing the \(R^2\) history and the original \(R^2\) with which each variable can be predicted from ALL other variables.
Value
an object of class "redun" including an element "scores", a numeric matrix with all transformed values when each variable was the dependent variable and the first canonical variate was computed
Details
A categorical variable is deemed
redundant if a linear combination of dummy variables representing it can
be predicted from a linear combination of other variables. For example,
if there were 4 cities in the data and each city's rainfall was also
present as a variable, with virtually the same rainfall reported for all
observations for a city, city would be redundant given rainfall (or
vice-versa; the one declared redundant would be the first one in the
formula). If two cities had the same rainfall, city might be
declared redundant even though tied cities might be deemed non-redundant
in another setting. To ensure that all categories may be predicted well
from other variables, use the allcat option. To ignore
categories that are too infrequent or too frequent, set minfreq
to a nonzero integer. When the number of observations in the category
is below this number or the number of observations not in the category
is below this number, no attempt is made to predict observations being
in that category individually for the purpose of redundancy detection.
Author
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See also
areg, dataframeReduce,
transcan, varclus, r2describe,
subselect::genetic
Examples
set.seed(1)
n <- 100
x1 <- runif(n)
x2 <- runif(n)
x3 <- x1 + x2 + runif(n)/10
x4 <- x1 + x2 + x3 + runif(n)/10
x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))
x6 <- 1*(x5=='a' | x5=='c')
redun(~x1+x2+x3+x4+x5+x6, r2=.8)
#>
#> Redundancy Analysis
#>
#> ~x1 + x2 + x3 + x4 + x5 + x6
#> <environment: 0x5e2f7e1f6cb0>
#>
#> n: 100 p: 6 nk: 3
#>
#> Number of NAs: 0
#>
#> Transformation of target variables forced to be linear
#>
#> R-squared cutoff: 0.8 Type: ordinary
#>
#> R^2 with which each variable can be predicted from all other variables:
#>
#> x1 x2 x3 x4 x5 x6
#> 0.994 0.994 0.998 0.999 1.000 1.000
#>
#> Rendundant variables:
#>
#> x5 x4 x3
#>
#>
#> Predicted from variables:
#>
#> x1 x2 x6
#>
#> Variable Deleted R^2 R^2 after later deletions
#> 1 x5 1.000 1 1
#> 2 x4 0.999 0.997
#> 3 x3 0.995
redun(~x1+x2+x3+x4+x5+x6, r2=.8, minfreq=40)
#>
#> Redundancy Analysis
#>
#> ~x1 + x2 + x3 + x4 + x5 + x6
#> <environment: 0x5e2f7e1f6cb0>
#>
#> n: 100 p: 4 nk: 3
#>
#> Number of NAs: 0
#>
#> Transformation of target variables forced to be linear
#>
#> Minimum category frequency required for retention of a binary or
#> categorical variable: 40
#>
#> Binary or categorical variables removed because of inadequate frequencies:
#>
#> x5 x6
#>
#> R-squared cutoff: 0.8 Type: ordinary
#>
#> R^2 with which each variable can be predicted from all other variables:
#>
#> x1 x2 x3 x4
#> 0.994 0.994 0.998 0.999
#>
#> Rendundant variables:
#>
#> x4 x3
#>
#>
#> Predicted from variables:
#>
#> x1 x2
#>
#> Variable Deleted R^2 R^2 after later deletions
#> 1 x4 0.999 0.997
#> 2 x3 0.995
redun(~x1+x2+x3+x4+x5+x6, r2=.8, allcat=TRUE)
#>
#> Redundancy Analysis
#>
#> ~x1 + x2 + x3 + x4 + x5 + x6
#> <environment: 0x5e2f7e1f6cb0>
#>
#> n: 100 p: 6 nk: 3
#>
#> Number of NAs: 0
#>
#> Transformation of target variables forced to be linear
#>
#> All levels of a categorical variable had to be redundant before the
#> variable was declared redundant
#>
#> R-squared cutoff: 0.8 Type: ordinary
#>
#> R^2 with which each variable can be predicted from all other variables:
#>
#> x1 x2 x3 x4 x5 x6
#> 0.994 0.994 0.998 0.999 0.221 1.000
#>
#> (For categorical variables the minimum R^2 for any sufficiently
#> frequent dummy variable is displayed)
#>
#>
#> Rendundant variables:
#>
#> x6 x4 x3
#>
#>
#> Predicted from variables:
#>
#> x1 x2 x5
#>
#> Variable Deleted R^2 R^2 after later deletions
#> 1 x6 1.000 1 1
#> 2 x4 0.999 0.997
#> 3 x3 0.995
# x5 is no longer redundant but x6 is
redun(~x1+x2+x3+x4+x5+x6, r2=.8, rank=TRUE)
#>
#> Redundancy Analysis
#>
#> ~x1 + x2 + x3 + x4 + x5 + x6
#> <environment: 0x5e2f7e1f6cb0>
#>
#> n: 100 p: 6 nk: 0
#>
#> Number of NAs: 0
#>
#> Analysis used ranks
#>
#> Transformation of target variables forced to be linear in the ranks
#>
#> R-squared cutoff: 0.8 Type: ordinary
#>
#> R^2 with which each variable can be predicted from all other variables:
#>
#> x1 x2 x3 x4 x5 x6
#> 0.994 0.994 0.998 0.999 1.000 1.000
#>
#> Rendundant variables:
#>
#> x5 x4 x3
#>
#>
#> Predicted from variables:
#>
#> x1 x2 x6
#>
#> Variable Deleted R^2 R^2 after later deletions
#> 1 x5 1.000 1 1
#> 2 x4 0.999 0.997
#> 3 x3 0.995
redun(~x1+x2+x3+x4+x5+x6, r2=.8, qrank=TRUE)
#>
#> Redundancy Analysis
#>
#> ~x1 + x2 + x3 + x4 + x5 + x6
#> <environment: 0x5e2f7e1f6cb0>
#>
#> n: 100 p: 6 nk: 0
#>
#> Number of NAs: 0
#>
#> Analysis used ranks and square of ranks
#>
#> Transformation of target variables forced to be linear in the ranks
#>
#> R-squared cutoff: 0.8 Type: ordinary
#>
#> R^2 with which each variable can be predicted from all other variables:
#>
#> x1 x2 x3 x4 x5 x6
#> 0.907 0.914 0.994 0.995 1.000 1.000
#>
#> Rendundant variables:
#>
#> x5 x4 x3
#>
#>
#> Predicted from variables:
#>
#> x1 x2 x6
#>
#> Variable Deleted R^2 R^2 after later deletions
#> 1 x5 1.000 1 1
#> 2 x4 0.995 0.952
#> 3 x3 0.949
# To help decode which variables made a particular variable redundant:
# r <- redun(...)
# r2describe(r$scores)