Representative Curves

curveRep finds representative curves from a relatively large collection of curves. The curves usually represent time-response profiles as in serial (longitudinal or repeated) data with possibly unequal time points and greatly varying sample sizes per subject. After excluding records containing missing x or y, records are first stratified into kn groups having similar sample sizes per curve (subject). Within these strata, curves are next stratified according to the distribution of x points per curve (typically measurement times per subject). The clara clustering/partitioning function is used to do this, clustering on one, two, or three x characteristics depending on the minimum sample size in the current interval of sample size. If the interval has a minimum number of unique values of one, clustering is done on the single x values. If the minimum number of unique x values is two, clustering is done to create groups that are similar on both min(x) and max(x). For groups containing no fewer than three unique x values, clustering is done on the trio of values min(x), max(x), and the longest gap between any successive x. Then within sample size and x distribution strata, clustering of time-response profiles is based on p values of y all evaluated at the same p equally-spaced x's within the stratum. An option allows per-curve data to be smoothed with lowess before proceeding. Outer x values are taken as extremes of x across all curves within the stratum. Linear interpolation within curves is used to estimate y at the grid of x's. For curves within the stratum that do not extend to the most extreme x values in that stratum, extrapolation uses flat lines from the observed extremes in the curve unless extrap=TRUE. The p y values are clustered using clara.

print and plot methods show results. By specifying an auxiliary idcol variable to plot, other variables such as treatment may be depicted to allow the analyst to determine for example whether subjects on different treatments are assigned to different time-response profiles. To write the frequencies of a variable such as treatment in the upper left corner of each panel (instead of the grand total number of clusters in that panel), specify freq.

curveSmooth takes a set of curves and smooths them using lowess. If the number of unique x points in a curve is less than p, the smooth is evaluated at the unique x values. Otherwise it is evaluated at an equally spaced set of x points over the observed range. If fewer than 3 unique x values are in a curve, those points are used and smoothing is not done.

Usage

curveRep(x, y, id, kn = 5, kxdist = 5, k = 5, p = 5,
         force1 = TRUE, metric = c("euclidean", "manhattan"),
         smooth=FALSE, extrap=FALSE, pr=FALSE)

# S3 method for class 'curveRep'
print(x, ...)

# S3 method for class 'curveRep'
plot(x, which=1:length(res),
                        method=c('all','lattice','data'),
                        m=NULL, probs=c(.5, .25, .75), nx=NULL, fill=TRUE,
                        idcol=NULL, freq=NULL, plotfreq=FALSE,
                        xlim=range(x), ylim=range(y),
                        xlab='x', ylab='y', colorfreq=FALSE, ...)
curveSmooth(x, y, id, p=NULL, pr=TRUE)

Arguments

x: a numeric vector, typically measurement times. For plot.curveRep is an object created by curveRep.
y: a numeric vector of response values
id: a vector of curve (subject) identifiers, the same length as x and y
kn: number of curve sample size groups to construct. curveRep tries to divide the data into equal numbers of curves across sample size intervals.
kxdist: maximum number of x-distribution clusters to derive using clara
k: maximum number of x-y profile clusters to derive using clara
p: number of x points at which to interpolate y for profile clustering. For curveSmooth is the number of equally spaced points at which to evaluate the lowess smooth, and if p is omitted the smooth is evaluated at the original x values (which will allow curveRep to still know the x distribution
force1: By default if any curves have only one point, all curves consisting of one point will be placed in a separate stratum. To prevent this separation, set force1 = FALSE.
metric: see clara
smooth: By default, linear interpolation is used on raw data to obtain y values to cluster to determine x-y profiles. Specify smooth = TRUE to replace observed points with lowess before computing y points on the grid. Also, when smooth is used, it may be desirable to use extrap=TRUE.
extrap: set to TRUE to use linear extrapolation to evaluate y points for x-y clustering. Not recommended unless smoothing has been or is being done.
pr: set to TRUE to print progress notes
which: an integer vector specifying which sample size intervals to plot. Must be specified if method='lattice' and must be a single number in that case.
method: The default makes individual plots of possibly all x-distribution by sample size by cluster combinations. Fewer may be plotted by specifying which. Specify method='lattice' to show a lattice xyplot of a single sample size interval, with x distributions going across and clusters going down. To not plot but instead return a data frame for a single sample size interval, specify method='data'
m: the number of curves in a cluster to randomly sample if there are more than m in a cluster. Default is to draw all curves in a cluster. For method = "lattice" you can specify m = "quantiles" to use the xYplot function to show quantiles of y as a function of x, with the quantiles specified by the probs argument. This cannot be used to draw a group containing n = 1.
nx: applies if m = "quantiles". See xYplot.
probs: 3-vector of probabilities with the central quantile first. Default uses quartiles.
fill: for method = "all", by default if a sample size x-distribution stratum did not have enough curves to stratify into k x-y profiles, empty graphs are drawn so that a matrix of graphs will have the next row starting with a different sample size range or x-distribution. See the example below.
idcol: a named vector to be used as a table lookup for color assignments (does not apply when m = "quantile"). The names of this vector are curve ids and the values are color names or numbers.
freq: a named vector to be used as a table lookup for a grouping variable such as treatment. The names are curve ids and values are any values useful for grouping in a frequency tabulation.
plotfreq: set to TRUE to plot the frequencies from the freq variable as horizontal bars instead of printing them. Applies only to method = "lattice". By default the largest bar is 0.1 times the length of a panel's x-axis. Specify plotfreq = 0.5 for example to make the longest bar half this long.
colorfreq: set to TRUE to color the frequencies printed by plotfreq using the colors provided by idcol.
xlim, ylim, xlab, ylab: plotting parameters. Default ranges are the ranges in the entire set of raw data given to curveRep.
...: arguments passed to other functions.

Value

a list of class "curveRep" with the following elements

res: a hierarchical list first split by sample size intervals, then by x distribution clusters, then containing a vector of cluster numbers with id values as a names attribute
ns: a table of frequencies of sample sizes per curve after removing NAs
nomit: total number of records excluded due to NAs
missfreq: a table of frequencies of number of NAs excluded per curve
ncuts: cut points for sample size intervals
kn: number of sample size intervals
kxdist: number of clusters on x distribution
k: number of clusters of curves within sample size and distribution groups
p: number of points at which to evaluate each curve for clustering
x
y
id: input data after removing NAs

curveSmooth returns a list with elements x,y,id.

Details

In the graph titles for the default graphic output, n refers to the minimum sample size, x refers to the sequential x-distribution cluster, and c refers to the sequential x-y profile cluster. Graphs from method = "lattice" are produced by xyplot and in the panel titles distribution refers to the x-distribution stratum and cluster refers to the x-y profile cluster.

References

Segal M. (1994): Representative curves for longitudinal data via regression trees. J Comp Graph Stat 3:214-233.

Jones MC, Rice JA (1992): Displaying the important features of large collections of similar curves. Am Statistician 46:140-145.

Zheng X, Simpson JA, et al (2005): Data from a study of effectiveness suggested potential prognostic factors related to the patterns of shoulder pain. J Clin Epi 58:823-830.

Author

Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com

Note

The references describe other methods for deriving representative curves, but those methods were not used here. The last reference which used a cluster analysis on principal components motivated curveRep however. The kml package does k-means clustering of longitudinal data with imputation.

Examples

if (FALSE) { # \dontrun{
# Simulate 200 curves with per-curve sample sizes ranging from 1 to 10
# Make curves with odd-numbered IDs have an x-distribution that is random
# uniform [0,1] and those with even-numbered IDs have an x-dist. that is
# half as wide but still centered at 0.5.  Shift y values higher with
# increasing IDs
set.seed(1)
N <- 200
nc <- sample(1:10, N, TRUE)
id <- rep(1:N, nc)
x <- y <- id
for(i in 1:N) {
  x[id==i] <- if(i %% 2) runif(nc[i]) else runif(nc[i], c(.25, .75))
  y[id==i] <- i + 10*(x[id==i] - .5) + runif(nc[i], -10, 10)
}

w <- curveRep(x, y, id, kxdist=2, p=10)
w
par(ask=TRUE, mfrow=c(4,5))
plot(w)                # show everything, profiles going across
par(mfrow=c(2,5))
plot(w,1)              # show n=1 results
# Use a color assignment table, assigning low curves to green and
# high to red.  Unique curve (subject) IDs are the names of the vector.
cols <- c(rep('green', N/2), rep('red', N/2))
names(cols) <- as.character(1:N)
plot(w, 3, idcol=cols)
par(ask=FALSE, mfrow=c(1,1))

plot(w, 1, 'lattice')  # show n=1 results
plot(w, 3, 'lattice')  # show n=4-5 results
plot(w, 3, 'lattice', idcol=cols)  # same but different color mapping
plot(w, 3, 'lattice', m=1)  # show a single "representative" curve
# Show median, 10th, and 90th percentiles of supposedly representative curves
plot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9))
# Same plot but with much less grouping of x variable
plot(w, 3, 'lattice', m='quantiles', probs=c(.5,.1,.9), nx=2)

# Use ggplot2 for one sample size interval
z <- plot(w, 2, 'data')
require(ggplot2)
ggplot(z, aes(x, y, color=curve)) + geom_line() +
       facet_grid(distribution ~ cluster) +
       theme(legend.position='none') +
       labs(caption=z$ninterval[1])


# Smooth data before profiling.  This allows later plotting to plot
# smoothed representative curves rather than raw curves (which
# specifying smooth=TRUE to curveRep would do, if curveSmooth was not used)
d <- curveSmooth(x, y, id)
w <- with(d, curveRep(x, y, id))

# Example to show that curveRep can cluster profiles correctly when
# there is no noise.  In the data there are four profiles - flat, flat
# at a higher mean y, linearly increasing then flat, and flat at the
# first height except for a sharp triangular peak

set.seed(1)
x <- 0:100
m <- length(x)
profile <- matrix(NA, nrow=m, ncol=4)
profile[,1] <- rep(0, m)
profile[,2] <- rep(3, m)
profile[,3] <- c(0:3, rep(3, m-4))
profile[,4] <- c(0,1,3,1,rep(0,m-4))
col <- c('black','blue','green','red')
matplot(x, profile, type='l', col=col)
xeval <- seq(0, 100, length.out=5)
s <- x 
matplot(x[s], profile[s,], type='l', col=col)

id <- rep(1:100, each=m)
X <- Y <- id
cols <- character(100)
names(cols) <- as.character(1:100)
for(i in 1:100) {
  s <- id==i
  X[s] <- x
  j <- sample(1:4,1)
  Y[s] <- profile[,j]
  cols[i] <- col[j]
}
table(cols)
yl <- c(-1,4)
w <- curveRep(X, Y, id, kn=1, kxdist=1, k=4)
plot(w, 1, 'lattice', idcol=cols, ylim=yl)
# Found 4 clusters but two have same profile
w <- curveRep(X, Y, id, kn=1, kxdist=1, k=3)
plot(w, 1, 'lattice', idcol=cols, freq=cols, plotfreq=TRUE, ylim=yl)
# Incorrectly combined black and red because default value p=5 did
# not result in different profiles at x=xeval
w <- curveRep(X, Y, id, kn=1, kxdist=1, k=4, p=40)
plot(w, 1, 'lattice', idcol=cols, ylim=yl)
# Found correct clusters because evaluated curves at 40 equally
# spaced points and could find the sharp triangular peak in profile 4
} # }