Sorting: Sort R vector in-RAM and in-place

Function ramsort will sort the input vector in-place (without making a copy) and return the number of NAs found

# Default S3 method
ramsort(x, has.na = TRUE, na.last = TRUE, decreasing = FALSE
, optimize = c("time", "memory"), VERBOSE = FALSE, ...)
# Default S3 method
mergesort(x, has.na = TRUE, na.last = TRUE, decreasing = FALSE, ...)
# Default S3 method
radixsort(x, has.na = TRUE, na.last = TRUE, decreasing = FALSE, ...)
# Default S3 method
keysort(x, keyrange=range(x, na.rm=has.na), has.na = TRUE
, na.last = TRUE, decreasing = FALSE, ...)
# Default S3 method
shellsort(x, has.na = TRUE, na.last = TRUE, decreasing = FALSE, ...)

Arguments

x: an atomic R vector
keyrange: an integer vector with two values giving the smallest and largest possible value in x, note that you should give this explicitely for best performance, relying on the default needs one pass over the data to determine the range
has.na: boolean scalar telling ramsort whether the vector might contain NAs. Note that you risk a crash if there are unexpected NAs with has.na=FALSE
na.last: boolean scalar telling ramsort whether to sort NAs last or first. Note that 'boolean' means that there is no third option NA as in sort
decreasing: boolean scalar telling ramsort whether to sort increasing or decreasing
optimize: by default ramsort optimizes for 'time' which requires more RAM, set to 'memory' to minimize RAM requirements and sacrifice speed
VERBOSE: cat some info about chosen method
...: ignored

Details

Function ramsort is a front-end to a couple of single-threaded sorting algorithms that have been carefully implemented to be fast with and without NAs.
The default is a mergesort algorithm without copying (Sedgewick 8.4) for integer and double data which requires 2x the RAM of its input vector (character or complex data are not supported). Mergesort is fast, stable with a reliable runtime.
For integer data longer than a certain length we improve on mergesort by using a faster LSD radixsort algorithm (Sedgewick 10.5) that uses 2x the RAM of its input vector plus 65536+1 integers.
For booleans, logicals, integers at or below the resolution of smallint and for factors below a certain number of levels we use a key-index sort instead of mergesort or radix sort (note that R has a (slower) key-index sort in sort.list available with confusingly named method='radix' but the standard sort does not leverage it for factors (2-11.1). If you call keysort directly, you should provide a known 'keyrange' directly to obtain the full speed.
Finally the user can request a sort method that minimizes memory use at the price of longer computation time with optimize='memory' – currently a shellsort.

Value

integer scalar with the number of NAs. This is always 0 with has.na=FALSE

References

Robert Sedgewick (1997). Algorithms in C, Third edition. Addison-Wesley.

Author

Jens Oehlschlägel

Note

This function is called for its side-effects and breaks the functional programming paradigm. Use with care.

Examples

   n <- 50
   x <- sample(c(NA, NA, 1:26), n, TRUE)
   sort(x)
#>  [1]  1  1  2  2  3  5  6  6  7  8  8 10 10 11 12 12 12 13 14 14 15 15 15 16 16
#> [26] 17 17 18 18 18 19 19 19 20 20 20 21 22 23 25 25 25 26 26
   ramsort(x)
#> [1] 6
   x
#>  [1]  1  1  2  2  3  5  6  6  7  8  8 10 10 11 12 12 12 13 14 14 15 15 15 16 16
#> [26] 17 17 18 18 18 19 19 19 20 20 20 21 22 23 25 25 25 26 26 NA NA NA NA NA NA

   if (FALSE) { # \dontrun{
      message("Note how the datatype influences sorting speed")
      n <- 5e6
      x <- sample(1:26, n, TRUE)

      y <- as.double(x)
      system.time(ramsort(y))

      y <- as.integer(x)
      system.time(ramsort(y))

      y <- as.short(x)
      system.time(ramsort(y))

      y <- as.factor(letters)[x]
      system.time(ramsort(y))
   } # }