vignettes/rmd/Rcpp-introduction.Rmd
Rcpp-introduction.RmdAbstract
has always provided an application programming interface (API) for extensions. Based on the language, it uses a number of macros and other low-level constructs to exchange data structures between the process and any dynamically-loaded component modules authors added to it. With the introduction of the package, and its later refinements, this process has become considerably easier yet also more robust. By now, has become the most popular extension mechanism for . This article introduces , and illustrates with several examples how the Rcpp Attributes mechanism in particular eases the transition of objects between and code.
The language and environment has established itself as both an increasingly dominant facility for data analysis, and the lingua franca for statistical computing in both research and application settings.
Since the beginning, and as we argue below, “by design”, the system has always provided an application programming interface (API) suitable for extending with code written in or . Being implemented chiefly in and (with a generous sprinkling of for well-established numerical subroutines), has always been extensible via a interface. Both the actual implementation and the interface use a number of macros and other low-level constructs to exchange data structures between the process and any dynamically-loaded component modules authors added to it.
A interface will generally also be accessible to other languages. Particularly noteworthy here is the language, developed originally as a ‘better ’, which is by its design very interoperable with . And with the introduction of the package , and its later refinements, this process of extending has become considerably easier yet also more robust. To date, has become the most popular extension system for . This article introduces , and illustrates with several examples how the Rcpp Attributes mechanism in particular eases the transition of objects between and code.
provides a very thorough discussion of desirable traits for a system designed to program with data, and the system in particular. Two key themes motivate the introductory discussion. First, the Mission is to aid exploration in order to provide the best platform to analyse data: “to boldly go where no one has gone before.” Second, the Prime Directive is that the software systems we build must be trustworthy: “the many computational steps between original data source and displayed result must all be trustful.” The remainder of the book then discusses , leading to two final chapters on interfaces.
builds and expands on this theme. Two core facets of what “makes” are carried over from the previous book. The first states what is composed of: Everything that exists in is an object. The second states how these objects are created or altered: Everything that happens in is a function call. A third statement is now added: Interfaces to other software are part of .
This last addition is profound. If and when suitable and performant software for a task exists, it is in fact desirable to have a (preferably also performant) interface to this software from . discusses several possible approaches for simpler interfaces and illustrates them with reference implementations to both and . However, the most performant interface for is provided at the subroutine level, and rather than discussing the older interface for , prefers to discuss . This article follows the same school of thought and aims to introduce to analysts and data scientists, aiming to enable them to use—and create— further interfaces for which aid the mission while staying true to the prime directive. Adding interfaces in such a way is in fact a natural progression from the earliest designs for its predecessor which was after all designed to provide a more usable ‘interface’ to underlying routines in .
The rest of the paper is structured as follows. We start by discussing possible first steps, chiefly to validate correct installations. This is followed by an introduction to simple functions, comparison to the API, a discussion of packaging with and a linear algebra example. The appendix contains some empirical illustrations of the adoption of .
is a CRAN package and can be installed by using
install.packages('Rcpp') just like any other package. On
some operating systems this will download pre-compiled binary
packages; on others an installation from source will be attempted. But
is a little different from many standard packages in one important
aspect: it helps the user to write C(++) programs more easily.
The key aspect to note here is programs: to operate, needs not only but
also an additional toolchain of a compiler, linker and more in
order to be able to create binary object code extending .
We note that this requirement is no different from what is needed with base when compilation of extensions is attempted. How to achieve this using only base is described in some detail in the Writing R Extensions manual that is included with . As for the toolchain requirements, on Linux and macOS, all required components are likely to be present. The macOS can offer additional challenges as toolchain elements can be obtained in different ways. Some of these are addressed in the Rcpp FAQ in sections 2.10 and 2.16. On Windows, users will have to install the Rtools kit provided by R Core available at https://cran.r-project.org/bin/windows/Rtools/. Details of these installation steps are beyond the scope of this paper. However, many external resources exist that provide detailed installation guides for toolchains in Windows and macOS.
As a first step, and chiefly to establish that the toolchain is set up correctly, consider a minimal use case such as the following:
## [1] 4
Here the package is loaded first via the library()
function. Next, we deploy one of its simplest functions,
evalCpp(), which is described in the Rcpp
Attributes vignette . It takes the first (and often only)
argument—a character object—and evaluates it as a minimal expression.
The value assignment and return are implicit, as is the addition of a
trailing semicolon and more. In fact, evalCpp() surrounds
the expression with the required ‘glue’ to make it a minimal source file
which can be compiled, linked and loaded. The exact details behind this
process are available in-depth when the verbose option of
the function is set. If everything is set up correctly, the
newly-created function will be returned.
While such a simple expression is not interesting in itself, it serves a useful purpose here to unequivocally establish whether is correctly set up. Having accomplished that, we can proceed to the next step of creating simple functions.
As a first example, consider the determination of whether a number is odd or even. The default practice is to use modular arithmetic to check if a remainder exists under . Within , this can be implemented as follows:
## [1] FALSE
The operator %% implements the $\bmod$ operation in . For the default
(integer) argument of ten used in the example,
results in zero, which is then mapped to FALSE in the
context of a logical expression.
Translating this implementation into , several small details have to
be considered. First and foremost, as is a statically-typed
language, there needs to be additional (compile-time) information
provided for each of the variables. Specifically, a type,
i.e. the kind of storage used by a variable must be explicitly
defined. Typed languages generally offer benefits in terms of both
correctness (as it is harder to accidentally assign to an ill-matched
type) and performance (as the compiler can optimize code based on the
storage and cpu characteristics). Here we have an int
argument, but return a logical, or bool for short. Two more
smaller differences are that each statement within the body must be
concluded with a semicolon, and that return does not
require parentheses around its argument. A graphical breakdown of all
aspects of a corresponding function is given in Figure .
When using , such functions can be directly embedded and compiled in
an script file through the use of the cppFunction()
provided by Rcpp Attributes . The first parameter of the
function accepts string input that represents the code. Upon calling the
cppFunction(), and similarly to the earlier example
involving evalCpp(), the code is both compiled and
linked, and then imported into under the name of the
function supplied (e.g. here isOddCpp()).
library("Rcpp")
cppFunction("
bool isOddCpp(int num = 10) {
bool result = (num % 2 == 1);
return result;
}")
isOddCpp(42L)## [1] FALSE
Let us first consider the case of ‘standard ’, i.e. the API as defined in the core documentation. Extending with routines written using the language requires the use of internal macros and functions documented in Chapter 5 of Writing R Extensions .
#include <R.h>
#include <Rinternals.h>
SEXP convolve2(SEXP a, SEXP b) {
int na, nb, nab;
double *xa, *xb, *xab;
SEXP ab;
a = PROTECT(coerceVector(a, REALSXP));
b = PROTECT(coerceVector(b, REALSXP));
na = length(a); nb = length(b);
nab = na + nb - 1;
ab = PROTECT(allocVector(REALSXP, nab));
xa = REAL(a); xb = REAL(b); xab = REAL(ab);
for (int i = 0; i < nab; i++)
xab[i] = 0.0;
for (int i = 0; i < na; i++)
for (int j = 0; j < nb; j++)
xab[i + j] += xa[i] * xb[j];
UNPROTECT(3);
return ab;
}This function computes a convolution of two vectors supplied
on input,
and
,
which is defined to be
.
Before computing the convolution (which is really just the three lines
involving two nested for loops with indices
and
),
a total of ten lines of mere housekeeping are required. Vectors
and
are coerced to double, and a results vector ab
is allocated. This expression involves three calls to the
PROTECT macro for which a precisely matching
UNPROTECT(3) is required as part of the interfacing of
internal memory allocation. The vectors are accessed through pointer
equivalents xa, xb and xab; and
the latter has to be explicitly zeroed prior to the convolution
calculation involving incremental summary at index
.
Using the idioms of , the above example can be written in a much more compact fashion—leading to code that is simpler to read and maintain.
#include "Rcpp.h"
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector
convolve_cpp(const NumericVector& a,
const NumericVector& b) {
// Declare loop counters, and vector sizes
int i, j,
na = a.size(), nb = b.size(),
nab = na + nb - 1;
// Create vector filled with 0
NumericVector ab(nab);
// Crux of the algorithm
for(i = 0; i < na; i++) {
for(j = 0; j < nb; j++) {
ab[i + j] += a[i] * b[j];
}
}
// Return result
return ab;
}To deploy such code from within an script or session, first save it
into a new file—which could be called convolve.cpp—in
either the working directory, a temporary directory or a project
directory. Then from within the session, use
Rcpp::sourceCpp("convolve.cpp") (possibly using a path as
well as the filename). This not only compiles, links and loads the code
within the external file but also adds the necessary “glue” to make the
function available in the environment. Once the code is compiled and
linked, call the newly-created convolve_cpp() function with
the appropriate parameters as done in previous examples.
What is notable about the version is that it has no
PROTECT or UNPROTECT which not only frees the
programmer from a tedious (and error-prone) step but more importantly
also shows that memory management can be handled automatically. The
result vector is already initialized at zero as well, reducing the
entire function to just the three lines for the two nested loops, plus
some variable declarations and the return statement. The
resulting code is shorter, easier to read, comprehend and maintain.
Furthermore, the code is more similar to traditional code, which reduces
the barrier of entry.
When beginning to implement an idea, more so an algorithm, there are many ways one is able to correctly implement it. Prior to the routine being used in production, two questions must be asked:
The first question is subject to a binary pass-fail unit test verification while the latter question is where the details of an implementation are scrutinized to extract maximal efficiency from the routine. The quality of the best routine follows first and foremost from its correctness. To that end, offers many different unit testing frameworks such as by , which is used to construct ’s 1385+ unit tests, and by . Only when correctness is achieved is it wise to begin the procedure of optimizing the efficiency of the routine and, in turn, selecting the best routine.
Optimization of an algorithm involves performing a quantitative
analysis of the routine’s properties. There are two main approaches to
analyzing the behavior of a routine: theoretical analysis or an
empirical examination using profiling tools. Typically, the latter
option is more prominently used as the routine’s theoretical properties
are derived prior to an implementation being started. Often the main
concern regarding an implementation in relates to the speed of the
algorithm as it impacts how quickly analyses can be done and reports can
be provided to decision makers. Coincidentally, the speed of code is one
of the key governing use cases of . Profiling code will reveal
shortcomings related to loops, e.g. for,
while, and repeat; conditional statements,
e.g. if-else if-else and
switch; and recursive functions, i.e. a function
written in terms of itself such that the problem is broken down on each
call in a reduced state until an answer can be obtained. In contrast,
the overhead for such operations is significantly less in . Thus,
critical components of a given routine should be written in to capture
maximal efficiency.
Returning to the second question, to decide which implementation works the best, one needs to employ a benchmark to obtain quantifiable results. Benchmarks are an ideal way to quantify how well a method performs because they have the ability to show the amount of time the code has been running and where bottlenecks exist within functions. This does not imply that benchmarks are completely infallible as user error can influence the end results. For example, if a user decides to benchmark code in one session and in another session performs a heavy computation, then the benchmark will be biased (if “wall clock” is measured).
There are different levels of magnification that a benchmark can
provide. For a more macro analysis, one should benchmark data using
benchmark(test = func(), test2 = func2()), a function from
the package by . This form of benchmarking will be used when the
computation is more intensive. The motivating example
isOdd() (which is only able to accept a single
integer) warrants a much more microscopic timing
comparison. In cases such as this, the objective is to obtain precise
results in the amount of nanoseconds elapsed. Using the
microbenchmark function from the package by is more helpful
to obtain timing information. To perform the benchmark:
library("microbenchmark")
results <- microbenchmark(isOddR = isOddR(12L),
isOddCpp = isOddCpp(12L))
print(summary(results)[, c(1:7)],digits=1)By looking at the summary of 100 evaluations, we note that the function performed better than the equivalent in by achieving a lower run time on average. The lower run time in this part is not necessarily critical as the difference is nanoseconds on a trivial computation. However, each section of code does contribute to a faster overall runtime.
connects with . Only the former is vectorized: is not. Rcpp Sugar, however, provides a convenient way to work with high-performing functions in a similar way to how offers vectorized operations. The Rcpp Sugar vignette details these, as well as many more functions directly accessible to in a way that should feel familiar to users. Some examples of Rcpp Sugar functions include special math functions like gamma and beta, statistical distributions and random number generation.
We will illustrate a case of random number generation. Consider
drawing one or more
-distributed
random variables. The very simplest case can just use
evalCpp():
evalCpp("R::rnorm(0, 1)")## [1] -1.4
By setting a seed, we can make this reproducible:
## [1] -0.56048
One important aspect of the behind-the-scenes code generation for the single expression (as well as all code created via Rcpp Attributes) is the automatic preservation of the state of the random number generators in . This means that from a given seed, we will receive identical draws of random numbers whether we access them from or via code accessing the same generators (via the interfaces). To illustrate, the same number is drawn via code after resetting the seed:
## [1] -0.56048
We can make the Rcpp Sugar function rnorm()
accessible from in the same way to return a vector of values:
## [1] -0.56048 -0.23018 1.55871
Note that we use the Rcpp:: namespace explicitly here to
contrast the vectorised Rcpp::rnorm() with the scalar
R::rnorm() also provided as a convenience wrapper for the
API of .
And as expected, this too replicates from as the very same generators are used in both cases along with consistent handling of generator state permitting to alternate:
## [1] -0.56048 -0.23018 1.55871
Statistical inference relied primarily upon asymptotic theory until proposed the bootstrap. Bootstrapping is known to be computationally intensive due to the need to use loops. Thus, it is an ideal candidate to use as an example. Before starting to write code using , prototype the code in .
# Function declaration
bootstrap_r <- function(ds, B = 1000) {
# Preallocate storage for statistics
boot_stat <- matrix(NA, nrow = B, ncol = 2)
# Number of observations
n <- length(ds)
# Perform bootstrap
for(i in seq_len(B)) {
# Sample initial data
gen_data <- ds[ sample(n, n, replace=TRUE) ]
# Calculate sample data mean and SD
boot_stat[i,] <- c(mean(gen_data),
sd(gen_data))
}
# Return bootstrap result
return(boot_stat)
}Before continuing, check that the initial prototype code works. To do
so, write a short script. Note the use of set.seed() to
ensure reproducible draws.
# Set seed to generate data
set.seed(512)
# Generate data
initdata <- rnorm(1000, mean = 21, sd = 10)
# Set a new _different_ seed for bootstrapping
set.seed(883)
# Perform bootstrap
result_r <- bootstrap_r(initdata)Figure shows that the bootstrap procedure worked well!
With reassurances that the method to be implemented within works appropriately in , proceed to translating the code into . As indicated previously, there are many convergences between syntax and base via Sugar.
#include <Rcpp.h>
// Function declaration with export tag
// [[Rcpp::export]]
Rcpp::NumericMatrix
bootstrap_cpp(Rcpp::NumericVector ds,
int B = 1000) {
// Preallocate storage for statistics
Rcpp::NumericMatrix boot_stat(B, 2);
// Number of observations
int n = ds.size();
// Perform bootstrap
for(int i = 0; i < B; i++) {
// Sample initial data
Rcpp::NumericVector gen_data =
ds[ floor(Rcpp::runif(n, 0, n)) ];
// Calculate sample mean and std dev
boot_stat(i, 0) = mean(gen_data);
boot_stat(i, 1) = sd(gen_data);
}
// Return bootstrap results
return boot_stat;
}In the version of the bootstrap function, there are a few additional
changes that occurred during the translation. In particular, the use of
Rcpp::runif(n, 0, n) enclosed by floor(),
which rounds down to the nearest integer, in place of
sample(n, n, replace = TRUE) to sample row ids. This is an
equivalent substitution since equal weight is being placed upon all row
ids and replacement is allowed. Note that the upper bound of the
interval, n, will never be reached. While this may seem
flawed, it is important to note that vectors and matrices in use a
zero-based indexing system, meaning that they begin at 0 instead of 1
and go up to
instead of
,
which is unlike ’s system. Thus, an out of bounds error would be
triggered if n was used as that point does not
exist within the data structure. The application of this logic can be
seen in the span the for loop takes in when compared to .
Another syntactical change is the use of () in place of
[] while accessing the matrix. This change is due to the
governance of and its comma operator making it impossible to place
multiple indices inside the square brackets.
To validate that the translation was successful, first run the function under the same data and seed as was given for the function.
# Use the same seed use in R and C++
set.seed(883)
# Perform bootstrap with C++ function
result_cpp <- bootstrap_cpp(initdata)Next, check the output between the functions using ’s
all.equal() function that allows for an
-neighborhood
around a number.
# Compare output
all.equal(result_r, result_cpp)## [1] "Mean relative difference: 0.019931"
Lastly, make sure to benchmark the newly translated function against the implementation. As stated earlier, data is paramount to making a decision related to which function to use in an analysis or package.
library(rbenchmark)
benchmark(r = bootstrap_r(initdata),
cpp = bootstrap_cpp(initdata))[, 1:4]## test replications elapsed relative
## 2 cpp 100 2.168 1.000
## 1 r 100 3.835 1.769
Many of the previously illustrated examples were directed primarily to show the gains in computational efficiency that are possible by implementing code directly in ; however, this is only one potential application of . Perhaps one of the most understated features of is its ability to enable ’s third statement of Interfaces to other software are part of . In particular, is designed to facilitate interfacing libraries written in or to . Hence, if there is a specific feature within a or library, then one can create a bridge to it using to enable it from within .
An example is the use of matrix algebra libraries like
or . By outsourcing complex
linear algebra operations to matrix libraries, the need to directly call
functions within is negated. Moreover, the design allows for seamless
transfer between object types by using automatic converters governed by
wrap(), to , and as<T>(), to with the
T indicating the type of object being cast into. These two
helper functions provide a non-invasive way to work with an external
object. Thus, a further benefit to using external libraries is the
ability to have a portable code base that can be implemented within a
standalone program or within another computational language.
A common application in statistical computing is simulating from a multivariate normal distribution. The algorithm relies on a linear transformation of the standard Normal distribution. Letting , where is a matrix, , , and is the identity matrix, then . To obtain the matrix from , either a Cholesky or Eigen decomposition is required. As noted in , the Eigen decomposition is more stable in addition to being more computationally demanding compared to the Cholesky decomposition. For simplicity and speed, we have opted to implement the sampling procedure using a Cholesky decomposition. Regardless, there is a need to involve one of the above matrix libraries to make the sampling viable in .
Here, we demonstrate how to take advantage of the Armadillo
linear algebra template classes via the package . Prior to running this
example, the package must be installed using
install.packages('RcppArmadillo'). One important caveat
when using additional packages within the ecosystem is the correct
header file may not be Rcpp.h. In a majority of cases, the
additional package ships a dedicated header (as e.g.
RcppArmadillo.h here) which not only declares data
structures from both systems, but may also add complementary integration
and conversion routines. It typically needs to be listed in an
include statement along with a depends()
attribute to tell where to find the additional header files:
// Use the RcppArmadillo package
// Requires different header file from Rcpp.h
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]With this in mind, sampling from a multivariate normal distribution can be obtained in a straightforward manner. Using only Armadillo data types and values:
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// Sample N x P observations from a Standard
// Multivariate Normal given N observations, a
// vector of P means, and a P x P cov matrix
// [[Rcpp::export]]
arma::mat rmvnorm(int n,
const arma::vec& mu,
const arma::mat& Sigma) {
unsigned int p = Sigma.n_cols;
// First draw N x P values from a N(0,1)
Rcpp::NumericVector draw = Rcpp::rnorm(n*p);
// Instantiate an Armadillo matrix with the
// drawn values using advanced constructor
// to reuse allocated memory
arma::mat Z = arma::mat(draw.begin(), n, p,
false, true);
// Simpler, less performant alternative
// arma::mat Z = Rcpp::as<arma::mat>(draw);
// Generate a sample from the Transformed
// Multivariate Normal
arma::mat Y = arma::repmat(mu, 1, n).t() +
Z * arma::chol(Sigma);
return Y;
}As a result of using a random number generation (RNG), there is an
additional requirement to ensure reproducible results: the necessity to
explicitly set a seed (as shown above). Because of the (programmatic)
interface provided by to its own RNGs, this setting of the seed has to
occur at the level via the set.seed() function as no
(public) interface is provided by the header files.
As a second example, consider the problem of estimating a common
linear model repeatedly. One use case might be the simulation of size
and power of standard tests. Many users of would default to using
lm(), however, the overhead associated with this function
greatly impacts speed with which an estimate can be obtained. Another
approach would be to take the base function lm.fit(), which
is called by lm(), to compute estimated
in just about the fastest time possible. However, this approach is also
not viable as it does not report the estimated standard errors. As a
result, we cannot use any default functions in the context of simulating
finite sample population effects on inference.
One alternative is provided by the fastLm() function in
.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// Compute coefficients and their standard error
// during multiple linear regression given a
// design matrix X containing N observations with
// P regressors and a vector y containing of
// N responses
// [[Rcpp::export]]
Rcpp::List fastLm(const arma::mat& X,
const arma::colvec& y) {
// Dimension information
int n = X.n_rows, p = X.n_cols;
// Fit model y ~ X
arma::colvec coef = arma::solve(X, y);
// Compute the residuals
arma::colvec res = y - X*coef;
// Estimated variance of the random error
double s2 =
std::inner_product(res.begin(), res.end(),
res.begin(), 0.0)
/ (n - p);
// Standard error matrix of coefficients
arma::colvec std_err = arma::sqrt(s2 *
arma::diagvec(arma::pinv(X.t()*X)));
// Create named list with the above quantities
return Rcpp::List::create(
Rcpp::Named("coefficients") = coef,
Rcpp::Named("stderr") = std_err,
Rcpp::Named("df.residual") = n - p );
}The interface is very simple: a matrix
of regressors, and a dependent variable
as a vector. We invoke the standard Armadillo function
solve() to fit the model y ~ X. We then
compute residuals, and extract the (appropriately scaled) diagonal of
the covariance matrix, also taking its square root, in order to return
both estimates
and
.
Once a project containing compiled code has matured to the point of sharing it with collaborators or using it within a parallel computing environments, the ideal way forward is to embed the code within an package. Not only does an package provide a way to automatically compile source code, but also enables the use of the help system to document how the written functions should be used. As a further benefit, the package format enables the use of unit tests to ensure that the functions are producing the correct output. Lastly, having a package provides the option of uploading to a repository such as CRAN for wider dissemination.
To facilitate package building, provides a function
Rcpp.package.skeleton() that is modeled after the base
function package.skeleton(). This function automates the
creation of a skeleton package appropriate for distributing :
library("Rcpp")
Rcpp.package.skeleton("samplePkg")This shows how distinct directories man, R,
src are created for, respectively, the help pages, files
with code and files with code. Generally speaking, all compiled code, be
it from , or sources, should be placed within the src/
directory.
Alternatively, one can achieve similar results to using
Rcpp.package.skeleton() by using a feature of the RStudio
IDE. Specifically, while creating a new package project there is an
option to select the type of package by engaging a dropdown menu to
select “Package w/ Rcpp” in RStudio versions prior to v1.1.0. In RStudio
versions later than v1.1.0, support for package templates has been added
allowing users to directly create -based packages that use Eigen or
Armadillo.
Lastly, one more option exists for users who are familiar with the
package. To create the package skeleton use
devtools::create("samplePkg"). From here, part of the
structure required by can be added by using
devtools::use_rcpp(). The remaining aspects needed by must
be manually copied from the roxygen tags written to console and pasted
into one of the package’s files to successfully incorporate the dynamic
library and link to ’s headers.
All of these methods take care of a number of small settings one
would have to enable manually otherwise. These include an ‘Imports:’ and
‘LinkingTo:’ declaration in file DESCRIPTION, as well as ‘useDynLib’ and
‘importFrom’ in NAMESPACE. For Rcpp Attributes use, the
compileAttributes() function has to be called. Similarly,
to take advantage of its documentation-creation feature, the
roxygenize() function from has to be called. Additional
details on using within a package scope are detailed in .
has always provided mechanisms to extend it. The bare-bones API is already used to great effect by a large number of packages. By taking advantage of a number of features, has been able to make extending easier, offering a combination of both speed and ease of use that has been finding increasingly widespread utilization by researchers and data scientists. We are thrilled about this adoption, and look forward to seeing more exciting extensions to being built.