vignettes/rmd/Rcpp-sugar.Rmd
Rcpp-sugar.RmdAbstract
This note describes which has been introduced in version 0.8.3 of . brings a higher-level of abstraction to code written using the API. is based on expression templates and provides some ‘syntactic sugar’ facilities directly in . This is similar to how offers linear algebra classes based on .
facilitates development of internal compiled code in an package by abstracting low-level details of the API into a consistent set of classes.
Code written using classes is easier to read, write and maintain,
without loosing performance. Consider the following code example which
provides a function foo as a extension to by using the
API:
RcppExport SEXP foo(SEXP x, SEXP y) {
Rcpp::NumericVector xx(x), yy(y);
int n = xx.size();
Rcpp::NumericVector res(n);
double x_ = 0.0, y_ = 0.0;
for (int i=0; i<n; i++) {
x_ = xx[i];
y_ = yy[i];
if (x_ < y_) {
res[i] = x_ * x_;
} else {
res[i] = -(y_ * y_);
}
}
return res;
}The goal of the function foo code is simple. Given two
numeric vectors, we create a third one. This is typical
low-level code that that could be written much more concisely in thanks
to vectorisation as shown in the next example.
foo <- function(x, y) {
ifelse(x < y, x * x, -(y * y))
}Put succinctly, the motivation of is to bring a subset of the
high-level syntax in . Hence, with , the version of foo now
becomes:
Rcpp::NumericVector foo(Rcpp::NumericVector x,
Rcpp::NumericVector y) {
return ifelse(x < y, x * x, -(y * y));
}Apart from being strongly-typed and the need for explicit
return statement, the code is now identical between
highly-vectorised and .
is written using expression templates and lazy evaluation techniques . This not only allows a much nicer high-level syntax, but also makes it rather efficient (as we detail in section below).
takes advantage of operator overloading. The next few sections discuss several examples.
defines the usual binary arithmetic operators : +,
-, *, /.
// two numeric vectors of the same size
NumericVector x;
NumericVector y;
// expressions involving two vectors
NumericVector res = x + y;
NumericVector res = x - y;
NumericVector res = x * y;
NumericVector res = x / y;
// one vector, one single value
NumericVector res = x + 2.0;
NumericVector res = 2.0 - x;
NumericVector res = y * 2.0;
NumericVector res = 2.0 / y;
// two expressions
NumericVector res = x * y + y / 2.0;
NumericVector res = x * (y - 2.0);
NumericVector res = x / (y * y);The left hand side (lhs) and the right hand side (rhs) of each binary
arithmetic expression must be of the same type (for example they should
be both numeric expressions).
The lhs and the rhs can either have the same size or one of them
could be a primitive value of the appropriate type, for example adding a
NumericVector and a double.
Binary logical operators create a logical sugar
expression from either two sugar expressions of the same type or one
sugar expression and a primitive value of the associated type.
// two integer vectors of the same size
NumericVector x;
NumericVector y;
// expressions involving two vectors
LogicalVector res = x < y;
LogicalVector res = x > y;
LogicalVector res = x <= y;
LogicalVector res = x >= y;
LogicalVector res = x == y;
LogicalVector res = x != y;
// one vector, one single value
LogicalVector res = x < 2;
LogicalVector res = 2 > x;
LogicalVector res = y <= 2;
LogicalVector res = 2 != y;
// two expressions
LogicalVector res = (x + y) < (x*x);
LogicalVector res = (x + y) >= (x*x);
LogicalVector res = (x + y) == (x*x);defines functions that closely match the behavior of functions of the same name.
Given a logical sugar expression, the all function
identifies if all the elements are TRUE. Similarly, the
any function identifies if any the element is
TRUE when given a logical sugar expression.
Either call to all and any creates an
object of a class that has member functions is_true,
is_false, is_na and a conversion to
SEXP operator.
One important thing to highlight is that all is lazy.
Unlike , there is no need to fully evaluate the expression. In the
example above, the result of all is fully resolved after
evaluating only the two first indices of the expression .
any is lazy too, so it will only need to resolve the first
element of the example above.
Given a sugar expression of any type, (just like the other functions
in this section) produces a logical sugar expression of the same length.
Each element of the result expression evaluates to TRUE if
the corresponding input is a missing value, or FALSE
otherwise.
Given a sugar expression of any type, seq_along creates
an integer sugar expression whose values go from 1 to the size of the
input.
IntegerVector x =
IntegerVector::create( 0, 1, NA_INTEGER, 3 );
IntegerVector y = seq_along(x);
IntegerVector z = seq_along(x * x * x * x * x * x);This is the most lazy function, as it only needs to call the
size member function of the input expression. The input
expression need not to be resolved. The two examples above gives the
same result with the same efficiency at runtime. The compile time will
be affected by the complexity of the second expression, since the
abstract syntax tree is built at compile time.
seq_len creates an integer sugar expression whose
element expands to i. seq_len is particularly
useful in conjunction with sapply and
lapply.
Given two sugar expressions of the same type and size, or one
expression and one primitive value of the appropriate type,
pmin (pmax) generates a sugar expression of
the same type whose element expands to the lowest (highest) value
between the element of the first expression and the element of the
second expression.
ifelse expands to a sugar expression whose
element is the element of the first expression if the element of the
condition expands to TRUE or the of the second expression
if the element of the condition expands to FALSE, or the
appropriate missing value otherwise.
sapply applies a function to each element of the given
expression to create a new expression. The type of the resulting
expression is deduced by the compiler from the result type of the
function.
The function can be a free function such as the overload generated by the template function below:
Alternatively, the function can be a functor whose type has a nested
type called result_type
lapply is similar to sapply except that the
result is allways an list expression (an expression of type
VECSXP).
For the following set of functions, generally speaking, the element
of the result of the given function (say, abs) is the
result of applying that function to this element of the input
expression. Supported types are integer and numeric.
The framework provided by also permits easy and efficient access the density, distribution function, quantile and random number generation functions function by in the library.
Currently, most of these functions are vectorised for the first element which denote size. Consequently, these calls works in just as they would in :
x1 = dnorm(y1, 0, 1); // density of y1 at m=0, sd=1
x2 = qnorm(y2, 0, 1); // quantiles of y2
x3 = pnorm(y3, 0, 1); // distribution of y3
x4 = rnorm(n, 0, 1); // 'n' RNG draws of N(0, 1)Similar d/q/p/r functions are provided for the most common distributions: beta, binom, cauchy, chisq, exp, f, gamma, geom, hyper, lnorm, logis, nbeta, nbinom, nbinom_mu, nchisq, nf, norm, nt, pois, t, unif, and weibull.
Note that the parameterization used in these sugar functions may differ between the top-level functions exposed in an session. For example, the internal is parameterized by , whereas the R-level is parameterized by . Consult for more details on the parameterization used for these sugar functions.
One point to note is that the programmer using these functions needs to initialize the state of the random number generator as detailed in Section 6.3 of the `Writing R Extensions’ manual . A nice solution for this is to use a class that sets the random number generator on entry to a block and resets it on exit. We offer the class which allows code such as
As there is some computational overhead involved in using , we are not wrapping it around each inner function. Rather, the user of these functions ( you) should place an at the appropriate level of your code.
This section details some of the techniques used in the implementation of . Note that the user need not to be familiar with the implementation details in order to use , so this section can be skipped upon a first read of the paper.
Writing functions is fairly repetitive and follows a well-structured pattern. So once the basic concepts are mastered (which may take time given the inherent complexities in template programming), it should be possible to extend the set of function further following the established pattern.
Expression templates such as those used by use a technique called the Curiously Recurring Template Pattern (CRTP). The general form of CRTP is:
// The Curiously Recurring Template Pattern (CRTP)
template <typename T>
struct base {
// ...
};
struct derived : base<derived> {
// ...
};The base class is templated by the class that derives
from it : derived. This shifts the relationship between a
base class and a derived class as it allows the base class to access
methods of the derived class.
The CRTP is used as the basis for with the VectorBase
class template. All sugar expression derive from one class generated by
the VectorBase template. The current definition of
VectorBase is given here:
template <int RTYPE, bool na, typename VECTOR>
class VectorBase {
public:
struct r_type :
traits::integral_constant<int,RTYPE>{};
struct can_have_na :
traits::integral_constant<bool,na>{};
typedef typename
traits::storage_type<RTYPE>::type
stored_type;
VECTOR& get_ref(){
return static_cast<VECTOR&>(*this);
}
inline stored_type operator[](int i) const {
return static_cast<const VECTOR*>(
this)->operator[](i);
}
inline int size() const {
return static_cast<const VECTOR*>(
this)->size();
}
/* definition omitted here */
class iterator;
inline iterator begin() const {
return iterator(*this, 0);
}
inline iterator end() const {
return iterator(*this, size());
}
}The VectorBase template has three parameters:
RTYPE: This controls the type of expression (INTSXP,
REALSXP, …)na: This embeds in the derived type information about
whether instances may contain missing values. vector types
(IntegerVector, …) derive from VectorBase with
this parameter set to true because there is no way to know
at compile-time if the vector will contain missing values at run-time.
However, this parameter is set to false for types that are
generated by sugar expressions as these are guaranteed to produce
expressions that are without missing values. An example is the
is_na function. This parameter is used in several places as
part of the compile time dispatch to limit the occurrence of redundant
operations.VECTOR: This parameter is the key of . This is the
manifestation of CRTP. The indexing operator and the size
method of VectorBase use a static cast of this
to the VECTOR type to forward calls to the actual method of
the derived class.As an example, the current implementation of sapply,
supported by the template class Rcpp::sugar::Sapply is
given below:
template <int RTYPE, bool NA,
typename T, typename Function>
class Sapply : public VectorBase<
Rcpp::traits::r_sexptype_traits< typename
::Rcpp::traits::result_of<Function>::type
>::rtype,
true,
Sapply<RTYPE, NA, T, Function>
> {
public:
typedef typename
::Rcpp::traits::result_of<Function>::type;
const static int RESULT_R_TYPE =
Rcpp::traits::r_sexptype_traits<
result_type>::rtype;
typedef Rcpp::VectorBase<RTYPE,NA,T> VEC;
typedef typename
Rcpp::traits::r_vector_element_converter<
RESULT_R_TYPE>::type
converter_type;
typedef typename Rcpp::traits::storage_type<
RESULT_R_TYPE>::type STORAGE;
Sapply(const VEC& vec_, Function fun_) :
vec(vec_), fun(fun_){}
inline STORAGE operator[]( int i ) const {
return converter_type::get(fun(vec[i]));
}
inline int size() const {
return vec.size();
}
private:
const VEC& vec;
Function fun;
};
// sugar
template <int RTYPE, bool _NA_,
typename T, typename Function >
inline sugar::Sapply<RTYPE, _NA_, T, Function>
sapply(const Rcpp::VectorBase<RTYPE,_NA_,T>& t,
Function fun) {
return
sugar::Sapply<RTYPE,_NA_,T,Function>(t, fun);
}sapply is a template function that takes two arguments.
The first argument is a sugar expression, which we recognize because of
the relationship with the VectorBase class template. The
second argument is the function to apply.
The sapply function itself does not do anything, it is
just used to trigger compiler detection of the template parameters that
will be used in the sugar::Sapply template.
In order to decide which kind of expression is built, the
Sapply template class queries the template argument via the
Rcpp::traits::result_of template.
The result_of type trait is implemented as such:
template <typename T>
struct result_of{
typedef typename T::result_type type;
};
template <typename RESULT_TYPE,
typename INPUT_TYPE>
struct result_of<RESULT_TYPE (*)(INPUT_TYPE)> {
typedef RESULT_TYPE type;
};The generic definition of result_of targets functors
with a nested result_type type.
The second definition is a partial specialization targetting function pointers.
Based on the result type of the function, the
r_sexptype_traits trait is used to identify the expression
type.
The r_vector_element_converter class is used to convert
an object of the function’s result type to the actual storage type
suitable for the sugar expression.
The storage_type trait is used to get access to the
storage type associated with a sugar expression type. For example, the
storage type of a REALSXP expression is
double.
The input expression—the expression over which sapply
runs—is also typedef’ed for convenience:
In order to be part of the system, the type generated by the
Sapply class template must inherit from
VectorBase.
template <int RTYPE, bool NA,
typename T, typename Function>
class Sapply : public VectorBase<
Rcpp::traits::r_sexptype_traits<
typename
::Rcpp::traits::result_of<Function>::type
>::rtype,
true,
Sapply<RTYPE,NA,T,Function>
>The expression built by Sapply depends on the result
type of the function, may contain missing values, and the third argument
is the manifestation of the CRTP.
The constructor of the Sapply class template is
straightforward, it simply consists of holding the reference to the
input expression and the function.
The indexing operator and the size member function is
what the VectorBase expects. The size of the result
expression is the same as the size of the input expression and the
element of the result is simply retrieved by applying the function and
the converter. Both these methods are inline to maximize
performance: