One of the greatest benefits of using Rcpp
is it’s ability to interface with other libraries written in C++ or C. There is a long history of highly performant, well written routines written in C++ and C, such as Armadillo
and Eigen
for linear algebra, Intel MKL
for highly optimized math routines, and GSL
, the GNU Scientific Library, a large, extremely well tested and used set of mathematical functions, amongst many others. Let’s take a look into how we might integrate Armadillo
into a high performance workflow.
Note Mac Users: this may not work immediately on your computer and may involve installing fortran libraries.
#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]
// Compute coefficients and their standard error during multiple linear
// regression given a design matrix X containing N observations with P
// regressors and a vector y containing of N responses
// [[Rcpp::export]]
Rcpp::List fastLm(const arma::mat& X, const arma::colvec& y) {
// Dimension information
int n = X.n_rows;
int p = X.n_cols;
// Fit model y ~ X
arma::colvec coef = arma::solve(X, y);
// Compute the residuals
arma::colvec res = y - X*coef;
// Obtain the estimated variance of the random error
double s2 = std::inner_product(res.begin(), res.end(),
res.begin(), 0.0)/(n - p);
// Obtain the standard error matrix of coefficients
arma::colvec std_err = arma::sqrt(s2 *
arma::diagvec(arma::pinv(X.t()*X)));
// Create a named list with the above quantities
return Rcpp::List::create(Rcpp::Named("coefficients") = coef,
Rcpp::Named("stderr") = std_err,
Rcpp::Named("df.residual") = n - p );
}
Here we’ve defined a routine that calls out to Armadillo
and returns the parameters of interest, namely, the coefficients of the model, the std. error of the model, and the residuals of the fit, minimizing any extra work that would be done. Let’s benchmark it’s performance against lm.fit()
.
X = matrix(c(rep(1, 300), runif(300, 0, 50)), nrow = 300)
y = 5 + 3 * X[,2] + rnorm(300, 0, 5)
results <- microbenchmark(
lm_r = lm.fit(X, y),
lm_cpp = fastLm(X, y)
)
results
## Unit: microseconds
## expr min lq mean median uq max neval cld
## lm_r 57.253 67.4845 97.3162 76.7865 86.8895 867.441 100 a
## lm_cpp 14.362 21.0940 101.5664 28.2840 33.4220 3415.079 100 a
Here we have an example of when it might Rcpp
might not make a huge difference, the two perform comparably. This isn’t terribly surprising, lm.fit
is heavily used and therefore has been heavily optimized, so we would expect it to perform as such. However, third party libraries often contain more obscure functionality that hasn’t made it’s way into R, offering an avenue for highly performant code without having to write the core algorithms yourself.