One of the greatest benefits of using Rcpp is it’s ability to interface with other libraries written in C++ or C. There is a long history of highly performant, well written routines written in C++ and C, such as Armadillo and Eigen for linear algebra, Intel MKL for highly optimized math routines, and GSL, the GNU Scientific Library, a large, extremely well tested and used set of mathematical functions, amongst many others. Let’s take a look into how we might integrate Armadillo into a high performance workflow.

Armadillo

Note Mac Users: this may not work immediately on your computer and may involve installing fortran libraries.

#include <RcppArmadillo.h>
// [[Rcpp::depends(RcppArmadillo)]]

// Compute coefficients and their standard error during multiple linear
// regression given a design matrix X containing N observations with P
// regressors and a vector y containing of N responses

// [[Rcpp::export]]
Rcpp::List fastLm(const arma::mat& X, const arma::colvec& y) {
  // Dimension information
  int n = X.n_rows;
  int p = X.n_cols;

  // Fit model y ~ X
  arma::colvec coef = arma::solve(X, y);

  // Compute the residuals
  arma::colvec res  = y - X*coef;

  // Obtain the estimated variance of the random error
  double s2 = std::inner_product(res.begin(), res.end(),
                                 res.begin(), 0.0)/(n - p);

  // Obtain the standard error matrix of coefficients
  arma::colvec std_err = arma::sqrt(s2 *
    arma::diagvec(arma::pinv(X.t()*X)));

  // Create a named list with the above quantities
  return Rcpp::List::create(Rcpp::Named("coefficients") = coef,
                            Rcpp::Named("stderr")       = std_err,
                            Rcpp::Named("df.residual")  = n - p    );
}

Here we’ve defined a routine that calls out to Armadillo and returns the parameters of interest, namely, the coefficients of the model, the std. error of the model, and the residuals of the fit, minimizing any extra work that would be done. Let’s benchmark it’s performance against lm.fit().

X = matrix(c(rep(1, 300), runif(300, 0, 50)), nrow = 300)
y = 5 + 3 * X[,2] + rnorm(300, 0, 5)

results <- microbenchmark(
  lm_r = lm.fit(X, y),
  lm_cpp = fastLm(X, y)
)

results
## Unit: microseconds
##    expr    min      lq     mean  median      uq      max neval cld
##    lm_r 57.253 67.4845  97.3162 76.7865 86.8895  867.441   100   a
##  lm_cpp 14.362 21.0940 101.5664 28.2840 33.4220 3415.079   100   a

Here we have an example of when it might Rcpp might not make a huge difference, the two perform comparably. This isn’t terribly surprising, lm.fit is heavily used and therefore has been heavily optimized, so we would expect it to perform as such. However, third party libraries often contain more obscure functionality that hasn’t made it’s way into R, offering an avenue for highly performant code without having to write the core algorithms yourself.