r - sorting columns of Rcpp NumericMatrix for median calculations -
i've been testing rcpp , rcpparmadillo calculating summary stats on big matrices. lot faster (5 or 10 times faster) base r colmeans or the armadillo on ~4million rows, 45 columns.
colmeansrcpp <- cxxfunction(signature(x_="integer"), plugin='rcpp', body=' rcpp::integermatrix x = x_; int ncol = x.ncol(); int nrow = x.nrow(); rcpp::numericvector out(ncol); for(int col = 0; col < ncol; col++){ out[col]=rcpp::sum(x(_, col)); } return wrap(out/nrow); ')
i want calculate median , maybe other quantiles plotting - , because requires sort more needy of c++ outsourcing. armadillo seems bit slow wanted in place sort on code similar above cant syntax right... here i'm trying..
# ok i'm aware floor(nrow/2) not **absolutely** correct # i'm simplifying here colmedianrcpp <- cxxfunction(signature(x_="integer"), plugin='rcpp', body=' rcpp::integermatrix x = clone(x_); int ncol = x.ncol(); int nrow = x.nrow(); rcpp::numericvector out(ncol); for(int col = 0; col < ncol; col++){ x(_,col)= std::sort((x_,col).begin, (x_,col).end)); out[col]=x(floor(nrow/2), col)); } return wrap(out); ')
basically it's line
x(_,col)= std::sort((x_,col).begin, (x_,col).end));
i don't know how express "sort column in place" mixture of rcpp sugar , std c++. sorry can see i'm doing wrong hint on right syntax lovely.
ps right need clone() don't change r object?
edit add rcpparmadillo code , benchmark comparison address answer/comment below. benchmark on 50k rows quick reply recall similar many more. realise rcpp author.. many time!
the thought occurs perhaps i'm doing daft rcpparmadillo code make run slower base colmeans or rcpp version?
colmeansrcpparmadillo <- cxxfunction(signature(x_="integer"), plugin="rcpparmadillo", body=' arma::mat x = rcpp::as<arma::mat > (x_); arma::rowvec md= arma::mean(x, 0); return wrap(md); ')
and benchmark ...
(mb = microbenchmark( + colmeans(fqsmallmatrix), + colmeansrcpp(fqsmallmatrix), + colmeansrcpparmadillo(fqsmallmatrix), + times=50)) unit: milliseconds expr min lq median uq max neval colmeans(fqsmallmatrix) 10.620919 10.63289 10.640819 10.648882 10.907145 50 colmeansrcpp(fqsmallmatrix) 2.649038 2.66832 2.676709 2.700839 2.841012 50 colmeansrcpparmadillo(fqsmallmatrix) 25.687067 26.23488 33.168589 33.792489 113.832495 50
you not showing rcpparmadillo code -- have been quite happy performance of rcpparmadillo code needed row/col column subsetting.
you can instantiate armadillo matrices via rcpp efficiently (no copy, re-using r object memory) try that.
and you: want clone()
distinct copy, , think you'd free if use default rcpparmadillo ctor (rather more efficient two-step).
edit few hours later
you had left open question why armadillo slow. in meantime, vincent solved issue here revisited, cleaner solution using code vincent's.
now how instantiates armadillo matrix without copy -- faster. , avoids mixing integer , numeric matrices. code first:
#include <rcpparmadillo.h> using namespace rcpp; // [[rcpp::depends(rcpparmadillo)]] // [[rcpp::export]] numericvector colmedianrcpp(numericmatrix x) { int nrow = x.nrow(); int ncol = x.ncol(); int position = nrow / 2; // euclidian division numericvector out(ncol); (int j = 0; j < ncol; j++) { numericvector y = x(_,j); // copy column -- original not mod std::nth_element(y.begin(), y.begin() + position, y.end()); out[j] = y[position]; } return out; } // [[rcpp::export]] arma::rowvec colmeansrcpparmadillo(numericmatrix x){ arma::mat x = arma::mat(x.begin(), x.nrow(), x.ncol(), false); return arma::mean(x, 0); } // [[rcpp::export]] numericvector colmeansrcpp(numericmatrix x) { int ncol = x.ncol(); int nrow = x.nrow(); rcpp::numericvector out(ncol); (int col = 0; col < ncol; col++){ out[col]=rcpp::sum(x(_, col)); } return wrap(out/nrow); } /*** r set.seed(42) x <- matrix(rnorm(100*10), 100, 10) library(microbenchmark) mb <- microbenchmark(colmeans(x), colmeansrcpp(x), colmeansrcpparmadillo(x), colmedianrcpp(x), times=50) print(mb) */
and here result on machine, concise armadillo version fast yours, , median little slower has more work:
r> sourcecpp("/tmp/stephen.cpp") r> set.seed(42) r> x <- matrix(rnorm(100*10), 100, 10) r> library(microbenchmark) r> mb <- microbenchmark(colmeans(x), colmeansrcpp(x), colmeansrcpparmadillo(x), + colmedianrcpp(x), times=50) r> print(mb) unit: microseconds expr min lq median uq max neval colmeans(x) 9.469 10.422 11.5810 12.421 30.597 50 colmeansrcpp(x) 3.922 4.281 4.5245 5.306 18.020 50 colmeansrcpparmadillo(x) 4.196 4.549 4.9295 5.927 11.159 50 colmedianrcpp(x) 15.615 16.291 16.7290 17.971 27.026 50 r>
Comments
Post a Comment