c++ - How large should matrices be if I use BLAS/cuBLAS for it to perform better than plain C/CUDA? -
i implementing stochastic gradient descent on gpu using cuda, thrust , cublas.
in initial implementation used plain cuda perform matrix-vector operations, , i'm trying optimize using cublas such operations instead.
what i'm observing matrices of size rows x cols, small number of cols, plain cuda consistently outperforms cublas, apparently regardless of number of rows. large number of cols however, cublas implementation wins out.
so wondering: there rules of thumb/guidelines should minimal dimensions of matrices/vectors after using blas or cublas better performing plain c/cuda, or dependent on application/blas function?
i have run few benchmarks post here: results linear regression task running 10 iterations of sgd, on datasets 10000 rows. implementation , more results available here: https://github.com/thvasilo/cuda-sgd-sese-project
runtimes 10-100 features/columns:
so implementation change-point @ plain cuda becomes slower @ 50 columns. there jump in runtime 100 features cublas, artifact, these experiments run once , differences not large anyway.
when running more columns blas lvl. 2 consistently performs better:
Comments
Post a Comment