c++ - How large should matrices be if I use BLAS/cuBLAS for it to perform better than plain C/CUDA? -


i implementing stochastic gradient descent on gpu using cuda, thrust , cublas.

in initial implementation used plain cuda perform matrix-vector operations, , i'm trying optimize using cublas such operations instead.

what i'm observing matrices of size rows x cols, small number of cols, plain cuda consistently outperforms cublas, apparently regardless of number of rows. large number of cols however, cublas implementation wins out.

so wondering: there rules of thumb/guidelines should minimal dimensions of matrices/vectors after using blas or cublas better performing plain c/cuda, or dependent on application/blas function?

i have run few benchmarks post here: results linear regression task running 10 iterations of sgd, on datasets 10000 rows. implementation , more results available here: https://github.com/thvasilo/cuda-sgd-sese-project

runtimes 10-100 features/columns:

enter image description here

so implementation change-point @ plain cuda becomes slower @ 50 columns. there jump in runtime 100 features cublas, artifact, these experiments run once , differences not large anyway.

when running more columns blas lvl. 2 consistently performs better:

enter image description here


Comments

Popular posts from this blog

sublimetext3 - what keyboard shortcut is to comment/uncomment for this script tag in sublime -

java - No use of nillable="0" in SOAP Webservice -

ubuntu - Laravel 5.2 quickstart guide gives Not Found Error -