Adreno 530 workgroup size

So, at m=n=k=1024, reports roughly the same performance for the master branch and the try-out branch (both around 8 GFLOPS). And then the numbers reported by the tuner are most important, since the client might also run other pre/post kernels which we are currently not investigating. I would want to focus first on the use-case of m=n=k=1024, then afterwards we will see other cases. So, about the results now, it seems the Qualcomm-provided kernel I plugged in is not ideal yet.

I guess this is because the kernel assumes multiples of 16 or 32 or so, and you have e.g. Given that almost all results point out some error, don't trust them. This database.cpp was always a tricky one, I improved things over time, but I will now try to split it up to avoid long compilation times and excessive memory I wouldn't try other sizes right now for tuning. Thanks both for trying out, very I'm working on improving compilation. This is just an experimental branch, but still is this an expected behaviour? Please share your thoughts. | ref | - | - | OK | 35.66 ms | - | reference OK | | ID | total | param | compiles | time | GFLOPS | status | Notes: this branch is currently single-precision FP32 only and assumes alpha=1 and beta=0. clblast_client_xgemm and compare with what you had before. Modify the numbers in src/database/kernels/xgemm/xgemm_32.hpp according to the output of the tuner.clblast_tuner_xgemm and share the output here. clblast_test_xgemm to see if everything works OK. First test performance with the latest master branch for reference, e.g./clblast_client_xgemm -m 256 -n 256 -k 256 -num_steps 4 -step 256.The branch adreno_tryout contains the Qualcomm-provided kernel and also a modified tuner to tune the local workgroup size. One thing suggested by the tutorial is using an OpenCL image for matrix B, but I didn't implement that. If not, we'll have to continue investigating. If so, I can work towards integration such a kernel properly. However, it is there to be able to find out if that kernel does fix the performance issues with CLBlast. This is a very hacky integration of that kernel and is in no means meant to be actually used. I've added a test branch ( adreno_tryout) in CLBlast to test the Qualcomm-provided kernel from the tutorial mentioned above.