The code can be found on github
Recently I came across Serban Giuroiu’s implementation of k-means in CUDA. However Serban used pointer of pointers to represent a 2D matrix, which might not be very convenient in some cases. Moreover in my application, I have the data matrix on the device memory already, and the matrix is stored in column major order (to be used in CUBLAS and other CUDA libraries). Therefore I made some changes to Serban’s implementation, concretely:
- The function now works with column major matrix stored in device memory, and the result is also stored in device memory. This reduces the overhead caused by transposing the matrix in Serban’s code, and makes it easier to integrate k-means in other applications.
- A simple CUDA kernel is added for updating the cluster centroids after each iteration. This reduces the overhead caused by multiple memory transfers at each iteration. However I was lazy and this kernel (called update_cluster) has not been well optimized.
- The membership array can be set to NULL when calling the function if you don’t want to have it in the results.
- Added a parameter for the maximum number of k-means iterations.
With the new kernel, the program seems to be faster. The source code is published on github. I already included a simple test case and benchmark in main.cpp, you can compile and run it yourself. Serban’s original version is also included.