As I have mentioned in the last post, I like sudokus and use them as a vehicle to explore disziplines and techniques unknown to me. And as I have mentioned in an earlier post I have a new computer with a capable graphics card.
The synthesis of these to pieces of information is, that I wrote a solver for sudokus that runs on graphics cards. It uses a Monte-Carlo method, because these run trivially in parallel, and I wanted to start easy.
So I coded up a version for the GPU using CUDA and a roughly equivalent one for the CPU, using a highly optimized datastructure to minimize time spent on counting symbols in blocks and rows and columns. However, when I benchmarked both versions, the GPU one was half as fast or twice as slow.
It turns out that my so very clever data structure was layed out in memory in the worst possible way, which caused it to degrade to more or less serial behaviour when accessing memory (basically all the time).
So I rewrote everything to adapt the memory layout to CUDA. In the process of rewriting I made several discoveries. One can share a lot of code between the CPU and the GPU version by using a functions declared as both __device__ and __host__. This reduced the number of source files by one to one. On the other hand, the comparison between GPU and CPU is now even more skewed, because the correct memory layout for the GPU is not very cache friendly on the CPU, at least when one organises the CPU calculations the way I did.
I also stopped using CURAND to generate the random numbers, because that was a bit awkward and did complicate things even more. So now I use a super-simple (and pretty bad) linear congruential prng. I know that it is not very well suited for MC simulations, so I guess I should go back to CURAND or implement a Mersenne Twister or ranrot or something.
Another thing that I noticed is that it is very easy to get confused with device and host memory, especially when one tries to build up nested arrays and uses structs. It definitely took me a while to get that straight. A few things still puzzle me, because I could not get them to work. For example, I had problems with a __constant__ the lut for the exponential. But if I understand correctly, this should not change much performance wise.
So if I rewrite the code again, I have still lots of improvements. I also suspect that it would be easier and cleaner (and faster) to use strided arrays instead of nested ones.
But for now I am pretty happy with the performance. The change of the memory layout boosted performance by a order of magnitude. So if one wants to use CUDA efficiently, one needs to keep in mind a few things about how the hardware behaves, and design the algorithms accordingly.
The code is not a particularly good sudoku solver, it is very inefficient and not very fast. It needs a few thousand MC Steps to solve a completely empty sudoku, and the big ensemble one considers does not really help. It might be interesting to build a genetic optimisation algorithm on top of that. That should work pretty well. Or use it to study the statistical physics of the sudoku, as for example described in this article (Preprint).
If someone is interested, the code is on GitHub.