This weekend I was reading this paper on programming the Cerebras wafer scale engine, https://arxiv.org/html/2405.07898v1 . Data movement is the expensive part of computing, and some algorithms like stencils only require nearest neighbor data movement per cycle. Cerebras wafers have very low energy transfer between neighboring processing elements on the same wafer, so they come up with a language called Tungsten that focuses on this exchange primitive in the kernel programming model.
I thought the challenge of programming 100,000s of cores using a mesh would be interesting so I wrote a simulator, simple compiler, and a few simple kernels for the wafer scale engine using publicly available documents.
I'm used to CUDA. So I asked: "How would you map something like CUDA onto a machine like this?" Well I use something like malloc to allocate global memory, memcpy to move between host and device memory, and a queue of launch thread block launches, but this time, thread blocks can communicate using nearest neighbor send/recv instructions within the same block instead of through shared memory on a streaming multiprocessor. This is inspired by the stencils in Tungsten.
The whole program is made up of a bulk synchronous kernel of many thread blocks.
I think it is interesting because CUDA has some hard limits on thread block sizes, but this mesh perspective lets you grow or shrink the blocks significantly.
Note that some information about cerebras wafer engines like the ISA is not public (as far as I know). In this code, I just guessed what it could be.
So this should not be taken as a faithful or accurate simulation of the wafer scale engine. More like a point on the design space that is similar in that it includes a wafer sized mesh of processing elements.
This weekend I was reading this paper on programming the Cerebras wafer scale engine, https://arxiv.org/html/2405.07898v1 . Data movement is the expensive part of computing, and some algorithms like stencils only require nearest neighbor data movement per cycle. Cerebras wafers have very low energy transfer between neighboring processing elements on the same wafer, so they come up with a language called Tungsten that focuses on this exchange primitive in the kernel programming model.
I thought the challenge of programming 100,000s of cores using a mesh would be interesting so I wrote a simulator, simple compiler, and a few simple kernels for the wafer scale engine using publicly available documents.
I'm used to CUDA. So I asked: "How would you map something like CUDA onto a machine like this?" Well I use something like malloc to allocate global memory, memcpy to move between host and device memory, and a queue of launch thread block launches, but this time, thread blocks can communicate using nearest neighbor send/recv instructions within the same block instead of through shared memory on a streaming multiprocessor. This is inspired by the stencils in Tungsten.
The whole program is made up of a bulk synchronous kernel of many thread blocks.
I think it is interesting because CUDA has some hard limits on thread block sizes, but this mesh perspective lets you grow or shrink the blocks significantly.
Note that some information about cerebras wafer engines like the ISA is not public (as far as I know). In this code, I just guessed what it could be.
So this should not be taken as a faithful or accurate simulation of the wafer scale engine. More like a point on the design space that is similar in that it includes a wafer sized mesh of processing elements.