We'll now be using our new knowledge of CUDA kernels to implement the parallel prefix algorithm, also known as the scan design pattern. We have already seen simple examples of this in the form of PyCUDA's InclusiveScanKernel and ReductionKernel functions in the previous chapter. We'll now look into this idea in a little more detail.
The central motivation of this design pattern is that we have a binary operator
(minimum)), and collection of elements,
.
We wish to retain the partial results, that is the n - 1 sub-computations...