Parallel execution of kernels

I'm trying to execute kernels in parallel. Currently not for real use, but just to understand the possibilities. I'm running on an 460 GPU, windows 10, 64 bit.

 

My test set-up is that I have a simple kernel that does not do anything interesting, but takes about 800 milliseconds to execute. I run the kernel for a small work-size of 64, which means that it can run on a single CU on my GPU. GIven that a 460 has 14 CUs, it should, in theory, be possible to run a number in parallel. I start by allocating a number of command queues on the host. I then start a number of threads on the host, all doing the same thing:

  • create their own kernel instance,
  • get one (each one a different) of the preallocated command queues,
  • enqueue the kernel,
  • call "finish" to get timing on the host of the kernel execution,
  • enqueue a read operation to get results back to the host
  • call "finish" to get timing about the read back

 

This way I start a number, typically 4 or 8, kernels simultaneously, each on a different command queue.  Looking at the time line with CodeXL confirms this.

 

What I'm seeing is that I usually get one or two kernels executing in parallel. Although I've occasionally seen three or four, but that seems to be the exception. So my first conclusion is that parallel execution is to some extent possible.

 

Looking at the CodeXL time line I see that my kernels take their normal 800 milliseconds to execute, or twice as long: 1600 milliseconds. However, according to CodeXL sometimes a kernel executes in almost 0 time, so there might be some error in CodeXL's timing. See attached picture.

 

I do have a few questions:

  • I don't understand where the limitation of two simultaneous kernels comes from.
  • I don't understand why kernels sometimes take almost exactly twice their normal time to execute.
  • Looking at the attached picture, the enqueue operation for the read-back operation on the host for the kernel that finished first is delayed (the blue line top right). Apparently until something on other host threads or command queues happens, but I don't understand what it's waiting for. As far as I understand things the command queues should be independent of each other.

 

Any help would be appreciated.