Efficiency of event

What I'm doing now is in a pattern like "a large job -> several small jobs -> a large job -> small jobs -> ...". The small ones each uses no more than 4 CU's and don't depends on each other, that is they can run concurrently. But when I do so, I find a kernel will not start immediately when the event it waits on is signaled. And the delay is quite huge, putting them in one queue is actually much faster.


And what's more. I did an experiment. I enqueued one kernel to two queues many times. And it ran like


And then, I made the kernels from Queue 2 "depending" on kernels from Queue 1. And I got




This is... fascinated... Kernels that wait for events do not only delay after the events are signaled but also delay after the kernels before them in the same queue. The gaps exist even after all jobs in Queue 1 are done. Why?