work_group_scan_inclusive_add and register count

I thought I would try out some of the new OpenCL 2.0 workgroup functions.


Comparing perf of work_group_scan_inclusive_add vs my home-grown prefix scan, I found that

work_group_scan_inclusive_add  led to less work-item divergence, but used up 10 more VGPRs.

My own scan, using local memory, led to more divergence but no increase in VGPR usage.


Overall, work_group_scan_inclusive_add was faster. But, is there a way for this method

to use existing registers and not increase register pressure ?