peak GCN performance possible with 1 wave

Everything I've read so far about wave occupancy suggests (or even explicitly states) that a minimum of 4 waves in flight is required for full VALU occupancy on GCN.  After scrutinizing documentation and code, I've come to the conclusion that full VALU utilization can be obtained with just one wave.  This is only possible for kernels executing only vector instructions, so for practical purposes the minimum is 2 waves.

Nerd Ralph: Inside AMD GCN code execution