FirePro W8100 GPU fault

Hi,

 

I have a brand new FirePro W8100, on Ubuntu, which I bought for double-precision OpenCL calculations.

 

Unfortunately it is very unstable.  I have tried several combinations of drivers and kernels, I consistently get errors in the log like:

 

Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: GPU fault detected: 147 0x096ac802                                                                                                                                             
Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0011B328                                                                                                                                 
Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A104002                                                                                                                                 
Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM fault (0x02, vmid 5) at page 1159976, read from 'TC3' (0x54433300) (260)                                                                                                    
Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: GPU fault detected: 147 0x014a0402                                                                                                                                             
Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0004E107                                                                                                                                 
Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A184002                                                                                                                                 
Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM fault (0x02, vmid 5) at page 319751, read from 'TC5' (0x54433500) (388)                                                                                                     
Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: GPU fault detected: 147 0x062a0402                                                                                                                                             
Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0004DD06                                                                                                                                 
Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A1C4002                                                                                                                                 
Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM fault (0x02, vmid 5) at page 318726, read from 'TC7' (0x54433700) (452)                                                                                                     
Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: GPU fault detected: 147 0x01eac402                                                                                                                                              

 

I have found many similar reports via google but I haven't been able to find any definitive solutions.  This error is usually unrecoverable; the process running the calculation freezes, and trying to kill that process locks up the whole machine.

 

A calculation that seems to reliably cause this problem is 'make alltuners' from the CLBLast project GitHub - CNugteren/CLBlast: Tuned OpenCL BLAS
It dies at the double-precision complex ZGEMM (if it makes it that far).