Optimized half precision gemm assembly kernels on AMD Fiji for deep learning

This is an optimized half precision gemm assembly kernels on AMD Fiji which utilizes native GCN assembly to achieve much better performance than clBLAS.

Link: GitHub - hyln9/GCNGEMM: Optimized half precision gemm assembly kernels on AMD Fiji