Wmma 3 dawn of mma2/13/2023 Wmma::mma_sync( acc_frag, a_frag, b_frag, acc_frag ) Wmma::load_matrix_sync( a_frag, d_A, 16 ) _global_ void test_wmma( _half* d_A, _half* d_B, _half* d_C ) In the hope that someone here can help me understand what I am doing wrong, I will post a small repro-case here. I am attempting to use the tensor cores efficiently in a custom DL inference kernel, but I get very poor performance.
0 Comments
Leave a Reply.AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |