[Mosaic GPU] Make sure to do the async proxy fence before wargroup sync

This is the ordering we want for a proper release of generic SMEM stores
into the async proxy. The old order was problematic: once the warpgroup
barrier was complete, some warps could get deselected before they get to
the fence. For as long as the first warp would make progress, it could go
through the fence along and start issuing TMA copies before other warps
have synchronized with the async proxy.

I have not observed this problem in any of our kernels so far, but this
order seems safer to me.

PiperOrigin-RevId: 733333814
This commit is contained in:
Adam Paszke 2025-03-04 08:10:34 -08:00 committed by jax authors
parent 155839bb4d
commit cdae5fcfc7

View File

@ -670,10 +670,10 @@ def parse_indices(
def commit_shared():
warpgroup_barrier()
nvvm.fence_proxy(
nvvm.ProxyKind.async_shared, space=nvvm.SharedSpace.shared_cta
)
warpgroup_barrier()
def warpgroup_barrier():