llvm-project

mirror of https://github.com/llvm/llvm-project.git synced 2025-04-16 18:46:37 +00:00

History

[MLIR][NVGPU] Add tma.fence.descriptor OP (#133218 )

When the TMA descriptor is transferred from host memory to global memory
using cudaMemcpy, each thread block must insert a fence before any
thread accesses the updated tensor map in global memory. Once the tensor
map has been accessed, no additional fences are needed by that block
unless the map is modified again.

[Example from cuda programming
guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/#using-tma-to-transfer-multi-dimensional-arrays).
The `tma.fence.descriptor` basically implements
`ptx::fence_proxy_tensormap_generic`.
```
#include <cuda.h>
#include <cuda/ptx>
namespace ptx = cuda::ptx;

__device__ CUtensorMap global_tensor_map;
__global__ void kernel(CUtensorMap *tensor_map)
{
  // Fence acquire tensor map:
  ptx::n32_t<128> size_bytes;
  // Since the tensor map was modified from the host using cudaMemcpy,
  // the scope should be .sys.
  ptx::fence_proxy_tensormap_generic(
     ptx::sem_acquire, ptx::scope_sys, tensor_map, size_bytes
 );
 // Safe to use tensor_map after fence inside this thread..
}
int main() {
  CUtensorMap local_tensor_map;
  // [ ..Initialize map.. ]
  cudaMemcpy(&global_tensor_map, &local_tensor_map, sizeof(CUtensorMap), cudaMemcpyHostToDevice);
  kernel<<<1, 1>>>(global_tensor_map);
}
```

2025-03-27 15:20:19 +01:00

Analysis

[mlir] Use *Set::insert_range (NFC) (#133043 )

2025-03-26 07:47:02 -07:00

AsmParser

[TypeID] Update private typeid definition in DeferredLocInfo (#128968 )

2025-02-27 13:36:38 -08:00

Bindings/Python

Sub-channel quantized type implementation (#120172 )

2025-03-23 07:37:55 -05:00

Bytecode

[mlir] BytecodeWriter: invoke reserveExtraSpace (#126953 )