rocm_jax

mirror of https://github.com/ROCm/jax.git synced 2025-04-16 03:46:06 +00:00

History

Dan Foreman-Mackey b6306e3953 Remove synchronization from GPU LU decomposition kernel by adding an async batch pointers builder.

In the batched LU decomposition in cuBLAS, the output buffer is required to be a pointer of pointers to the appropriate batch matrices. Previously this reshaping was done on the host and then copied to the device, requiring a synchronization, but it seems straightforward to instead implement a tiny CUDA kernel to do this work. This definitely isn't a bottleneck or a high priority change, but this seemed like a reasonable time to fix a longstanding TODO.

PiperOrigin-RevId: 663686539

2024-08-16 04:37:09 -07:00

BUILD.bazel

Remove synchronization from GPU LU decomposition kernel by adding an async batch pointers builder.

2024-08-16 04:37:09 -07:00