rocm_jax/gpu at afaa3bf43c10304e97e6fd041f22882a8f91ee3d - rocm_jax - Gitea For EOELAB

mirrors/rocm_jax

mirror of https://github.com/ROCm/jax.git synced 2025-04-16 11:56:07 +00:00

History

Dan Foreman-Mackey afaa3bf43c Port GPU kernels for SVD to the FFI.

Unlike the other GPU linear algebra kernels that I've ported so far, this one isn't straightforward to implement as a single kernel, and while it does support lowering without access to a GPU (no more descriptor!), it only supports dynamics shapes in the batch dimensions. There are two main technical challenges:

1. The main `gesvd` kernels in cuSolver/hipSolver only support matrices with shape `(m, n)` with `m >= n`. This means that we need to transpose the inputs and outputs as part of the lowering rule when `m < n`. (Note: we actually just use C layouts instead of Fortran layouts to implement this case.) While this could be handled in the kernel, this seemed like a lot of work for somewhat limited benefit, and it would probably have performance implications.

2. The `gesvd` and `gesvdj` kernels return `V^H` and `V` respectively, and the batched version of `gesvdj` doesn't support `full_matrices=False`. This means that we need logic in the lowering rule to handle transposition and slicing. This makes it hard to have the algorithm selection be a parameter to the kernel.

Another note: cuSolver has a 64-bit implementation of the SVD, and we always use that implementation on the CUDA backend. The 32-bit interface is included for ROCM support, and I have tested it manually. This was a feature request from https://github.com/jax-ml/jax/issues/23413.

PiperOrigin-RevId: 676839182

2024-09-20 07:34:50 -07:00

..

blas_handle_pool.cc

Move jaxlib GPU handlers to separate build target.

2024-08-01 12:30:04 -07:00

blas_handle_pool.h

Move jaxlib GPU handlers to separate build target.

2024-08-01 12:30:04 -07:00

blas_kernels.cc

Remove synchronization from GPU LU decomposition kernel by adding an async batch pointers builder.

2024-08-16 04:37:09 -07:00

blas_kernels.h

Switch JAX to use the OpenXLA repository.

2023-03-13 18:38:26 +00:00

blas.cc

Move logic about when to dispatch to batched LU decomposition algorithm on GPU into the kernel.

2024-08-14 09:20:40 -07:00

BUILD

Refactor gpusolver kernel definitions into separate build target.

2024-09-12 07:11:36 -07:00

gpu_kernel_helpers.cc

Remove synchronization from GPU LU decomposition kernel by adding an async batch pointers builder.

2024-08-16 04:37:09 -07:00

gpu_kernel_helpers.h

Remove synchronization from GPU LU decomposition kernel by adding an async batch pointers builder.

2024-08-16 04:37:09 -07:00

gpu_kernels.cc

Update FFI target name for syrk operation to be consistent with other kernels.

2024-09-06 13:21:38 -07:00

linalg_kernels.cc

Fix a number of minor problems in the ROCM build.

2024-08-26 17:04:01 -07:00

linalg_kernels.cu.cc

Fix a number of minor problems in the ROCM build.

2024-08-26 17:04:01 -07:00

linalg_kernels.h

Fix a number of minor problems in the ROCM build.

2024-08-26 17:04:01 -07:00

linalg.cc

Port the GPU Cholesky update custom call to the FFI.

2024-08-20 05:46:03 -07:00

make_batch_pointers.cu.cc

Remove synchronization from GPU LU decomposition kernel by adding an async batch pointers builder.

2024-08-16 04:37:09 -07:00

make_batch_pointers.h

Remove synchronization from GPU LU decomposition kernel by adding an async batch pointers builder.

2024-08-16 04:37:09 -07:00

prng_kernels.cc

Move FFI helper macros from jaxlib/cpu/lapack_kernels.cc to a jaxlib/ffi_helpers.h.

2024-07-10 15:09:45 -07:00

prng_kernels.cu.cc

Ported threefry2x32 for GPU to the typed XLA FFI

2024-06-28 06:24:44 -07:00

prng_kernels.h

Fix C++ registration of FFI handlers and consolidate gpu/linalg kernel implementation.

2024-07-10 12:09:12 -07:00

prng.cc

Remove forward compatibility mode for old PRGN custom call on GPU

2024-07-31 08:10:17 -07:00

rnn_kernels.cc

Run LSTM test using FP32 math (as opposed to TF32)

2023-09-19 14:45:14 -04:00

rnn_kernels.h

Run LSTM test using FP32 math (as opposed to TF32)

2023-09-19 14:45:14 -04:00

rnn.cc

Run LSTM test using FP32 math (as opposed to TF32)

2023-09-19 14:45:14 -04:00

solver_handle_pool.cc

Move jaxlib GPU handlers to separate build target.

2024-08-01 12:30:04 -07:00

solver_handle_pool.h

Move jaxlib GPU handlers to separate build target.

2024-08-01 12:30:04 -07:00

solver_interface.cc

Port GPU kernels for SVD to the FFI.

2024-09-20 07:34:50 -07:00

solver_interface.h

Port GPU kernels for SVD to the FFI.

2024-09-20 07:34:50 -07:00

solver_kernels_ffi.cc

Port GPU kernels for SVD to the FFI.

2024-09-20 07:34:50 -07:00

solver_kernels_ffi.h

Port GPU kernels for SVD to the FFI.

2024-09-20 07:34:50 -07:00

solver_kernels.cc

Fix a number of minor problems in the ROCM build.

2024-08-26 17:04:01 -07:00

solver_kernels.h

Move jaxlib GPU handlers to separate build target.

2024-08-01 12:30:04 -07:00

solver.cc

Port GPU kernels for SVD to the FFI.

2024-09-20 07:34:50 -07:00

sparse_kernels.cc

Fix a number of minor problems in the ROCM build.

2024-08-26 17:04:01 -07:00

sparse_kernels.h

Fix a number of minor problems in the ROCM build.

2024-08-26 17:04:01 -07:00

sparse.cc

Fix a number of minor problems in the ROCM build.

2024-08-26 17:04:01 -07:00

triton_kernels.cc

Fix a number of minor problems in the ROCM build.

2024-08-26 17:04:01 -07:00

triton_kernels.h

[jax_triton] Only use side stream to do autotuning when doing graph capture

2024-02-02 10:48:26 -08:00

triton_utils.cc

[jax_triton] Add user-specified name field to serialized format.

2023-08-16 02:53:51 -07:00

triton_utils.h

[jax_triton] Add user-specified name field to serialized format.

2023-08-16 02:53:51 -07:00

triton.cc

[ROCm]: Add get_arch_details for triton kernel call

2024-08-12 06:16:27 +00:00

triton.proto

[triton] Pass cluster_dims to TritonKernel and use cuLaunchKernel if size <= 1

2024-01-19 05:55:41 -08:00

vendor.h

Port GPU kernels for SVD to the FFI.

2024-09-20 07:34:50 -07:00