rocm_jax

mirror of https://github.com/ROCm/jax.git synced 2025-04-16 03:46:06 +00:00

Author	SHA1	Message	Date
tttc3	b1b56ea0b0	Enable pivoted QR on GPU via MAGMA. Originally noted in #20282, this commit provides a GPU compatible implementation of `geqp3` via MAGMA.	2025-02-12 16:12:42 +00:00
Dan Foreman-Mackey	5bc17f7ec3	Remove the unused cu_cholesky_update kernel in favor of the FFI version. This kernel wasn't allowed in export, so no backwards compatibility period is required. Even so, the FFI kernels were added 6 months ago. PiperOrigin-RevId: 724359996	2025-02-07 08:48:15 -08:00
Michael Hudgins	2e808f2836	Merge pull request #26279 from MichaelHudgins:tsan-resultstore PiperOrigin-RevId: 723918760	2025-02-06 14:55:57 +00:00
Dan Foreman-Mackey	32df37e6e4	Port symmetric tridiagonal reduction GPU kernel to FFI. PiperOrigin-RevId: 704382200	2024-12-09 12:41:23 -08:00
Yuchen Jin	218f763255	(follow-up of PR #23852 ) add missing `typename` keyword to work with `gcc` This update is a follow-up of PR #23852. In the previous PR, there was one missing place where the `typename` was not added.	2024-11-07 23:55:38 -06:00
Dan Foreman-Mackey	6625a2b3ed	Update Eigh kernel on GPU to use 64-bit interface when it is available. Part of https://github.com/jax-ml/jax/issues/23413 PiperOrigin-RevId: 684546802	2024-10-10 12:59:37 -07:00
Ruturaj4	29a1cb766e	[ROCM] add missing typename keyword to work with gcc	2024-09-23 14:42:01 -05:00
Dan Foreman-Mackey	afaa3bf43c	Port GPU kernels for SVD to the FFI. Unlike the other GPU linear algebra kernels that I've ported so far, this one isn't straightforward to implement as a single kernel, and while it does support lowering without access to a GPU (no more descriptor!), it only supports dynamics shapes in the batch dimensions. There are two main technical challenges: 1. The main `gesvd` kernels in cuSolver/hipSolver only support matrices with shape `(m, n)` with `m >= n`. This means that we need to transpose the inputs and outputs as part of the lowering rule when `m < n`. (Note: we actually just use C layouts instead of Fortran layouts to implement this case.) While this could be handled in the kernel, this seemed like a lot of work for somewhat limited benefit, and it would probably have performance implications. 2. The `gesvd` and `gesvdj` kernels return `V^H` and `V` respectively, and the batched version of `gesvdj` doesn't support `full_matrices=False`. This means that we need logic in the lowering rule to handle transposition and slicing. This makes it hard to have the algorithm selection be a parameter to the kernel. Another note: cuSolver has a 64-bit implementation of the SVD, and we always use that implementation on the CUDA backend. The 32-bit interface is included for ROCM support, and I have tested it manually. This was a feature request from https://github.com/jax-ml/jax/issues/23413. PiperOrigin-RevId: 676839182	2024-09-20 07:34:50 -07:00
Dan Foreman-Mackey	a3bf75e442	Refactor gpusolver kernel definitions into separate build target. There is a lot of boilerplate required for each new custom call to cuSolver / cuBLAS, and having both the FFI logic and the framework wrappers in the same library was getting unwieldy. This change adds a new "interface" target which just includes the shims to wrap cuSolver/BLAS functions, and then these are used from `solver_kernels_ffi` where the FFI logic lives. PiperOrigin-RevId: 673832309	2024-09-12 07:11:36 -07:00
Dan Foreman-Mackey	1d12a9934c	Port GPU kernel for symmetric eigendecomposition to GPU. Of note, I moved the logic about which algorithm to use, and when to use the batched algorithm into the kernel in order to support shape polymorphism and export. PiperOrigin-RevId: 671853879	2024-09-06 12:23:04 -07:00
jax authors	f97bfc85a3	Implement symmetric_product() to produce a symmetric matrix: `C = alpha * X @ X.T + beta * C` PiperOrigin-RevId: 671845818	2024-09-06 11:58:20 -07:00
Peter Hawkins	45b871950e	Fix a number of minor problems in the ROCM build. Change in preparation for adding more presubmits for AMD ROCM. PiperOrigin-RevId: 667766343	2024-08-26 17:04:01 -07:00
Dan Foreman-Mackey	b56ed8eedd	Port GPU kernel for Householder transformation to FFI. PiperOrigin-RevId: 666305682	2024-08-22 05:23:09 -07:00
Dan Foreman-Mackey	71a93d0c87	Port QR factorization GPU kernel to FFI. The biggest change here is that we now ignore the `info` parameter that is returned by `getrf`. In the previous implementation, we would return an error in the batched implementation, or set the relevant matrix entries to NaN in the non-batched version if `info != 0`. But, since info is only used for shape checking (see LAPACK, cuBLAS and cuSolver docs), I argue that we will never see `info != 0`, because we're including all the shape checks in the kernel already. PiperOrigin-RevId: 665307128	2024-08-20 05:07:04 -07:00
Dan Foreman-Mackey	30d54ec6ff	Refactor FFI shape inference functions to include dimension check. Previously we always had two steps when extracting the batch size: (1) check the buffer has enough dimensions, (2) get the shape. And, in a few cases, this first check was missing. Now these steps are combined into one function that returns a StatusOr. As part of this, I needed to fix our implementation of the `ASSIGN_OR_RETURN` macro to properly handle parentheses. PiperOrigin-RevId: 664803225	2024-08-19 07:41:28 -07:00
Dan Foreman-Mackey	b6306e3953	Remove synchronization from GPU LU decomposition kernel by adding an async batch pointers builder. In the batched LU decomposition in cuBLAS, the output buffer is required to be a pointer of pointers to the appropriate batch matrices. Previously this reshaping was done on the host and then copied to the device, requiring a synchronization, but it seems straightforward to instead implement a tiny CUDA kernel to do this work. This definitely isn't a bottleneck or a high priority change, but this seemed like a reasonable time to fix a longstanding TODO. PiperOrigin-RevId: 663686539	2024-08-16 04:37:09 -07:00
Dan Foreman-Mackey	ad1bd38790	Move logic about when to dispatch to batched LU decomposition algorithm on GPU into the kernel. This simplifies the lowering logic, and means that we don't get hit with a performance penalty when exporting with shape polymorphism. PiperOrigin-RevId: 662945116	2024-08-14 09:20:40 -07:00
Dan Foreman-Mackey	8df0c3a9cc	Port Getrf GPU kernel from custom call to FFI. PiperOrigin-RevId: 658550170	2024-08-01 15:02:25 -07:00

18 Commits