rocm_jax

mirror of https://github.com/ROCm/jax.git synced 2025-04-15 19:36:06 +00:00

Author	SHA1	Message	Date
Peter Hawkins	3f91b4b43a	Move jaxlib/{cuda,rocm}_plugin_extension into jaxlib/{cuda/rocm}/ Move the common jaxlib/gpu_plugin_extension into jaxlib/gpu/ Cleanup only, no functional changes intended. PiperOrigin-RevId: 738183402	2025-03-18 16:29:37 -07:00
Dan Foreman-Mackey	5bc17f7ec3	Remove the unused cu_cholesky_update kernel in favor of the FFI version. This kernel wasn't allowed in export, so no backwards compatibility period is required. Even so, the FFI kernels were added 6 months ago. PiperOrigin-RevId: 724359996	2025-02-07 08:48:15 -08:00
Dan Foreman-Mackey	c6e83903de	Update RNN kernels to use FFI. PiperOrigin-RevId: 724151647	2025-02-06 18:27:58 -08:00
Dan Foreman-Mackey	5e915d3307	Update the sparse GPU kernels in jaxlib to use the FFI. Unlike the other more detailed ports, this version doesn't take full advantage of the features provided by the FFI. For example, it would be possible to update the kernels to use the ScratchAllocator instead of querying the workspace size during lowering. However, since these kernels are really only meant to be experimental, it's not obvious to me that it's worth the extra work to do anything more sophisticated. PiperOrigin-RevId: 724016331	2025-02-06 11:45:57 -08:00
jax authors	41993fdb24	Merge pull request #25755 from ROCm:ci_rnn_final-upstream PiperOrigin-RevId: 715856939	2025-01-15 10:40:54 -08:00
Ruturaj4	fe68eb8b25	[ROCm] Implement RNN support	2025-01-14 19:04:49 -06:00
Peter Hawkins	91ffb640a8	Use thread-safe initialization of LAPACK kernels. Use absl::call_once instead of a GIL-protected global initialization. In passing, also remove an unused function. PiperOrigin-RevId: 714892175	2025-01-13 02:51:38 -08:00
Peter Hawkins	90d8f37863	Rename pybind_extension to nanobind_extension. We have no remaining uses of pybind11 outside a GPU custom call example. PiperOrigin-RevId: 712608834	2025-01-06 11:53:44 -08:00
Dan Foreman-Mackey	ccb331707e	Add a GPU implementation of `lax.linalg.eig`. This feature has been in the queue for a long time (see https://github.com/jax-ml/jax/issues/1259), and some folks have found that they can use `pure_callback` to call the CPU version as a workaround. It has recently come up that there can be issues when using `pure_callback` with JAX calls in the body (https://github.com/jax-ml/jax/issues/24255; this should be investigated separately). This change adds a native solution for computing `lax.linalg.eig` on GPU. By default, this is implemented by calling LAPACK on host directly because this has good performance for small to moderately sized problems (less than about 2048^2). For larger matrices, a GPU-backed implementation based on [MAGMA](https://icl.utk.edu/magma/) can have significantly better performance. (I should note that I haven't done a huge amount of benchmarking yet, but this was the breakeven point used by PyTorch, and I find roughly similar behavior so far.) We don't want to add MAGMA as a required dependency, but if a user has installed it, JAX can use it when the `jax_gpu_use_magma` configuration variable is set to `"on"`. By default, we try to dlopen `libmagma.so`, but the path to a non-standard installation location can be specified using the `JAX_GPU_MAGMA_PATH` environment variable. PiperOrigin-RevId: 697631402	2024-11-18 08:11:57 -08:00
jax authors	57ef7a4a59	Merge pull request #24274 from ROCm:ci_linalg_fix PiperOrigin-RevId: 685717437	2024-10-14 08:33:33 -07:00
Ruturaj4	ee223d4004	[ROCm] jaxlib linalg fix	2024-10-13 20:25:18 -05:00
Dan Foreman-Mackey	a3bf75e442	Refactor gpusolver kernel definitions into separate build target. There is a lot of boilerplate required for each new custom call to cuSolver / cuBLAS, and having both the FFI logic and the framework wrappers in the same library was getting unwieldy. This change adds a new "interface" target which just includes the shims to wrap cuSolver/BLAS functions, and then these are used from `solver_kernels_ffi` where the FFI logic lives. PiperOrigin-RevId: 673832309	2024-09-12 07:11:36 -07:00
Peter Hawkins	45b871950e	Fix a number of minor problems in the ROCM build. Change in preparation for adding more presubmits for AMD ROCM. PiperOrigin-RevId: 667766343	2024-08-26 17:04:01 -07:00
jax authors	6a2a96c3b8	Merge pull request #23166 from ROCm:ci_bazel_build PiperOrigin-RevId: 665935164	2024-08-21 10:27:51 -07:00
Ruturaj4	c41d644886	[ROCm] Fix bazel build issue	2024-08-21 08:40:54 -05:00
Dan Foreman-Mackey	bd90968a25	Port the GPU Cholesky update custom call to the FFI. PiperOrigin-RevId: 665319689	2024-08-20 05:46:03 -07:00
Dan Foreman-Mackey	71a93d0c87	Port QR factorization GPU kernel to FFI. The biggest change here is that we now ignore the `info` parameter that is returned by `getrf`. In the previous implementation, we would return an error in the batched implementation, or set the relevant matrix entries to NaN in the non-batched version if `info != 0`. But, since info is only used for shape checking (see LAPACK, cuBLAS and cuSolver docs), I argue that we will never see `info != 0`, because we're including all the shape checks in the kernel already. PiperOrigin-RevId: 665307128	2024-08-20 05:07:04 -07:00
Dan Foreman-Mackey	b6306e3953	Remove synchronization from GPU LU decomposition kernel by adding an async batch pointers builder. In the batched LU decomposition in cuBLAS, the output buffer is required to be a pointer of pointers to the appropriate batch matrices. Previously this reshaping was done on the host and then copied to the device, requiring a synchronization, but it seems straightforward to instead implement a tiny CUDA kernel to do this work. This definitely isn't a bottleneck or a high priority change, but this seemed like a reasonable time to fix a longstanding TODO. PiperOrigin-RevId: 663686539	2024-08-16 04:37:09 -07:00
Dan Foreman-Mackey	ad1bd38790	Move logic about when to dispatch to batched LU decomposition algorithm on GPU into the kernel. This simplifies the lowering logic, and means that we don't get hit with a performance penalty when exporting with shape polymorphism. PiperOrigin-RevId: 662945116	2024-08-14 09:20:40 -07:00
Ruturaj4	a2d79936df	[ROCM] Fix BUILD.bazel library source paths	2024-08-07 09:18:20 -05:00
Dan Foreman-Mackey	8df0c3a9cc	Port Getrf GPU kernel from custom call to FFI. PiperOrigin-RevId: 658550170	2024-08-01 15:02:25 -07:00
Dan Foreman-Mackey	f20efc630f	Move jaxlib GPU handlers to separate build target. In anticipation of refactoring the jaxlib GPU custom calls into FFI calls, this change moves the implementation of `BlasHandlePool`, `SolverHandlePool`, and `SpSolverHandlePool` into new target. PiperOrigin-RevId: 658497960	2024-08-01 12:30:04 -07:00
Dan Foreman-Mackey	4f394828e1	Fix C++ registration of FFI handlers and consolidate gpu/linalg kernel implementation. This change does a few things (arguably too many): 1. The key change here is that it fixes the handler registration in `jaxlib/gpu/gpu_kernels.cc` for the two handlers that use the XLA FFI API. A previous attempt at this change caused downstream issues because of duplicate registrations, but we were able to fix that directly in XLA. 2. A second related change is to declare and define the XLA FFI handlers consistently using the `XLA_FFI_DECLARE_HANDLER_SYMBOL` and `XLA_FFI_DEFINE_HANDLER_SYMBOL` macros. We need to use these macros instead of the `XLA_FFI_DEFINE_HANDLER` version which produces a lambda, so that when XLA checks the address of the handler during registration it is consistent. Without this change, the downstream tests would continue to fail. 3. The final change is to consolidate the `cholesky_update_kernel` and `lu_pivot_kernels` implementations into a common `linalg_kernels` target. This makes the implementation of the `_linalg` nanobind module consistent with the other targets within `jaxlib/gpu`, and (I think!) makes the details easier to follow. This last change is less urgent, but it was what I set out to do so that's why I'm suggesting them all together, but I can split this in two if that would be preferred. PiperOrigin-RevId: 651107659	2024-07-10 12:09:12 -07:00
Ruturaj4	58b658cfb8	[ROCM] add typed XLA FFI support in rocm specific code	2024-07-02 11:04:43 -05:00
Dan Foreman-Mackey	9ae1c56c44	Update lu_pivots_to_permutation to use FFI dimensions on GPU. The XLA FFI interface provides metadata about buffer dimensions, so quantities like batch dimensions can be evaluated on the backend, instead of passed as attributes. This change has the added benefit of allowing this FFI call to support "vectorized" vmap and dynamic shapes. PiperOrigin-RevId: 647343656	2024-06-27 09:27:15 -07:00
Ruturaj4	a00d030248	[ROCM] nits and fixes	2024-06-18 20:21:23 +00:00
Ruturaj4	99c2b7b4e9	[ROCm] Bring-up pjrt support	2024-06-17 16:49:22 +00:00
Ruturaj4	79fccf6c82	add cholesky changes in bazel	2024-05-18 00:37:09 +00:00
Sergei Lebedev	51fc4f85ad	Ported LuPivotsToPermutation to the typed XLA FFI The typed FFI * allows passing custom call attributes directly to backend_config= instead of serializing them into a C++ struct. * It also handles validation and deserialization of custom call operands. PiperOrigin-RevId: 630067005	2024-05-02 08:12:05 -07:00
Ruturaj4	97bf2d2bb8	[ROCm]: fix tsl path	2024-04-08 19:58:41 -05:00
Rahul Batra	b4b97cd8e8	[ROCm]: Add jax-triton support for ROCm	2023-10-18 07:09:20 +00:00
Rahul Batra	4091ac646c	[ROCm]: Fix duplicate deps include	2023-09-08 22:56:59 +00:00
Peter Hawkins	70b7d50181	Switch jaxlib to use nanobind instead of pybind11. nanobind has a number of advantages (https://nanobind.readthedocs.io/en/latest/why.html), notably speed of compilation and dispatch, but the main reason to do this for these bindings is because nanobind can target the Python Stable ABI starting with Python 3.12. This means that we will not need to ship per-Python version CUDA plugins starting with Python 3.12. PiperOrigin-RevId: 559898790	2023-08-24 16:07:56 -07:00
Chris Jones	6b13d4eb86	Add branch prediction to JAX status macros. PiperOrigin-RevId: 535233546	2023-05-25 06:23:23 -07:00
Peter Hawkins	172a831219	Switch JAX to use the OpenXLA repository.	2023-03-13 18:38:26 +00:00
Peter Hawkins	a852710a09	Merge CUDA and ROCM kernel code in jaxlib. The code for both CUDA and ROCM is almost identical, so with a small shim library to handle the differences we can share almost everything. PiperOrigin-RevId: 483666051	2022-10-25 07:23:34 -07:00
Peter Hawkins	5617a02fa4	Remove JAX custom call implementation of batched triangular solve. XLA supports batched triangular solve on GPU and has since February 2022, which is older than the minimum jaxlib version. We can therefore delete our implementation and just use XLA's implementation. PiperOrigin-RevId: 482031830	2022-10-18 15:04:14 -07:00
Peter Hawkins	2693afa263	Revert: Use input-output aliasing for jaxlib GPU custom calls. Previously we had no way to tell XLA that inputs and outputs of GPU custom calls must alias. This now works in XLA:GPU so we can just ask XLA to enforce the aliasing we need. This seems to be causing some test failures downstream, so reverting this for the moment until I can debug them. PiperOrigin-RevId: 479670565	2022-10-07 14:36:24 -07:00
Peter Hawkins	93b839ace4	Use input-output aliasing for jaxlib GPU custom calls. Previously we had no way to tell XLA that inputs and outputs of GPU custom calls must alias. This now works in XLA:GPU so we can just ask XLA to enforce the aliasing we need. PiperOrigin-RevId: 479642543	2022-10-07 12:22:04 -07:00
Rohit Santhanam	b815ac9d8e	[ROCm] Upgrade to ROCm 5.3 and associated enhancements	2022-10-01 04:45:26 -07:00
Peter Hawkins	ba557d5e1b	Change JAX's copyright attribution from "Google LLC" to "The JAX Authors.". See https://opensource.google/documentation/reference/releasing/contributions#copyright for more details. PiperOrigin-RevId: 476167538	2022-09-22 12:27:19 -07:00
Rohit Santhanam	82adc6a1d0	[ROCm] Enhance hipsparse to reach parity with cusparse based on commit `d37b711dd4`.	2022-08-21 22:46:00 +00:00
Rohit Santhanam	080cf47002	[ROCm] Fixes for compilation failures caused by compiler changes in ROCm Tensorflow fork.	2022-06-29 14:34:08 +00:00
Peter Hawkins	aa7d291767	Replace references to absl::string_view with std::string_view. PiperOrigin-RevId: 450768333	2022-05-24 14:21:32 -07:00
Rohit Santhanam	f9321f2536	Fix hipblas kernels for ROCm.	2022-05-19 14:47:21 +00:00
Peter Hawkins	bb0816227d	Add a batched QR decomposition implementation on GPU. PiperOrigin-RevId: 449583027	2022-05-18 14:50:18 -07:00
Rohit Santhanam	bbdcec84f8	Fixes to enable JAX to build on ROCm.	2022-05-06 22:57:51 +00:00
Peter Hawkins	08c3c2ec24	Split CUDA and HIP C++ code in jaxlib into separate directories. PiperOrigin-RevId: 447062506	2022-05-06 13:48:00 -07:00

48 Commits