rocm_jax

mirror of https://github.com/ROCm/jax.git synced 2025-04-24 05:56:05 +00:00

Author	SHA1	Message	Date
Dan Foreman-Mackey	5bc17f7ec3	Remove the unused cu_cholesky_update kernel in favor of the FFI version. This kernel wasn't allowed in export, so no backwards compatibility period is required. Even so, the FFI kernels were added 6 months ago. PiperOrigin-RevId: 724359996	2025-02-07 08:48:15 -08:00
Dan Foreman-Mackey	c6e83903de	Update RNN kernels to use FFI. PiperOrigin-RevId: 724151647	2025-02-06 18:27:58 -08:00
Dan Foreman-Mackey	5e915d3307	Update the sparse GPU kernels in jaxlib to use the FFI. Unlike the other more detailed ports, this version doesn't take full advantage of the features provided by the FFI. For example, it would be possible to update the kernels to use the ScratchAllocator instead of querying the workspace size during lowering. However, since these kernels are really only meant to be experimental, it's not obvious to me that it's worth the extra work to do anything more sophisticated. PiperOrigin-RevId: 724016331	2025-02-06 11:45:57 -08:00
Michael Hudgins	2e808f2836	Merge pull request #26279 from MichaelHudgins:tsan-resultstore PiperOrigin-RevId: 723918760	2025-02-06 14:55:57 +00:00
Peter Hawkins	034e967e11	Remove CUDA rpaths from jaxlib build. These are also set in the TSL build rules as part of the CUDA stub libraries, which these libraries depend on, so these copies of the rpath settings are redundant. PiperOrigin-RevId: 716844265	2025-01-17 17:09:30 -08:00
Peter Hawkins	91ffb640a8	Use thread-safe initialization of LAPACK kernels. Use absl::call_once instead of a GIL-protected global initialization. In passing, also remove an unused function. PiperOrigin-RevId: 714892175	2025-01-13 02:51:38 -08:00
Peter Hawkins	90d8f37863	Rename pybind_extension to nanobind_extension. We have no remaining uses of pybind11 outside a GPU custom call example. PiperOrigin-RevId: 712608834	2025-01-06 11:53:44 -08:00
Dan Foreman-Mackey	ccb331707e	Add a GPU implementation of `lax.linalg.eig`. This feature has been in the queue for a long time (see https://github.com/jax-ml/jax/issues/1259), and some folks have found that they can use `pure_callback` to call the CPU version as a workaround. It has recently come up that there can be issues when using `pure_callback` with JAX calls in the body (https://github.com/jax-ml/jax/issues/24255; this should be investigated separately). This change adds a native solution for computing `lax.linalg.eig` on GPU. By default, this is implemented by calling LAPACK on host directly because this has good performance for small to moderately sized problems (less than about 2048^2). For larger matrices, a GPU-backed implementation based on [MAGMA](https://icl.utk.edu/magma/) can have significantly better performance. (I should note that I haven't done a huge amount of benchmarking yet, but this was the breakeven point used by PyTorch, and I find roughly similar behavior so far.) We don't want to add MAGMA as a required dependency, but if a user has installed it, JAX can use it when the `jax_gpu_use_magma` configuration variable is set to `"on"`. By default, we try to dlopen `libmagma.so`, but the path to a non-standard installation location can be specified using the `JAX_GPU_MAGMA_PATH` environment variable. PiperOrigin-RevId: 697631402	2024-11-18 08:11:57 -08:00
Dan Foreman-Mackey	a3bf75e442	Refactor gpusolver kernel definitions into separate build target. There is a lot of boilerplate required for each new custom call to cuSolver / cuBLAS, and having both the FFI logic and the framework wrappers in the same library was getting unwieldy. This change adds a new "interface" target which just includes the shims to wrap cuSolver/BLAS functions, and then these are used from `solver_kernels_ffi` where the FFI logic lives. PiperOrigin-RevId: 673832309	2024-09-12 07:11:36 -07:00
jax authors	f97bfc85a3	Implement symmetric_product() to produce a symmetric matrix: `C = alpha * X @ X.T + beta * C` PiperOrigin-RevId: 671845818	2024-09-06 11:58:20 -07:00
Peter Hawkins	45b871950e	Fix a number of minor problems in the ROCM build. Change in preparation for adding more presubmits for AMD ROCM. PiperOrigin-RevId: 667766343	2024-08-26 17:04:01 -07:00
Peter Hawkins	6d1f51e63d	Clean up BUILD files. PiperOrigin-RevId: 667604964	2024-08-26 09:11:17 -07:00
Dan Foreman-Mackey	bd90968a25	Port the GPU Cholesky update custom call to the FFI. PiperOrigin-RevId: 665319689	2024-08-20 05:46:03 -07:00
Dan Foreman-Mackey	71a93d0c87	Port QR factorization GPU kernel to FFI. The biggest change here is that we now ignore the `info` parameter that is returned by `getrf`. In the previous implementation, we would return an error in the batched implementation, or set the relevant matrix entries to NaN in the non-batched version if `info != 0`. But, since info is only used for shape checking (see LAPACK, cuBLAS and cuSolver docs), I argue that we will never see `info != 0`, because we're including all the shape checks in the kernel already. PiperOrigin-RevId: 665307128	2024-08-20 05:07:04 -07:00
Dan Foreman-Mackey	b6306e3953	Remove synchronization from GPU LU decomposition kernel by adding an async batch pointers builder. In the batched LU decomposition in cuBLAS, the output buffer is required to be a pointer of pointers to the appropriate batch matrices. Previously this reshaping was done on the host and then copied to the device, requiring a synchronization, but it seems straightforward to instead implement a tiny CUDA kernel to do this work. This definitely isn't a bottleneck or a high priority change, but this seemed like a reasonable time to fix a longstanding TODO. PiperOrigin-RevId: 663686539	2024-08-16 04:37:09 -07:00
Dan Foreman-Mackey	ad1bd38790	Move logic about when to dispatch to batched LU decomposition algorithm on GPU into the kernel. This simplifies the lowering logic, and means that we don't get hit with a performance penalty when exporting with shape polymorphism. PiperOrigin-RevId: 662945116	2024-08-14 09:20:40 -07:00
Dan Foreman-Mackey	8df0c3a9cc	Port Getrf GPU kernel from custom call to FFI. PiperOrigin-RevId: 658550170	2024-08-01 15:02:25 -07:00
Dan Foreman-Mackey	f20efc630f	Move jaxlib GPU handlers to separate build target. In anticipation of refactoring the jaxlib GPU custom calls into FFI calls, this change moves the implementation of `BlasHandlePool`, `SolverHandlePool`, and `SpSolverHandlePool` into new target. PiperOrigin-RevId: 658497960	2024-08-01 12:30:04 -07:00
Dan Foreman-Mackey	33a9db3943	Move FFI helper macros from jaxlib/cpu/lapack_kernels.cc to a jaxlib/ffi_helpers.h. Some of the macros that were used in jaxlib's FFI calls to LAPACK turned out to be useful for other FFI calls. This change consolidates these macros in the ffi_helper header. PiperOrigin-RevId: 651166306	2024-07-10 15:09:45 -07:00
Dan Foreman-Mackey	4f394828e1	Fix C++ registration of FFI handlers and consolidate gpu/linalg kernel implementation. This change does a few things (arguably too many): 1. The key change here is that it fixes the handler registration in `jaxlib/gpu/gpu_kernels.cc` for the two handlers that use the XLA FFI API. A previous attempt at this change caused downstream issues because of duplicate registrations, but we were able to fix that directly in XLA. 2. A second related change is to declare and define the XLA FFI handlers consistently using the `XLA_FFI_DECLARE_HANDLER_SYMBOL` and `XLA_FFI_DEFINE_HANDLER_SYMBOL` macros. We need to use these macros instead of the `XLA_FFI_DEFINE_HANDLER` version which produces a lambda, so that when XLA checks the address of the handler during registration it is consistent. Without this change, the downstream tests would continue to fail. 3. The final change is to consolidate the `cholesky_update_kernel` and `lu_pivot_kernels` implementations into a common `linalg_kernels` target. This makes the implementation of the `_linalg` nanobind module consistent with the other targets within `jaxlib/gpu`, and (I think!) makes the details easier to follow. This last change is less urgent, but it was what I set out to do so that's why I'm suggesting them all together, but I can split this in two if that would be preferred. PiperOrigin-RevId: 651107659	2024-07-10 12:09:12 -07:00
George Necula	cbe524298c	Ported threefry2x32 for GPU to the typed XLA FFI This allows lowering of threefry2x32 for GPU even on a machine without GPUs. For the next 3 weeks, we only use the new custom call implementation if we are not in "export" mode, and if we use a new jaxlib. PiperOrigin-RevId: 647657084	2024-06-28 06:24:44 -07:00
Dan Foreman-Mackey	9ae1c56c44	Update lu_pivots_to_permutation to use FFI dimensions on GPU. The XLA FFI interface provides metadata about buffer dimensions, so quantities like batch dimensions can be evaluated on the backend, instead of passed as attributes. This change has the added benefit of allowing this FFI call to support "vectorized" vmap and dynamic shapes. PiperOrigin-RevId: 647343656	2024-06-27 09:27:15 -07:00
Thomas Köppe	cd93b46df4	Add initialization annotations (for the benefit of MSAN) to variables that are initialized by external functions. PiperOrigin-RevId: 641879836	2024-06-10 06:21:16 -07:00
Adam Paszke	cfe64cd5ce	[Mosaic GPU] Integrate the ExecutionEngine with the jaxlib GPU plugin This lets us avoid bundling a whole another copy of LLVM with JAX packages and so we can finally start building Mosaic GPU by default. PiperOrigin-RevId: 638569750	2024-05-30 01:46:23 -07:00
jax authors	e8b06ccf56	Cholesky rank-1 update kernel for JAX. PiperOrigin-RevId: 633722940	2024-05-14 15:21:38 -07:00
Sergei Lebedev	51fc4f85ad	Ported LuPivotsToPermutation to the typed XLA FFI The typed FFI * allows passing custom call attributes directly to backend_config= instead of serializing them into a C++ struct. * It also handles validation and deserialization of custom call operands. PiperOrigin-RevId: 630067005	2024-05-02 08:12:05 -07:00
Adam Paszke	9b0319512a	[Mosaic GPU] Use a custom TMA descriptor initialization method The one bundled with the default MLIR runtime was convenient, but it is also impractical. It allocates memory (which can deadlock due to NCCL), does a synchronous host-to-device copy and then leaks the descriptor after the kernel... With this change, we use our own runtime function to create all the descriptors. What's more, we pack them all into a single buffer so that a single asynchronous copy is sufficient. Finally, we use a scratch output to allocate the scratch buffer, letting us lean on XLA:GPU for memory management. PiperOrigin-RevId: 628430358	2024-04-26 09:40:47 -07:00
Marvin Kim	90e9e47a55	[Jax/Triton] Skip benchmarking while autotuning for configs that cannot be launched. For configs that cannot be launched, we should not launch them via benchmark. PiperOrigin-RevId: 626153377	2024-04-18 14:35:51 -07:00
Jieying Luo	44e83d4e0a	Add a few custom call registrations to gpu_kernel to keep in-sync with callers of xla_client.register_custom_call_target. PiperOrigin-RevId: 624275186	2024-04-12 13:30:18 -07:00
Henning Becker	9809aa1929	Move CUDA specific functions from asm_compiler to cuda_asm_compiler target This avoids: - a forward declaration of `GpuContext` - the `:asm_compiler_header` header only target The moved code is unchanged - I just move it from one file to another and fix up includes and dependencies. Note that this is adding just another `#ifdef` to the redzone allocator code. I will clean this up in a subsequent change. PiperOrigin-RevId: 623285804	2024-04-09 14:43:41 -07:00
Olli Lupton	c97d955771	cuInit before querying compute capability	2024-04-04 15:27:57 +00:00
David Dunleavy	aade591fdf	Move `tsl/python` to `xla/tsl/python` PiperOrigin-RevId: 620320903	2024-03-29 13:15:21 -07:00
Michael Hudgins	023930decf	Fix some load orderings for buildifier PiperOrigin-RevId: 619575196	2024-03-27 10:28:57 -07:00
Meekail Zain	9fff9aeb69	Update	2024-03-03 19:57:26 +00:00
David Dunleavy	be3e39ad3b	Move `tsl/cuda` to `xla/tsl/cuda` PiperOrigin-RevId: 610550833	2024-02-26 15:45:10 -08:00
Peter Hawkins	a999120514	Improve error message when cudnn is not found. We infer a missing cudnn if cudnnGetVersion() returns 0, since the stub implementation in TSL will do that if the library isn't found (`10a378f499/third_party/tsl/tsl/cuda/cudnn_stub.cc (L58)`). PiperOrigin-RevId: 587056454	2023-12-01 10:52:48 -08:00
Peter Hawkins	41f0b336e3	Add minimum version checks for cublas and cusparse. Split code to determine CUDA library versions out of py_extension() module and into a cc_library(), because it fixes a linking problem in Google's build. (Long story, not worth it.) Fixes https://github.com/google/jax/issues/8289 PiperOrigin-RevId: 583544218	2023-11-17 19:30:41 -08:00
jax authors	88fe0da6d1	Merge pull request #18078 from ROCmSoftwarePlatform:rocm-jax-triton PiperOrigin-RevId: 574546618	2023-10-18 11:56:01 -07:00
Jieying Luo	7478fbcfd5	[PJRT C API] Add "cuda_plugin_extension" to "gpu_only_test_deps" to support bazel test for GPU plugin. PiperOrigin-RevId: 573251982	2023-10-13 10:12:16 -07:00
Peter Hawkins	2eca5b34b3	Add a compile-time version test that verifies CUDA is version 11.8 or newer. Issue https://github.com/google/jax/issues/17829 PiperOrigin-RevId: 569302585	2023-09-28 15:14:04 -07:00
Peter Hawkins	53845615ff	Disable nanobind leak checker in cuda/versions module. The leak checker appears to be sensitive to the destruction order during Python shutdown. PiperOrigin-RevId: 568962933	2023-09-27 14:43:20 -07:00
Peter Hawkins	9404518201	[CUDA] Add code to jax initialization that verifies that the CUDA libraries that are found are at least as new as the versions against which JAX was built. This is intended to flag cases where the wrong CUDA libraries are used, either because: * the user self-installed CUDA and that installation is too old, or * the user used the pip package installation, but due to LD_LIBRARY_PATH overrides or similar we didn't end up using the pip-installed version. PiperOrigin-RevId: 568910422	2023-09-27 11:28:40 -07:00
Peter Hawkins	8c70288b83	Refer to CUDA stubs directly from TSL, rather than using an alias defined in xla/stream_executor. Remove the aliases in xla/stream_executor. PiperOrigin-RevId: 567025507	2023-09-20 11:21:54 -07:00
Peter Hawkins	70b7d50181	Switch jaxlib to use nanobind instead of pybind11. nanobind has a number of advantages (https://nanobind.readthedocs.io/en/latest/why.html), notably speed of compilation and dispatch, but the main reason to do this for these bindings is because nanobind can target the Python Stable ABI starting with Python 3.12. This means that we will not need to ship per-Python version CUDA plugins starting with Python 3.12. PiperOrigin-RevId: 559898790	2023-08-24 16:07:56 -07:00
Richard Levasseur	f891cbf64b	Load Python rules from rules_python PiperOrigin-RevId: 559789250	2023-08-24 10:22:57 -07:00
Chris Jones	f70f1f8006	Internal change. PiperOrigin-RevId: 559053761	2023-08-22 03:05:17 -07:00
Chris Jones	4ac2bdc2b1	[jax_triton] Add user-specified `name` field to serialized format. PiperOrigin-RevId: 557415723	2023-08-16 02:53:51 -07:00
Chris Jones	9935445d57	[jax_triton] Simplify auto-tuning code. PiperOrigin-RevId: 545733541	2023-07-05 11:18:18 -07:00
Chris Jones	31b862dd56	[jax_triton] Split C++ only parts of Triton custom callback from Python parts. Register callback with default call target name from C++, enabling Triton calls with the default name to work in C++ only contexts (e.g. serving). PiperOrigin-RevId: 545211452	2023-07-03 06:52:32 -07:00
Chris Jones	d4e2464340	[jax_triton] Expose Triton custom call callback in header file. This allows users to register the callback from C++ when not using the default call target name. PiperOrigin-RevId: 544029098	2023-06-28 05:32:02 -07:00

1 2

85 Commits