rocm_jax

mirror of https://github.com/ROCm/jax.git synced 2025-04-19 05:16:06 +00:00

Author	SHA1	Message	Date
Sergei Lebedev	ea68f4569c	Internal change PiperOrigin-RevId: 673409076	2024-09-11 08:47:58 -07:00
Adam Paszke	4c3111bf26	[Mosaic GPU] Unbreak tests I mistakenly checked for `amount + 1` instead of `amount * 2`. It initially seemed right because both expressions evalute to 2 for 1 :) PiperOrigin-RevId: 670527107	2024-09-03 06:07:54 -07:00
Peter Hawkins	cd20404159	Disable mosaic gpu tests that are failing at head. PiperOrigin-RevId: 669390680	2024-08-30 11:31:09 -07:00
Jake VanderPlas	68be5b5085	CI: update ruff to v0.6.1	2024-08-27 14:54:11 -07:00
Peter Hawkins	6d1f51e63d	Clean up BUILD files. PiperOrigin-RevId: 667604964	2024-08-26 09:11:17 -07:00
Adam Paszke	be59f6ec47	[Mosaic GPU] Support tiled stores of arrays with fewer columns than swizzling PiperOrigin-RevId: 666798285	2024-08-23 08:06:25 -07:00
Adam Paszke	f54e220430	[Mosaic GPU] Add support for short n dimension in WGMMA PiperOrigin-RevId: 666766079	2024-08-23 06:08:37 -07:00
Adam Paszke	9c3f2dcefc	[Mosaic GPU] Make CUDA context part of the hash key + replace kernel id with a SHA256 digest XLA runtime creates a context per device, so we need to make sure that a kernel is loaded separately on each device. PiperOrigin-RevId: 666353098	2024-08-22 08:06:37 -07:00
Adam Paszke	0b4f64e002	[Mosaic GPU] Allow tile sizes to exceed dimension size Otherwise, the dimension size still needs to be a multiple of tiling. PiperOrigin-RevId: 666298624	2024-08-22 04:59:11 -07:00
Adam Paszke	ce3ea109a4	[Mosaic GPU] Add a fast type conversion from s8 vectors to bf16 vectors Regular conversion instructions have a ridiculously low throughput on Hopper, so replacing them with some bit tricks yields a much faster implementation. Co-authored-by: Benjamin Chetioui <bchetioui@google.com> PiperOrigin-RevId: 665893696	2024-08-21 08:39:24 -07:00
Adam Paszke	2ab7558425	[Mosaic GPU] Add support for grid tiling to improve L2 cache utilization While CUDA technically does not guarantee anything about the order in which blocks will be executed, in practice they are generally scheduled in column-major order within the grid. We can use this property to launch the blocks in a tiled way, which can lead to an improved rate of L2 hits and a significant performance boost. PiperOrigin-RevId: 662834982	2024-08-14 02:17:55 -07:00
Adam Paszke	f384497f68	[Mosaic GPU] Add support for cluster collective loads and barriers over multiple dimensions This will be useful for an upcoming change to the matmul kernel that splits the N blocks over two cluster dimensions. PiperOrigin-RevId: 662825455	2024-08-14 01:47:12 -07:00
Adam Paszke	f4c0b1feb0	[Mosaic GPU] Add control over the output format in the matmul example PiperOrigin-RevId: 662478648	2024-08-13 05:33:12 -07:00
Adam Paszke	5cf89b3f61	[Mosaic GPU] Add support for various swizzles in the matmul example PiperOrigin-RevId: 662459766	2024-08-13 04:12:43 -07:00
Adam Paszke	ca6be2573b	[Mosaic GPU] Move matmul tests to Hypothesis We've been generating thousands of test cases and that's just not scalable. Hypothesis should let us efficiently explore a large number of configurations. PiperOrigin-RevId: 662447113	2024-08-13 03:21:51 -07:00
Christos Perivolaropoulos	cd4e91b2b0	[mosaic_gpu] Store untiled splat layout PiperOrigin-RevId: 662077826	2024-08-12 07:34:07 -07:00
Sergei Lebedev	28ca734d9b	Added another boxDim check to mosaic_gpu_init_tma_desc PiperOrigin-RevId: 660314586	2024-08-07 03:16:54 -07:00
Adam Paszke	d862f78dcc	[Mosaic GPU] Skip matmul tests with large clusters I'm still investigating but they sometimes hang for an unclear reason. PiperOrigin-RevId: 656426326	2024-07-26 09:21:13 -07:00
Adam Paszke	2e6da35e97	[Mosaic GPU] Add support for clusters in the matmul example With the collective async_copy API, the changes are quite minimal! PiperOrigin-RevId: 655937185	2024-07-25 06:46:51 -07:00
Adam Paszke	e59303cf3e	[Mosaic GPU] Simplify the matmul example Remove a bunch of WGMMAImpl classes. This is meant to be a simple forkable example, not a complete kernel. PiperOrigin-RevId: 655923069	2024-07-25 05:43:57 -07:00
Adam Paszke	4f19af911c	[Mosaic GPU] Only split collective TMAs only (multiple) major dimensions Each TMA only writes to a contiguous subset of SMEM, so skipping a major dimension while splitting results in incorrect code. To work around the loss of flexibility, we now allow splitting multiple leading dimensions to handle larger clusters and tiled references. PiperOrigin-RevId: 655700486	2024-07-24 14:26:07 -07:00
Adam Paszke	dbe8f56353	[Mosaic GPU] Strengthen cluster-related tests by covering more cluster shapes In particular test trivial collectives (over singleton cluster axes), collectives over more than 2 devices and clusters larger than 8 devices. This uncovered a few more bugs in the implementation. PiperOrigin-RevId: 655686102	2024-07-24 13:43:52 -07:00
Adam Paszke	e52dc7ed15	[Mosaic GPU] Move barrier allocation to SMEM scratch specs This is slightly less convenient than our previous approach but it has two main upsides: 1. It lets us automatically emit necessary fences and barriers for use with block clusters 2. It lets us share the same block/cluster barrier for all initializations of mbarriers This change also moves away from the nvgpu dialect for barriers and allocates them in dynamic SMEM instead of relying on static SMEM. This should give us more control over SMEM layouts and alignments, and simplifies the lowering process. PiperOrigin-RevId: 655493451	2024-07-24 02:56:52 -07:00
Adam Paszke	6bc7929376	[Mosaic GPU] Add sin/cos + unify support for approximate transcendental functions PiperOrigin-RevId: 655469213	2024-07-24 01:15:57 -07:00
Adam Paszke	f0792b2d77	[Mosaic GPU] Add a collective mbarrier interface Memory barriers are necessary to prevent excessive run ahead in a collective pipeline, but the implementation can be tricky (both in terms of calculating the right arrival count and dividing the signalling responsibility between threads). I largely tried to follow the practices that CUTLASS established, although I still do not understand why it swizzles the cluster for signalling. PiperOrigin-RevId: 655098234	2024-07-23 03:19:29 -07:00
Adam Paszke	51732c5caf	[Mosaic GPU] Replace multicast_mask by a nicer collective async copy interface Instead of asking the user to compute the transfer size, manually slice up the transfer and compute and specify the multicast mask, we fold all that functionality into the `async_copy` function. The copy should be called by all blocks in a given cluster slice along the specified dimension, and will collectively load all the requested data into all blocks in that slice. PiperOrigin-RevId: 655077439	2024-07-23 01:55:14 -07:00
Adam Paszke	a2b2fbf513	[Mosaic GPU] Add early support for block clusters and multicast TMA PiperOrigin-RevId: 655057490	2024-07-23 00:50:20 -07:00
Adam Paszke	d8f435094d	[Mosaic GPU] Add support for tiled loads and stores with swizzles other than 128 We have correctness tests in CI, but I additionally ran them under ncu to verify that we never cause bank conflicts. PiperOrigin-RevId: 653930174	2024-07-19 02:06:52 -07:00
Adam Paszke	a07b9adcb2	[Mosaic GPU] Add support for WGMMA lhs in registers for swizzles other than 128 PiperOrigin-RevId: 653626991	2024-07-18 08:23:16 -07:00
Adam Paszke	ade76f09b1	[Mosaic GPU] Support narrower swizzles in copy and TMA tests PiperOrigin-RevId: 649045134	2024-07-03 05:58:54 -07:00
Adam Paszke	b19ad5b315	[Mosaic GPU] Add support for non-128B swizzles in WGMMA PiperOrigin-RevId: 647667550	2024-06-28 07:12:10 -07:00
Adam Paszke	3ebebdfb76	[Mosaic GPU] Stop using nvgpu for TMA It seems like nvgpu dialect bakes in a bunch of overly restrictive checks in its verifiers and doesn't really buy us much in this case. nvvm works just fine. PiperOrigin-RevId: 647653684	2024-06-28 06:08:36 -07:00
Peter Hawkins	7f4ef63cd8	Run `pyupgrade --py310-plus`. Also apply manual fixes to import sorting and unused imports.	2024-06-26 16:10:18 -04:00
Peter Hawkins	945fde41e4	Update minimum Python version to 3.10.	2024-06-26 13:47:14 -04:00
Adam Paszke	f976f1f224	[Mosaic GPU] Use explicit WGMMA/ALU scheduling in the flash attention kernel With this change we reach state of the art performance (as far as I can tell) of 50%+ TC util for head_dim 128 and 256. I also added a little tuning harness to try out different block sizes. PiperOrigin-RevId: 644927079	2024-06-20 00:56:44 -07:00
Benjamin Chetioui	25a47649d2	[Mosaic GPU] Change FlashAttention implementation to support Grouped Query Attention. Also add tests in `flash_attention_test.py`. PiperOrigin-RevId: 642626612	2024-06-12 08:46:06 -07:00
Adam Paszke	3b4039c850	[Mosaic GPU] Load LLVM lowering interfaces for all dialects Apparently we were missing interface registration code for LLVM lowering, which the gpu-to-llvm pass gracefully ignores unless compiled with debug assertions enabled. But, simply adding the assertions in fact makes the pass _too powerful_ and makes it lower _all dialects to LLVM_, which is not what we want. That's why I've replaced it with a minimal version that is only repsponsible for handling the GPU dialect, making the lowering similar to the one prior to extra registrations. PiperOrigin-RevId: 641874183	2024-06-10 05:55:01 -07:00
Jake VanderPlas	a2c31f4d15	pallas/mosaic test: avoid leaking global config state	2024-06-06 16:00:02 -07:00
Adam Paszke	a7e35c6b9a	[Mosaic GPU] Move the matmul example runner away from the test harness This just makes more sense. It really shouldn't be a jax_test beacause it doesn't even import test_util. PiperOrigin-RevId: 639872888	2024-06-03 12:23:31 -07:00
Christos Perivolaropoulos	726adc035e	[Mosaic GPU] Support f16 wgmma accumulator type. PiperOrigin-RevId: 639780370	2024-06-03 07:46:35 -07:00
Adam Paszke	84dee133ad	[Mosaic GPU] Skip tests for Python versions < 3.10 We use the match statement in Mosaic GPU, making it incompatible with Python 3.9. Since we're dropping support for 3.9 in about a month, it's better to skip the OSS tests for now (especially that we can't test with H100s) than to rewrite the code. Also add a check for having at least an sm90 GPU. PiperOrigin-RevId: 639733266	2024-06-03 04:35:40 -07:00
Benjamin Chetioui	cf6f0aa00d	[Mosaic GPU] Add a synchronization point at the end of the constructor for `BarrierArray`. Without such a synchronization point, calls to `mbarrier_init`s may end up happening after uses of the `mbarrier` being initialized---which is undefined behaviour and leads to deadlocks. This allows us to reenable the previously broken test cases. PiperOrigin-RevId: 638246527	2024-05-29 05:13:47 -07:00
Benjamin Chetioui	72b111afe6	[Mosaic GPU] Disable XLA autotuning in `matmul_test.py`. This slows down correctness tests for no value whatsoever. PiperOrigin-RevId: 637805645	2024-05-28 01:01:18 -07:00
jax authors	2fceaf05d4	[Mosaic GPU] Implement non-transposing matmul for bf16 x s8 matmul PiperOrigin-RevId: 636566253	2024-05-23 08:53:25 -07:00
Benjamin Chetioui	2f0d7fdf66	[Mosaic GPU] Disable broken matmul tests pending investigation. PiperOrigin-RevId: 636551068	2024-05-23 07:56:35 -07:00
Benjamin Chetioui	5aec259dc7	[Mosaic GPU] Implement basic support for aliasing shared memory. When the user constructs the relevant shapes that live in `smem` throughout the program, they now have the possibility of using a `mosaic_gpu.Union` of PyTrees instead of a single `PyTree`. `mosaic_gpu.Union` allows declaring several sets of buffers where within the set, the buffers are alive concurrently, but between two distinct sets, the buffers are alive for non-intersecting time intervals. PiperOrigin-RevId: 636533045	2024-05-23 06:40:34 -07:00
Benjamin Chetioui	0146bb8d83	[Mosaic GPU] Generalize matmul epilogue to work with `tile_n != 128`. The expectation is now that `tile_n >= 32` for f32 inputs, and `tile_n >= 64` for 16-bit types. PiperOrigin-RevId: 636091427	2024-05-22 02:35:49 -07:00
Adam Paszke	53ec2cd26f	Add notap tag to Mosaic tests PiperOrigin-RevId: 635379982	2024-05-20 01:35:56 -07:00
jax authors	f87be35b0f	[Mosaic GPU] reduce_sum does an intra-warp reduction before communicating with the other warps PiperOrigin-RevId: 634765339	2024-05-17 07:24:35 -07:00
jax authors	0e9243391b	[Mosaic GPU] Add a WGSplatLayout that trivially supports reshape and broadcast. PiperOrigin-RevId: 634610004	2024-05-16 20:04:05 -07:00

1 2

60 Commits