368 Commits

Author SHA1 Message Date
Tomás Longeri
8163e74e45 [Mosaic:TPU] Add relayout for adding minor implicit dim and relax some offset restrictions on similar shape cast
This factors out some logic from the apply-vector-layout shape cast rule where we insert a minor dimension, relaxes some offset restrictions on it, and uses it for the relayout.

PiperOrigin-RevId: 702993092
2024-12-04 23:13:26 -08:00
Tomás Longeri
101168740e [Mosaic:TPU] Lift offset restrictions on single-row (1, 128) -> (8, 128) 32-bit replicated retiling
PiperOrigin-RevId: 702966495
2024-12-04 21:08:17 -08:00
Naums Mogers
3990e05af7 [Mosaic] Add extra memref_slice verification and a memory space check helper
PiperOrigin-RevId: 702883469
2024-12-04 15:35:02 -08:00
jax authors
fa6585dea1 Merge pull request #25006 from andportnoy:aportnoy/mosaic-gpu-kernel-custom-name
PiperOrigin-RevId: 702772768
2024-12-04 10:26:21 -08:00
Jevin Jiang
9e5edb7015 [Mosaic TPU] Support packed type matmul with arbitrary shapes.
This cl removes all the shape constrains in matmul for all types.

We only need to mask out subelement on contracting dim. Instead of unpacking data and applying masks, we create a VREG-sized i32 "mask" which contains subelement mask info to logical and with target vreg. Through this way, in order to mask sub-elements, each target vreg only needs to apply 1 op (logical_and) instead of 3 ops (unpacking + select + packing).

PiperOrigin-RevId: 702480077
2024-12-03 14:58:42 -08:00
Andrey Portnoy
7bd81dbe0d [Mosaic GPU] Improve default kernel name and add option to customize
This allows users to distinguish Mosaic GPU kernels from other kernels
when using profiling programs such as Nsight Systems.

The new default behavior is to use `mosaic_gpu_<def_name>_kernel` as
the kernel name, where `<def_name>` is the name of the Mosaic GPU
Python kernel function passed to `as_gpu_kernel` or
`as_torch_gpu_kernel`.

We also add a new `kernel_name` optional argument to `as_gpu_kernel`
and `as_torch_gpu_kernel`. If `kernel_name` is not `None`, the
resulting kernel name is `mosaic_gpu_<kernel_name>_kernel`. This is
useful when the Mosaic GPU Python kernel function is constructed
through metaprogramming so that the final specialized kernel can have
different meaningful names depending on the metaparameters.

Previously the kernel name was always `main_kernel`.
2024-12-02 22:22:11 -05:00
Benjamin Kramer
03b6945ee7 Integrate LLVM at llvm/llvm-project@b214ca82da
Updates LLVM usage to match
[b214ca82daee](https://github.com/llvm/llvm-project/commit/b214ca82daee)

PiperOrigin-RevId: 700689999
2024-11-27 07:08:09 -08:00
Tomás Longeri
7a2070e7da [Mosaic:TPU] Enable broadcast from 1-D vectors
PiperOrigin-RevId: 700592669
2024-11-27 00:27:59 -08:00
Jevin Jiang
f899d51535 [Mosaic TPU] Fold sublane offset to indices when storing to untiled ref.
This optimization avoids unnecessary retiling when storing to untiled ref but adds at most one extra store op for sublane offset (since sublane offset is limieted to < VregSlice[0]).

PiperOrigin-RevId: 698896373
2024-11-21 13:29:06 -08:00
Naums Mogers
e72b449089 Reverts c04aec9d525dd2e767495e41b98e82dd79315f37
PiperOrigin-RevId: 698654038
2024-11-20 22:45:46 -08:00
Jevin Jiang
869a53345d [Mosaic TPU] Add bound check for general vector store op.
PiperOrigin-RevId: 698577015
2024-11-20 17:28:04 -08:00
Naums Mogers
6c291d67b7 [Mosaic] Add tpu.log verification on SC
Guards against using formatting and targeting vector subcores on SC.

PiperOrigin-RevId: 698222100
2024-11-19 19:04:29 -08:00
Naums Mogers
c04aec9d52 [Mosaic] Extend tpu.sem_signal with subcore_id
This change:
- Bumps up the version of Mosaic to 4 in `serde.cc`.

- Adds optional `subcore_id` parameter to `tpu.sem_signal` for signalling specific subcores.

- Extends deserialization to correctly parse the older versions of Mosaic without the new parameter `subcore_id` of `tpu.sem_signal`.

PiperOrigin-RevId: 698163836
2024-11-19 15:22:59 -08:00
Naums Mogers
0d36b0b433 [Mosaic] Add target core type parameter to tpu.sem_signal
Adds the optional core type parameter to `tpu.sem_signal` for cross-core signalling.
If the target core type is not provided, the target core type is assumed to be that of the core issuing the signal.
The issuing core type is determined based on the core type annotation of the parent function; if the annotation is not provided, the issuing core type is assumed to be TensorCore.

PiperOrigin-RevId: 698129842
2024-11-19 13:40:13 -08:00
Jevin Jiang
6c31efa3f3 [Mosaic TPU] Add general tpu.vector_store and support masked store.
This cl introduces a general store op called tpu.vector_stores which aims to unify vector::store, tpu::strided_load, vector::masked_store. The tpu.vector_stores should also provide general interface for lowering for both TensorCore and SparseCore.

This cl also adds the support for (dynamic) masked store.

PiperOrigin-RevId: 698067741
2024-11-19 10:33:09 -08:00
jax authors
d397dd9684 Implement lax.pad in Pallas.
PiperOrigin-RevId: 697897093
2024-11-18 23:59:20 -08:00
Jevin Jiang
0fe77bc9f0 [Mosaic TPU] Support relayout for mask vector
We cast i1 vector (mask) to i32 vector before relayout and then cast back to i1 vector (mask) after relayout is finished.

PiperOrigin-RevId: 697823543
2024-11-18 18:07:15 -08:00
jax authors
1471702adc [Mosaic TPU] Support 1D concat: set implicit_dim to kSecondMinor to treat 1D (N,) as (1, N) and then tile it as (1, 128)
PiperOrigin-RevId: 696870258
2024-11-15 06:41:57 -08:00
jax authors
a8464ce761 [Mosaic][TPU] Omit short circuiting of relayout (we should always relayout!) and implement product mismatch case for where we relayout from replicated to offset, and the number of vregs changes.
PiperOrigin-RevId: 696557463
2024-11-14 09:53:25 -08:00
Naums Mogers
c32db46e6c [Mosaic] Add parameter names to tpu.sem_signal and add tests
This CLs adds parameter names to the optional parameters of `tpu.sem_signal` -- `device_id`, `core_id` -- to remove the ambiguity upon deserialization.
Adds LIT tests of signalling on TC with parameter names.

PiperOrigin-RevId: 695875037
2024-11-12 14:37:47 -08:00
Sergei Lebedev
d304025a41 [mosaic_gpu] The profiler now uses FFI calls for creating events and computing elapsed time
PiperOrigin-RevId: 695798787
2024-11-12 11:01:59 -08:00
jax authors
1221da8467 [Mosaic] Fix mask creation for packed sublanes
Unaligned concat used to be f32 only, but implicitly protected via unimplemented support for multi-row-shift in sub32 types. When this was added, we started invoking unaligned concat flow w/ sub32 types, but the masking code that assumed full rows (unpacked types) was no longer sufficient - we need better granularity for these cases. This only affects sublanes, as that is where we pack, we don't have partial lanes.

This CL, as a small benefit, also adds better error messages to the ops involved in lower_to_llo.cc.

PiperOrigin-RevId: 695796095
2024-11-12 10:55:19 -08:00
Jevin Jiang
38d062dbee [Mosaic TPU] Support dynamic DMA and ref slice on the 2nd minor when memref is untiled
* Generalize any untiled memref to have tiling (packing, 128)
* Support dynamic index on 2nd minor.
* Support dynamic shape on 2nd minor.

PiperOrigin-RevId: 695516124
2024-11-11 16:14:27 -08:00
Benjamin Chetioui
da89c9e38c [Mosaic GPU] Add base_pointer argument to InitializeBarrierOp.
This corresponds to what's implemented in `BarrierRef`, and ultimately makes it
easier to allocate barriers at a specific address in dynamic shared memory.

PiperOrigin-RevId: 695308297
2024-11-11 06:18:26 -08:00
Dimitar (Mitko) Asenov
d833066a1f [MOSAIC:GPU] Add async_load, async_store, and supporting attributes to the MLIR Mosaic GPU Dialect.
PiperOrigin-RevId: 694643777
2024-11-08 14:34:23 -08:00
Adam Paszke
ce3826d098 [Mosaic GPU] Make sure to free the cloned MLIR module when debugging
We only recently started using this in tests and it has caused ASAN
to report a bunch of leaks.

PiperOrigin-RevId: 694510867
2024-11-08 08:35:10 -08:00
Tomás Longeri
04a6652243 [Mosaic] Fix handling of i1 splat constants
PiperOrigin-RevId: 694248723
2024-11-07 14:28:59 -08:00
Tzu-Wei Sung
8b7bcadebe [Mosaic] Fix canonicalize_extract op name.
PiperOrigin-RevId: 694236671
2024-11-07 13:51:52 -08:00
Naums Mogers
3df204a457 [Mosaic] Verify that tpu.sem_wait semaphore rank is zero
Since we only wait on one semaphore, we should enforce this in the verifier.

PiperOrigin-RevId: 693770055
2024-11-06 10:10:15 -08:00
Sergei Lebedev
34b4787e2e [mosaic_gpu] Check the return code of gpuEventCreate and gpuEventDestroy
PiperOrigin-RevId: 693260326
2024-11-05 01:59:58 -08:00
Benjamin Chetioui
63e59c5fd7 [Mosaic GPU] Ensure that the dialect module can be loaded successfully.
This requires that the file providing the bindings has the same name as the
dialect it defines, since dialect search looks for a module path of the form
`<prefix>.<dialect namespace>`.

PiperOrigin-RevId: 693241875
2024-11-05 00:47:21 -08:00
Praveen Batra
8296f6e0ba [Mosaic] Add extension files for infer/apply vector layout.
PiperOrigin-RevId: 691868278
2024-10-31 11:08:37 -07:00
Praveen Batra
7d9f565647 [Mosaic] Fix some imports.
PiperOrigin-RevId: 691830491
2024-10-31 09:25:34 -07:00
Benjamin Chetioui
c708a04c6e [Mosaic GPU] Add Python bindings for the Mosaic GPU MLIR dialect.
Also start moving the existing C++ tests to Python.

PiperOrigin-RevId: 691729887
2024-10-31 02:47:30 -07:00
Dimitar (Mitko) Asenov
7d504cd95a [MOSAIC:GPU] Extend the mosaic mlir dialect with fragmented layouts.
PiperOrigin-RevId: 691712579
2024-10-31 01:29:22 -07:00
jax authors
5aeffde707 [Mosaic] Extend tpu matmulop to have dimension dims. Add support for batching and simple transposition.
PiperOrigin-RevId: 691706218
2024-10-31 00:59:13 -07:00
Naums Mogers
242e6634ff [Mosaic] Add the core type enum
The new attribute allows differentiating compilation by target core.

PiperOrigin-RevId: 691531726
2024-10-30 13:23:34 -07:00
jax authors
99ea4c1a4a [Fix] Put * packing into reshape no-op condition (Bug in my original CL)
PiperOrigin-RevId: 691476663
2024-10-30 10:47:23 -07:00
jax authors
5ad066eeaa [TPU][Mosaic] Replace tpu lowering (at canonicalization) for repeat with concat (which handles far more cases)
PiperOrigin-RevId: 691192121
2024-10-29 15:57:44 -07:00
jax authors
de68018473 [NFC][Mosaic TPU] Clarify layout comment block
PiperOrigin-RevId: 690977672
2024-10-29 05:20:08 -07:00
jax authors
12d26053e3 [TPU][Mosaic] Add support for a no-op reshape where sublane_tiling = 1 and the res_tiled and src_tiled shapes both fill a full vreg (1024)
PiperOrigin-RevId: 690796348
2024-10-28 16:57:51 -07:00
Adam Paszke
36c56fa19b [Pallas:MGPU] Fix flaky debug_print tests
Turns out that waiting for the kernel to finish it not enough, since the
prints also need to be processed by the CUDA runtime. Using a test-only
function that synchronizes all the devices seems to suffice.

PiperOrigin-RevId: 690624999
2024-10-28 08:42:02 -07:00
Sergei Lebedev
04bdd07f66 [mosaic_gpu] mgpu.FragmentedArray now supports //
This is needed to compute grid index from the iteration step counter in `emit_pipeline`.

PiperOrigin-RevId: 690608581
2024-10-28 07:52:22 -07:00
Jevin Jiang
2a671e25a7 [Mosaic TPU] Remove extra check
PiperOrigin-RevId: 689852989
2024-10-25 11:22:17 -07:00
Tzu-Wei Sung
4972f84c94 [Mosaic] Use max sublane offset per shuffled load to decide whether to avoid bank conflict.
PiperOrigin-RevId: 689809024
2024-10-25 09:09:14 -07:00
jax authors
63c1699ed0 Fix a use-after-free bug in third_party/py/jax/jaxlib/mosaic/dialect/gpu/mosaic_gpu_test.cc
The backing array of the initializer_list is destroyed at the end of the full expression.

PiperOrigin-RevId: 689783482
2024-10-25 07:40:12 -07:00
Adam Paszke
6634f5a348 [Mosaic GPU] Use absl::StrCat instead std::string::operator+
Repeated string addition is apparently a bit of an anti-pattern. Not that it matters
much in this place, but why not do it properly.

PiperOrigin-RevId: 689416587
2024-10-24 09:49:51 -07:00
Andrey Portnoy
14e0f0e7fa [Mosaic GPU] Query SM and PTX ISA dynamically using driver and LLVM
Originally proposed in #24021. Slightly rewritter to make testing with internal LLVM toolchains better.

Use CUDA driver API to query major and minor compute capabilities, thus arriving at a "base" SM string (e.g. `sm_90`).
Then use LLVM to see if we can "upgrade" the base SM string to one that enables architecture-specific capabilities (e.g. `sm_90a`).
Then use LLVM to map the SM string to a PTX ISA version that supports the SM.

Co-authored-by: Andrey Portnoy <aportnoy@nvidia.com>
PiperOrigin-RevId: 689286774
2024-10-24 01:46:29 -07:00
Jevin Jiang
b8bacda2d9 [Mosaic TPU] Use native vector tiling to load and store with untiled memref.
PiperOrigin-RevId: 689142734
2024-10-23 16:22:16 -07:00
jax authors
48bddc6f6c Adds arith.select to the op patters in order to canonicalize non 32 bit selects.
PiperOrigin-RevId: 687635492
2024-10-19 09:09:06 -07:00