The fact that src generalizes dst does not mean that they have the same implicit
tile shape (if one has an implicit dim and the other one doesn't, then they will
differ by a singleton dimension).
PiperOrigin-RevId: 658775019
We're constantly hitting unimpelmented relayouts, but it's hard to even know what's
in there given the way the code is written. This is the first of a few clean-up CLs
that aims to partition the process into steps with clear responsibilities. It should
help us better understand what's missing.
PiperOrigin-RevId: 658318811
This should help with understanding cuTensorMapEncodeTiled failures, since
CUDA doesn't provide any details beyond the error return code.
Note that this change also ensures that TMA descriptors are 64-byte aligned.
PiperOrigin-RevId: 656062820
In particular test trivial collectives (over singleton cluster axes), collectives
over more than 2 devices and clusters larger than 8 devices. This uncovered a few
more bugs in the implementation.
PiperOrigin-RevId: 655686102
We will implement a more efficient relayout according to the configs in rewrite ctx, such as `hardware_generation`, `max_sublanes_in_scratch` and so on. So it makes sense to change the relayout interface to take ctx (including python bindings). Now we can define rewrite ctx in `apply_vector_layout_test` as well. It makes it easier to test some advanced stuff (eg., mxu_shape change, max_sublanes_in_scratch change for rotate and relayout).
PiperOrigin-RevId: 655350013
This cl removes the funcOp from RewriteContext of apply-vector-layout-pass (since only one function is using it) and uses context to create the pass instead of a long list of arguments. We will need to add more args (target's bank counts) to create apply-vector-layout.
PiperOrigin-RevId: 655329321
This affects the (packing, 128) -> (8 * packing, 128) and 32-bit (8, 128),-2 -> (8, 128) retilings:
- No longer always broadcast the first sublane of a vreg before blending, which is usually unnecessary. Rotate instead, unless dst requires replicated offsets in (1, 128) -> (8, 128).
For (8, 128),-2 -> (8, 128), with our current restrictions, the first vreg always already has the sublane in the right position, so the broadcast is always wasteful.
- Unclear if rotate is always better than broadcast, but it doesn't make sense to broadcast the first vreg yet rotate the others.
This is some cleanup prior to removing some offset restrictions for (8, 128),-2 -> (8, 128)
PiperOrigin-RevId: 654935883
It's unused, buggy (will return a reference to local copy of array) and `ArrayRef` already has a ctor that takes a `std::array`
PiperOrigin-RevId: 654916697
- Sublane unfolding was not being checked for non-empty implicit dims e.g. (2, 2, 128, 1) -> (2, 256) would not work
- Noop squeeze/unsqueeze paths in infer-vector-layout, when the source has ImplicitDim::kNone, were forcing native tiling for some reason
- 1D lane squeeze was always assigning bitwidth of 32.
- Maybe others
PiperOrigin-RevId: 653910942
* if bitwidth does not change after bitcast:
- We can bitcast the input with any vector layout.
* if bitwidth changes after bitcast:
- We can bitcast the input with sublane offset which is a multiple of the ratio of bandwidths.
PiperOrigin-RevId: 653375579
Also fix bug in (1, 128 * packing) -> (packing, 128) retiling where the part index could be incremented OOB.
Note: Many relayouts might be inefficient for implicit dims. If, for example, implicit dim is kSecondMinor, retiling might blend tiles that are only padding. This also applies to kNone implicit dim with small shapes, however, so any optimizations should be written based on the implicit shape.
PiperOrigin-RevId: 653209744
This cl supports memref shapecast:
1. if tile is (1, 128), we support shapecast on any dim.
2. if shapecast on sublane dim, we only support tile aligned shape.
3. if shapecast on non-tiling dim, we support any shapecast.
4. all other cases would be considered as invalid memref shapecast.
PiperOrigin-RevId: 651924552
VectorLayout offsets are now allowed to fall anywhere within the vreg slice. This way, tiling is still applied after offsets and offsets are still applied after implicit dimensions.
Note that offsets outside of the vreg slice would mean a vreg full of padding, which is why we disallow them.
PiperOrigin-RevId: 650408597
We support any dynamic index on 2nd minor dim in either of the cases:
1. The minormost dim size of a unsliced memref matches VREG lane count.
2. Load/store one row on the second minormost dim, which triggers implicit strided load/store.
Note: For the default cases which can not skip the alignment check, we still use dynamic slice + static load/store solution to reduce scalar core work. We should figure out a way to optimize this in all cases.
PiperOrigin-RevId: 648771794
As we've established (sigh) we can't pass in TMA descriptors through global memory.
The current workaround was to use constant memory instead, but this raises a number of
potential concurrency issues. So, instead, we use the freshly added support for grid_constant
parameters in upstream LLVM to pass the descriptors as kernel arguments. This seems to work
fine and should in fact have lower overheads than both previous methods.
PiperOrigin-RevId: 648744363
In some situations, this meant also changing unrelated files to directly include tsl/platform/statusor.h to get the definitions for TF_ASSIGN_OR_RETURN, etc., where they were getting transitively included for free.
PiperOrigin-RevId: 645169743
We will choose the best solution based on the size of internal scratch memory.
- Sol 1: Convert dynamic roll to Log(N) static ops
- Sol 2: Static Store + Dynamic Load with internal scratch
PiperOrigin-RevId: 644509328