It is possible that a null pointer is inserted into the cache and not updated with a valid kernel call
in case there is an error later during initialization. This change updates the cache to store either
an error or a valid kernel call.
PiperOrigin-RevId: 666161091
This avoids:
- a forward declaration of `GpuContext`
- the `:asm_compiler_header` header only target
The moved code is unchanged - I just move it from one
file to another and fix up includes and dependencies.
Note that this is adding just another `#ifdef` to the redzone allocator code. I will clean this up in a subsequent change.
PiperOrigin-RevId: 623285804
A bug in CUDA prevents us from calling gpuStreamGetCtx inside graph capture. We use cuCtxGetCurrent as workaround for now.
PiperOrigin-RevId: 605417225
Autotuning is not compatible with graph capture because it requires synchronizing.
We use cuThreadExchangeStreamCaptureMode to execute a sequence of commands that are not recorded to graphs, similar to what NCCL does here: b6d7438d31/src/include/alloc.h (L171)
PiperOrigin-RevId: 602436960
The autotuner runs a series of benchmarks to determine the best configuration
for a Triton kernel. However, if it encounters a config that does not fit in
shared memory it throws an error and stops. I this eventuality it should just
continue.
PiperOrigin-RevId: 600730507
Register callback with default call target name from C++, enabling Triton calls with the default name to work in C++ only contexts (e.g. serving).
PiperOrigin-RevId: 545211452