From a13b618c98109aa61b435f432d162bdbd6dc73fd Mon Sep 17 00:00:00 2001 From: Frederic Bastien Date: Thu, 21 Nov 2024 13:25:07 -0500 Subject: [PATCH] Document cudaMallocAsync as an experimental feature. --- docs/gpu_memory_allocation.rst | 28 ++++++++++++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/docs/gpu_memory_allocation.rst b/docs/gpu_memory_allocation.rst index dac52c194..6667589e7 100644 --- a/docs/gpu_memory_allocation.rst +++ b/docs/gpu_memory_allocation.rst @@ -70,3 +70,31 @@ Common causes of OOM failures memory. Note however, that the algorithm is basic and you can often get better trade-off between compute and memory by disabling the automatic remat pass and doing it manually with `the jax.remat API `_ + + +Experimental features +--------------------- + +Features here are experimental and must be tried with caution. + +``TF_GPU_ALLOCATOR=cuda_malloc_async`` + This replace XLA's own BFC memory allocator with `cudaMallocAsync + `_. + This will remove the big fixed pre-allocation and use a memory pool that grows. + The expected benefit is no need to set `XLA_PYTHON_CLIENT_MEM_FRACTION`. + + The risk are: + + - that memory fragmentation is different, so if you are close to the + limit, the exact OOM case due to fragmentation will be different. + - The allocation time won't be all paid at the start, but be incurred + when the memory pool need to be increased. So you could + experience less speed stability at the start and for benchmarks + it will be even more important to ignore the first few iterations. + + The risks can be mitigated by pre-allocating a signigicant chunk and + still get the benefit of having a growing memory pool. This can be + done with `TF_CUDA_MALLOC_ASYNC_SUPPORTED_PREALLOC=N`. If N is `-1` + it will preallocate the same as what was allocatedy by + default. Otherwise, it is the size in bytes that you want to + preallocate.