rocm_jax/docs/ffi.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(ffi-tutorial)=\n",
    "\n",
    "# Foreign function interface (FFI)\n",
    "\n",
    "_This tutorial requires JAX v0.4.31 or newer._\n",
    "\n",
    "While a wide range of numerical operations can be easily and efficiently implemented using JAX's built in `jax.numpy` and `jax.lax` interfaces, it can sometimes be useful to explicitly call out to external compiled libraries via a \"foreign function interface\" (FFI).\n",
    "This can be particularly useful when particular operations have been previously implemented in an optimized C or CUDA library, and it would be non-trivial to reimplement these computations directly using JAX, but it can also be useful for optimizing runtime or memory performance of JAX programs.\n",
    "That being said, the FFI should typically be considered a last resort option because the XLA compiler that sits in the backend, or the Pallas kernel language, which provides lower level control, typically produce performant code with a lower development and maintenance cost.\n",
    "\n",
    "One point that should be taken into account when considering use of the FFI is that _JAX doesn't automatically know how to differentiate through foreign functions_.\n",
    "This means that if you want to use JAX's autodifferentiation capabilities alongside a foreign function, you'll also need to provide an implementation of the relevant differentiation rules.\n",
    "We will discuss some possible approaches below, but it is important to call this limitation out right from the start!\n",
    "\n",
    "JAX's FFI support is provided in two parts:\n",
    "\n",
    "1. A header-only C++ library from XLA which is packaged as part of JAX as of v0.4.29 or available from the [openxla/xla](https://github.com/openxla/xla) project, and\n",
    "2. A Python front end, available in the `jax.ffi` submodule.\n",
    "\n",
    "In this tutorial we demonstrate the use of both of these components using a simple example, and then go on to discuss some lower-level extensions for more complicated use cases.\n",
    "We start by presenting the FFI on CPU, and discuss generalizations to GPU or multi-device environments below.\n",
    "\n",
    "The end-to-end code for this example and some other more advanced use cases can be found in the JAX FFI examples project on GitHub at [`examples/ffi` in the JAX repository](https://github.com/jax-ml/jax/tree/main/examples/ffi).\n",
    "\n",
    "Because we demonstrate how FFI calls can be sharded at the end of this tutorial, let's first set up our environment to be treated by JAX as having multiple CPUs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.environ[\"XLA_FLAGS\"] = \"--xla_force_host_platform_device_count=4\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A simple example\n",
    "\n",
    "To demonstrate the use of the FFI interface, we will implement a simple \"root-mean-square (RMS)\" normalization function.\n",
    "RMS normalization takes an array $x$ with shape $(N,)$ and returns\n",
    "\n",
    "$$\n",
    "y_n = \\frac{x_n}{\\sqrt{\\frac{1}{N}\\sum_{n=1}^N {x_n}^2 + \\epsilon}}\n",
    "$$\n",
    "\n",
    "where $\\epsilon$ is a tuning parameter used for numerical stability.\n",
    "\n",
    "This is a somewhat silly example, because it can be easily implemented using JAX as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import jax\n",
    "import jax.numpy as jnp\n",
    "\n",
    "\n",
    "def rms_norm_ref(x, eps=1e-5):\n",
    "  scale = jnp.sqrt(jnp.mean(jnp.square(x), axis=-1, keepdims=True) + eps)\n",
    "  return x / scale"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But, it's just non-trivial enough to be useful for demonstrating some key details of the FFI, while still being straightforward to understand.\n",
    "We will use this reference implementation to test our FFI version below.\n",
    "\n",
    "## Backend code\n",
    "\n",
    "To begin with, we need an implementation of RMS normalization in C++ that we will expose using the FFI.\n",
    "This isn't meant to be particularly performant, but you could imagine that if you had some new better implementation of RMS normalization in a C++ library, it might have an interface like the following.\n",
    "So, here's a simple implementation of RMS normalization in C++:\n",
    "\n",
    "```c++\n",
    "#include <cmath>\n",
    "#include <cstdint>\n",
    "\n",
    "float ComputeRmsNorm(float eps, int64_t size, const float *x, float *y) {\n",
    "  float sm = 0.0f;\n",
    "  for (int64_t n = 0; n < size; ++n) {\n",
    "    sm += x[n] * x[n];\n",
    "  }\n",
    "  float scale = 1.0f / std::sqrt(sm / float(size) + eps);\n",
    "  for (int64_t n = 0; n < size; ++n) {\n",
    "    y[n] = x[n] * scale;\n",
    "  }\n",
    "  return scale;\n",
    "}\n",
    "```\n",
    "\n",
    "and, for our example, this is the function that we want to expose to JAX via the FFI."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### C++ interface\n",
    "\n",
    "To expose our library function to JAX and XLA, we need to write a thin wrapper using the APIs provided by the header-only library in the [`xla/ffi/api`](https://github.com/openxla/xla/tree/main/xla/ffi/api) directory of the [XLA project](https://github.com/openxla/xla).\n",
    "For more information about this interface, take a look at [the XLA custom call documentation](https://openxla.org/xla/custom_call).\n",
    "The full source listing can be downloaded [here](https://github.com/jax-ml/jax/blob/main/examples/ffi/src/jax_ffi_example/rms_norm.cc), but the key implementation details are reproduced here:\n",
    "\n",
    "```c++\n",
    "#include <functional>\n",
    "#include <numeric>\n",
    "#include <utility>\n",
    "\n",
    "#include \"xla/ffi/api/c_api.h\"\n",
    "#include \"xla/ffi/api/ffi.h\"\n",
    "\n",
    "namespace ffi = xla::ffi;\n",
    "\n",
    "// A helper function for extracting the relevant dimensions from `ffi::Buffer`s.\n",
    "// In this example, we treat all leading dimensions as batch dimensions, so this\n",
    "// function returns the total number of elements in the buffer, and the size of\n",
    "// the last dimension.\n",
    "template <ffi::DataType T>\n",
    "std::pair<int64_t, int64_t> GetDims(const ffi::Buffer<T> &buffer) {\n",
    "  auto dims = buffer.dimensions();\n",
    "  if (dims.size() == 0) {\n",
    "    return std::make_pair(0, 0);\n",
    "  }\n",
    "  return std::make_pair(buffer.element_count(), dims.back());\n",
    "}\n",
    "\n",
    "// A wrapper function providing the interface between the XLA FFI call and our\n",
    "// library function `ComputeRmsNorm` above. This function handles the batch\n",
    "// dimensions by calling `ComputeRmsNorm` within a loop.\n",
    "ffi::Error RmsNormImpl(float eps, ffi::Buffer<ffi::F32> x,\n",
    "                       ffi::ResultBuffer<ffi::F32> y) {\n",
    "  auto [totalSize, lastDim] = GetDims(x);\n",
    "  if (lastDim == 0) {\n",
    "    return ffi::Error::InvalidArgument(\"RmsNorm input must be an array\");\n",
    "  }\n",
    "  for (int64_t n = 0; n < totalSize; n += lastDim) {\n",
    "    ComputeRmsNorm(eps, lastDim, &(x.typed_data()[n]), &(y->typed_data()[n]));\n",
    "  }\n",
    "  return ffi::Error::Success();\n",
    "}\n",
    "\n",
    "// Wrap `RmsNormImpl` and specify the interface to XLA. If you need to declare\n",
    "// this handler in a header, you can use the `XLA_FFI_DECLARE_HANDLER_SYMBOL`\n",
    "// macro: `XLA_FFI_DECLARE_HANDLER_SYMBOL(RmsNorm)`.\n",
    "XLA_FFI_DEFINE_HANDLER_SYMBOL(\n",
    "    RmsNorm, RmsNormImpl,\n",
    "    ffi::Ffi::Bind()\n",
    "        .Attr<float>(\"eps\")\n",
    "        .Arg<ffi::Buffer<ffi::F32>>()  // x\n",
    "        .Ret<ffi::Buffer<ffi::F32>>()  // y\n",
    ");\n",
    "```\n",
    "\n",
    "Starting at the bottom, we're using the XLA-provided macro `XLA_FFI_DEFINE_HANDLER_SYMBOL` to generate some boilerplate which will expand into a function called `RmsNorm` with the appropriate signature.\n",
    "But, the important stuff here is all in the call to `ffi::Ffi::Bind()`, where we define the input and output types, and the types of any parameters.\n",
    "\n",
    "Then, in `RmsNormImpl`, we accept `ffi::Buffer` arguments which include information about the buffer shape, and pointers to the underlying data.\n",
    "In this implementation, we treat all leading dimensions of the buffer as batch dimensions, and perform RMS normalization over the last axis.\n",
    "`GetDims` is a helper function providing support for this batching behavior.\n",
    "We discuss this batching behavior in more detail [below](ffi-call-vmap), but the general idea is that it can be useful to transparently handle batching in the left-most dimensions of the input arguments.\n",
    "In this case, we treat all but the last axis as batch dimensions, but other foreign functions may require a different number of non-batch dimensions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Building and registering an FFI handler\n",
    "\n",
    "Now that we have our minimal FFI wrapper implemented, we need to expose this function (`RmsNorm`) to Python.\n",
    "In this tutorial, we compile `RmsNorm` into a shared library and load it using [ctypes](https://docs.python.org/3/library/ctypes.html), but another common pattern is to use [nanobind](https://nanobind.readthedocs.io/) or [pybind11](https://pybind11.readthedocs.io/) as discussed below.\n",
    "\n",
    "To compile the shared library, we're using CMake here, but you should be able to use your favorite build system without too much trouble."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [],
   "source": [
    "!cmake -DCMAKE_BUILD_TYPE=Release -B ffi/_build ffi\n",
    "!cmake --build ffi/_build\n",
    "!cmake --install ffi/_build"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With this compiled library in hand, we now need to register this handler with XLA via the {func}`~jax.ffi.register_ffi_target` function.\n",
    "This function expects our handler (a function pointer to the C++ function `RmsNorm`) to be wrapped in a [`PyCapsule`](https://docs.python.org/3/c-api/capsule.html).\n",
    "JAX provides a helper function {func}`~jax.ffi.pycapsule` to help with this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import ctypes\n",
    "from pathlib import Path\n",
    "\n",
    "path = next(Path(\"ffi\").glob(\"librms_norm*\"))\n",
    "rms_norm_lib = ctypes.cdll.LoadLibrary(path)\n",
    "jax.ffi.register_ffi_target(\n",
    "    \"rms_norm\", jax.ffi.pycapsule(rms_norm_lib.RmsNorm), platform=\"cpu\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{tip}\n",
    "If you're familiar with the legacy \"custom call\" API, it's worth noting that you can also use {func}`~jax.ffi.register_ffi_target` to register a custom call target by manually specifying the keyword argument `api_version=0`. The default `api_version` for {func}`~jax.ffi.register_ffi_target` is `1`, the new \"typed\" FFI API that we're using here.\n",
    "```\n",
    "\n",
    "**An alternative approach**:\n",
    "A common alternative pattern for exposing handlers to Python is to use [nanobind](https://nanobind.readthedocs.io/) or [pybind11](https://pybind11.readthedocs.io/) to define a tiny Python extension which can be imported.\n",
    "For our example here, the nanobind code would be:\n",
    "\n",
    "```c++\n",
    "#include <type_traits>\n",
    "\n",
    "#include \"nanobind/nanobind.h\"\n",
    "#include \"xla/ffi/api/c_api.h\"\n",
    "\n",
    "namespace nb = nanobind;\n",
    "\n",
    "template <typename T>\n",
    "nb::capsule EncapsulateFfiCall(T *fn) {\n",
    "  // This check is optional, but it can be helpful for avoiding invalid handlers.\n",
    "  static_assert(std::is_invocable_r_v<XLA_FFI_Error *, T, XLA_FFI_CallFrame *>,\n",
    "                \"Encapsulated function must be and XLA FFI handler\");\n",
    "  return nb::capsule(reinterpret_cast<void *>(fn));\n",
    "}\n",
    "\n",
    "NB_MODULE(rms_norm, m) {\n",
    "  m.def(\"rms_norm\", []() { return EncapsulateFfiCall(RmsNorm); });\n",
    "}\n",
    "```\n",
    "\n",
    "Then, in Python we can register this handler using:\n",
    "\n",
    "```python\n",
    "# Assuming that we compiled a nanobind extension called `rms_norm`:\n",
    "import rms_norm as rms_norm_lib\n",
    "\n",
    "jax.ffi.register_ffi_target(\"rms_norm\", rms_norm_lib.rms_norm(), platform=\"cpu\")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Frontend code\n",
    "\n",
    "Now that we have registered our FFI handler, it is straightforward to call our C++ library from JAX using the {func}`~jax.ffi.ffi_call` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "def rms_norm(x, eps=1e-5):\n",
    "  # We only implemented the `float32` version of this function, so we start by\n",
    "  # checking the dtype. This check isn't strictly necessary because type\n",
    "  # checking is also performed by the FFI when decoding input and output\n",
    "  # buffers, but it can be useful to check types in Python to raise more\n",
    "  # informative errors.\n",
    "  if x.dtype != jnp.float32:\n",
    "    raise ValueError(\"Only the float32 dtype is implemented by rms_norm\")\n",
    "\n",
    "  call = jax.ffi.ffi_call(\n",
    "    # The target name must be the same string as we used to register the target\n",
    "    # above in `register_custom_call_target`\n",
    "    \"rms_norm\",\n",
    "\n",
    "    # In this case, the output of our FFI function is just a single array with\n",
    "    # the same shape and dtype as the input. We discuss a case with a more\n",
    "    # interesting output type below.\n",
    "    jax.ShapeDtypeStruct(x.shape, x.dtype),\n",
    "\n",
    "    # The `vmap_method` parameter controls this function's behavior under `vmap`\n",
    "    # as discussed below.\n",
    "    vmap_method=\"broadcast_all\",\n",
    "  )\n",
    "\n",
    "  # Note that here we're use `numpy` (not `jax.numpy`) to specify a dtype for\n",
    "  # the attribute `eps`. Our FFI function expects this to have the C++ `float`\n",
    "  # type (which corresponds to numpy's `float32` type), and it must be a\n",
    "  # static parameter (i.e. not a JAX array).\n",
    "  return call(x, eps=np.float32(eps))\n",
    "\n",
    "\n",
    "# Test that this gives the same result as our reference implementation\n",
    "x = jnp.linspace(-0.5, 0.5, 32).reshape((8, 4))\n",
    "np.testing.assert_allclose(rms_norm(x), rms_norm_ref(x), rtol=1e-5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This code cell includes a lot of inline comments which should explain most of what is happening here, but there are a few points that are worth explicitly highlighting.\n",
    "Most of the heavy lifting here is done by the {func}`~jax.ffi.ffi_call` function, which tells JAX how to call the foreign function for a particular set of inputs.\n",
    "It's important to note that the first argument to {func}`~jax.ffi.ffi_call` must be a string that matches the target name that we used when calling `register_custom_call_target` above.\n",
    "\n",
    "Any attributes (defined using `Attr` in the C++ wrapper above) should be passed as keyword arguments to {func}`~jax.ffi.ffi_call`.\n",
    "Note that we explicitly cast `eps` to `np.float32` because our FFI library expects a C `float`, and we can't use `jax.numpy` here, because these parameters must be static arguments.\n",
    "\n",
    "The `vmap_method` argument to {func}`~jax.ffi.ffi_call` defines how this FFI call interacts with {func}`~jax.vmap` as described next.\n",
    "\n",
    "```{tip}\n",
    "If you are familiar with the earlier \"custom call\" interface, you might be surprised that we're not passing the problem dimensions as parameters (batch size, etc.) to {func}`~jax.ffi.ffi_call`.\n",
    "In this earlier API, the backend had no mechanism for receiving metadata about the input arrays, but since the FFI includes dimension information with the `Buffer` objects, we no longer need to compute this using Python when lowering.\n",
    "One major perk of this change is {func}`~jax.ffi.ffi_call` can support some simple {func}`~jax.vmap` semantics out of the box, as discussed below.\n",
    "```\n",
    "\n",
    "(ffi-call-vmap)=\n",
    "### Batching with `vmap`\n",
    "\n",
    "{func}`~jax.ffi.ffi_call` supports some simple {func}`~jax.vmap` semantics out of the box using the `vmap_method` parameter.\n",
    "The docs for {func}`~jax.pure_callback` provide more details about the `vmap_method` parameter, and the same behavior applies to {func}`~jax.ffi.ffi_call`.\n",
    "\n",
    "The simplest `vmap_method` is `\"sequential\"`.\n",
    "In this case, when `vmap`ped, an `ffi_call` will be rewritten as a {func}`~jax.lax.scan` with the `ffi_call` in the body.\n",
    "This implementation is general purpose, but it doesn't parallelize very well.\n",
    "Many FFI calls provide more efficient batching behavior and, in some simple cases, the `\"expand_dims\"` or `\"broadcast_all\"` methods can be used to expose a better implementation.\n",
    "\n",
    "In this case, since we only have one input argument, `\"expand_dims\"` and `\"broadcast_all\"` actually have the same behavior.\n",
    "The specific assumption required to use these methods is that the foreign function knows how to handle batch dimensions.\n",
    "Another way of saying this is that the result of calling `ffi_call` on the batched inputs is assumed to be equal to stacking the repeated application of `ffi_call` to each element in the batched input, roughly:\n",
    "\n",
    "```python\n",
    "ffi_call(xs) == jnp.stack([ffi_call(x) for x in xs])\n",
    "```\n",
    "\n",
    "```{tip}\n",
    "Note that things get a bit more complicated when we have multiple input arguments.\n",
    "For simplicity, we will use the `\"broadcast_all\"` throughout this tutorial, which guarantees that all inputs will be broadcasted to have the same batch dimensions, but it would also be possible to implement a foreign function to handle the `\"expand_dims\"` method.\n",
    "The documentation for {func}`~jax.pure_callback` includes some examples of this\n",
    "```\n",
    "\n",
    "Our implementation of `rms_norm` has the appropriate semantics, and it supports `vmap` with `vmap_method=\"broadcast_all\"` out of the box:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.testing.assert_allclose(jax.vmap(rms_norm)(x), jax.vmap(rms_norm_ref)(x), rtol=1e-5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can inspect the [jaxpr](jax-internals-jaxpr) of the {func}`~jax.vmap` of `rms_norm` to confirm that it isn't being rewritten using {func}`~jax.lax.scan`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "jax.make_jaxpr(jax.vmap(rms_norm))(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using `vmap_method=\"sequential\"`, `vmap`ping a `ffi_call` will fall back on a {func}`jax.lax.scan` with the `ffi_call` in the body:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def rms_norm_sequential(x, eps=1e-5):\n",
    "  return jax.ffi.ffi_call(\n",
    "    \"rms_norm\",\n",
    "    jax.ShapeDtypeStruct(x.shape, x.dtype),\n",
    "    vmap_method=\"sequential\",\n",
    "  )(x, eps=np.float32(eps))\n",
    "\n",
    "\n",
    "jax.make_jaxpr(jax.vmap(rms_norm_sequential))(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If your foreign function provides an efficient batching rule that isn't supported by this simple `vmap_method` parameter, it might also be possible to define more flexible custom `vmap` rules using the experimental `custom_vmap` interface, but it's worth also opening an issue describing your use case on [the JAX issue tracker](https://github.com/jax-ml/jax/issues)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Differentiation\n",
    "\n",
    "Unlike with batching, {func}`~jax.ffi.ffi_call` doesn't provide any default support for automatic differentiation (AD) of foreign functions.\n",
    "As far as JAX is concerned, the foreign function is a black box that can't be inspected to determine the appropriate behavior when differentiated.\n",
    "Therefore, it is the {func}`~jax.ffi.ffi_call` user's responsibility to define a custom derivative rule.\n",
    "\n",
    "More details about custom derivative rules can be found in the [custom derivatives tutorial](https://jax.readthedocs.io/en/latest/notebooks/Custom_derivative_rules_for_Python_code.html), but the most common pattern used for implementing differentiation for foreign functions is to define a {func}`~jax.custom_vjp` which itself calls a foreign function.\n",
    "In this case, we actually define two new FFI calls:\n",
    "\n",
    "1. `rms_norm_fwd` returns two outputs: (a) the \"primal\" result, and (b) the \"residuals\" which are used in the backwards pass.\n",
    "2. `rms_norm_bwd` takes the residuals and the output co-tangents, and returns the input co-tangents.\n",
    "\n",
    "We won't get into the details of the RMS normalization backwards pass, but take a look at the [C++ source code](https://github.com/jax-ml/jax/blob/main/examples/ffi/src/jax_ffi_example/rms_norm.cc) to see how these functions are implemented on the back end.\n",
    "The main point to emphasize here is that the \"residual\" computed has a different shape than the primal output, therefore, in the {func}`~jax.ffi.ffi_call` to `res_norm_fwd`, the output type has two elements with different shapes.\n",
    "\n",
    "This custom derivative rule can be wired in as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "jax.ffi.register_ffi_target(\n",
    "  \"rms_norm_fwd\", jax.ffi.pycapsule(rms_norm_lib.RmsNormFwd), platform=\"cpu\"\n",
    ")\n",
    "jax.ffi.register_ffi_target(\n",
    "  \"rms_norm_bwd\", jax.ffi.pycapsule(rms_norm_lib.RmsNormBwd), platform=\"cpu\"\n",
    ")\n",
    "\n",
    "\n",
    "def rms_norm_fwd(x, eps=1e-5):\n",
    "  y, res = jax.ffi.ffi_call(\n",
    "    \"rms_norm_fwd\",\n",
    "    (\n",
    "      jax.ShapeDtypeStruct(x.shape, x.dtype),\n",
    "      jax.ShapeDtypeStruct(x.shape[:-1], x.dtype),\n",
    "    ),\n",
    "    vmap_method=\"broadcast_all\",\n",
    "  )(x, eps=np.float32(eps))\n",
    "  return y, (res, x)\n",
    "\n",
    "\n",
    "def rms_norm_bwd(eps, res, ct):\n",
    "  del eps\n",
    "  res, x = res\n",
    "  assert res.shape == ct.shape[:-1]\n",
    "  assert x.shape == ct.shape\n",
    "  return (\n",
    "    jax.ffi.ffi_call(\n",
    "      \"rms_norm_bwd\",\n",
    "      jax.ShapeDtypeStruct(ct.shape, ct.dtype),\n",
    "      vmap_method=\"broadcast_all\",\n",
    "    )(res, x, ct),\n",
    "  )\n",
    "\n",
    "\n",
    "rms_norm = jax.custom_vjp(rms_norm, nondiff_argnums=(1,))\n",
    "rms_norm.defvjp(rms_norm_fwd, rms_norm_bwd)\n",
    "\n",
    "# Check that this gives the right answer when compared to the reference version\n",
    "ct_y = jnp.ones_like(x)\n",
    "np.testing.assert_allclose(\n",
    "  jax.vjp(rms_norm, x)[1](ct_y), jax.vjp(rms_norm_ref, x)[1](ct_y), rtol=1e-5\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point, we can use our new `rms_norm` function transparently for many JAX applications, and it will transform appropriately under the standard JAX function transformations like {func}`~jax.vmap` and {func}`~jax.grad`.\n",
    "One thing that this example doesn't support is forward-mode AD ({func}`jax.jvp`, for example) since {func}`~jax.custom_vjp` is restricted to reverse-mode.\n",
    "JAX doesn't currently expose a public API for simultaneously customizing both forward-mode and reverse-mode AD, but such an API is on the roadmap, so please [open an issue](https://github.com/jax-ml/jax/issues) describing you use case if you hit this limitation in practice.\n",
    "\n",
    "One other JAX feature that this example doesn't support is higher-order AD.\n",
    "It would be possible to work around this by wrapping the `res_norm_bwd` function above in a {func}`jax.custom_jvp` or {func}`jax.custom_vjp` decorator, but we won't go into the details of that advanced use case here.\n",
    "\n",
    "## FFI calls on a GPU\n",
    "\n",
    "So far, we have been interfacing only with foreign functions running on the CPU, but JAX's FFI also supports calls to GPU code.\n",
    "Since this documentation page is automatically generated on a machine without access to a GPU, we can't execute any GPU-specific examples here, but we will go over the key points.\n",
    "\n",
    "When defining our FFI wrapper for CPU, the function signature that we used was:\n",
    "\n",
    "```c++\n",
    "ffi::Error RmsNormImpl(float eps, ffi::Buffer<ffi::F32> x,\n",
    "                       ffi::ResultBuffer<ffi::F32> y)\n",
    "```\n",
    "\n",
    "To update this to interface with a CUDA kernel, this signature becomes:\n",
    "\n",
    "```c++\n",
    "ffi::Error RmsNormImpl(cudaStream_t stream, float eps,\n",
    "                       ffi::Buffer<ffi::F32> x,\n",
    "                       ffi::ResultBuffer<ffi::F32> y)\n",
    "```\n",
    "\n",
    "And the handler definition is updated to include a `Ctx` in its binding:\n",
    "\n",
    "```c++\n",
    "XLA_FFI_DEFINE_HANDLER(\n",
    "    RmsNorm, RmsNormImpl,\n",
    "    ffi::Ffi::Bind()\n",
    "        .Ctx<ffi::PlatformStream<cudaStream_t>>()\n",
    "        .Attr<float>(\"eps\")\n",
    "        .Arg<ffi::Buffer<ffi::F32>>()  // x\n",
    "        .Ret<ffi::Buffer<ffi::F32>>()  // y\n",
    ");\n",
    "```\n",
    "\n",
    "Then, the `RmsNormImpl` can use the CUDA stream to launch CUDA kernels.\n",
    "\n",
    "On the front end, the registration code would be updated to specify the appropriate platform:\n",
    "\n",
    "```python\n",
    "jax.ffi.register_ffi_target(\n",
    "  \"rms_norm_cuda\", rms_norm_lib_cuda.rms_norm(), platform=\"CUDA\"\n",
    ")\n",
    "```\n",
    "\n",
    "### Supporting multiple platforms\n",
    "\n",
    "To support running our `rms_norm` function on both GPU and CPU, we can combine our implementation above with the {func}`jax.lax.platform_dependent` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def rms_norm_cross_platform(x, eps=1e-5):\n",
    "  assert x.dtype == jnp.float32\n",
    "  out_type = jax.ShapeDtypeStruct(x.shape, x.dtype)\n",
    "\n",
    "  def impl(target_name):\n",
    "    return lambda x: jax.ffi.ffi_call(\n",
    "      target_name,\n",
    "      out_type,\n",
    "      vmap_method=\"broadcast_all\",\n",
    "    )(x, eps=np.float32(eps))\n",
    "\n",
    "  return jax.lax.platform_dependent(x, cpu=impl(\"rms_norm\"), cuda=impl(\"rms_norm_cuda\"))\n",
    "\n",
    "\n",
    "np.testing.assert_allclose(rms_norm_cross_platform(x), rms_norm_ref(x), rtol=1e-5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This version of the function will call the appropriate FFI target depending on the runtime platform.\n",
    "\n",
    "As an aside, it may be interesting to note that while the jaxpr and lowered HLO both contain a reference to both FFI targets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "jax.make_jaxpr(rms_norm_cross_platform)(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(jax.jit(rms_norm_cross_platform).lower(x).as_text().strip())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "by the time the function is compiled, the appropriate FFI has been selected:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(jax.jit(rms_norm_cross_platform).lower(x).as_text(dialect=\"hlo\").strip())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "and there will be no runtime overhead to using {func}`jax.lax.platform_dependent`, and the compiled program won't include any references to unavailable FFI targets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sharding\n",
    "\n",
    "Most large scale users of JAX use its APIs for distributed computation across multiple devices.\n",
    "As discussed in {ref}`sharded-computation`, parallelism in JAX is controlled by sharding data across devices, and most JAX operations can be used within any of the supported parallel programming paradigms (from automatic to fully manual).\n",
    "But, the story is a little bit more complicated for FFI calls.\n",
    "Since the internals of an FFI call are opaque to both JAX and XLA, FFI calls won't typically show optimal (or even good) performance when the data are sharded.\n",
    "\n",
    "Before getting into the FFI details, let's consider the behavior of our pure-JAX reference implementation of RMS normalization (the `rms_norm_ref` function defined at the top of this document) with a sharded input.\n",
    "As discussed above, our implementation treats all leading axes of the input as _batch_ dimensions, and the normalization is performed along the last axis.\n",
    "This means that if the data are sharded along any batch dimensions, but replicated on the last dimension, no communication is required.\n",
    "This can be seen by sharding our 2-dimensional test data from above along its first dimension and checking the compiled HLO for operations like `all-gather`, `all-reduce`, etc.:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from jax.sharding import PartitionSpec as P\n",
    "\n",
    "assert len(jax.devices()) == 4  # Set using the XLA_FLAGS environment variable\n",
    "mesh = jax.make_mesh((4,), (\"x\",))\n",
    "\n",
    "batch_shd = jax.NamedSharding(mesh, P(\"x\", None))\n",
    "x_batch_shd = jax.device_put(x, batch_shd)\n",
    "hlo_batch = jax.jit(rms_norm_ref, out_shardings=batch_shd).lower(x_batch_shd).compile().as_text()\n",
    "assert \"all-\" not in hlo_batch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, if the data are sharded along the last axis, communication (in this case an `all-reduce`) is required to compute the sum in the normalization:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_shd = jax.NamedSharding(mesh, P(None, \"x\"))\n",
    "x_data_shd = jax.device_put(x, data_shd)\n",
    "hlo_data = jax.jit(rms_norm_ref, out_shardings=data_shd).lower(x_data_shd).compile().as_text()\n",
    "assert \"all-reduce\" in hlo_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, if we try to naively use our FFI version of the same model, it runs fine and gets the right answer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "output = jax.jit(rms_norm, out_shardings=batch_shd)(x_batch_shd)\n",
    "np.testing.assert_allclose(output, rms_norm_ref(x), rtol=1e-5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "But, if you look at the compiled HLO (omitting a helper functions for clarity), you'll see that\n",
    "\n",
    "1. the data are first fully replicated onto each device via an `all-gather` operation,\n",
    "2. the FFI call is executed on the full dataset on each device, and\n",
    "3. the output is sliced to discard the unused portions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hlo = jax.jit(rms_norm, out_shardings=batch_shd).lower(x_batch_shd).compile().as_text().strip()\n",
    "print(hlo.split(\"\\n\\n\")[-1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This clearly (to us!) isn't the optimal partitioning of this function, but it's the best that JAX/XLA can do with the information given.\n",
    "\n",
    "To generate better partitioning logic, we can use {func}`~jax.experimental.shard_map.shard_map` or {func}`~jax.experimental.custom_partitioning.custom_partitioning`, and we discuss both options here.\n",
    "That being said, it's not straightforward to generate _optimal_ partitioning for all inputs, because sometimes this would require algorithmic changes.\n",
    "Specifically, let's add support for \"batch partitioning\", which handles the case where the data are sharded on batch dimensions, but sharding on the last dimension will always require in re-sharding.\n",
    "\n",
    "### Using `shard_map`\n",
    "\n",
    "If you are using manual sharding control via {func}`~jax.experimental.shard_map.shard_map`, any FFI calls in your program should already partition appropriately:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from functools import partial\n",
    "from jax.experimental.shard_map import shard_map\n",
    "\n",
    "@partial(shard_map, mesh=mesh, in_specs=P(\"x\", None), out_specs=P(\"x\", None))\n",
    "def rms_norm_shmap(x):\n",
    "  return rms_norm(x)\n",
    "\n",
    "np.testing.assert_allclose(rms_norm_shmap(x_batch_shd), rms_norm_ref(x), rtol=1e-5)\n",
    "print(jax.jit(rms_norm_shmap, out_shardings=batch_shd).lower(x_batch_shd).compile().as_text().strip())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see in this program, if the input and output shardings match the `shard_map` specs, no communication is required and the FFI call is executed on the appropriately sharded subset of the data.\n",
    "\n",
    "You can also use inputs and outputs with shardings that don't match the `shard_map` specs, but (unrelated to the FFI) this will require re-sharding, as seen by the `all-to-all` operations in the compiled HLO:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hlo_data_shmap = jax.jit(rms_norm_shmap, out_shardings=data_shd).lower(x_data_shd).compile().as_text()\n",
    "assert \"all-to-all\" in hlo_data_shmap"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using `custom partitioning`\n",
    "\n",
    "If you can't use {func}`~jax.experimental.shard_map.shard_map`, an alternative approach is to use {func}`~jax.experimental.custom_partitioning.custom_partitioning`, which supports automatic parallelization via {func}`jax.jit`.\n",
    "{func}`~jax.experimental.custom_partitioning.custom_partitioning` works by adding Python callbacks into the XLA compiler's partitioning pass, which allows very flexible logic, but also comes with some rough edges.\n",
    "We won't go into too much detail on the caveats here, but the main issues that you should be aware of are:\n",
    "\n",
    "1. `custom_partitioning` can cause unexpected cache misses when used with the JAX's [Persistent compilation cache](https://jax.readthedocs.io/en/latest/persistent_compilation_cache.html). This can be mitigated using the `jax_remove_custom_partitioning_ptr_from_cache_key` configuration flag, but that isn't always appropriate either.\n",
    "2. Debugging `custom_partitioning` logic can be tedious because Python errors don't always get propagated, instead causing your Python process to exit. That being said, any exceptions will show up in the process logs, so you should be able to track them down there.\n",
    "\n",
    "All that being said, here's how we can wrap our FFI implementation of `rms_norm` using {func}`~jax.experimental.custom_partitioning.custom_partitioning`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from jax.experimental.custom_partitioning import custom_partitioning\n",
    "\n",
    "@partial(custom_partitioning, static_argnums=(1,))\n",
    "def rms_norm_partitioned(x, eps=1e-5):\n",
    "  return rms_norm(x, eps=eps)\n",
    "\n",
    "def replicate_sharding_on_last_dim(mesh, sharding, target_info):\n",
    "  # Our implementation supports trivial sharding on any batch dimensions, but the data\n",
    "  # must be replicated on the last (non-batch) dimension.\n",
    "  rank = len(target_info.shape)\n",
    "  num_batch_dims = min(len(sharding.spec), rank - 1)\n",
    "\n",
    "  # The Nones here indicate which dimensions should be replicated.\n",
    "  names = tuple(sharding.spec[:num_batch_dims]) + (None,) * (rank - num_batch_dims)\n",
    "  return jax.NamedSharding(mesh, P(*names))\n",
    "\n",
    "def rms_norm_infer_sharding_from_operands(eps, mesh, args_info, result_info):\n",
    "  del eps  # unused\n",
    "  arg_info, = args_info\n",
    "  result_sharding = replicate_sharding_on_last_dim(mesh, arg_info.sharding, result_info)\n",
    "\n",
    "  # In this case, we only have a single output, but the return value from this function\n",
    "  # must have the same pytree structure as the output from the underlying function\n",
    "  # (`rms_norm` in this case).\n",
    "  return result_sharding\n",
    "\n",
    "def rms_norm_partition(eps, mesh, args_info, result_info):\n",
    "  arg_info, = args_info\n",
    "  arg_sharding = replicate_sharding_on_last_dim(mesh, arg_info.sharding, arg_info)\n",
    "  result_sharding = replicate_sharding_on_last_dim(mesh, arg_info.sharding, result_info)\n",
    "\n",
    "  # This is the function that computes the partitioned model on the appropriate subset\n",
    "  # of the data.\n",
    "  def partitioned_rms_norm(x):\n",
    "    return rms_norm(x, eps=eps)\n",
    "\n",
    "  # Note that the third element of our returned tuple must be the shardings for the\n",
    "  # _outputs_ and its pytree structure must match the output of `rms_norm`. Similarly,\n",
    "  # the fourth element must have the same pytree structure as the _inputs_ to\n",
    "  # `rms_norm`. In this case, there is only one input, but it must be returned within\n",
    "  # a `tuple` anyways.\n",
    "  return mesh, partitioned_rms_norm, result_sharding, (arg_sharding,)\n",
    "\n",
    "rms_norm_partitioned.def_partition(\n",
    "    infer_sharding_from_operands=rms_norm_infer_sharding_from_operands,\n",
    "    partition=rms_norm_partition,\n",
    ")\n",
    "\n",
    "output = jax.jit(rms_norm_partitioned, out_shardings=batch_shd)(x_batch_shd)\n",
    "np.testing.assert_allclose(output, rms_norm_ref(x), rtol=1e-5)\n",
    "print(jax.jit(rms_norm_partitioned, out_shardings=batch_shd).lower(x_batch_shd).compile().as_text().strip())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see from the compiled program above, this `custom_partitioning` logic produces exactly the same program as the `shard_map` version above when the input is sharded on the batch dimension.\n",
    "\n",
    "However, it's worth noting that the behavior is _different_ when the input is sharded along the data dimension.\n",
    "When used under `shard_map`, the data are resharded on the batch dimension, whereas with `custom_partitioning` the data are gathered onto each device."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hlo_data_partitioned = jax.jit(rms_norm_partitioned, out_shardings=data_shd).lower(x_data_shd).compile().as_text().strip()\n",
    "assert \"all-gather\" in hlo_data_partitioned"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To also support automatic parallelization of the backwards pass, we would also need to write (similar) {func}`~jax.experimental.custom_partitioning.custom_partitioning` rules for `rms_norm_fwd` and `rms_norm_bwd`, but we leave those as an exercise for the reader."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced topics\n",
    "\n",
    "This tutorial covers most of the basic steps that are required to get up and running with JAX's FFI, but advanced use cases may require more features.\n",
    "We will leave these topics to future tutorials, but here are some possibly useful references:\n",
    "\n",
    "* **Supporting multiple dtypes**: In this tutorial's example, we restricted to only support `float32` inputs and outputs, but many use cases require supporting multiple different input types. One option to handle this is to register different FFI targets for all supported input types and then use Python to select the appropriate target for {func}`jax.ffi.ffi_call` depending on the input types. But, this approach could get quickly unwieldy depending on the combinatorics of the supported cases. So it is also possible to define the C++ handler to accept `ffi::AnyBuffer` instead of `ffi::Buffer<Dtype>`. Then, the input buffer will include a `element_type()` method which can be used to define the appropriate dtype dispatching logic in the backend.\n",
    "\n",
    "* **Stateful foreign functions**: It is also possible to use the FFI to wrap functions with associated state. There is a [low-level example included in the XLA test suite](https://github.com/openxla/xla/blob/737a7da3c5405583dc95773ac0bb11b1349fc9ea/xla/service/gpu/custom_call_test.cc#L794-L845), and a future tutorial will include more details."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "formats": "ipynb,md:myst"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}