rocm_jax/conftest.py
Skye Wanderman-Milne 120125f3dd Make pytest-xdist work on TPU and update Cloud TPU CI.
This change also marks multiaccelerator test files in a way pytest can
understand (if pytest is installed).

By running single-device tests on a single TPU chip, running the test
suite goes from 1hr 45m to 35m (both timings are running slow tests).

I tried using bazel at first, which already supported parallel
execution across TPU cores, but somehow it still takes 2h 20m! I'm not
sure why it's so slow. It appears that bazel creates many new test
processes over time, vs. pytest reuses the number of processes
initially specified, and starting and stopping the TPU runtime takes a
few seconds so that may be adding up. It also appears that
single-process bazel is slower than single-process pytest, which I
haven't looked into yet.
2022-11-18 22:05:13 +00:00

60 lines
2.5 KiB
Python

# Copyright 2021 The JAX Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""pytest configuration"""
import os
import pytest
@pytest.fixture(autouse=True)
def add_imports(doctest_namespace):
import jax
import numpy
doctest_namespace["jax"] = jax
doctest_namespace["lax"] = jax.lax
doctest_namespace["jnp"] = jax.numpy
doctest_namespace["np"] = numpy
# A pytest hook that runs immediately before test collection (i.e. when pytest
# loads all the test cases to run). When running parallel tests via xdist on
# Cloud TPU, we use this hook to set the env vars needed to run multiple test
# processes across different TPU chips.
#
# It's important that the hook runs before test collection, since jax tests end
# up initializing the TPU runtime on import (e.g. to query supported test
# types). It's also important that the hook gets called by each xdist worker
# process. Luckily each worker does its own test collection.
#
# The pytest_collection hook can be used to overwrite the collection logic, but
# we only use it to set the env vars and fall back to the default collection
# logic by always returning None. See
# https://docs.pytest.org/en/latest/how-to/writing_hook_functions.html#firstresult-stop-at-first-non-none-result
# for details.
#
# The env var JAX_ENABLE_TPU_XDIST must be set for this hook to have an
# effect. We do this to minimize any effect on non-TPU tests, and as a pointer
# in test code to this "magic" hook. TPU tests should not specify more xdist
# workers than the number of TPU chips.
def pytest_collection() -> None:
if not os.environ.get("JAX_ENABLE_TPU_XDIST", None):
return
# When running as an xdist worker, will be something like "gw0"
xdist_worker_name = os.environ.get("PYTEST_XDIST_WORKER", "")
if not xdist_worker_name.startswith("gw"):
return
xdist_worker_number = int(xdist_worker_name[len("gw"):])
os.environ.setdefault("TPU_VISIBLE_CHIPS", str(xdist_worker_number))
os.environ.setdefault("ALLOW_MULTIPLE_LIBTPU_LOAD", "true")