Also shortens the job names so the full name is visible from the
github UI (this was driving me crazy), and marks a new test that can't
be run on the PJRT C API yet.
Example run: https://github.com/google/jax/actions/runs/4019968334
--
20080434922caf49181c456785ab78b90a4907e3 by Anselm Levskaya <levskaya@google.com>:
Revert to old test runners to investigate runner queue failure.
PiperOrigin-RevId: 496099919
We're seeing failures on v3-8 that don't appear on the current v4-8
testing. v3-8 also exposes 8 devices (vs. v4-8 exposes 4), and some
tests needs 8 devices to run.
I just added a v3-8 runner VM.
Also adds a missing pip install command (I only caught this with a
fresh runner since it only needs to be installed once).
This prevents spamming the test output with 100s of failures when something fundamental is broken.
Also updates some `python3` commands to use `python` for consistency.
This change also marks multiaccelerator test files in a way pytest can
understand (if pytest is installed).
By running single-device tests on a single TPU chip, running the test
suite goes from 1hr 45m to 35m (both timings are running slow tests).
I tried using bazel at first, which already supported parallel
execution across TPU cores, but somehow it still takes 2h 20m! I'm not
sure why it's so slow. It appears that bazel creates many new test
processes over time, vs. pytest reuses the number of processes
initially specified, and starting and stopping the TPU runtime takes a
few seconds so that may be adding up. It also appears that
single-process bazel is slower than single-process pytest, which I
haven't looked into yet.
This also includes some utilites for setting up the self-hosted
runner. Googlers, see go/jax-self-hosted-runners for more setup info.
The workflow is pretty basic currently. We can and should add more
functionality later, such as email notifications. I kept it simple
here for easier reviewing.
Testing:
- Sample workflow run in my fork: https://github.com/skye/jax/actions/runs/3333614180
- Sample PR attempt: (will add soon but I did verify validate_job.sh blocks pull_request workflows)
commit 0b4c3f05a49037be93eb0612113e193f3a8d61c5
Author: Sudhakar Singh <sudhakars@nvidia.com>
Date: Thu Aug 4 09:53:04 2022 -0700
change the path
commit 2c629739c1cfa45d848a2cf7109d329c1262e6ac
Author: Sudhakar Singh <sudhakars@nvidia.com>
Date: Wed Aug 3 16:37:46 2022 -0700
rename file to reflect current objective
commit ef46bcae6cd66d6fe7b04bd6d8aeed42c4f3ddfa
Author: Sudhakar Singh <sudhakars@nvidia.com>
Date: Wed Aug 3 15:56:32 2022 -0700
correct formatting
commit e5da60ad855592d5f150612f65ad679872160132
Author: Sudhakar Singh <sudhakars@nvidia.com>
Date: Wed Aug 3 15:26:32 2022 -0700
Add multi-node multi-GPU JAX tests
This adds multi-node multi-GPU test for `jax.distributed.initialize`.
Presently, this is expected to run on a nightly basis. Under the hood,
SLURM is used to launch the `pytest <test_name>` commands on multiple
nodes.
Resolves: #11648