1
0
mirror of https://github.com/ROCm/jax.git synced 2025-04-27 10:36:07 +00:00

5 Commits

Author SHA1 Message Date
Skye Wanderman-Milne
120125f3dd Make pytest-xdist work on TPU and update Cloud TPU CI.
This change also marks multiaccelerator test files in a way pytest can
understand (if pytest is installed).

By running single-device tests on a single TPU chip, running the test
suite goes from 1hr 45m to 35m (both timings are running slow tests).

I tried using bazel at first, which already supported parallel
execution across TPU cores, but somehow it still takes 2h 20m! I'm not
sure why it's so slow. It appears that bazel creates many new test
processes over time, vs. pytest reuses the number of processes
initially specified, and starting and stopping the TPU runtime takes a
few seconds so that may be adding up. It also appears that
single-process bazel is slower than single-process pytest, which I
haven't looked into yet.
2022-11-18 22:05:13 +00:00
Skye Wanderman-Milne
b4564a2a57 TPU CI: don't notify when testing the workflow from a branch 2022-11-16 21:27:24 +00:00
Skye Wanderman-Milne
8bed9bac81 Update Github Actions workflows using Ratchet
https://opensource.google/documentation/reference/github/services#actions
mandates using a specific commit for non-Google actions in workflow
files. I used https://github.com/sethvargo/ratchet to update all our
workflow files. Example command: `ratchet pin cloud-tpu-ci-nightly.yml`

Ratchet appears to also auto-format the YAML files. It makes the diff
confusing but I'm ok with the final result.
2022-11-16 18:45:59 +00:00
Skye Wanderman-Milne
5da7976093 Send message to internal chat room on Cloud TPU CI failure 2022-11-14 19:44:45 +00:00
Skye Wanderman-Milne
8c22e34e22 Add Github Actions workflow that runs on a self-hosted TPU VM runner.
This also includes some utilites for setting up the self-hosted
runner. Googlers, see go/jax-self-hosted-runners for more setup info.

The workflow is pretty basic currently. We can and should add more
functionality later, such as email notifications. I kept it simple
here for easier reviewing.

Testing:
- Sample workflow run in my fork: https://github.com/skye/jax/actions/runs/3333614180
- Sample PR attempt: (will add soon but I did verify validate_job.sh blocks pull_request workflows)
2022-11-03 21:15:57 +00:00