140 Commits

Author SHA1 Message Date
Skye Wanderman-Milne
93cd07efb8 Add PJRT C API to Cloud TPU test matrix
Also shortens the job names so the full name is visible from the
github UI (this was driving me crazy), and marks a new test that can't
be run on the PJRT C API yet.

Example run: https://github.com/google/jax/actions/runs/4019968334
2023-01-27 01:06:21 +00:00
Skye Wanderman-Milne
582578220d [TPU CI] Send chat notification on cancellation as well as failure.
In particular, this makes it notify on timeouts (which usually
indicates a test hang, but should be addressed in any case).
2023-01-25 22:12:30 +00:00
Jake VanderPlas
e24a0e5bf2 CI: adjust permissions for upstream-nightly build 2023-01-20 14:01:48 -08:00
Jake VanderPlas
25c9621295 CI: update deprecated uses of set-output 2023-01-20 09:51:05 -08:00
Leopold Cambier
056702c1cb Multinodes CICD on GPUs using on-demand cluster and e2e tests using T5X 2023-01-17 16:29:30 -08:00
Jake VanderPlas
aa34ea7b1c JAX github actions: Update Python versions in test matrix for better coverage
PiperOrigin-RevId: 496501739
2022-12-19 15:10:33 -08:00
Jake VanderPlas
ad40b0842d CI: use Python 3.11 for upstream-nightly action 2022-12-19 10:47:09 -08:00
Yash Katariya
e6e1836711 Copybara import of the project:
--
20080434922caf49181c456785ab78b90a4907e3 by Anselm Levskaya <levskaya@google.com>:

Revert to old test runners to investigate runner queue failure.

PiperOrigin-RevId: 496099919
2022-12-17 09:28:51 -08:00
Jake VanderPlas
9f811ba54d Address drastic slowdown in mypy runtime 2022-12-16 14:48:26 -08:00
Anselm Levskaya
2008043492 Revert to old test runners to investigate runner queue failure. 2022-12-15 18:51:34 -08:00
Skye Wanderman-Milne
8d4b50e397 [TPU CI] Run build matrix on v3-8 as well as v4-8
We're seeing failures on v3-8 that don't appear on the current v4-8
testing. v3-8 also exposes 8 devices (vs. v4-8 exposes 4), and some
tests needs 8 devices to run.

I just added a v3-8 runner VM.

Also adds a missing pip install command (I only caught this with a
fresh runner since it only needs to be installed once).
2022-12-09 22:32:09 +00:00
jax authors
23b808f7d0 Merge pull request #13446 from google:maxfail
PiperOrigin-RevId: 493414635
2022-12-06 14:34:01 -08:00
Jake VanderPlas
cb62a31653 Drop support for Python 3.7 2022-11-29 15:01:47 -08:00
Jake VanderPlas
1647c5960e CI: bump timeout for pre-commit 2022-11-28 13:26:44 -08:00
Anselm Levskaya
074e4ec813 Enable faster test-runners for PR/push CI runs. 2022-11-23 14:07:08 -08:00
Skye Wanderman-Milne
246614ed5c Add --maxfail=20 to Cloud TPU CI.
This prevents spamming the test output with 100s of failures when something fundamental is broken.

Also updates some `python3` commands to use `python` for consistency.
2022-11-23 00:47:54 +00:00
jax authors
dd902fde21 Merge pull request #13317 from google:xdist_tpu
PiperOrigin-RevId: 490366370
2022-11-22 16:40:00 -08:00
Roy Frostig
35634fcc2a exercise config.jax_threefry_partitionable in one of the CI runs 2022-11-21 15:30:58 -08:00
Skye Wanderman-Milne
120125f3dd Make pytest-xdist work on TPU and update Cloud TPU CI.
This change also marks multiaccelerator test files in a way pytest can
understand (if pytest is installed).

By running single-device tests on a single TPU chip, running the test
suite goes from 1hr 45m to 35m (both timings are running slow tests).

I tried using bazel at first, which already supported parallel
execution across TPU cores, but somehow it still takes 2h 20m! I'm not
sure why it's so slow. It appears that bazel creates many new test
processes over time, vs. pytest reuses the number of processes
initially specified, and starting and stopping the TPU runtime takes a
few seconds so that may be adding up. It also appears that
single-process bazel is slower than single-process pytest, which I
haven't looked into yet.
2022-11-18 22:05:13 +00:00
Skye Wanderman-Milne
0a886c34fa Include which jaxlib/libtpu version failed (latest or nightly) in TPU CI chat notification 2022-11-16 21:38:36 +00:00
Skye Wanderman-Milne
b4564a2a57 TPU CI: don't notify when testing the workflow from a branch 2022-11-16 21:27:24 +00:00
Skye Wanderman-Milne
8bed9bac81 Update Github Actions workflows using Ratchet
https://opensource.google/documentation/reference/github/services#actions
mandates using a specific commit for non-Google actions in workflow
files. I used https://github.com/sethvargo/ratchet to update all our
workflow files. Example command: `ratchet pin cloud-tpu-ci-nightly.yml`

Ratchet appears to also auto-format the YAML files. It makes the diff
confusing but I'm ok with the final result.
2022-11-16 18:45:59 +00:00
Yash Katariya
a419e1917a Use jax.Array by default for doctests
PiperOrigin-RevId: 488719467
2022-11-15 11:52:22 -08:00
Skye Wanderman-Milne
5da7976093 Send message to internal chat room on Cloud TPU CI failure 2022-11-14 19:44:45 +00:00
Skye Wanderman-Milne
52775c42e4 Add .github/workflows/self_hosted_runner_utils/README.md
This was meant to be part of https://github.com/google/jax/pull/13000, oops
2022-11-04 17:12:54 +00:00
jax authors
3db2a59f76 Merge pull request #13097 from jakevdp:actions-permissions
PiperOrigin-RevId: 486160888
2022-11-04 09:36:32 -07:00
Skye Wanderman-Milne
8c22e34e22 Add Github Actions workflow that runs on a self-hosted TPU VM runner.
This also includes some utilites for setting up the self-hosted
runner. Googlers, see go/jax-self-hosted-runners for more setup info.

The workflow is pretty basic currently. We can and should add more
functionality later, such as email notifications. I kept it simple
here for easier reviewing.

Testing:
- Sample workflow run in my fork: https://github.com/skye/jax/actions/runs/3333614180
- Sample PR attempt: (will add soon but I did verify validate_job.sh blocks pull_request workflows)
2022-11-03 21:15:57 +00:00
Jake VanderPlas
8057e2805b CI: set explicit permissions for ci-build action 2022-11-03 13:21:58 -07:00
dependabot[bot]
cef5f20dbb
Bump styfle/cancel-workflow-action from 0.10.1 to 0.11.0
Bumps [styfle/cancel-workflow-action](https://github.com/styfle/cancel-workflow-action) from 0.10.1 to 0.11.0.
- [Release notes](https://github.com/styfle/cancel-workflow-action/releases)
- [Commits](https://github.com/styfle/cancel-workflow-action/compare/0.10.1...0.11.0)

---
updated-dependencies:
- dependency-name: styfle/cancel-workflow-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-10-17 17:18:44 +00:00
jax authors
6c3c51e8f3 Merge pull request #12591 from sudhakarsingh27:add_pytest_run_for_jaxlib_release
PiperOrigin-RevId: 478608240
2022-10-03 14:34:32 -07:00
dependabot[bot]
8f71b03662
Bump styfle/cancel-workflow-action from 0.10.0 to 0.10.1
Bumps [styfle/cancel-workflow-action](https://github.com/styfle/cancel-workflow-action) from 0.10.0 to 0.10.1.
- [Release notes](https://github.com/styfle/cancel-workflow-action/releases)
- [Commits](https://github.com/styfle/cancel-workflow-action/compare/0.10.0...0.10.1)

---
updated-dependencies:
- dependency-name: styfle/cancel-workflow-action
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-10-03 17:11:55 +00:00
Sudhakar
4fbc9a10d1 Add multihost GPU CI run with last public jaxlib release 2022-09-29 17:06:56 -07:00
Roy Frostig
2a7b3197e0 add nvidia-smi question to bug template 2022-09-26 11:06:29 -07:00
Peter Hawkins
ba557d5e1b Change JAX's copyright attribution from "Google LLC" to "The JAX Authors.".
See https://opensource.google/documentation/reference/releasing/contributions#copyright for more details.

PiperOrigin-RevId: 476167538
2022-09-22 12:27:19 -07:00
Yash Katariya
160e14308c [Rollback] Add a github presubmit build which runs with jax.Array flag enabled for OSS coverage.
PiperOrigin-RevId: 473161716
2022-09-08 22:09:44 -07:00
Yash Katariya
49672cd2bc Add a github presubmit build which runs with jax.Array flag enabled for OSS coverage.
PiperOrigin-RevId: 473100614
2022-09-08 15:32:31 -07:00
Sudhakar
5f1858f533 Add pytest marker inside the test only if pytest is present in the env 2022-09-06 11:45:59 -07:00
Roy Frostig
fe2de26b2c test the RNG key upgrade in one of our github CI runs
Choosing the "numpy-dispatch" configuration since it tends to be our
frequent pick for future-proofing.
2022-08-30 21:43:44 -07:00
Sudhakar
a571db18db Enable one gpu per process in multinode GPU CI 2022-08-29 09:00:19 -07:00
Sudhakar
4b1a2eaaec combine gpu tests 2022-08-25 15:27:07 -07:00
Roy Frostig
655ecb4aaf
fix bug issue template yaml parsing error 2022-08-23 19:56:41 -07:00
Roy Frostig
825cbace65
remove 'BUG' title prefix from bug report template
We auto-label with "bug" to this end.
2022-08-22 15:09:38 -07:00
Sudhakar
c2e521807c Add support to test gpu jaxlib nightly in CI instead of prebuilt jax/jaxlib 2022-08-19 11:08:11 -07:00
Jake VanderPlas
f00dfb434e issue template: avoid checkboxes because they're interpreted as tasks 2022-08-11 14:05:23 -07:00
Jake VanderPlas
7ec6acd981 nightly multiprocess test: create issue on failure 2022-08-09 19:12:32 -07:00
Sudhakar Singh
efb37ff784 Bump EnricoMi/publish-unit-test-result-action from 1 to 2 2022-08-08 10:55:53 -07:00
jax authors
0a8ca1982c Merge pull request #11721 from sudhakarsingh27:main
PiperOrigin-RevId: 465381834
2022-08-04 12:52:16 -07:00
Jake VanderPlas
03e5f9a24e bug-report.yml: fix markdown formatting 2022-08-04 10:48:30 -07:00
Jake VanderPlas
f0c5747d9a bug-report.yaml: fix yaml syntax error 2022-08-04 10:30:25 -07:00
Sudhakar Singh
1565fd2525 Squashed commit of the following:
commit 0b4c3f05a49037be93eb0612113e193f3a8d61c5
Author: Sudhakar Singh <sudhakars@nvidia.com>
Date:   Thu Aug 4 09:53:04 2022 -0700

    change the path

commit 2c629739c1cfa45d848a2cf7109d329c1262e6ac
Author: Sudhakar Singh <sudhakars@nvidia.com>
Date:   Wed Aug 3 16:37:46 2022 -0700

    rename file to reflect current objective

commit ef46bcae6cd66d6fe7b04bd6d8aeed42c4f3ddfa
Author: Sudhakar Singh <sudhakars@nvidia.com>
Date:   Wed Aug 3 15:56:32 2022 -0700

    correct formatting

commit e5da60ad855592d5f150612f65ad679872160132
Author: Sudhakar Singh <sudhakars@nvidia.com>
Date:   Wed Aug 3 15:26:32 2022 -0700

    Add multi-node multi-GPU JAX tests

    This adds multi-node multi-GPU test for `jax.distributed.initialize`.
    Presently, this is expected to run on a nightly basis. Under the hood,
    SLURM is used to launch the `pytest <test_name>` commands on multiple
    nodes.

    Resolves: #11648
2022-08-04 10:13:50 -07:00