rocm_jax

mirror of https://github.com/ROCm/jax.git synced 2025-04-19 05:16:06 +00:00

Author	SHA1	Message	Date
Skye Wanderman-Milne	93cd07efb8	Add PJRT C API to Cloud TPU test matrix Also shortens the job names so the full name is visible from the github UI (this was driving me crazy), and marks a new test that can't be run on the PJRT C API yet. Example run: https://github.com/google/jax/actions/runs/4019968334	2023-01-27 01:06:21 +00:00
Skye Wanderman-Milne	582578220d	[TPU CI] Send chat notification on cancellation as well as failure. In particular, this makes it notify on timeouts (which usually indicates a test hang, but should be addressed in any case).	2023-01-25 22:12:30 +00:00
Jake VanderPlas	e24a0e5bf2	CI: adjust permissions for upstream-nightly build	2023-01-20 14:01:48 -08:00
Jake VanderPlas	25c9621295	CI: update deprecated uses of set-output	2023-01-20 09:51:05 -08:00
Leopold Cambier	056702c1cb	Multinodes CICD on GPUs using on-demand cluster and e2e tests using T5X	2023-01-17 16:29:30 -08:00
Jake VanderPlas	aa34ea7b1c	JAX github actions: Update Python versions in test matrix for better coverage PiperOrigin-RevId: 496501739	2022-12-19 15:10:33 -08:00
Jake VanderPlas	ad40b0842d	CI: use Python 3.11 for upstream-nightly action	2022-12-19 10:47:09 -08:00
Yash Katariya	e6e1836711	Copybara import of the project: -- 20080434922caf49181c456785ab78b90a4907e3 by Anselm Levskaya <levskaya@google.com>: Revert to old test runners to investigate runner queue failure. PiperOrigin-RevId: 496099919	2022-12-17 09:28:51 -08:00
Jake VanderPlas	9f811ba54d	Address drastic slowdown in mypy runtime	2022-12-16 14:48:26 -08:00
Anselm Levskaya	2008043492	Revert to old test runners to investigate runner queue failure.	2022-12-15 18:51:34 -08:00
Skye Wanderman-Milne	8d4b50e397	[TPU CI] Run build matrix on v3-8 as well as v4-8 We're seeing failures on v3-8 that don't appear on the current v4-8 testing. v3-8 also exposes 8 devices (vs. v4-8 exposes 4), and some tests needs 8 devices to run. I just added a v3-8 runner VM. Also adds a missing pip install command (I only caught this with a fresh runner since it only needs to be installed once).	2022-12-09 22:32:09 +00:00
jax authors	23b808f7d0	Merge pull request #13446 from google:maxfail PiperOrigin-RevId: 493414635	2022-12-06 14:34:01 -08:00
Jake VanderPlas	cb62a31653	Drop support for Python 3.7	2022-11-29 15:01:47 -08:00
Jake VanderPlas	1647c5960e	CI: bump timeout for pre-commit	2022-11-28 13:26:44 -08:00
Anselm Levskaya	074e4ec813	Enable faster test-runners for PR/push CI runs.	2022-11-23 14:07:08 -08:00
Skye Wanderman-Milne	246614ed5c	Add --maxfail=20 to Cloud TPU CI. This prevents spamming the test output with 100s of failures when something fundamental is broken. Also updates some `python3` commands to use `python` for consistency.	2022-11-23 00:47:54 +00:00
jax authors	dd902fde21	Merge pull request #13317 from google:xdist_tpu PiperOrigin-RevId: 490366370	2022-11-22 16:40:00 -08:00
Roy Frostig	35634fcc2a	exercise `config.jax_threefry_partitionable` in one of the CI runs	2022-11-21 15:30:58 -08:00
Skye Wanderman-Milne	120125f3dd	Make pytest-xdist work on TPU and update Cloud TPU CI. This change also marks multiaccelerator test files in a way pytest can understand (if pytest is installed). By running single-device tests on a single TPU chip, running the test suite goes from 1hr 45m to 35m (both timings are running slow tests). I tried using bazel at first, which already supported parallel execution across TPU cores, but somehow it still takes 2h 20m! I'm not sure why it's so slow. It appears that bazel creates many new test processes over time, vs. pytest reuses the number of processes initially specified, and starting and stopping the TPU runtime takes a few seconds so that may be adding up. It also appears that single-process bazel is slower than single-process pytest, which I haven't looked into yet.	2022-11-18 22:05:13 +00:00
Skye Wanderman-Milne	0a886c34fa	Include which jaxlib/libtpu version failed (latest or nightly) in TPU CI chat notification	2022-11-16 21:38:36 +00:00
Skye Wanderman-Milne	b4564a2a57	TPU CI: don't notify when testing the workflow from a branch	2022-11-16 21:27:24 +00:00
Skye Wanderman-Milne	8bed9bac81	Update Github Actions workflows using Ratchet https://opensource.google/documentation/reference/github/services#actions mandates using a specific commit for non-Google actions in workflow files. I used https://github.com/sethvargo/ratchet to update all our workflow files. Example command: `ratchet pin cloud-tpu-ci-nightly.yml` Ratchet appears to also auto-format the YAML files. It makes the diff confusing but I'm ok with the final result.	2022-11-16 18:45:59 +00:00
Yash Katariya	a419e1917a	Use jax.Array by default for doctests PiperOrigin-RevId: 488719467	2022-11-15 11:52:22 -08:00
Skye Wanderman-Milne	5da7976093	Send message to internal chat room on Cloud TPU CI failure	2022-11-14 19:44:45 +00:00
Skye Wanderman-Milne	52775c42e4	Add .github/workflows/self_hosted_runner_utils/README.md This was meant to be part of https://github.com/google/jax/pull/13000, oops	2022-11-04 17:12:54 +00:00
jax authors	3db2a59f76	Merge pull request #13097 from jakevdp:actions-permissions PiperOrigin-RevId: 486160888	2022-11-04 09:36:32 -07:00
Skye Wanderman-Milne	8c22e34e22	Add Github Actions workflow that runs on a self-hosted TPU VM runner. This also includes some utilites for setting up the self-hosted runner. Googlers, see go/jax-self-hosted-runners for more setup info. The workflow is pretty basic currently. We can and should add more functionality later, such as email notifications. I kept it simple here for easier reviewing. Testing: - Sample workflow run in my fork: https://github.com/skye/jax/actions/runs/3333614180 - Sample PR attempt: (will add soon but I did verify validate_job.sh blocks pull_request workflows)	2022-11-03 21:15:57 +00:00
Jake VanderPlas	8057e2805b	CI: set explicit permissions for ci-build action	2022-11-03 13:21:58 -07:00
dependabot[bot]	cef5f20dbb	Bump styfle/cancel-workflow-action from 0.10.1 to 0.11.0 Bumps [styfle/cancel-workflow-action](https://github.com/styfle/cancel-workflow-action) from 0.10.1 to 0.11.0. - [Release notes](https://github.com/styfle/cancel-workflow-action/releases) - [Commits](https://github.com/styfle/cancel-workflow-action/compare/0.10.1...0.11.0) --- updated-dependencies: - dependency-name: styfle/cancel-workflow-action dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>	2022-10-17 17:18:44 +00:00
jax authors	6c3c51e8f3	Merge pull request #12591 from sudhakarsingh27:add_pytest_run_for_jaxlib_release PiperOrigin-RevId: 478608240	2022-10-03 14:34:32 -07:00
dependabot[bot]	8f71b03662	Bump styfle/cancel-workflow-action from 0.10.0 to 0.10.1 Bumps [styfle/cancel-workflow-action](https://github.com/styfle/cancel-workflow-action) from 0.10.0 to 0.10.1. - [Release notes](https://github.com/styfle/cancel-workflow-action/releases) - [Commits](https://github.com/styfle/cancel-workflow-action/compare/0.10.0...0.10.1) --- updated-dependencies: - dependency-name: styfle/cancel-workflow-action dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>	2022-10-03 17:11:55 +00:00
Sudhakar	4fbc9a10d1	Add multihost GPU CI run with last public jaxlib release	2022-09-29 17:06:56 -07:00
Roy Frostig	2a7b3197e0	add `nvidia-smi` question to bug template	2022-09-26 11:06:29 -07:00
Peter Hawkins	ba557d5e1b	Change JAX's copyright attribution from "Google LLC" to "The JAX Authors.". See https://opensource.google/documentation/reference/releasing/contributions#copyright for more details. PiperOrigin-RevId: 476167538	2022-09-22 12:27:19 -07:00
Yash Katariya	160e14308c	[Rollback] Add a github presubmit build which runs with jax.Array flag enabled for OSS coverage. PiperOrigin-RevId: 473161716	2022-09-08 22:09:44 -07:00
Yash Katariya	49672cd2bc	Add a github presubmit build which runs with jax.Array flag enabled for OSS coverage. PiperOrigin-RevId: 473100614	2022-09-08 15:32:31 -07:00
Sudhakar	5f1858f533	Add pytest marker inside the test only if pytest is present in the env	2022-09-06 11:45:59 -07:00
Roy Frostig	fe2de26b2c	test the RNG key upgrade in one of our github CI runs Choosing the "numpy-dispatch" configuration since it tends to be our frequent pick for future-proofing.	2022-08-30 21:43:44 -07:00
Sudhakar	a571db18db	Enable one gpu per process in multinode GPU CI	2022-08-29 09:00:19 -07:00
Sudhakar	4b1a2eaaec	combine gpu tests	2022-08-25 15:27:07 -07:00
Roy Frostig	655ecb4aaf	fix bug issue template yaml parsing error	2022-08-23 19:56:41 -07:00
Roy Frostig	825cbace65	remove 'BUG' title prefix from bug report template We auto-label with "bug" to this end.	2022-08-22 15:09:38 -07:00
Sudhakar	c2e521807c	Add support to test gpu jaxlib nightly in CI instead of prebuilt jax/jaxlib	2022-08-19 11:08:11 -07:00
Jake VanderPlas	f00dfb434e	issue template: avoid checkboxes because they're interpreted as tasks	2022-08-11 14:05:23 -07:00
Jake VanderPlas	7ec6acd981	nightly multiprocess test: create issue on failure	2022-08-09 19:12:32 -07:00
Sudhakar Singh	efb37ff784	Bump EnricoMi/publish-unit-test-result-action from 1 to 2	2022-08-08 10:55:53 -07:00
jax authors	0a8ca1982c	Merge pull request #11721 from sudhakarsingh27:main PiperOrigin-RevId: 465381834	2022-08-04 12:52:16 -07:00
Jake VanderPlas	03e5f9a24e	bug-report.yml: fix markdown formatting	2022-08-04 10:48:30 -07:00
Jake VanderPlas	f0c5747d9a	bug-report.yaml: fix yaml syntax error	2022-08-04 10:30:25 -07:00
Sudhakar Singh	1565fd2525	Squashed commit of the following: commit 0b4c3f05a49037be93eb0612113e193f3a8d61c5 Author: Sudhakar Singh <sudhakars@nvidia.com> Date: Thu Aug 4 09:53:04 2022 -0700 change the path commit 2c629739c1cfa45d848a2cf7109d329c1262e6ac Author: Sudhakar Singh <sudhakars@nvidia.com> Date: Wed Aug 3 16:37:46 2022 -0700 rename file to reflect current objective commit ef46bcae6cd66d6fe7b04bd6d8aeed42c4f3ddfa Author: Sudhakar Singh <sudhakars@nvidia.com> Date: Wed Aug 3 15:56:32 2022 -0700 correct formatting commit e5da60ad855592d5f150612f65ad679872160132 Author: Sudhakar Singh <sudhakars@nvidia.com> Date: Wed Aug 3 15:26:32 2022 -0700 Add multi-node multi-GPU JAX tests This adds multi-node multi-GPU test for `jax.distributed.initialize`. Presently, this is expected to run on a nightly basis. Under the hood, SLURM is used to launch the `pytest <test_name>` commands on multiple nodes. Resolves: #11648	2022-08-04 10:13:50 -07:00

1 2 3

140 Commits