rocm_jax

mirror of https://github.com/ROCm/jax.git synced 2025-04-17 12:26:07 +00:00

Author	SHA1	Message	Date
Skye Wanderman-Milne	6c909760d5	Cloud TPU CI: make sure we update test deps and upgrade protobuf version `profiler_test.py:ProfilerTest.test_remote_profiler` fails with the protobuf upgrade. However, I was seeing mysterious hangs without this, and in general I think we should be testing with up-to-date deps given that we don't pin. I'm gonna continue working on getting the Cloud TPU CI green.	2023-07-19 16:17:47 -07:00
jax authors	961d918883	Merge pull request #16333 from skye:profiler_test PiperOrigin-RevId: 542069606	2023-06-20 15:40:10 -07:00
Skye Wanderman-Milne	2ca151ef5b	profiler_test.py fixes and add coverage to Cloud TPU CI * Add deps to test requirements, including in new `collect-profile-requirements.txt` (to avoid adding tensorflow to `test-requirements.txt`). * Use correct Python executable `ProfilerTest.test_remote_profiler` (`python` sometimes defaults to python2) * Run computations for longer in `ProfilerTest.test_remote_profiler`, othewise `collect_profile` sometimes misses it.	2023-06-20 22:25:17 +00:00
Peter Hawkins	741f0d03c3	Rename Github actions so they have more consistent naming.	2023-06-20 16:26:12 -04:00
Peter Hawkins	990eb2ce68	Enable colored output in pytest under github actions.	2023-06-15 18:57:12 -04:00
Skye Wanderman-Milne	131d28ba0f	Use default Python version on Cloud TPU CI	2023-05-31 18:04:39 +00:00
Skye Wanderman-Milne	74a5c0d125	Add nightly to TPU test matrix	2023-04-06 23:13:55 +00:00
Skye Wanderman-Milne	6927e5d4bf	Add 1-hour timeout to each Cloud TPU CI job. Sometimes they hang, and the default timeout is 6 hours, which is way too long.	2023-03-29 14:23:45 -07:00
Skye Wanderman-Milne	ef5e4a4035	Remove 'pjrt_c_api_unimplemented' pytest mark. Instead, we skip tests that the PJRT C API doesn't support. We had this tag for feature development so it was easy to broadly disable, but now we don't expect to need to do that.	2023-03-24 23:14:54 +00:00
Skye Wanderman-Milne	1476a85225	Add `--pre` to nightly libtpu pip install command. This is necessary to make sure we pick up the nightly "dev" versions hosted on GCP and not the fake package at https://pypi.org/project/libtpu-nightly/.	2023-02-21 23:38:39 +00:00
Skye Wanderman-Milne	93cd07efb8	Add PJRT C API to Cloud TPU test matrix Also shortens the job names so the full name is visible from the github UI (this was driving me crazy), and marks a new test that can't be run on the PJRT C API yet. Example run: https://github.com/google/jax/actions/runs/4019968334	2023-01-27 01:06:21 +00:00
Skye Wanderman-Milne	582578220d	[TPU CI] Send chat notification on cancellation as well as failure. In particular, this makes it notify on timeouts (which usually indicates a test hang, but should be addressed in any case).	2023-01-25 22:12:30 +00:00
Skye Wanderman-Milne	8d4b50e397	[TPU CI] Run build matrix on v3-8 as well as v4-8 We're seeing failures on v3-8 that don't appear on the current v4-8 testing. v3-8 also exposes 8 devices (vs. v4-8 exposes 4), and some tests needs 8 devices to run. I just added a v3-8 runner VM. Also adds a missing pip install command (I only caught this with a fresh runner since it only needs to be installed once).	2022-12-09 22:32:09 +00:00
Skye Wanderman-Milne	246614ed5c	Add --maxfail=20 to Cloud TPU CI. This prevents spamming the test output with 100s of failures when something fundamental is broken. Also updates some `python3` commands to use `python` for consistency.	2022-11-23 00:47:54 +00:00
jax authors	dd902fde21	Merge pull request #13317 from google:xdist_tpu PiperOrigin-RevId: 490366370	2022-11-22 16:40:00 -08:00
Skye Wanderman-Milne	120125f3dd	Make pytest-xdist work on TPU and update Cloud TPU CI. This change also marks multiaccelerator test files in a way pytest can understand (if pytest is installed). By running single-device tests on a single TPU chip, running the test suite goes from 1hr 45m to 35m (both timings are running slow tests). I tried using bazel at first, which already supported parallel execution across TPU cores, but somehow it still takes 2h 20m! I'm not sure why it's so slow. It appears that bazel creates many new test processes over time, vs. pytest reuses the number of processes initially specified, and starting and stopping the TPU runtime takes a few seconds so that may be adding up. It also appears that single-process bazel is slower than single-process pytest, which I haven't looked into yet.	2022-11-18 22:05:13 +00:00
Skye Wanderman-Milne	0a886c34fa	Include which jaxlib/libtpu version failed (latest or nightly) in TPU CI chat notification	2022-11-16 21:38:36 +00:00
Skye Wanderman-Milne	b4564a2a57	TPU CI: don't notify when testing the workflow from a branch	2022-11-16 21:27:24 +00:00
Skye Wanderman-Milne	8bed9bac81	Update Github Actions workflows using Ratchet https://opensource.google/documentation/reference/github/services#actions mandates using a specific commit for non-Google actions in workflow files. I used https://github.com/sethvargo/ratchet to update all our workflow files. Example command: `ratchet pin cloud-tpu-ci-nightly.yml` Ratchet appears to also auto-format the YAML files. It makes the diff confusing but I'm ok with the final result.	2022-11-16 18:45:59 +00:00
Skye Wanderman-Milne	5da7976093	Send message to internal chat room on Cloud TPU CI failure	2022-11-14 19:44:45 +00:00
Skye Wanderman-Milne	8c22e34e22	Add Github Actions workflow that runs on a self-hosted TPU VM runner. This also includes some utilites for setting up the self-hosted runner. Googlers, see go/jax-self-hosted-runners for more setup info. The workflow is pretty basic currently. We can and should add more functionality later, such as email notifications. I kept it simple here for easier reviewing. Testing: - Sample workflow run in my fork: https://github.com/skye/jax/actions/runs/3333614180 - Sample PR attempt: (will add soon but I did verify validate_job.sh blocks pull_request workflows)	2022-11-03 21:15:57 +00:00

21 Commits