64 Commits

Author SHA1 Message Date
Nathan Gauër
5d50af3f03
Revert "[CI] Extend metrics container to log BuildKite metrics" (#130770)
Reverts llvm/llvm-project#129699
2025-03-11 14:15:44 +01:00
Nathan Gauër
3df8be3ee9
[CI] Extend metrics container to log BuildKite metrics (#129699)
The current container focuses on Github metrics. Before deprecating
BuildKite, we want to make sure the new infra quality is better, or at
least the same.

Being able to compare buildkite metrics with github metrics on grafana
will allow us to easily present the comparison.

This PR requires https://github.com/llvm/llvm-zorg/pull/400 to be merged
first.
2025-03-11 14:11:07 +01:00
Aiden Grossman
cef6dbbe54 [CI] Add Logging for Workflow Jobs
This patch adds some logging information for individual workflow jobs inside
the metrics container. This is mainly intended for debugging why we seem to be
missing metrics from some workflows within Grafana.
2025-03-01 03:06:57 +00:00
Aiden Grossman
3c518940b0 [CI] Make Metrics Container Use Python Logging
This patch makes the metrics container use the python logging library. This
is more of what we want given we're essentially just logging the status of
things. It also means we do not have to explicitly specify an output file
and lets us control verbosity a bit more cleanly.
2025-03-01 03:03:24 +00:00
Aiden Grossman
b24e14093d [CI] Keep Track of Workflow Name Instead of Job Name
The metrics script includes some logic to only read look at workflows up
to the most recent workflow it has seen previously. This was broken in a
previous patch when workflow metrics began to be emitted per job. The
logic ending the metrics gathering would never trigger, so we would
continually fetch more and more workflows until OOM.
2025-02-15 06:16:08 +00:00
Aiden Grossman
d7b89b0dca
[CI] Do Not Consider a Job Failed if Steps Were Skipped
This patch makes it so that skipped steps do not cause a job to be
considered failed. The windows premerge jobs currently skip the
build/test step if there are no projects to build/test. These show up as
failures in the dashboard even though everything executed perfectly
fine.

Reviewers: lnihlen, Keenuts

Reviewed By: lnihlen

Pull Request: https://github.com/llvm/llvm-project/pull/127279
2025-02-14 19:14:56 -08:00
Aiden Grossman
97d2cfeab3
[CI] Try Moving Github Object Into Loop
Currently the metrics container is crashing reasonably often with
incomplete read/connection broken errors. Try moving the creation of the
Github Object into the main loop to see if recreating the object that
maybe handles some connection state fixes the issue.

Reviewers: Keenuts, lnihlen

Reviewed By: lnihlen

Pull Request: https://github.com/llvm/llvm-project/pull/127276
2025-02-14 19:12:16 -08:00
Aiden Grossman
4aeb2f1c79
[CI] Remove Duplicate Heartbeat in Metrics Script
This patch removes an extra heartbeat metric in the metrics python file. Before
it was performed twice, once in the main function, and once in the
get_sampled_workflow_metrics function. We only need one to keep everything
happy, and I've chosen to keep the one in get_sampled_workflow_metrics as it
seems a more appropriate place to keep it.

Reviewers: Keenuts, lnihlen

Reviewed By: lnihlen

Pull Request: https://github.com/llvm/llvm-project/pull/127275
2025-02-14 19:10:51 -08:00
Aiden Grossman
2d878ccf54
[CI] Track Queue/In Progress Metrics By Job Rather Than Workflow
This patch makes it so that the metrics container counts the number of in
progress and queued jobs at the job level rather than at the workflow
level. This helps us distinguish windows versus linux load and also lets
us filter out the MacOS jobs that only run in the release branch.

Reviewers: Keenuts, lnihlen

Reviewed By: lnihlen

Pull Request: https://github.com/llvm/llvm-project/pull/127274
2025-02-14 19:08:45 -08:00
Aiden Grossman
7f24b9acd1
[CI] Support multiple jobs in metrics container (#124457)
This patch makes it so that the metrics script can support multiple jobs
in a single workflow. This is needed so that we do not crash on an
assertion now that the windows job has been enabled within the premerge
workflow.
2025-01-27 17:05:05 +01:00
Aiden Grossman
280c7d7198
[CI] Increase Configurability of Monolithic Windows Build (#124328)
This patch makes it so that the caller of monolithic-windows.sh can set
the maximum number of parallel compile/link jobs in an environment
variable rather than manually specifying it inside of the CMake.
Additionally, the env variable definitions for CC, CXX, and LD are sunk
into the shell script due to those config options being pretty inherent
to what the pipeline is testing.

This is intended to make things more flexible/useable for the new
premerge CI pipeline, particularly as we are looking at using larger
runners and want the increased flexibility to experiment.
2025-01-24 15:37:36 -08:00
Nathan Gauër
13b44283e9
[CI] Add queue size, running count metrics (#122714)
This commits allows the container to report 3 additional metrics at
every sampling event:
- a heartbeat
- the size of the workflow queue (filtered)
- the number of running workflows (filtered)

The heartbeat is a simple metric allowing us to monitor the metrics
health. Before this commit, a new metrics was pushed only when a
workflow was completed. This meant we had to wait a few hours
before noticing if the metrics container was unable to push metrics.

In addition to this, this commits adds a sampling of the workflow
queue size and running count. This should allow us to better understand
the load, and improve the autoscale values we pick for the cluster.

---------

Signed-off-by: Nathan Gauër <brioche@google.com>
2025-01-16 11:41:49 +01:00
Nathan Gauër
05f9cdd58d
[CI] Remove Check Clang Format from watched workflows (#122740)
This was useful to test metrics before we had an actual workflow, now it
generates noise.

Signed-off-by: Nathan Gauër <brioche@google.com>
2025-01-14 11:09:48 +01:00
David Spickett
1b199d1990
[ci] Handle the case where all reported tests pass but the build is still a failure (#120264)
In this build:
https://buildkite.com/llvm-project/github-pull-requests/builds/126961

The builds actually failed, probably because prerequisite of a test
suite failed to build.

However they still ran other tests and all those passed. This meant that
the test reports were green even though the build was red. On some level
this is technically correct, but it is very misleading in practice.

So I've also passed the build script's return code, as it was when we
entered the on exit handler, to the generator, so that when this happens
again, the report will draw the viewer's attention to the overall
failure. There will be a link in the report to the build's log file, so
the next step to investigate is clear.

It would be nice to say "tests failed and there was some other build
error", but we cannot tell what the non-zero return code was caused by.
Could be either.

The script handles the following situations now:
| Have Result Files? | Tests reported failed? | Return code | Report |

|--------------------|------------------------|-------------|-----------------------------------------------------------------------------|
| Yes | No | 0 | Success style report. |
| Yes | Yes | 0 | Shouldn't happen, but if it did, failure style report
showing the failures. |
| Yes | No | 1 | Failure style report, showing no failures but noting
that the build failed. |
| Yes | Yes | 1 | Failure style report, showing the test failures. |
| No | ? | 0 | No test report, success shown in the normal build
display. |
| No | ? | 1 | No test report, failure shown in the normal build
display. |
2025-01-13 09:05:18 +00:00
Aiden Grossman
eabf9313d4
[CI] Detect step failures in metrics job (#122564)
This patch makes the metrics job also detect failures in individual
steps. This is necessary now that we are setting continue-on-error in
the premerge jobs to prevent sending out unnecessary email to detect
what jobs actually fail.
2025-01-11 14:04:03 -08:00
Nathan Gauër
3bcfa1a579
[Github] Add LLVM Premerge Checks to the watchlist (#120230)
LLVM Premerge Checks is running on the new GCP cluster. Tracking its
metrics will allow us to determine the stability of the presubmit and
make sure the new infra is working as intended.

---------

Signed-off-by: Nathan Gauër <brioche@google.com>
2024-12-18 09:58:56 +01:00
Aiden Grossman
a24645463b
[CI] Only upload test results if buildkite-agent is present (#119954)
This patch modifies the monolithic shell scrips to only run if the
buildkite-agent application is present. This allows for running the
scripts to completion outside of buildkite (eg inside of a GHA
pipeline).
2024-12-16 01:01:05 -08:00
Aiden Grossman
d6cc140dfd
[CI] Refactor common functionality into separate script (#119530)
This patch refactors some common functionality present in the CI scripts
to a separate shell script. This is mainly intended to make it easier to
reuse this functionality inside of a Github Actions pipeline as we make
the switch.
2024-12-13 01:20:02 -08:00
David Spickett
71fd5288d2
[ci] Include a log download link when test report is truncated (#117985)
Now "Download" will be a link to the file so people don't have to know
to open the build tab and find the download button.

This is a URL from a real build:

https://buildkite.com/organizations/llvm-project/pipelines/github-pull-requests/builds/123979/jobs/01937132-0fc3-4c95-a884-2fc0048cb9a7/download.txt
And this is how we can build it: 

https://buildkite.com/organizations/{BUILDKITE_ORGANIZATION_SLUG}/pipelines/{BUILDKITE_PIPELINE_SLUG}/builds/{BUILDKITE_BUILD_NUMBER}/jobs/{BUILDKITE_JOB_ID}/download.txt

Given these env vars that were set in that job:
BUILDKITE_ORGANIZATION_SLUG="llvm-project"
BUILDKITE_PIPELINE_SLUG="github-pull-requests"
BUILDKITE_BUILD_NUMBER="123979"
BUILDKITE_JOB_ID="01937132-0fc3-4c95-a884-2fc0048cb9a7"

In theory these will always be available but:
1. Rather safe than sorry with this script, I don't want to make a
passing
   build a failure because this script failed.
2. It would get very annoying if you had to set all these to test
   the script locally.
2024-12-11 09:46:34 +00:00
Aiden Grossman
77c2b00553
[CI] Upstream metrics script and container definition (#117461)
This patch includes the script that pulls information from Github and
pushes it to Grafana. This is currently running in the cluster and
pushes information to
https://llvm.grafana.net/public-dashboards/6a1c1969b6794e0a8ee5d494c72ce2cd.
This script is designed to accept other jobs relatively easily and can
be easily modified to look at other metrics.
2024-11-29 11:15:44 -08:00
David Spickett
3b8426d340 [ci] Fix unit tests for test report generator
Last time I fixed a bug here I forgot to update them.
2024-11-28 09:26:30 +00:00
David Spickett
6a12b43ac0 [ci] Fix error when no junit files are passed to report generator
This resulted in the style being None and despite the report being
empty as well, we tried to send it to the agent and Python can't
send None as an argument.

To fix this return "success" style and also check whether the
report has any content before calling the agent.
2024-11-18 09:08:41 +00:00
David Spickett
889b3c9487 Reland "[ci] New script to generate test reports as Buildkite Annotations (#113447)"
This reverts commit 8a1ca6cad9cd0e972c322910cdfbbe9552c6c7ca.

I have fixed 2 things:
* The report is now sent by stdin so we do not hit the limit on the size
  of command line arguments.
* The report is limited to 1MB in size and if we exceed that we fall back
  to listing only the totals with a note telling you to check the full log.
2024-11-13 10:39:57 +00:00
David Spickett
8a1ca6cad9 Revert "[ci] New script to generate test reports as Buildkite Annotations (#113447)"
This reverts commit e74a002433b4cf7f891ceedb61bd862867218a8b.

As it is failing on Linux with "OSError: [Errno 7] Argument list too long: 'buildkite-agent'".
2024-11-12 16:29:55 +00:00
David Spickett
e74a002433
[ci] New script to generate test reports as Buildkite Annotations (#113447)
The CI builds now send the results of every lit run to a unique file.
This means we can read them all to make a combined report for all
tests.

This report will be shown as an "annotation" in the build results:
https://buildkite.com/docs/agent/v3/cli-annotate#creating-an-annotation

Here is an example:
https://buildkite.com/llvm-project/github-pull-requests/builds/112660
(make sure it is showing "All" instead of "Failures")

This is an alternative to using the existing Buildkite plugin:
https://github.com/buildkite-plugins/junit-annotate-buildkite-plugin

As the plugin is:
* Specific to Buildkite, and we may move away from Buildkite.
* Requires docker, unless we were to fork it ourselves.
* Does not let you customise the report format unless again,
  we make our own fork.

Annotations use GitHub's flavour of Markdown so the main code in the
script generates that text. There is an extra "style" argument generated
to make the formatting nicer in Buildkite.

"context" is the name of the annotation that will be created. By using
different context names for Linux and Windows results we get 2 separate
annotations.

The script also handles calling the buildkite-agent. This makes passing
extra arguments to the agent easier, rather than piping the output of
this script into the agent.

In the future we can remove the agent part of it and simply use
the report content. Either printed to stdout or as a comment on
the GitHub PR.
2024-11-12 13:34:47 +00:00
David Spickett
f539d92dca
[ci] Write test results to unique file names (#113160)
In this patch I'm using a new lit option so that the pipeline writes
many results files, one for each time lit is run:
```
--use-unique-output-file-name
  When enabled, lit will add a unique element to the output file name, before the extension. For example "results.xml" will become "results.<something>.xml". The
  "<something>" is not ordered in any way and is chosen so that existing files are not overwritten. [Default: Off]
```
(I added this to lit recently)

Alternatives were considered:
* mkfifo - does not work on bash for Windows.
* tail -f - does not print full content on file truncation
* lit wrapper script - more complication than using an option to lit
itself
* ninja/mv file/ninja/mv file etc - lots of changes needed to make the
scripts build each target separately

And after feedback I decided that using an option to lit itself is the
cleanest way to go. It can be removed when we no longer need it.

If I run the Linux build after this change:
```
$ bash ./.ci/monolithic-linux.sh "clang;lldb;lld" "check-lldb-shell check-lld" "libcxx;libcxxabi" "check-libcxx check-libcxxabi"
```
I get multiple test result files. In my case some tests fail so runtimes
aren't checked, but all projects are so there is 1 file for lldb and one
for lld:
```
$ ls build/*.xml
build/test-results.klc82utf.xml  build/test-results.majylh73.xml
```
This change just collects the XML files as artifacts. Once I know that's
working, I can set up test reporting to make a summary of them.
2024-11-12 13:24:44 +00:00
David Spickett
90149204bd
[ci] Don't add check-all target when pstl project is enabled (#111803)
Fixes #110265

Adding check-all causes us to run some tests twice if a project specific
target like check-clang is also added.

check-pstl is an alternative but as far as I can tell, check-all does
not include this so we have not been running the tests in CI anyway.

When I tried to run check-pstl locally I got a lot of compiler errors
but have not found any instructions on how to setup a correct build
environment. Even if such instructions exist, it's probably more than we
want to do in CI.

According to Louis Dionne, the project is probably not active. So if
it's ever revived it'll be up to the new contributors to enable testing.
2024-10-10 14:26:46 +01:00
David Spickett
10008f731d
[ci] Don't add a testing target for libclc (#111547)
According to
https://github.com/llvm/llvm-project/pull/111369#issuecomment-2400152471
there is no testing to be done here.

Adding "check-all" only risks duplicating tests if other project
specific "check-" targets are also added.
2024-10-09 09:16:37 +01:00
David Spickett
5be1024ea7
[ci] Use check-compiler-rt target for testing compiler-rt (#111515)
Instead of "check-all" which leads to us running some tests twice if
there are other "check-..." targets. For example on one of my PRs this
script produced:
```
commands:
  - './.ci/monolithic-linux.sh "clang;clang;lld;clang-tools-extra;compiler-rt;llvm" "check-all check-clang check-clang-tools" "libcxx;libcxxabi;libunwind" "check-cxx check-cxxabi check-unwind"'
  commands:
  - 'C:\BuildTools\Common7\Tools\VsDevCmd.bat -arch=amd64 -host_arch=amd64'
  - 'bash .ci/monolithic-windows.sh "clang;clang-tools-extra;llvm" "check-clang check-clang-tools"'
```
Which meant that Linux ran the clang and clang-tools tests twice. These
extra tests were about 24% of the test run and increased testing time
(on my local machine) by 45%.

This problem can also happen with other projects but there isn't a
simple fix like this one at the moment.
* pstl has a check-pstl target but it is not part of check-all and when
I tried it locally I couldn't build it.
* libclc has no check- target.

I will deal with those projects later.
2024-10-09 09:15:56 +01:00
Vlad Serebrennikov
a4f6b7dfa4
[lldb] Stop testing LLDB on Clang changes in pre-commit CI (#95537)
This is a temporary measure to alleviate Linux pre-commit CI waiting
times that started snowballing
[recently](https://discourse.llvm.org/t/long-wait-for-linux-presubmit-testing/79547/5).
My [initial
estimate](https://github.com/llvm/llvm-project/pull/94208#issuecomment-2155972973)
of 4 additional minutes spent per built seems to be in the right
ballpark, but looks like that was the last straw to break camel's back.
It seems that CI load got past the tipping point, and now it's not able
to burn through the queue over the night on workdays.

I don't intend to overthrow the consensus we reached in #94208, but it
shouldn't come at the expense of the whole LLVM community. I'll enable
this back as soon as we have news that we got more capacity for Linux
pre-commit CI.
2024-06-14 20:33:38 +04:00
Vlad Serebrennikov
d4eed43bad
Enable LLDB tests in Linux pre-merge CI (#94208)
This patch removes LLDB from a list of projects that are excluded from
building and testing on pre-merge CI on Linux.

Windows environment needs to be prepared in order to test LLDB
(https://github.com/llvm/llvm-project/pull/94208#issuecomment-2146256857),
but we don't have enough maintenance resources to do that at the moment.

Because LLDB has been in the list of projects that need to be tested on
Clang changes, this PR make this happen on Linux. This seems to be the
consensus in the discussion of this PR.
2024-06-08 16:23:17 +04:00
Mehdi Amini
49ef21d767 Remove debug print from CI generation script (NFC) 2024-05-29 22:02:30 -07:00
Mehdi Amini
e4b424afc4
[CI] Disable Flang from pre-commit tests when Flang files are not touched on Windows Only (#93729)
Flang triggers some OOM on Windows CI right now. This is disruptive to
MLIR and LLVM changes that don't touch Flang, as such we disable
building Flang on Windows only for these PR that don't touch flang. The
testing on Linux is unchanged, and the post-merge Windows testing is
still fully covering here.
2024-05-29 16:27:06 -06:00
Lucile Rose Nihlen
d9dec10937
[ci] limit parallel windows compile jobs to 24 (#93329)
This is an experiment to see if we can prevent some of the compiler OOMs
happening without unduly impacting the Windows build latency.
2024-05-28 19:53:21 +00:00
Vlad Serebrennikov
1de1ee9cba
[clang][ci] Move libc++ testing into the main PR pipeline (#93318)
Following the discussion in
https://github.com/llvm/llvm-project/pull/93233#issuecomment-2127920882,
this patch merges `clang-ci` pipeline into main `GitHub Pull Requests`
pipeline. `clang-ci` enables additional test coverage for Clang by
compiling it, and then using it to compile and test libc++, libc++abi,
and libunwind in C++03, C++26, and Clang Modules modes.

Additional work we skip and total time savings we should see:
1. Checking out the repo to generate the clang-ci pipeline (2 minutes)
2. Building Clang (3.5 minutes)
3. Uploading the artifacts once, then downloading them 3 times and
unpacking 3 times (0.5 minutes)

Note that because previously-split jobs for each mode are now under a
single Linux job, it now takes around 8 minutes more see the Linux CI
results despite total time savings.

The primary goal of this patch is to reduce the load of CI by removing
duplicated work. I consider this goal achieved. I could keep the job
parallelism we had (3 libc++ jobs depending on a main Linux job), but I
don't consider it worth the effort and opportunity cost, because
parallelism is not helping once the pool of builders is fully
subscribed.
2024-05-28 02:25:15 +04:00
Vlad Serebrennikov
243611ed4c
Disable compiling and testing Flang on Clang changes (#92740)
This patch aims to rectify the Windows CI situation by decoupling Clang
changes from Flang test suite, which is causing Windows CI to "pause"
for 20 minutes (details can be found
[here](https://discourse.llvm.org/t/flang-tests-are-extremely-slow-on-windows/78591/11)).
This even seems desirable in the long run, because it was highlighted
that the only part of Clang that Flang depends on is Driver ([Discourse
post](https://discourse.llvm.org/t/flang-tests-are-extremely-slow-on-windows/78591/14)).

Importantly, this patch leaves the question of _entirely_ disabling
Flang tests on Windows CI out of scope.
2024-05-22 00:14:45 +04:00
Amir Ayupov
ced8497970
[ci] Add clang project dependency for bolt testing (#90262) 2024-04-26 22:06:24 +02:00
Amir Ayupov
59bfc31068 [CI] Use trunk Clang in BOLT testing 2024-04-25 20:10:37 -07:00
Fraser Cormack
d0af554464 [CI] Fix libclc dependencies
We need clang and llvm to build in-tree.
2024-04-18 07:01:13 +01:00
Marc Auberer
64f0410193
[CI] Hotfix: CI runs failing due to target escaping (#86897)
My patch #86877 contains a mistake.
Should have read the comment.
Recent buildkite runs fail because of this, so it is a bit urgent.
2024-03-28 02:03:24 +01:00
Marc Auberer
0a17eedf7b
[CI][NFC] Fix shellcheck warnings in CI scripts (#86877)
This fixes all shellcheck warnings we have in `monolithic-linux.sh` and
`monolithic-windows.sh`.
All of them have to do with
[SC2086](https://www.shellcheck.net/wiki/SC2086) - Double quote to
prevent globbing and word splitting.
2024-03-27 23:53:25 +01:00
Mehdi Amini
d35f944dde
Add missing clang to the monolithic pre-merge build (#85354)
Clang has a custom separate pipeline integrated with libc++ that only
runs in release mode. It means that changes which touches only clang
won't run the clang tests in the configuration used by LLVM premerge and
will break it unknowingly.
2024-03-14 22:06:45 -07:00
Connor Sughrue
a950c06d98
[CI] Run pre-merge build with -k 0 placed after "${BUILD_DIR}" (#84846)
#84828 added `-k 0` to pre-merge CI so that if one job fails the others
would continue building. This pull request fixes the location of `-k 0`
in the ninja command line.

Resolves #84842 and #83371
2024-03-11 18:41:50 -04:00
Mehdi Amini
65fd664daf
Run pre-merge build with -k 0 to ensure all tests runs (#84828)
The -k option allows to continue the build after failures as much as
possible. This is useful here because when we run

> ninja check-llvm check-clang

we would like the clang tests to run even if there is a failure in a
llvm tests.

The downside is that a build failure in one file that would prevent from
running any test does not prevent from building more targets, wasting
build resources potentially.

Fixes #83371
2024-03-11 14:00:03 -07:00
Lucile Rose Nihlen
cd4e246616
repair and re-enable Windows buildkite presubmit (#82393) 2024-02-20 15:30:38 -05:00
Tom Stellard
4ad9f5be83
ci: Temporarily disable the buildkite job on Windows (#81538)
The failure rate is too high.
See
https://discourse.llvm.org/t/rfc-future-of-windows-pre-commit-ci/76840
2024-02-13 07:45:55 -08:00
Louis Dionne
5aad789481 [ci] Diff against origin/BASE-BRANCH
Otherwise, when the base branch is not something that the CI runner
has checked out, that reference to e.g. release/18.x is ambiguous.
2024-01-25 16:48:08 -05:00
Louis Dionne
3b76289182
[ci] Fix the base branch we use to determine changes (#79503)
We should diff against the base branch, not always against `main`. This
allows the BuildKite pre-commit CI to work properly when we target other
branches, such as `release/18.x`.
2024-01-25 16:38:53 -05:00
Louis Dionne
5e894771d9
[ci] Remove unused generate-buildkite-pipeline-scheduled script (#79320)
The "scheduled build" pipeline on BuildKite had been disabled for months
and doesn't exist anymore, so this script is effectively dead code. When
we set up a cron-activated build again, we should do it using Github
actions (which could trigger a BK pipeline if needed).

Keeping this script around just creates additional confusion about
what's used and what's not used for doing CI.
2024-01-24 17:58:03 +01:00
Louis Dionne
ca8605a78b [ci] Remove bits that are unused since we stopped using Phabricator 2024-01-24 10:46:34 -05:00