Before this commit, we only pushed a queue/running count when the value
was not zero. This makes building Grafana alerting a bit harder.
Changing this to always upload a value for watched workflows.
This patch uses an env variable instead of the --break-system-packages
flag. This enables the heterogenous configuration between the old and
new premerge systems as the old premerge container does not recognize
the --break-system-packages flag. An env variable will work on new
premerge and have no impact on old premerge.
This reverts commit 23fb048ce35f672d8db3f466a2522354bbce66e5.
This broke the new premerge system as it appears the pip installations within
the CI image do not support this option. Buildkite was unaffected.
These changes are mostly pushed by the gnsyncbot directly to main and
thus don't go through a PR, but we still test on main to see if main is
broken. Given these touch llvm/, they end up burning a decent amount of
testing time for no real benefit, so I think it makes sense to exclude
them from premerge testing explicitly.
This patch fixes the monolithic linux build in Ubuntu 24.04. Newer
versions of debian/ubuntu pass a warning when installing packages at the
system level using pip as it interferes with system package manager
installed python packages. We do not use any system package manager
installed python packages, so we just ignore the warning (that is an
error without passing the flag) by passing the --break-system-packages
flag.
This patch adds rich test failure information to the Github output,
using the same library that is used for the buildkite pipeline.
Eventually I think we want to add more information like reproduction
information using the containers, but that is very divergent between
Github and Buildkite, so we probably want to wait until we've switched
over before doing that.
Reviewers: Keenuts, tstellar, lnihlen, DavidSpickett
Reviewed By: DavidSpickett, Keenuts
Pull Request: https://github.com/llvm/llvm-project/pull/133197
Currently when someone touches a docs directory in a subproject, it is
treated as if the source code of that project got touched, so the
project is built, it is tested, and the same for all of its enumerated
dependents. This is wasteful, particularly for patches just touching
docs in places like LLVM where we might spend an hour of node time to do
nothing useful given changes in the docs shouldn't cause test failures
and there is already another workflow that tests the documentation build
completes successfully.
Reviewers: Keenuts, tstellar, lnihlen
Reviewed By: tstellar
Pull Request: https://github.com/llvm/llvm-project/pull/133185
This patch migrates the CI over to the new compute_projects.py script
for calculating what projects need to be tested based on a change to
LLVM.
Reviewers: lnihlen, ldionne, tstellar, Endilll, joker-eph, Keenuts
Reviewed By: Keenuts, tstellar
Pull Request: https://github.com/llvm/llvm-project/pull/132642
This patch refactors the generate_test_report script, namely turning it
into a proper library, and pulling the script/unittests out into
separate files, as is standard with most python scripts. The main
purpose of this is to enable reusing the library for the new Github
premerge.
Reviewers: tstellar, DavidSpickett, Keenuts, lnihlen
Reviewed By: DavidSpickett
Pull Request: https://github.com/llvm/llvm-project/pull/133196
Before this patch, making a change to a runtime directory (like libcxx)
would cause the project to be added to the LLVM_ENABLE_PROJECTS CMake
flag which is illegal as they can only be built as part of
LLVM_ENABLE_RUNTIMES. This patch fixes that behavior. Test added.
This patch adds a python script, compute_projects, and associated unit
tests for computing the projects and runtimes that need to be tested in
premerge. Rewriting in Python opens up a couple new
improvements/opportunities:
1. I personally find python to be much easier to work with than shell
scripts for tasks like this. Particularly it becomes a lot easier to
work with paths with proper array support.
2. Unit testing becomes easier which makes it a lot easier to reason
about behavior changes, especially in review.
3. Most of the configuration is now setup in some dictionaries, which
makes changes much easier to apply for most of the common changes.
This preserves the behavior of the existing premerge scripts as much as
possible.
Reviewers: ldionne, lnihlen, Endilll, tstellar, Keenuts
Reviewed By: Keenuts
Pull Request: https://github.com/llvm/llvm-project/pull/132634
Between C++26 and Clang modules build of runtimes, which are used as an
additional testing for Clang, the only difference are
`LIBCXX_TEST_PARAMS` and `LIBCXXABI_TEST_PARAMS`. Both of them are
transformed into actual lit configuration lines, and put into
`SERIALIZED_LIT_PARAMS`, which end up in `libcxx/test/cmake-bridge.cfg`
via `configure_file` command. Notably, it seems that they are not used
in any other way.
I checked that if those variables are changed, subsequent runs of CMake
configuration step regenerate `cmake-bridge.cfg` with the new values.
Which means that we don't need to do clean builds for every runtimes
configuration we want to test.
I hope that together with #131913, this will be enough to alleviate
Linux CI pains we're having, and we wouldn't have to make a tough choice
between C++26 and Clang modules builds for pre-merge CI.
This patch cleans up the runtimes build in premerge due to queuing
delays, dropping the C++03 testing, but keeping the C++20 and Modules
configurations as they are deemed important by clang contributors.
This patch also makes it easier in the future when we need to rework the
runtimes build to anticipate the deprecation of building most of the
runtimes with LLVM_ENABLE_PROJECTS.
This reverts commit 95d28fe503cc3d2bc0bb980442d3defaf199ea5a.
I did not fully realize the implications of this change when reviewing.
With how it is set up currently, it causes clang and all of the runtimes
to be built and tested everytime a change to MLIR is made. This is a
large regression in build/test time, which seems to have been causing
large queueing delays.
Reverting for now. Once we rework the runtimes build for premerge (which
I hope to do soon, ideally in the next week), I will make sure flang-rt
gets added in.
This patch bumps the maximum number of metrics to look through when
collecting metrics data. We are currently running into issues where we
are losing data due to the most recent 1000 workflows not containing the
workflows that we actually need to query. Just double it for now.
I plan on monitoring this reasonably closely to ensure we do not run
into issues, mainly API rate limits.
The current container focuses on Github metrics. Before deprecating
BuildKite, we want to make sure the new infra quality is better, or at
least the same.
Being able to compare buildkite metrics with github metrics on grafana
will allow us to easily present the comparison.
BuildKite API allows filtering, but doesn't allow changing the result
ordering. Meaning we are left with builds ordered by IDs. This means a
completed job can appear before a running job in the list. 2 solutions
from there:
- keep the cursor on the oldest running workflow
- keep a list of running workflows to compare.
Because there is no guarantees in workflow ordering, waiting for the
oldest build to complete before reporting any newer build could mean
delaying the more recent build completion reporting by a few hours. And
because grafana cannot ingest metrics older than 2 hours, this is not an
option.
Thus we leave with the second solution: remember what jobs were running
during the last iteration, and record them as soon as they are
completed. Buildkite has at most ~100 pending jobs, so keeping all those
IDs should be OK.
Flang's runtime can now be built using LLVM's LLVM_ENABLE_RUNTIMES
mechanism, with the intent to remove the old mechanism in #124126.
Update the pre-merge builders to use the new mechanism.
In the current form, #124126 actually will add
LLVM_ENABLE_RUNTIMES=flang-rt implicitly, so no change is strictly
needed. I still think it is a good idea to do it explicitly and in
advance.
On Windows, flang-rt also requires compiler-rt, but which is not
building on Windows anyway.
Yesterday, the monitoring reported a job queued for 23h59. After some
checks, it appears no such job existed: the age of the workflows on
completion was at most 5 hours during the last 48 hours.
After some digging, I found out GitHub could return a job with a start
date slightly before the creation date, or completion date before start
date.
This would cause python to compute a negative timedelta, which would
then be reported in grafana as a full 24h delta due to the conversions.
Adding code to ignore negative delta, but logging them.
Before this patch, the job/workflow name impacted the metric name,
meaning a change in the workflow definition could break monitoring. This
patch adds a map to get a stable name on metrics from a workflow name.
In addition, it reworks a bit how we track the last processed workflow:
the github queries are broken if filtering is applied, meaning we have a
list of workflow, ordered by 'created_at', which mixes completed &
running workflows.
We have no guarantees over the order of completion, meaning we cannot
stop at the first completed job we found (even per-workflow).
This PR processed the last 1000 workflows, but allows an early stop if
the created_at time is older than 8 hours. This means we could miss
long-running workflows (>8 hours), and if the number of workflows
started before another one completes becomes high (>1000), we'll miss
it.
To detect this kind of behavior, a new metric is added "oldest workflow
processed", which should at least indicate if the depth is too small.
An alternative without arbitrary cut would be to initially parse all
workflows, and then record the last non-completed one we find and always
start from the last (moving the lower bound as they complete). But LLVM
has forever-queued workflows runs (>1 years), hence this would cause us
to iterate over a very large number of jobs.
---------
Signed-off-by: Nathan Gauër <brioche@google.com>
The current container focuses on Github metrics. Before deprecating
BuildKite, we want to make sure the new infra quality is better, or at
least the same.
Being able to compare buildkite metrics with github metrics on grafana
will allow us to easily present the comparison.
This PR requires https://github.com/llvm/llvm-zorg/pull/400 to be merged
first.
This patch adds some logging information for individual workflow jobs inside
the metrics container. This is mainly intended for debugging why we seem to be
missing metrics from some workflows within Grafana.
This patch makes the metrics container use the python logging library. This
is more of what we want given we're essentially just logging the status of
things. It also means we do not have to explicitly specify an output file
and lets us control verbosity a bit more cleanly.
The metrics script includes some logic to only read look at workflows up
to the most recent workflow it has seen previously. This was broken in a
previous patch when workflow metrics began to be emitted per job. The
logic ending the metrics gathering would never trigger, so we would
continually fetch more and more workflows until OOM.
This patch makes it so that skipped steps do not cause a job to be
considered failed. The windows premerge jobs currently skip the
build/test step if there are no projects to build/test. These show up as
failures in the dashboard even though everything executed perfectly
fine.
Reviewers: lnihlen, Keenuts
Reviewed By: lnihlen
Pull Request: https://github.com/llvm/llvm-project/pull/127279
Currently the metrics container is crashing reasonably often with
incomplete read/connection broken errors. Try moving the creation of the
Github Object into the main loop to see if recreating the object that
maybe handles some connection state fixes the issue.
Reviewers: Keenuts, lnihlen
Reviewed By: lnihlen
Pull Request: https://github.com/llvm/llvm-project/pull/127276
This patch removes an extra heartbeat metric in the metrics python file. Before
it was performed twice, once in the main function, and once in the
get_sampled_workflow_metrics function. We only need one to keep everything
happy, and I've chosen to keep the one in get_sampled_workflow_metrics as it
seems a more appropriate place to keep it.
Reviewers: Keenuts, lnihlen
Reviewed By: lnihlen
Pull Request: https://github.com/llvm/llvm-project/pull/127275
This patch makes it so that the metrics container counts the number of in
progress and queued jobs at the job level rather than at the workflow
level. This helps us distinguish windows versus linux load and also lets
us filter out the MacOS jobs that only run in the release branch.
Reviewers: Keenuts, lnihlen
Reviewed By: lnihlen
Pull Request: https://github.com/llvm/llvm-project/pull/127274
This patch makes it so that the metrics script can support multiple jobs
in a single workflow. This is needed so that we do not crash on an
assertion now that the windows job has been enabled within the premerge
workflow.
This patch makes it so that the caller of monolithic-windows.sh can set
the maximum number of parallel compile/link jobs in an environment
variable rather than manually specifying it inside of the CMake.
Additionally, the env variable definitions for CC, CXX, and LD are sunk
into the shell script due to those config options being pretty inherent
to what the pipeline is testing.
This is intended to make things more flexible/useable for the new
premerge CI pipeline, particularly as we are looking at using larger
runners and want the increased flexibility to experiment.
This commits allows the container to report 3 additional metrics at
every sampling event:
- a heartbeat
- the size of the workflow queue (filtered)
- the number of running workflows (filtered)
The heartbeat is a simple metric allowing us to monitor the metrics
health. Before this commit, a new metrics was pushed only when a
workflow was completed. This meant we had to wait a few hours
before noticing if the metrics container was unable to push metrics.
In addition to this, this commits adds a sampling of the workflow
queue size and running count. This should allow us to better understand
the load, and improve the autoscale values we pick for the cluster.
---------
Signed-off-by: Nathan Gauër <brioche@google.com>
In this build:
https://buildkite.com/llvm-project/github-pull-requests/builds/126961
The builds actually failed, probably because prerequisite of a test
suite failed to build.
However they still ran other tests and all those passed. This meant that
the test reports were green even though the build was red. On some level
this is technically correct, but it is very misleading in practice.
So I've also passed the build script's return code, as it was when we
entered the on exit handler, to the generator, so that when this happens
again, the report will draw the viewer's attention to the overall
failure. There will be a link in the report to the build's log file, so
the next step to investigate is clear.
It would be nice to say "tests failed and there was some other build
error", but we cannot tell what the non-zero return code was caused by.
Could be either.
The script handles the following situations now:
| Have Result Files? | Tests reported failed? | Return code | Report |
|--------------------|------------------------|-------------|-----------------------------------------------------------------------------|
| Yes | No | 0 | Success style report. |
| Yes | Yes | 0 | Shouldn't happen, but if it did, failure style report
showing the failures. |
| Yes | No | 1 | Failure style report, showing no failures but noting
that the build failed. |
| Yes | Yes | 1 | Failure style report, showing the test failures. |
| No | ? | 0 | No test report, success shown in the normal build
display. |
| No | ? | 1 | No test report, failure shown in the normal build
display. |
This patch makes the metrics job also detect failures in individual
steps. This is necessary now that we are setting continue-on-error in
the premerge jobs to prevent sending out unnecessary email to detect
what jobs actually fail.
LLVM Premerge Checks is running on the new GCP cluster. Tracking its
metrics will allow us to determine the stability of the presubmit and
make sure the new infra is working as intended.
---------
Signed-off-by: Nathan Gauër <brioche@google.com>
This patch modifies the monolithic shell scrips to only run if the
buildkite-agent application is present. This allows for running the
scripts to completion outside of buildkite (eg inside of a GHA
pipeline).
This patch refactors some common functionality present in the CI scripts
to a separate shell script. This is mainly intended to make it easier to
reuse this functionality inside of a Github Actions pipeline as we make
the switch.
This patch includes the script that pulls information from Github and
pushes it to Grafana. This is currently running in the cluster and
pushes information to
https://llvm.grafana.net/public-dashboards/6a1c1969b6794e0a8ee5d494c72ce2cd.
This script is designed to accept other jobs relatively easily and can
be easily modified to look at other metrics.
This resulted in the style being None and despite the report being
empty as well, we tried to send it to the agent and Python can't
send None as an argument.
To fix this return "success" style and also check whether the
report has any content before calling the agent.
This reverts commit 8a1ca6cad9cd0e972c322910cdfbbe9552c6c7ca.
I have fixed 2 things:
* The report is now sent by stdin so we do not hit the limit on the size
of command line arguments.
* The report is limited to 1MB in size and if we exceed that we fall back
to listing only the totals with a note telling you to check the full log.
This reverts commit e74a002433b4cf7f891ceedb61bd862867218a8b.
As it is failing on Linux with "OSError: [Errno 7] Argument list too long: 'buildkite-agent'".
The CI builds now send the results of every lit run to a unique file.
This means we can read them all to make a combined report for all
tests.
This report will be shown as an "annotation" in the build results:
https://buildkite.com/docs/agent/v3/cli-annotate#creating-an-annotation
Here is an example:
https://buildkite.com/llvm-project/github-pull-requests/builds/112660
(make sure it is showing "All" instead of "Failures")
This is an alternative to using the existing Buildkite plugin:
https://github.com/buildkite-plugins/junit-annotate-buildkite-plugin
As the plugin is:
* Specific to Buildkite, and we may move away from Buildkite.
* Requires docker, unless we were to fork it ourselves.
* Does not let you customise the report format unless again,
we make our own fork.
Annotations use GitHub's flavour of Markdown so the main code in the
script generates that text. There is an extra "style" argument generated
to make the formatting nicer in Buildkite.
"context" is the name of the annotation that will be created. By using
different context names for Linux and Windows results we get 2 separate
annotations.
The script also handles calling the buildkite-agent. This makes passing
extra arguments to the agent easier, rather than piping the output of
this script into the agent.
In the future we can remove the agent part of it and simply use
the report content. Either printed to stdout or as a comment on
the GitHub PR.