22 Commits

Author SHA1 Message Date
Nathan Gauër
fe4f666363
[CI] Always upload queue/running count (#134814)
Before this commit, we only pushed a queue/running count when the value
was not zero. This makes building Grafana alerting a bit harder.
Changing this to always upload a value for watched workflows.
2025-04-08 11:16:24 +02:00
Nathan Gauër
77edfbb96c
[CI] Don't count canceled buildkite builds (#132015)
We don't count canceled jobs on GCP, so we shouldn't count canceled jobs
on Buildkite neither.

Signed-off-by: Nathan Gauër <brioche@google.com>
2025-03-21 10:14:44 +01:00
Aiden Grossman
0619892cab [CI] Bump max workflow to process count in metrics
This patch bumps the maximum number of metrics to look through when
collecting metrics data. We are currently running into issues where we
are losing data due to the most recent 1000 workflows not containing the
workflows that we actually need to query. Just double it for now.

I plan on monitoring this reasonably closely to ensure we do not run
into issues, mainly API rate limits.
2025-03-18 19:34:57 +00:00
Nathan Gauër
05df923b0e
[CI] Add dateutil dependency to the metrics container (#131333) 2025-03-14 14:45:44 +01:00
Nathan Gauër
44f4e43b4f
[CI] Extend metrics container to log BuildKite metrics (#130996)
The current container focuses on Github metrics. Before deprecating
BuildKite, we want to make sure the new infra quality is better, or at
least the same.

Being able to compare buildkite metrics with github metrics on grafana
will allow us to easily present the comparison.

BuildKite API allows filtering, but doesn't allow changing the result
ordering. Meaning we are left with builds ordered by IDs. This means a
completed job can appear before a running job in the list. 2 solutions
from there:
 - keep the cursor on the oldest running workflow
 - keep a list of running workflows to compare.

Because there is no guarantees in workflow ordering, waiting for the
oldest build to complete before reporting any newer build could mean
delaying the more recent build completion reporting by a few hours. And
because grafana cannot ingest metrics older than 2 hours, this is not an
option.

Thus we leave with the second solution: remember what jobs were running
during the last iteration, and record them as soon as they are
completed. Buildkite has at most ~100 pending jobs, so keeping all those
IDs should be OK.
2025-03-14 11:44:39 +01:00
Nathan Gauër
1282878c52
[CI] Fix bad timestamps being reported (#130941)
Yesterday, the monitoring reported a job queued for 23h59. After some
checks, it appears no such job existed: the age of the workflows on
completion was at most 5 hours during the last 48 hours.

After some digging, I found out GitHub could return a job with a start
date slightly before the creation date, or completion date before start
date.
This would cause python to compute a negative timedelta, which would
then be reported in grafana as a full 24h delta due to the conversions.

Adding code to ignore negative delta, but logging them.
2025-03-13 10:18:02 +01:00
Nathan Gauër
389a705b8e
[CI] Rework github workflow processing (#130317)
Before this patch, the job/workflow name impacted the metric name,
meaning a change in the workflow definition could break monitoring. This
patch adds a map to get a stable name on metrics from a workflow name.

In addition, it reworks a bit how we track the last processed workflow:
the github queries are broken if filtering is applied, meaning we have a
list of workflow, ordered by 'created_at', which mixes completed &
running workflows.
We have no guarantees over the order of completion, meaning we cannot
stop at the first completed job we found (even per-workflow).

This PR processed the last 1000 workflows, but allows an early stop if
the created_at time is older than 8 hours. This means we could miss
long-running workflows (>8 hours), and if the number of workflows
started before another one completes becomes high (>1000), we'll miss
it.
To detect this kind of behavior, a new metric is added "oldest workflow
processed", which should at least indicate if the depth is too small.

An alternative without arbitrary cut would be to initially parse all
workflows, and then record the last non-completed one we find and always
start from the last (moving the lower bound as they complete). But LLVM
has forever-queued workflows runs (>1 years), hence this would cause us
to iterate over a very large number of jobs.

---------

Signed-off-by: Nathan Gauër <brioche@google.com>
2025-03-11 14:16:18 +01:00
Nathan Gauër
5d50af3f03
Revert "[CI] Extend metrics container to log BuildKite metrics" (#130770)
Reverts llvm/llvm-project#129699
2025-03-11 14:15:44 +01:00
Nathan Gauër
3df8be3ee9
[CI] Extend metrics container to log BuildKite metrics (#129699)
The current container focuses on Github metrics. Before deprecating
BuildKite, we want to make sure the new infra quality is better, or at
least the same.

Being able to compare buildkite metrics with github metrics on grafana
will allow us to easily present the comparison.

This PR requires https://github.com/llvm/llvm-zorg/pull/400 to be merged
first.
2025-03-11 14:11:07 +01:00
Aiden Grossman
cef6dbbe54 [CI] Add Logging for Workflow Jobs
This patch adds some logging information for individual workflow jobs inside
the metrics container. This is mainly intended for debugging why we seem to be
missing metrics from some workflows within Grafana.
2025-03-01 03:06:57 +00:00
Aiden Grossman
3c518940b0 [CI] Make Metrics Container Use Python Logging
This patch makes the metrics container use the python logging library. This
is more of what we want given we're essentially just logging the status of
things. It also means we do not have to explicitly specify an output file
and lets us control verbosity a bit more cleanly.
2025-03-01 03:03:24 +00:00
Aiden Grossman
b24e14093d [CI] Keep Track of Workflow Name Instead of Job Name
The metrics script includes some logic to only read look at workflows up
to the most recent workflow it has seen previously. This was broken in a
previous patch when workflow metrics began to be emitted per job. The
logic ending the metrics gathering would never trigger, so we would
continually fetch more and more workflows until OOM.
2025-02-15 06:16:08 +00:00
Aiden Grossman
d7b89b0dca
[CI] Do Not Consider a Job Failed if Steps Were Skipped
This patch makes it so that skipped steps do not cause a job to be
considered failed. The windows premerge jobs currently skip the
build/test step if there are no projects to build/test. These show up as
failures in the dashboard even though everything executed perfectly
fine.

Reviewers: lnihlen, Keenuts

Reviewed By: lnihlen

Pull Request: https://github.com/llvm/llvm-project/pull/127279
2025-02-14 19:14:56 -08:00
Aiden Grossman
97d2cfeab3
[CI] Try Moving Github Object Into Loop
Currently the metrics container is crashing reasonably often with
incomplete read/connection broken errors. Try moving the creation of the
Github Object into the main loop to see if recreating the object that
maybe handles some connection state fixes the issue.

Reviewers: Keenuts, lnihlen

Reviewed By: lnihlen

Pull Request: https://github.com/llvm/llvm-project/pull/127276
2025-02-14 19:12:16 -08:00
Aiden Grossman
4aeb2f1c79
[CI] Remove Duplicate Heartbeat in Metrics Script
This patch removes an extra heartbeat metric in the metrics python file. Before
it was performed twice, once in the main function, and once in the
get_sampled_workflow_metrics function. We only need one to keep everything
happy, and I've chosen to keep the one in get_sampled_workflow_metrics as it
seems a more appropriate place to keep it.

Reviewers: Keenuts, lnihlen

Reviewed By: lnihlen

Pull Request: https://github.com/llvm/llvm-project/pull/127275
2025-02-14 19:10:51 -08:00
Aiden Grossman
2d878ccf54
[CI] Track Queue/In Progress Metrics By Job Rather Than Workflow
This patch makes it so that the metrics container counts the number of in
progress and queued jobs at the job level rather than at the workflow
level. This helps us distinguish windows versus linux load and also lets
us filter out the MacOS jobs that only run in the release branch.

Reviewers: Keenuts, lnihlen

Reviewed By: lnihlen

Pull Request: https://github.com/llvm/llvm-project/pull/127274
2025-02-14 19:08:45 -08:00
Aiden Grossman
7f24b9acd1
[CI] Support multiple jobs in metrics container (#124457)
This patch makes it so that the metrics script can support multiple jobs
in a single workflow. This is needed so that we do not crash on an
assertion now that the windows job has been enabled within the premerge
workflow.
2025-01-27 17:05:05 +01:00
Nathan Gauër
13b44283e9
[CI] Add queue size, running count metrics (#122714)
This commits allows the container to report 3 additional metrics at
every sampling event:
- a heartbeat
- the size of the workflow queue (filtered)
- the number of running workflows (filtered)

The heartbeat is a simple metric allowing us to monitor the metrics
health. Before this commit, a new metrics was pushed only when a
workflow was completed. This meant we had to wait a few hours
before noticing if the metrics container was unable to push metrics.

In addition to this, this commits adds a sampling of the workflow
queue size and running count. This should allow us to better understand
the load, and improve the autoscale values we pick for the cluster.

---------

Signed-off-by: Nathan Gauër <brioche@google.com>
2025-01-16 11:41:49 +01:00
Nathan Gauër
05f9cdd58d
[CI] Remove Check Clang Format from watched workflows (#122740)
This was useful to test metrics before we had an actual workflow, now it
generates noise.

Signed-off-by: Nathan Gauër <brioche@google.com>
2025-01-14 11:09:48 +01:00
Aiden Grossman
eabf9313d4
[CI] Detect step failures in metrics job (#122564)
This patch makes the metrics job also detect failures in individual
steps. This is necessary now that we are setting continue-on-error in
the premerge jobs to prevent sending out unnecessary email to detect
what jobs actually fail.
2025-01-11 14:04:03 -08:00
Nathan Gauër
3bcfa1a579
[Github] Add LLVM Premerge Checks to the watchlist (#120230)
LLVM Premerge Checks is running on the new GCP cluster. Tracking its
metrics will allow us to determine the stability of the presubmit and
make sure the new infra is working as intended.

---------

Signed-off-by: Nathan Gauër <brioche@google.com>
2024-12-18 09:58:56 +01:00
Aiden Grossman
77c2b00553
[CI] Upstream metrics script and container definition (#117461)
This patch includes the script that pulls information from Github and
pushes it to Grafana. This is currently running in the cluster and
pushes information to
https://llvm.grafana.net/public-dashboards/6a1c1969b6794e0a8ee5d494c72ce2cd.
This script is designed to accept other jobs relatively easily and can
be easily modified to look at other metrics.
2024-11-29 11:15:44 -08:00