llvm-project

mirror of https://github.com/llvm/llvm-project.git synced 2025-04-15 23:56:30 +00:00

Author	SHA1	Message	Date
Nathan Gauër	fe4f666363	[CI] Always upload queue/running count (#134814 ) Before this commit, we only pushed a queue/running count when the value was not zero. This makes building Grafana alerting a bit harder. Changing this to always upload a value for watched workflows.	2025-04-08 11:16:24 +02:00
Nathan Gauër	77edfbb96c	[CI] Don't count canceled buildkite builds (#132015 ) We don't count canceled jobs on GCP, so we shouldn't count canceled jobs on Buildkite neither. Signed-off-by: Nathan Gauër <brioche@google.com>	2025-03-21 10:14:44 +01:00
Aiden Grossman	0619892cab	[CI] Bump max workflow to process count in metrics This patch bumps the maximum number of metrics to look through when collecting metrics data. We are currently running into issues where we are losing data due to the most recent 1000 workflows not containing the workflows that we actually need to query. Just double it for now. I plan on monitoring this reasonably closely to ensure we do not run into issues, mainly API rate limits.	2025-03-18 19:34:57 +00:00
Nathan Gauër	05df923b0e	[CI] Add dateutil dependency to the metrics container (#131333 )	2025-03-14 14:45:44 +01:00
Nathan Gauër	44f4e43b4f	[CI] Extend metrics container to log BuildKite metrics (#130996 ) The current container focuses on Github metrics. Before deprecating BuildKite, we want to make sure the new infra quality is better, or at least the same. Being able to compare buildkite metrics with github metrics on grafana will allow us to easily present the comparison. BuildKite API allows filtering, but doesn't allow changing the result ordering. Meaning we are left with builds ordered by IDs. This means a completed job can appear before a running job in the list. 2 solutions from there: - keep the cursor on the oldest running workflow - keep a list of running workflows to compare. Because there is no guarantees in workflow ordering, waiting for the oldest build to complete before reporting any newer build could mean delaying the more recent build completion reporting by a few hours. And because grafana cannot ingest metrics older than 2 hours, this is not an option. Thus we leave with the second solution: remember what jobs were running during the last iteration, and record them as soon as they are completed. Buildkite has at most ~100 pending jobs, so keeping all those IDs should be OK.	2025-03-14 11:44:39 +01:00
Nathan Gauër	1282878c52	[CI] Fix bad timestamps being reported (#130941 ) Yesterday, the monitoring reported a job queued for 23h59. After some checks, it appears no such job existed: the age of the workflows on completion was at most 5 hours during the last 48 hours. After some digging, I found out GitHub could return a job with a start date slightly before the creation date, or completion date before start date. This would cause python to compute a negative timedelta, which would then be reported in grafana as a full 24h delta due to the conversions. Adding code to ignore negative delta, but logging them.	2025-03-13 10:18:02 +01:00
Nathan Gauër	389a705b8e	[CI] Rework github workflow processing (#130317 ) Before this patch, the job/workflow name impacted the metric name, meaning a change in the workflow definition could break monitoring. This patch adds a map to get a stable name on metrics from a workflow name. In addition, it reworks a bit how we track the last processed workflow: the github queries are broken if filtering is applied, meaning we have a list of workflow, ordered by 'created_at', which mixes completed & running workflows. We have no guarantees over the order of completion, meaning we cannot stop at the first completed job we found (even per-workflow). This PR processed the last 1000 workflows, but allows an early stop if the created_at time is older than 8 hours. This means we could miss long-running workflows (>8 hours), and if the number of workflows started before another one completes becomes high (>1000), we'll miss it. To detect this kind of behavior, a new metric is added "oldest workflow processed", which should at least indicate if the depth is too small. An alternative without arbitrary cut would be to initially parse all workflows, and then record the last non-completed one we find and always start from the last (moving the lower bound as they complete). But LLVM has forever-queued workflows runs (>1 years), hence this would cause us to iterate over a very large number of jobs. --------- Signed-off-by: Nathan Gauër <brioche@google.com>	2025-03-11 14:16:18 +01:00
Nathan Gauër	5d50af3f03	Revert "[CI] Extend metrics container to log BuildKite metrics" (#130770 ) Reverts llvm/llvm-project#129699	2025-03-11 14:15:44 +01:00
Nathan Gauër	3df8be3ee9	[CI] Extend metrics container to log BuildKite metrics (#129699 ) The current container focuses on Github metrics. Before deprecating BuildKite, we want to make sure the new infra quality is better, or at least the same. Being able to compare buildkite metrics with github metrics on grafana will allow us to easily present the comparison. This PR requires https://github.com/llvm/llvm-zorg/pull/400 to be merged first.	2025-03-11 14:11:07 +01:00
Aiden Grossman	cef6dbbe54	[CI] Add Logging for Workflow Jobs This patch adds some logging information for individual workflow jobs inside the metrics container. This is mainly intended for debugging why we seem to be missing metrics from some workflows within Grafana.	2025-03-01 03:06:57 +00:00
Aiden Grossman	3c518940b0	[CI] Make Metrics Container Use Python Logging This patch makes the metrics container use the python logging library. This is more of what we want given we're essentially just logging the status of things. It also means we do not have to explicitly specify an output file and lets us control verbosity a bit more cleanly.	2025-03-01 03:03:24 +00:00
Aiden Grossman	b24e14093d	[CI] Keep Track of Workflow Name Instead of Job Name The metrics script includes some logic to only read look at workflows up to the most recent workflow it has seen previously. This was broken in a previous patch when workflow metrics began to be emitted per job. The logic ending the metrics gathering would never trigger, so we would continually fetch more and more workflows until OOM.	2025-02-15 06:16:08 +00:00
Aiden Grossman	d7b89b0dca	[CI] Do Not Consider a Job Failed if Steps Were Skipped This patch makes it so that skipped steps do not cause a job to be considered failed. The windows premerge jobs currently skip the build/test step if there are no projects to build/test. These show up as failures in the dashboard even though everything executed perfectly fine. Reviewers: lnihlen, Keenuts Reviewed By: lnihlen Pull Request: https://github.com/llvm/llvm-project/pull/127279	2025-02-14 19:14:56 -08:00
Aiden Grossman	97d2cfeab3	[CI] Try Moving Github Object Into Loop Currently the metrics container is crashing reasonably often with incomplete read/connection broken errors. Try moving the creation of the Github Object into the main loop to see if recreating the object that maybe handles some connection state fixes the issue. Reviewers: Keenuts, lnihlen Reviewed By: lnihlen Pull Request: https://github.com/llvm/llvm-project/pull/127276	2025-02-14 19:12:16 -08:00
Aiden Grossman	4aeb2f1c79	[CI] Remove Duplicate Heartbeat in Metrics Script This patch removes an extra heartbeat metric in the metrics python file. Before it was performed twice, once in the main function, and once in the get_sampled_workflow_metrics function. We only need one to keep everything happy, and I've chosen to keep the one in get_sampled_workflow_metrics as it seems a more appropriate place to keep it. Reviewers: Keenuts, lnihlen Reviewed By: lnihlen Pull Request: https://github.com/llvm/llvm-project/pull/127275	2025-02-14 19:10:51 -08:00
Aiden Grossman	2d878ccf54	[CI] Track Queue/In Progress Metrics By Job Rather Than Workflow This patch makes it so that the metrics container counts the number of in progress and queued jobs at the job level rather than at the workflow level. This helps us distinguish windows versus linux load and also lets us filter out the MacOS jobs that only run in the release branch. Reviewers: Keenuts, lnihlen Reviewed By: lnihlen Pull Request: https://github.com/llvm/llvm-project/pull/127274	2025-02-14 19:08:45 -08:00
Aiden Grossman	7f24b9acd1	[CI] Support multiple jobs in metrics container (#124457 ) This patch makes it so that the metrics script can support multiple jobs in a single workflow. This is needed so that we do not crash on an assertion now that the windows job has been enabled within the premerge workflow.	2025-01-27 17:05:05 +01:00
Nathan Gauër	13b44283e9	[CI] Add queue size, running count metrics (#122714 ) This commits allows the container to report 3 additional metrics at every sampling event: - a heartbeat - the size of the workflow queue (filtered) - the number of running workflows (filtered) The heartbeat is a simple metric allowing us to monitor the metrics health. Before this commit, a new metrics was pushed only when a workflow was completed. This meant we had to wait a few hours before noticing if the metrics container was unable to push metrics. In addition to this, this commits adds a sampling of the workflow queue size and running count. This should allow us to better understand the load, and improve the autoscale values we pick for the cluster. --------- Signed-off-by: Nathan Gauër <brioche@google.com>	2025-01-16 11:41:49 +01:00
Nathan Gauër	05f9cdd58d	[CI] Remove Check Clang Format from watched workflows (#122740 ) This was useful to test metrics before we had an actual workflow, now it generates noise. Signed-off-by: Nathan Gauër <brioche@google.com>	2025-01-14 11:09:48 +01:00
Aiden Grossman	eabf9313d4	[CI] Detect step failures in metrics job (#122564 ) This patch makes the metrics job also detect failures in individual steps. This is necessary now that we are setting continue-on-error in the premerge jobs to prevent sending out unnecessary email to detect what jobs actually fail.	2025-01-11 14:04:03 -08:00
Nathan Gauër	3bcfa1a579	[Github] Add LLVM Premerge Checks to the watchlist (#120230 ) LLVM Premerge Checks is running on the new GCP cluster. Tracking its metrics will allow us to determine the stability of the presubmit and make sure the new infra is working as intended. --------- Signed-off-by: Nathan Gauër <brioche@google.com>	2024-12-18 09:58:56 +01:00
Aiden Grossman	77c2b00553	[CI] Upstream metrics script and container definition (#117461 ) This patch includes the script that pulls information from Github and pushes it to Grafana. This is currently running in the cluster and pushes information to https://llvm.grafana.net/public-dashboards/6a1c1969b6794e0a8ee5d494c72ce2cd. This script is designed to accept other jobs relatively easily and can be easily modified to look at other metrics.	2024-11-29 11:15:44 -08:00

22 Commits