Before this commit, we only pushed a queue/running count when the value
was not zero. This makes building Grafana alerting a bit harder.
Changing this to always upload a value for watched workflows.
This patch bumps the maximum number of metrics to look through when
collecting metrics data. We are currently running into issues where we
are losing data due to the most recent 1000 workflows not containing the
workflows that we actually need to query. Just double it for now.
I plan on monitoring this reasonably closely to ensure we do not run
into issues, mainly API rate limits.
The current container focuses on Github metrics. Before deprecating
BuildKite, we want to make sure the new infra quality is better, or at
least the same.
Being able to compare buildkite metrics with github metrics on grafana
will allow us to easily present the comparison.
BuildKite API allows filtering, but doesn't allow changing the result
ordering. Meaning we are left with builds ordered by IDs. This means a
completed job can appear before a running job in the list. 2 solutions
from there:
- keep the cursor on the oldest running workflow
- keep a list of running workflows to compare.
Because there is no guarantees in workflow ordering, waiting for the
oldest build to complete before reporting any newer build could mean
delaying the more recent build completion reporting by a few hours. And
because grafana cannot ingest metrics older than 2 hours, this is not an
option.
Thus we leave with the second solution: remember what jobs were running
during the last iteration, and record them as soon as they are
completed. Buildkite has at most ~100 pending jobs, so keeping all those
IDs should be OK.
Yesterday, the monitoring reported a job queued for 23h59. After some
checks, it appears no such job existed: the age of the workflows on
completion was at most 5 hours during the last 48 hours.
After some digging, I found out GitHub could return a job with a start
date slightly before the creation date, or completion date before start
date.
This would cause python to compute a negative timedelta, which would
then be reported in grafana as a full 24h delta due to the conversions.
Adding code to ignore negative delta, but logging them.
Before this patch, the job/workflow name impacted the metric name,
meaning a change in the workflow definition could break monitoring. This
patch adds a map to get a stable name on metrics from a workflow name.
In addition, it reworks a bit how we track the last processed workflow:
the github queries are broken if filtering is applied, meaning we have a
list of workflow, ordered by 'created_at', which mixes completed &
running workflows.
We have no guarantees over the order of completion, meaning we cannot
stop at the first completed job we found (even per-workflow).
This PR processed the last 1000 workflows, but allows an early stop if
the created_at time is older than 8 hours. This means we could miss
long-running workflows (>8 hours), and if the number of workflows
started before another one completes becomes high (>1000), we'll miss
it.
To detect this kind of behavior, a new metric is added "oldest workflow
processed", which should at least indicate if the depth is too small.
An alternative without arbitrary cut would be to initially parse all
workflows, and then record the last non-completed one we find and always
start from the last (moving the lower bound as they complete). But LLVM
has forever-queued workflows runs (>1 years), hence this would cause us
to iterate over a very large number of jobs.
---------
Signed-off-by: Nathan Gauër <brioche@google.com>
The current container focuses on Github metrics. Before deprecating
BuildKite, we want to make sure the new infra quality is better, or at
least the same.
Being able to compare buildkite metrics with github metrics on grafana
will allow us to easily present the comparison.
This PR requires https://github.com/llvm/llvm-zorg/pull/400 to be merged
first.
This patch adds some logging information for individual workflow jobs inside
the metrics container. This is mainly intended for debugging why we seem to be
missing metrics from some workflows within Grafana.
This patch makes the metrics container use the python logging library. This
is more of what we want given we're essentially just logging the status of
things. It also means we do not have to explicitly specify an output file
and lets us control verbosity a bit more cleanly.
The metrics script includes some logic to only read look at workflows up
to the most recent workflow it has seen previously. This was broken in a
previous patch when workflow metrics began to be emitted per job. The
logic ending the metrics gathering would never trigger, so we would
continually fetch more and more workflows until OOM.
This patch makes it so that skipped steps do not cause a job to be
considered failed. The windows premerge jobs currently skip the
build/test step if there are no projects to build/test. These show up as
failures in the dashboard even though everything executed perfectly
fine.
Reviewers: lnihlen, Keenuts
Reviewed By: lnihlen
Pull Request: https://github.com/llvm/llvm-project/pull/127279
Currently the metrics container is crashing reasonably often with
incomplete read/connection broken errors. Try moving the creation of the
Github Object into the main loop to see if recreating the object that
maybe handles some connection state fixes the issue.
Reviewers: Keenuts, lnihlen
Reviewed By: lnihlen
Pull Request: https://github.com/llvm/llvm-project/pull/127276
This patch removes an extra heartbeat metric in the metrics python file. Before
it was performed twice, once in the main function, and once in the
get_sampled_workflow_metrics function. We only need one to keep everything
happy, and I've chosen to keep the one in get_sampled_workflow_metrics as it
seems a more appropriate place to keep it.
Reviewers: Keenuts, lnihlen
Reviewed By: lnihlen
Pull Request: https://github.com/llvm/llvm-project/pull/127275
This patch makes it so that the metrics container counts the number of in
progress and queued jobs at the job level rather than at the workflow
level. This helps us distinguish windows versus linux load and also lets
us filter out the MacOS jobs that only run in the release branch.
Reviewers: Keenuts, lnihlen
Reviewed By: lnihlen
Pull Request: https://github.com/llvm/llvm-project/pull/127274
This patch makes it so that the metrics script can support multiple jobs
in a single workflow. This is needed so that we do not crash on an
assertion now that the windows job has been enabled within the premerge
workflow.
This commits allows the container to report 3 additional metrics at
every sampling event:
- a heartbeat
- the size of the workflow queue (filtered)
- the number of running workflows (filtered)
The heartbeat is a simple metric allowing us to monitor the metrics
health. Before this commit, a new metrics was pushed only when a
workflow was completed. This meant we had to wait a few hours
before noticing if the metrics container was unable to push metrics.
In addition to this, this commits adds a sampling of the workflow
queue size and running count. This should allow us to better understand
the load, and improve the autoscale values we pick for the cluster.
---------
Signed-off-by: Nathan Gauër <brioche@google.com>
This patch makes the metrics job also detect failures in individual
steps. This is necessary now that we are setting continue-on-error in
the premerge jobs to prevent sending out unnecessary email to detect
what jobs actually fail.
LLVM Premerge Checks is running on the new GCP cluster. Tracking its
metrics will allow us to determine the stability of the presubmit and
make sure the new infra is working as intended.
---------
Signed-off-by: Nathan Gauër <brioche@google.com>
This patch includes the script that pulls information from Github and
pushes it to Grafana. This is currently running in the cluster and
pushes information to
https://llvm.grafana.net/public-dashboards/6a1c1969b6794e0a8ee5d494c72ce2cd.
This script is designed to accept other jobs relatively easily and can
be easily modified to look at other metrics.