Beyond Green Checks: Measuring Real Pipeline Health with CI Vitals

We rely on CI/CD pipelines more than ever to deliver value quickly and reliably. But are they truly healthy? We've all felt the pain: slow builds delaying critical feedback, flaky tests eroding trust, unexpected failures blocking deployments at the worst possible moment, and runner costs creeping up... How much time are you really losing?

While modern CI/CD tools offer incredible power and flexibility, simply seeing that green checkmark doesn't tell the whole story about your pipeline's performance or the friction it might be causing your team.

Even within a single repository or project, many CI/CD platforms lack built-in tools for creating meaningful statistical dashboards or analyzing workflow trends over time. Getting a high-level view of pipeline health often requires digging through logs, manual checks, or investing significant time in building custom tooling.

Remember when Google introduced Web Vitals? It brought much-needed clarity to web performance by focusing on a handful of core metrics that directly impact user experience. We believe the developer experience and the overall efficiency of our CI/CD processes deserve the same focused approach. Without clear, actionable metrics, it's difficult to understand performance, identify bottlenecks, or measure the impact of improvements.

Introducing CI Vitals

That’s why we're proposing CI Vitals: a curated set of three core metrics designed specifically to provide a clear, consistent, and actionable view of your CI/CD pipeline health.

CI Vitals focus on three fundamental areas: Speed, Reliability, and Efficiency. Crucially, they are designed for straightforward interpretation. Like Web Vitals, lower values across the CI Vitals typically signify improvements in pipeline health and performance.

Meet the CI Vitals: WET, FR, POT

Let's dive into the three core CI Vitals:

1. WET (Workflow Execution Time)

WET measures the end-to-end duration (typically in seconds or minutes) for your key workflows (e.g., the main CI run on pushes to main) to complete successfully. These are typically the workflows that run most frequently, protect the main branch, or are critical for deployment, thus having the largest impact on developer experience and delivery speed. To understand this typical performance while accounting for outliers that can skew simple averages, we often focus on percentiles like the 75th (p75) or 90th (p90). This metric directly impacts the developer feedback loop; faster workflows mean developers get results quicker, can merge code faster, and minimize costly context switching. Slow WET leads to frustration and delays in shipping value, so improvements here are crucial.

WET - Workflow Execution Time - Diagram showing a timed pipeline from commit to deploy, emphasizing the developer feedback loop. WET - Workflow Execution Time - Diagram showing a timed pipeline from commit to deploy, emphasizing the developer feedback loop.

2. FR (Failure Rate)

FR represents the percentage of completed workflow runs (for a specific critical workflow) that end in a 'failure' status. It serves as the baseline measure of pipeline reliability and stability. A high FR indicates frequent breakages, build issues, or flaky tests, eroding trust and blocking deployments. However, the ideal FR isn't necessarily zero! A healthy CI pipeline must fail when real bugs or integration issues are introduced – that's its primary job. The problem FR aims to track is the rate of false positives: failures caused by infrastructure glitches, flaky tests, or unexpected regressions when the underlying code changes themselves are sound. Conversely, a pipeline should never pass if there's a genuine bug that needs catching. Therefore, a low, stable FR where failures accurately reflect real code problems is the sign of a healthy, reliable pipeline. However, an unexpected increase in FR is a critical signal demanding investigation, as it often points to underlying infrastructure problems or newly introduced flaky tests. Context is always key to interpreting FR changes and understanding why things are failing.

FR - Failure Rate - Diagram showing workflow runs with pass/fail status, highlighting the difference between valuable failures that catch bugs and problematic failures from flaky tests or infrastructure issues. FR - Failure Rate - Diagram showing workflow runs with pass/fail status, highlighting the difference between valuable failures that catch bugs and problematic failures from flaky tests or infrastructure issues.

3. POT (Pipeline Overhead Time)

POT quantifies the unproductive time, spent during a typical workflow run. This includes:

POT highlights inefficiency and friction, directly measuring wasted compute time and potential cost. It captures that frustrating feeling of waiting – waiting in queues, waiting for retries after a seemingly random failure. It's the CI equivalent of XKCD's famous "Compiling" comic – dead time that can lead to costly developer context switching while they could be productive but are simply blocked. High POT often means hitting that "Re-run failed jobs" button and waiting another 5, 10, or 15 minutes, completely unsure if this time it will finally pass, significantly impacting morale, predictability, and often correlating with harder-to-diagnose underlying issues.

POT - Pipeline Overhead Time - Diagram showing a workflow run timeline with queue time and retry time segments highlighted as unproductive overhead. POT - Pipeline Overhead Time - Diagram showing a workflow run timeline with queue time and retry time segments highlighted as unproductive overhead.

Why These Three?

Together, WET (Speed), FR (Reliability), and POT (Efficiency/Waste) provide a holistic yet focused view of your pipeline's health. They cover the critical aspects of how fast your pipeline runs, how reliably it produces valid results, and how much time is wasted along the way. Monitoring these three Vitals gives you a powerful signal about the overall health and developer experience of your CI/CD process, regardless of the specific tools you use.

The Challenge of Tracking CI Vitals Manually

While the concepts behind WET, FR, and POT are clear, calculating them accurately and consistently across different CI/CD platforms requires significant engineering effort. If you were to build a monitoring solution yourself, you'd need to tackle:

In short, it's a substantial project in itself. Building and maintaining such a system often requires dedicated engineering resources, potentially even a team of its own, diverting valuable time and focus away from your core product development.

Cimatic: your CI/CD pipeline companion

This is where Cimatic comes in. We built Cimatic specifically to handle all this complexity for you, automatically tracking and visualizing CI Vitals. Initially launching with support for GitHub Actions, Cimatic works right out-of-the-box.

With Cimatic for GitHub Actions, you get:

(Stay tuned for integrations with other CI/CD providers in the future!)

Get Started with CI Vitals Today

Ready to move beyond simple green checks and gain real visibility into your pipeline health?

Join the waitlist for early access to Cimatic for GitHub Actions. Start tracking your CI Vitals and unlock a faster, more reliable development cycle.