Beyond Green Checks: Measuring Real Pipeline Health with CI Vitals
We rely on CI/CD pipelines more than ever to deliver value quickly and reliably. But are they truly healthy? We've all felt the pain: slow builds delaying critical feedback, flaky tests eroding trust, unexpected failures blocking deployments at the worst possible moment, and runner costs creeping up... How much time are you really losing?
While modern CI/CD tools offer incredible power and flexibility, simply seeing that green checkmark doesn't tell the whole story about your pipeline's performance or the friction it might be causing your team.
Even within a single repository or project, many CI/CD platforms lack built-in tools for creating meaningful statistical dashboards or analyzing workflow trends over time. Getting a high-level view of pipeline health often requires digging through logs, manual checks, or investing significant time in building custom tooling.
Remember when Google introduced Web Vitals? It brought much-needed clarity to web performance by focusing on a handful of core metrics that directly impact user experience. We believe the developer experience and the overall efficiency of our CI/CD processes deserve the same focused approach. Without clear, actionable metrics, it's difficult to understand performance, identify bottlenecks, or measure the impact of improvements.
Introducing CI Vitals
That’s why we're proposing CI Vitals: a curated set of three core metrics designed specifically to provide a clear, consistent, and actionable view of your CI/CD pipeline health.
CI Vitals focus on three fundamental areas: Speed, Reliability, and Efficiency. Crucially, they are designed for straightforward interpretation. Like Web Vitals, lower values across the CI Vitals typically signify improvements in pipeline health and performance.
Meet the CI Vitals: WET, FR, POT
Let's dive into the three core CI Vitals:
1. WET (Workflow Execution Time)
WET measures the end-to-end duration (typically in seconds or minutes) for your key workflows (e.g., the main CI run on pushes to main
) to complete successfully. These are typically the workflows that run most frequently, protect the main branch, or are critical for deployment, thus having the largest impact on developer experience and delivery speed. To understand this typical performance while accounting for outliers that can skew simple averages, we often focus on percentiles like the 75th (p75) or 90th (p90). This metric directly impacts the developer feedback loop; faster workflows mean developers get results quicker, can merge code faster, and minimize costly context switching. Slow WET leads to frustration and delays in shipping value, so improvements here are crucial.
2. FR (Failure Rate)
FR represents the percentage of completed workflow runs (for a specific critical workflow) that end in a 'failure' status. It serves as the baseline measure of pipeline reliability and stability. A high FR indicates frequent breakages, build issues, or flaky tests, eroding trust and blocking deployments. However, the ideal FR isn't necessarily zero! A healthy CI pipeline must fail when real bugs or integration issues are introduced – that's its primary job. The problem FR aims to track is the rate of false positives: failures caused by infrastructure glitches, flaky tests, or unexpected regressions when the underlying code changes themselves are sound. Conversely, a pipeline should never pass if there's a genuine bug that needs catching. Therefore, a low, stable FR where failures accurately reflect real code problems is the sign of a healthy, reliable pipeline. However, an unexpected increase in FR is a critical signal demanding investigation, as it often points to underlying infrastructure problems or newly introduced flaky tests. Context is always key to interpreting FR changes and understanding why things are failing.
3. POT (Pipeline Overhead Time)
POT quantifies the unproductive time, spent during a typical workflow run. This includes:
- Time spent waiting in the queue for a runner or agent to become available.
- Time lost to flakiness – such as the execution time of failed jobs that required a retry, or runs that failed intermittently and needed to be manually re-run.
- Time wasted due to misconfigured dependency caches, where dependencies are repeatedly downloaded instead of being fetched once from a cache.
POT highlights inefficiency and friction, directly measuring wasted compute time and potential cost. It captures that frustrating feeling of waiting – waiting in queues, waiting for retries after a seemingly random failure. It's the CI equivalent of XKCD's famous "Compiling" comic – dead time that can lead to costly developer context switching while they could be productive but are simply blocked. High POT often means hitting that "Re-run failed jobs" button and waiting another 5, 10, or 15 minutes, completely unsure if this time it will finally pass, significantly impacting morale, predictability, and often correlating with harder-to-diagnose underlying issues.
Why These Three?
Together, WET (Speed), FR (Reliability), and POT (Efficiency/Waste) provide a holistic yet focused view of your pipeline's health. They cover the critical aspects of how fast your pipeline runs, how reliably it produces valid results, and how much time is wasted along the way. Monitoring these three Vitals gives you a powerful signal about the overall health and developer experience of your CI/CD process, regardless of the specific tools you use.
The Challenge of Tracking CI Vitals Manually
While the concepts behind WET, FR, and POT are clear, calculating them accurately and consistently across different CI/CD platforms requires significant engineering effort. If you were to build a monitoring solution yourself, you'd need to tackle:
- Data Collection: Setting up robust mechanisms (using APIs or webhooks specific to each CI provider) to reliably gather detailed timing data, job statuses, queue times, and retry attempts for every relevant workflow run.
- Data Storage & Processing: Designing, deploying, and maintaining a database or system to store potentially large volumes of historical workflow data efficiently.
- Complex Calculations: Implementing the logic to calculate p75/p90 percentiles for WET, track failure rates accurately over rolling time windows, meticulously distinguish queue time from execution time for POT, and implement heuristics to identify flaky patterns (like retries or intermittent pass/fail cycles) to quantify their contribution to POT – potentially needing different logic for different CI systems.
- Log Parsing: Potentially needing to parse diverse log formats to detect specific failure reasons that might indicate infrastructure issues versus code issues, contributing to FR or POT.
- Ongoing Maintenance: Keeping this entire custom system running smoothly, adapting it to inevitable API changes from multiple providers, managing data retention, and scaling it as your team and projects grow.
- Visualization: Building effective dashboards to actually visualize these metrics and their trends in a way that provides actionable insights.
In short, it's a substantial project in itself. Building and maintaining such a system often requires dedicated engineering resources, potentially even a team of its own, diverting valuable time and focus away from your core product development.
Cimatic: your CI/CD pipeline companion
This is where Cimatic comes in. We built Cimatic specifically to handle all this complexity for you, automatically tracking and visualizing CI Vitals. Initially launching with support for GitHub Actions, Cimatic works right out-of-the-box.
With Cimatic for GitHub Actions, you get:
- Clear CI Vitals Dashboard: Instantly see the current WET, FR, and POT for your key workflows, addressing the lack of built-in overview dashboards.
- Historical Trends: Understand how your pipeline health is changing over time, identifying improvements or regressions easily – something very difficult to do manually.
- Actionable Insights: Drill down into what's driving poor Vitals. See which jobs contribute most to WET, identify common failures impacting FR, and understand the breakdown of POT between queue time and flakiness.
- Zero Configuration Setup: Cimatic works out-of-the-box with your standard GitHub Actions configuration. There are no complex agents to install or workflow files to modify. Connect your repository, and Cimatic immediately starts providing insights, even showing all metrics for your historical workflow runs.
- Optimize with Data: Stop guessing and start making data-driven decisions to improve your CI/CD pipelines, ultimately saving significant engineering time, reducing developer frustration, and shipping faster.
(Stay tuned for integrations with other CI/CD providers in the future!)
Get Started with CI Vitals Today
Ready to move beyond simple green checks and gain real visibility into your pipeline health?
Join the waitlist for early access to Cimatic for GitHub Actions. Start tracking your CI Vitals and unlock a faster, more reliable development cycle.