Why Your CI/CD Pipeline Slows Down After 50 Engineers, And How to Fix It
Your CI/CD pipeline worked fine at 15 engineers. At 50, deploys take 40 minutes. At 75, engineers run deploys in parallel to avoid queue bottlenecks, release confidence has dropped, and people have started working around the pipeline rather than through it. Three compounding problems turn a manageable irritant into a full-blown productivity tax: a growing test suite, shared runner contention, and fragmented pipeline ownership. This isn't an anomaly. It's the predictable result of pipeline architecture that wasn't designed for scale. Here's what's happening and how to fix it.
Key Takeaways
- In 2024, the DORA State of DevOps Report found that elite-performing teams deploy 182x more frequently than low performers. The gap is structural, not accidental. (Source: dora.dev)
- A 10-engineer team waiting 30 minutes daily for CI/CD loses five hours of productive engineering time every day, at 50 engineers, that's a full-time headcount equivalent gone every week. (Source: JetBrains TeamCity Blog, 2025)
- The most effective fixes are parallelisation, affected-build logic, and pipeline ownership, in that order.
Why Does CI/CD Slow Down After 50 Engineers?
In 2024, the DORA Report found that low performers grew from 17% to 25% of all surveyed teams, partly because growth-stage orgs lose operational discipline as they scale. Three root causes compound: test suite growth with no parallelisation, runner contention with queuing baked in, and organisational fragmentation where each team owns their own pipeline steps and nobody owns the whole thing.
From practice: When I audit a CI/CD pipeline at a 60–100 person engineering org, the first thing I look for is the gap between queue wait time and execution time. At Deliveroo scale, I saw pipelines where engineers were waiting 15 minutes for a runner before a single line of build code executed. The execution itself took 12 minutes. So 55% of total pipeline time was pure queue. That's not a build problem, it's an infrastructure provisioning problem wearing a build problem's clothing.
How Slow Is "Too Slow"?
The 2024 DORA benchmarks define elite performers as teams deploying multiple times per day with lead time under one hour; low performers deploy monthly with lead times over a month. For a 50–100 engineer team, the practical target is under 10 minutes for standard builds, under 20 for complex ones. Regularly seeing 30+ means the problem is urgent, not Q3 roadmap urgent, but this-sprint urgent.
CI/CD Performance by DORA Tier (2024)
Elite performers: deploy multiple times per day, lead time under 1 hour, change failure rate under 5%, time to restore under 1 hour. Low performers: deploy fewer than once per month, lead time over 1 month, change failure rate 46–60%, time to restore days to weeks. The gap between the two isn't primarily a skill gap, it's a systems gap. Source: DORA State of DevOps Report 2024, survey of 3,000+ professionals globally.
How to build the platform team that owns your CI/CD as a product
The Four Most Common Bottlenecks
1. The Test Suite Bottleneck
The most common cause. A test suite that ran in five minutes at 10,000 tests takes 45+ minutes at 60,000, because most teams run tests sequentially and nobody ever removed the ones added three years ago. The fix isn't to run fewer tests. It's parallelisation and selective execution.
2. The Runner Contention Bottleneck
Shared runners with a queue. Engineers waiting 15 minutes for a runner to become available before the build even starts. The execution itself is fast; the wait is not. Right-sizing runner pools and adding autoscaling is the fix, not buying more expensive runners.
3. The Monorepo Bottleneck
Running the full build on every commit when only one service changed. At a 50-engineer org with a sizeable monorepo, this means a change to a single TypeScript package triggers a 25-minute build that touches Java services the author has never worked on. Affected-build tooling, nx affected, Turborepo, Bazel, solves this directly.
4. The Ownership Vacuum Bottleneck
No one owns the pipeline as a product. Every team added their own CI steps over two years. Nobody removed anything. The pipeline accumulated layers like geological sediment, useful at the time, invisible now, collectively fatal. The fix here is organisational before it's technical: assign ownership and an SLA.
From practice: Of the four, the ownership vacuum is the one I find most often and the hardest to fix. At Scout24, I saw pipelines that had grown from 8 minutes to 38 minutes over 18 months with no single engineer able to explain more than 60% of what was in them. The technical fixes are available and well understood. What's missing is the accountability structure that makes someone's job to reduce build time, track it weekly, and have a number to defend. That doesn't come from a Jira ticket; it comes from organisational design.
How Do You Diagnose Your Specific Bottleneck?
Industry flow-efficiency research shows 75–85% of time in a software delivery process is waiting, not executing, which means the bottleneck is almost always in the queue, not the code. GitHub Actions, CircleCI, and GitLab CI all expose per-step timing data. Pull the last 30 days for your five most-trafficked services, separate queue wait from execution time, and look for outliers. Add time wrappers to each pipeline stage. Most teams find their bottleneck in under an hour and haven't looked before.
Fix 1: Parallelise and Shard Your Tests
In 2025, JetBrains analysis found that halving CI pipeline time recovers roughly 90 hours of engineering time per month per 10 engineers. Parallelisation is how you get there. A 40-minute test suite across four parallel workers becomes approximately 12 minutes, no test changes required, just orchestration. GitHub Actions matrix strategy, CircleCI's parallelism key, and Jest's --shard flag handle the split. Four to six workers is the sweet spot for most teams before diminishing returns kick in.
Tools to consider: Jest --shard, Vitest's --shard, Playwright's --shard for end-to-end tests, and RSpec's parallel_tests gem for Ruby shops. GitHub Actions' matrix strategy handles the orchestration without additional tooling.
How to choose the right internal developer platform tools for your team
Fix 2: Affected-Build Logic for Monorepos
For a typical monorepo, affected-build tools reduce pipeline execution by 60–80% on average commits, because most commits touch one or two services, not all of them. Tools like Nx, Turborepo, and Bazel compute a dependency graph and build only what changed.
Which tool to use depends on your stack:
- Turborepo is the pragmatic choice for JavaScript/TypeScript monorepos. Low setup cost, maintained by Vercel (where I saw it adopted widely across European scale-up customers), works well with pnpm workspaces. The
--filterand--affectedflags do the heavy lifting. - Nx is the better choice if you need a full dependency graph, custom executors, and caching across machines. More investment upfront, more capability at scale.
- Bazel is the right answer for polyglot monorepos (Go, Java, Python, TypeScript in the same repo) at large scale. Steep learning curve; justified above roughly 200 engineers.
The key implementation step: ensure your CI configuration passes the base commit reference correctly so the affected calculation is accurate. A common mistake is computing "affected since main" when you mean "affected since branch point."
Fix 3: Right-Size and Autoscale Your Runners
Self-hosted runners on appropriately sized EC2 instances typically cut CI infrastructure cost by 40–60% compared to GitHub's default hosted runners, while also being faster, because you control instance size and can autoscale to eliminate queue wait. For GitHub Actions, Philips Labs' terraform-aws-github-runner is the most widely adopted open-source option. Actuated is the managed alternative. For GitLab, Kubernetes-based runner autoscaling via the GitLab Runner Helm chart handles this natively.
The cost calculation matters here. At 50 engineers running 20 pipelines per day at 12 minutes average, you're consuming approximately 200 runner-hours daily. At GitHub's standard hosted runner pricing, that's meaningful spend. The setup investment is one to two days for a competent infrastructure engineer, a one-time cost that pays back in weeks.
Fix 4: Assign Pipeline Ownership
In 2024, DORA identified that teams with clear platform ownership consistently outperform those relying on shared, unowned infrastructure. Without someone whose job it is to measure and maintain the pipeline, a parallelisation win in March becomes a 45-minute pipeline again by October, because teams added steps and nobody noticed the regression.
The practical implementation: a member of your platform team owns CI/CD as a product. They publish a target SLA, say, P50 under 10 minutes, P95 under 20 for standard services, and track current P50 and P95 weekly. They maintain a public improvement roadmap. They run a lightweight monthly pipeline review where slow outliers are investigated. They own regression alerting: if the P50 for any service increases by more than five minutes week-on-week, they get paged.
This alone, before any technical fix, often produces improvement, because measurement creates accountability and accountability creates behaviour change.
What Does "Done" Look Like?
The 2024 DORA data places the high-performer threshold at daily deploys, lead time under one day, and change failure rate under 15%. For a 50–100 engineer team: standard builds under 10 minutes, complex builds under 20, and lead time commit-to-production under one hour. Those are the targets. Anything consistently above them is worth fixing.
Maintenance is where most improvements unravel. Monthly pipeline reviews, checking P50 and P95 for each service, looking for regressions, prevent the accumulation that caused the problem in the first place. Set regression alerts. If a service's P50 increases by five minutes week-on-week, someone gets notified before it becomes 15. Annual architecture reviews are worth scheduling even when things feel fine. The test suite that was appropriate at 40,000 tests isn't appropriate at 100,000.
Frequently Asked Questions
Should we migrate from GitHub Actions to self-hosted runners?
For most teams between 50 and 150 engineers, yes, but autoscaling self-hosted runners, not a fixed-size fleet. The break-even point is roughly 150 runner-hours per day, which teams at this size usually exceed. The setup cost is real (one to two engineer-days for AWS with terraform-aws-github-runner, a few hours for Actuated), but the ongoing cost savings and speed improvement justify it within two to three months. Don't migrate if your pipeline problems are primarily test-suite-related, faster runners won't fix a 45-minute sequential test run.
How do we handle flaky tests that slow down CI?
Flaky tests are a pipeline reliability problem before they're a speed problem. Track your flake rate per test file and quarantine tests with a flake rate above 3% into a separate non-blocking suite. Fix or delete them within two sprints. Tools like BuildPulse and Trunk Flaky Tests automate detection. The teams that tolerate flaky tests in blocking CI suites end up with engineers adding manual retries, which adds 5–10 minutes to every affected pipeline run and erodes trust in the pipeline signal entirely.
What's the ROI calculation for faster CI/CD?
Start with the straightforward version. Take your pipeline median time, multiply by the number of daily runs, multiply by the fraction of that time engineers actively wait (roughly 40–60% for most teams), and price it at your average fully-loaded engineering hour. A 50-engineer team running 150 pipelines per day at a 20-minute median, with engineers waiting on 50% of runs, loses approximately 25 engineer-hours daily. At €120,000 fully-loaded per engineer, that's over €750,000 in lost capacity annually. Reducing median pipeline time from 20 to 8 minutes recovers roughly €450,000 of that. The platform work to achieve it costs a fraction of that.
When does a monorepo become a CI/CD liability?
When you're running full builds on every commit and haven't implemented affected-build logic. A monorepo with proper affected builds and caching is often faster than an equivalent set of polyrepos, because you get build deduplication and shared cache. The liability is organisational: monorepos require someone to own the build tooling as a product. Without that ownership, affected-build configurations drift, caches expire without anyone noticing, and teams start opting out of the shared pipeline.
How do elite teams achieve multiple deploys per day safely?
Feature flags and trunk-based development are the foundation. Features ship behind flags, so "deploy" and "release" are decoupled. You can deploy 10 times per day without customers seeing incomplete features. Combined with a fast pipeline (under 10 minutes), high test coverage, and automated rollback, the risk per deploy is low because each deploy is small and reversible. The DORA data is unambiguous: high deployment frequency correlates with lower change failure rates, not higher, because small deploys are easier to reason about and easier to roll back.
Conclusion
Slow pipelines are a platform ownership problem as much as a technical one. The fixes, parallelisation, affected builds, runner autoscaling, are mechanical to implement once you've diagnosed the bottleneck. What makes them last is assigning someone to own the pipeline as a product: to measure it weekly, catch regressions early, and run a roadmap. Fix the architecture first, then give it an owner.