Cutting E2E Flake Rate from 100% to 25% in Five Weeks

How we killed flaky end-to-end tests so coding agents could actually iterate, and a checklist to replicate the workflow.

Problem

When coding agents write nearly all your code, a flaky end-to-end (E2E) test suite lies to agents and breaks their feedback loop.

Constraints

  • Data available: Playwright reports, CI pipeline history, per-test pass/fail data, application performance monitoring (APM) traces via W3C Trace Context response headers.
  • Tools: Two to three independent coding agents (e.g. Claude Code, Codex, Google Gemini), a custom Playwright reporter, GitHub Actions, and owning-team code reviewers.

Approach

  • Multi-agent consensus for test triage. Three agents independently categorize each test as delete, combine, convert, or leave with a confidence score. A second pair cross-checks and negotiates until they agree on every test.
  • Human-in-the-loop review. Domain-owning teams make the final call on each change.
  • Guardrails against regression. Update AGENTS.md with a pre-E2E checklist and the rule, “When in doubt, write an integration test.”
  • Agent-readable failure reports. A custom Playwright reporter emits structured JSON: unified timeline, network and console events, base64 screenshots, and a first-class traceId parsed from the traceparent header so APM traces can provide backend context.
  • Right agent for the job. Benchmark three agents on the same real failure before standardizing. For us, Codex consistently searched deeper and resorted to retries and timeout bumps least often.

Before → After

MetricWeek 0Week 5
PRs hitting a flaky E2E100%25%
E2E test count17487
Typical agent fix for a flakeAdd retry or bump timeoutTrace-backed root cause spanning 5-9 files
GitHub Actions flakiness costs$102,600/yr$25,650/yr

How to replicate

  1. Write a prompt that asks an agent to audit every E2E test and categorize each as delete, combine, convert, or leave with a confidence score and reason. Output structured JSON keyed by file:testName.
  2. Run that prompt against three agents that differ in both model and harness. Save each result.
  3. Normalize test names across the three files, randomize their order to obfuscate which agent came to which conclusion, and merge into one research-combined.json.
  4. Give two fresh agents the combined file. Each produces a final-*.json decision per test.
  5. Diff the two files and loop: Feed each agent the other’s reasoning and have them come to consensus on each test. Stop when the diff is empty.
  6. Group final decisions by test file to limit merge conflicts. Ship one PR per file, code-reviewed by the owning team.
  7. Add a pre-E2E checklist to AGENTS.md to prevent regression.
  8. Use the custom Playwright reporter to emits structured JSON, including a parsed traceparent header.
  9. Use the /flaky-test-debugger skill to point agents at the reporter output and perform APM trace lookups.
  10. Benchmark agents on debugging tasks to choose the best one.

Pitfalls

  • Handing agents only the stack trace and the repo. Nearly every agent adds a retry or bumps a timeout. Without a unified timeline and APM traces, there is nothing to root-cause against.
  • Scraping Playwright’s HTML report. The report renders for humans, not machines. Parsing logic grows brittle and information-sparse, and any upstream format change breaks it. Build a reporter instead of parsing one.
  • Letting agents set the final cut list. “These tests rarely catch logic regressions, but they prevent API-breaking changes” is a constraint agents cannot see from the test code alone.

Starter kit