- Blue/green deploy, roll forward
- Timeouts/deadlines for all remote calls
- Capped exponential backoff retries with jitter at single point in stack (only for idempotent calls based on response code, monitor retries)
- Well exercised fallbacks
- What happens when downstream services down? Gracefully degrade, timeouts (Postgres), circuit breakers (for all sync downstream calls via Resilience4j, which also does bulkheads and load shedding)
- For migrations, etc., two phase deploy with bake time, readers go before writers while rolling forward whereas writers go before readers while rolling backward
Mean Time to Recovery (MTTR)
Rocky Warren
December 31, 2020 • 1 min read