Architecture
Designing for Availability: Why Most Teams Discover Single Points of Failure Too Late
The problem with discovering failure modes in production
Most availability problems are not surprises in hindsight. The dependency that caused a three-hour outage was visible in the architecture diagram. The untested recovery path that failed under pressure had never been exercised. The alert that did not fire had a threshold set too conservatively to catch a gradual degradation.
The challenge is not that these risks are hard to identify. It is that identifying them requires time and deliberate attention that most teams redirect toward feature delivery until an incident forces the conversation.
This post outlines a structured approach for finding the single points of failure in your architecture before they find you.
Start with dependency mapping
A single point of failure is any component whose unavailability causes a user-visible service degradation with no automatic mitigation. The first step is making these components explicit.
For each service in your architecture, map:
- What it calls directly (synchronous dependencies)
- What it reads from at request time (databases, caches, configuration stores)
- What it depends on for deployment and scaling (registries, secret stores, DNS)
- What external services it requires (payment processors, identity providers, email delivery)
Pay particular attention to synchronous dependencies that do not have a fallback or timeout policy. A service that makes a blocking call to a dependency with no circuit breaker will degrade or fail whenever that dependency slows down, not just when it is fully unavailable.
The questions that surface real risk
Once you have a dependency map, ask three questions for each dependency:
- What happens to the calling service if this dependency returns a 500?
- What happens if it takes 30 seconds to respond instead of 30 milliseconds?
- What happens if it is unreachable for five minutes?
The answers reveal whether you have error handling, timeout policies, and circuit breakers in place — or whether you are assuming your dependencies are more reliable than they are.
Recovery paths that have never been tested
A recovery procedure that exists only in a runbook and has never been exercised is not a recovery procedure. It is a hypothesis.
The most common untested recovery paths include:
- Database failover to a standby replica
- Restoring a service from a snapshot or backup
- Re-deploying a service after a container registry outage
- Recovering from an accidentally deleted configuration or secret
Each of these should be exercised in a staging environment on a regular cadence. The first time you run a database failover should not be during an incident at 2am. The first time you restore from backup should not be when the backup is the only copy of production data.
Measuring recovery readiness
Two metrics define your recovery posture:
Recovery Time Objective (RTO) — how long can your service be unavailable before the business impact becomes unacceptable?
Recovery Point Objective (RPO) — how much data loss is acceptable in a recovery scenario?
Most teams have informal answers to these questions. Formalizing them creates a test target. If your RTO is four hours and your last database failover test took six, you have a gap to close.
Alerting that does not reflect real failure modes
Alerting configurations drift away from actual failure modes over time. Teams add alerts reactively after incidents, tune thresholds to reduce noise, and rarely revisit whether the alert set as a whole would have caught recent availability degradations.
A useful exercise: take the last three significant incidents and ask whether your current alert configuration would have detected them earlier. If the answer is no, the alert set has drifted.
Common gaps include:
- Alerts on error rates but not on latency degradation
- Alerts scoped to individual services but not to end-to-end user journeys
- Alert thresholds set against peak traffic patterns that do not fire during off-peak degradations
- Missing alerts on dependency health (database connection pool exhaustion, cache hit rate drops)
Synthetic monitoring as a complement to reactive alerts
Reactive alerting fires when something is already broken. Synthetic monitoring — scheduled requests that verify expected behavior from outside your system — can detect availability issues before real users do.
A synthetic monitor that hits your critical user journeys every minute from an external location will surface availability problems independently of your internal metrics. For public-facing services, this is a valuable complement to infrastructure-level alerting.
Multi-region and redundancy trade-offs
Not every service needs multi-region deployment. The cost and operational complexity of running across multiple regions is significant, and for many workloads the reliability improvement does not justify it.
The right question is not "should we be multi-region?" but "what is the blast radius of a single-region failure, and is it acceptable?"
For most growing companies, the answer involves a tiered approach:
- Critical user-facing services: active-active or active-passive across two availability zones within a single region, with a clear plan for regional failover if needed
- Internal tooling and batch workloads: single-region with backup and recovery procedures
- Data layer: synchronous replication within a region, asynchronous to a second region for disaster recovery
This is not a universal recommendation — the right answer depends on your RTO, RPO, and the cost of the workload. But it is a starting framework for the conversation.
Making availability a continuous concern
Availability work is not a project that ends. Architectures change, dependencies change, and traffic patterns change. A single point of failure that did not exist six months ago may exist today because of a new integration or a schema migration that removed a fallback path.
The most reliable teams treat availability review as a regular operational activity: reviewing dependency maps when services change, exercising recovery paths on a quarterly schedule, and auditing alert configurations after every significant incident.
The goal is not perfect availability — it is known availability, with documented failure modes, tested recovery paths, and alerts that reflect how the system actually fails.
Ready to find out what your cloud environment could save?
Book a free scoping call. We will map your spend band, define scope, and issue a fixed-fee proposal with a guaranteed minimum savings threshold — typically within one business day.
Book a Discovery Call