Reliability
Reliability
Reliability in system design refers to the ability of a system to consistently perform its intended function correctly, even in the presence of faults or failures. It means the system should be fault-tolerant, resilient, and able to recover gracefully from errors.
- Availability: The system is up and accessible when needed.
- Fault tolerance: The system continues operating even if some components fail.
- Durability: Once data is written, it remains available without loss.
- Consistency: The system behaves in an expected way under all circumstances.
Goals of a Reliable System
- Minimize Downtime – Keep services running with minimal interruptions.
- Fail Gracefully – If a part fails, it doesn’t crash the whole system.
- Quick Recovery – Recover fast from outages or failures.
- Prevent Data Loss – Data must not be lost during or after failures.
- Predictable Behavior – The system works the same way every tim
Improve Reliability
| Technique | Description |
|---|---|
| Redundancy | Duplicate components (servers, DBs) to handle failures |
| Load Balancing | Distribute traffic across multiple instances |
| Failover Mechanisms | Automatically switch to backup systems if the primary one fails |
| Replication | Copy data across multiple nodes for high availability and durability |
| Health Checks | Monitor services and restart failed ones |
| Retries with Backoff | Retry failed requests after increasing intervals |
| Graceful Degradation | The system reduces functionality instead of failing completely |
| Monitoring & Alerting | Detect and respond to issues quickly |
| Backups & Snapshots | Recover from disasters or data corruption |
Example of Reliable E-commerce System
Imagine you're designing an e-commerce platform like Amazon. Reliability is crucial—downtime means lost sales and customers.
Components
- Frontend Web Servers
- Backend Services (like Product Service, Payment Service)
- Database (e.g., MySQL/PostgreSQL)
- Cache Layer (e.g., Redis)
- Message Queue (e.g., RabbitMQ/Kafka)
How to Design for Reliability
-
Web Servers:
- Use load balancer (like Nginx or AWS ELB) to distribute traffic.
- Auto-scale using Kubernetes or AWS Auto Scaling.
-
Backend Services:
- Deploy multiple instances (e.g., 3 replicas of Payment Service).
- Implement circuit breakers to avoid cascading failures.
-
Database:
- Use master-slave replication or multi-master.
- Enable automated backups and point-in-time recovery.
-
Caching Layer:
- Use Redis in cluster mode with failover support.
-
Message Queue:
- Use Kafka with replication and acknowledgments to prevent message loss.
- Retry failed messages with dead-letter queues.
-
Monitoring & Alerts:
- Tools like Prometheus + Grafana, Datadog, or ELK Stack.
- Set alerts on latency, error rates, and service health.
Failure
In system design, failure refers to any event where a system, component, or service stops working as expected. This could mean downtime, data loss, performance degradation, or incorrect results — essentially, anything that breaks the system’s reliability, availability, or correctness.
Types of Failure
| Type | Description |
|---|---|
| Hardware failure | Disk crash, memory error, power outage |
| Software failure | Bugs, crashes, memory leaks, unhandled exceptions |
| Network failure | Packet loss, timeout, DNS issues, latency spikes |
| Database failure | Corruption, connection timeouts, replication lag |
| Dependency failure | External services (APIs, payment gateways) go down |
| Human error | Misconfigurations, accidental deletions, bad deployments |
| Security failure | Breach, DDoS attacks, unauthorized access |
Failure vs Fault
| Term | Definition |
|---|---|
| Fault | A bug or issue in the system (e.g., bad code, misconfiguration) |
| Failure | When a fault manifests and the system behaves incorrectly or crashes |
Examples of Failures
1. Database Failure
Scenario: Your e-commerce app uses a single MySQL instance. One day, the database crashes due to disk failure. Impact: Entire website becomes read-only or completely non-functional.
2. Service Dependency Failure
Your service uses an external payment gateway. That API becomes unavailable. Impact: Users can’t complete checkouts, causing loss in revenue.
3. Network Partitioning (Split-Brain)
Two services hosted in separate regions lose connection due to a network partition. Impact: They operate independently, potentially causing conflicting data writes.
4. Deployment Failure
A new release has a bug that crashes the user authentication service. Impact: Users cannot log in.
How to Handle Failures (Design Principles)
| Technique | Description |
|---|---|
| Redundancy | Use backup servers/services in case of failure |
| Failover | Automatically switch to standby systems |
| Circuit Breaker Pattern | Temporarily stop requests to a failing service |
| Retries with Backoff | Retry failed requests, but not too aggressively |
| Timeouts | Prevent hanging requests due to slow/failing services |
| Bulkheads | Isolate failures to small parts of the system |
| Graceful Degradation | Reduce functionality instead of total failure |
| Monitoring and Alerts | Detect and respond quickly to failures |
| Disaster Recovery | Plan and test for catastrophic failures |
Example of Netflix
Netflix must be resilient to failures since it serves millions globally.
Failure Scenario:
A single data center goes down due to a power issue.
Design Solutions Netflix uses:
- Multi-region deployments: Traffic rerouted to healthy regions.
- Chaos Monkey (part of the Simian Army): Intentionally kills services in production to test resilience.
- Retry and backoff: If a microservice fails, the caller retries with increasing intervals.
- Circuit breakers: If a service fails continuously, calls are stopped temporarily.
Result: End users likely don’t notice the issue.
Failure Detection & Recovery
| Step | Description |
|---|---|
| Detection | Use health checks, metrics, heartbeat signals |
| Isolation | Use containerization or process boundaries to localize failure |
| Recovery | Auto-restart, failover, database restore, or fallback services |
Availability
Availability in system design refers to how accessible and operational a system is at any given time. It answers the question: “Can users access the system when they need to?”
A highly available system is designed to ensure minimal downtime and maximum uptime, even in the face of failures or heavy load.
Availability = (Uptime) / (Uptime + Downtime)
For example, a service that is available 99.9% of the time means it may be down for about 43 minutes per month.
| Availability % | Downtime per Year | Downtime per Month |
|---|---|---|
| 99% | ~3.65 days | ~7.2 hours |
| 99.9% | ~8.8 hours | ~43.8 minutes |
| 99.99% | ~52.6 minutes | ~4.4 minutes |
| 99.999% | ~5.26 minutes | ~26 seconds |
Key Concepts for High Availability
| Concept | Explanation |
|---|---|
| Redundancy | Duplicate components to take over if one fails |
| Failover | Switch automatically to a backup system or server |
| Load Balancing | Distribute incoming traffic to multiple servers |
| Health Checks | Constant monitoring to ensure services are running properly |
| Stateless Architecture | Servers don't store session data; allows any instance to handle requests |
| Scalability | Handle traffic surges without going down |
| Monitoring & Alerts | Get notified instantly of any service degradation |
| Disaster Recovery | Plans to recover from major failures (like region-wide outages) |
Techniques to Improve Availability
| Technique | How It Helps |
|---|---|
| Multi-Zone Deployment | Deploy services across availability zones or regions |
| Auto Scaling | Add/remove servers automatically to maintain service health |
| Caching | Reduce load on databases, enabling faster and more resilient performance |
| Database Replication | Keep secondary DBs ready for failover |
| Circuit Breaker Pattern | Avoid cascading failures |
Example of Highly Available Web App
Let’s say you are designing a food delivery app backend like Uber Eats.
Goal: 99.99% Availability
Components:
- Frontend Web Servers
- Backend Microservices (e.g., Order, Payment, Notification)
- Database (e.g., PostgreSQL)
- Cache (Redis)
- Message Queue (e.g., Kafka)
- Load Balancer (e.g., NGINX, AWS ALB)
High Availability Design:
-
Load Balancers:
- Distribute incoming traffic to multiple web servers.
- If one server goes down, traffic is rerouted.
-
Web and Backend Services:
- Deployed in multiple availability zones (AZs).
- Use auto-scaling groups in AWS or Kubernetes deployments with multiple replicas.
-
Database:
- Master-Slave replication.
- Automatic failover to replicas using tools like Amazon RDS Multi-AZ or Patroni.
-
Caching Layer:
- Redis Sentinel or Redis Cluster for high availability and failover.
-
Message Queue:
- Kafka or RabbitMQ with replication to prevent message loss if a broker fails.
-
Monitoring:
- Use Prometheus + Grafana or Datadog.
- Trigger alerts if response times spike or instances fail.
-
Disaster Recovery Plan:
- Nightly backups.
- Infrastructure templates (e.g., Terraform) to re-deploy in a new region quickly.
Availability vs Reliability
| Term | Focus | Example |
|---|---|---|
| Availability | System is accessible and working | Website loads when the user visits it |
| Reliability | System is functionally correct over time | The correct product is shown when searched |
Example Availability in AWS
Imagine hosting your app on AWS:
- EC2 instances in 3 AZs behind an Elastic Load Balancer
- RDS PostgreSQL in Multi-AZ mode
- S3 for asset storage (which is designed for 99.999999999% durability and high availability)
- CloudFront CDN to distribute content globally
This setup ensures that even if one data center (AZ) goes down, your app remains availabl
Fault Tolerance
Fault tolerance means the system can withstand failures without affecting the overall functionality.
Failures are inevitable in any large-scale system—due to hardware crashes, software bugs, network issues, or human errors. A fault-tolerant system anticipates these failures and is designed to isolate, absorb, or recover from them.
Fault vs Failure vs Fault Tolerance
| Term | Meaning |
|---|---|
| Fault | A defect or abnormal condition (e.g., a bug, hardware failure) |
| Failure | When a fault causes a system or component to behave incorrectly |
| Fault Tolerance | The system’s ability to continue functioning despite the fault |
Characteristics of Fault-Tolerant Systems
| Feature | Description |
|---|---|
| Redundancy | Having backup components that can take over in case of failure |
| Failover | Automatic switching to a backup system/component |
| Replication | Duplication of data/services across multiple nodes |
| Isolation | Contain failures to prevent cascading breakdowns |
| Graceful Degradation | If part of the system fails, reduce functionality instead of full crash |
| Monitoring & Alerts | Detect faults early and respond proactively |
Netflix's Fault Tolerance
Netflix is a great real-world example of extreme fault tolerance:
- Runs across multiple AWS regions.
- Uses Chaos Monkey, a tool that intentionally kills random servers to test fault tolerance.
- Relies on microservices with:
- Redundancy
- Circuit breakers
- Service discovery
- Retry and timeout logic
- Eventual consistency in distributed systems Even if one part of Netflix breaks (e.g., recommendations), the core video streaming still works—thanks to graceful degradation.
Fault Tolerance in Distributed Systems
Distributed systems (e.g., cloud apps, microservices) face unique challenges:
| Fault Type | Strategy to Handle |
|---|---|
| Node failure | Use replication and automatic failover |
| Network partition | Apply CAP theorem trade-offs (choose between consistency and availability) |
| Service crash | Restart with health checks (e.g., Kubernetes liveness probes) |
| Disk failure | Use RAID, cloud storage, and backups |
| Corrupted data | Use checksums and versioned backups |
Redundancy
Redundancy in system design refers to having multiple components or resources that serve the same purpose, so that if one fails, the others can take over. It is a core technique for achieving fault tolerance, high availability, and reliability in modern systems.
Why is Redundancy Important?
- Prevents single points of failure (SPOF)
- Improves system availability and reliability
- Enables load balancing and failover
- Supports disaster recovery plans
Types of Redundancy
| Type | Description |
|---|---|
| Hardware Redundancy | Multiple physical devices (e.g., servers, disks, power supplies) |
| Software Redundancy | Multiple software instances performing the same function |
| Network Redundancy | Multiple network paths, routers, ISPs |
| Data Redundancy | Storing copies of data in different places (e.g., replication, backups) |
| Geographic Redundancy | Deploying across multiple data centers or regions |
| Service Redundancy | Redundant APIs or microservices doing the same job |
Redundancy in a Web Application
Imagine you're building an online ticket booking system. Here’s how redundancy is applied:
Goal: Ensure 99.99% availability and zero data loss
Components:
- Web servers
- Backend services (Booking, Payment, Notification)
- Databases
- File storage
- Message queues
Where Redundancy is Applied:
-
Web Servers (Software Redundancy)
- Deploy multiple web server instances behind a load balancer.
- If one crashes, others continue to serve traffic.
-
Database (Data Redundancy)
- Use master-slave replication (e.g., MySQL, PostgreSQL).
- If the master goes down, the slave can take over (automatic failover).
-
File Storage (Geographic Redundancy)
- Store user-uploaded files in cloud storage (e.g., AWS S3), replicated across multiple availability zones (AZs).
-
Message Queue (Service Redundancy)
- Kafka cluster with multiple brokers and topic replication.
- Messages remain safe even if one broker fails.
-
Network (Network Redundancy)
- Use multiple NICs, routers, and ISPs to ensure internet connectivity even if one fails.
Failure Scenario
Without Redundancy
- You have a single database server.
- It crashes due to disk failure.
- Entire system goes down. Orders can't be placed.
With Redundancy
- You have a replicated database setup.
- Master fails → slave takes over in seconds.
- System stays online, users continue booking tickets.
Redundancy vs Replication vs Backup
| Concept | Purpose | Example |
|---|---|---|
| Redundancy | Availability during failure | Multiple web servers |
| Replication | Real-time duplication of data | MySQL master-slave |
| Backup | Recovery from long-term data loss | Nightly database snapshots |
Design Considerations
| Consideration | Why It Matters |
|---|---|
| Cost | More components = more expense |
| Consistency | Need to handle data consistency across replicas |
| Failover logic | Must be well-tested and fast |
| Monitoring | Alert if primary fails and system switches to backup |
| Testing | Simulate failures regularly (e.g., Chaos Engineering) |
Common Redundancy Patterns
| Consideration | Why It Matters |
|---|---|
| Cost | More components = more expense |
| Consistency | Need to handle data consistency across replicas |
| Failover logic | Must be well-tested and fast |
| Monitoring | Alert if primary fails and system switches to backup |
| Testing | Simulate failures regularly (e.g., Chaos Engineering) |
Fault Detection
Fault detection is the process of identifying when a component or service in a system is behaving abnormally or failing, so corrective action can be taken.
It is a key first step in building fault-tolerant, highly available, and resilient systems.
Why Is Fault Detection Important?
- Prevents cascading failures by catching issues early
- Improves reliability and uptime
- Enables automated recovery (e.g., auto-restart, failover)
- Helps alert human operators quickly
- Ensures service-level objectives (SLOs) are met
Fault Detection vs Fault Tolerance
| Concept | Purpose |
|---|---|
| Fault Detection | Identifies faults when they happen |
| Fault Tolerance | Continues operation despite the fault |
How Fault Detection Works
Fault detection usually involves observing system behavior and checking if it deviates from the expected state.
Common Techniques:
| Technique | Description |
|---|---|
| Health Checks | Regularly ping services to check if they’re responsive |
| Heartbeats | Components send periodic signals to confirm they’re alive |
| Timeouts | If a service doesn't respond in time, it may be considered faulty |
| Monitoring & Metrics | Use tools to watch CPU, memory, error rates, request latency, etc. |
| Log Analysis | Scan logs for known error patterns |
| Synthetic Tests | Simulate user activity to detect failures in end-to-end flows |
| Alerting Systems | Send notifications (email, Slack, PagerDuty) when faults are detected |
| Anomaly Detection (AI/ML) | Detect subtle failures by modeling normal behavior and finding deviations |
Fault Detection in a Web Application
Imagine a ride-sharing app backend with services like:
- User Service
- Ride Matching Service
- Payment Service
All running on Kubernetes.
Fault Detection Methods:
-
Liveness and Readiness Probes (Kubernetes)
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5- Kubernetes kills and restarts a container if it fails health checks.
- Detects fault automatically and triggers self-healing.
-
Heartbeats Between Services
- Each service sends heartbeats to a central monitoring service.
- If no heartbeat is received within X seconds → mark as unhealthy.
-
Monitoring with Prometheus + Grafana
- Track:
- Error rate > 5% → raise alert
- CPU usage > 90% → potential overload
- Request latency > 3s → backend lag
- Track:
-
Alerting via PagerDuty
- When a fault is detected (e.g., Payment Service crashes), a notification is sent to an engineer:
PaymentService error rate > 10% for last 5 mins on prod cluster
Health Check
A health check is a mechanism that tests whether a component or service in a system is working correctly. It’s a foundational tool used in modern system design for:
- Monitoring service health
- Triggering auto-healing
- Enabling load balancing
- Facilitating graceful startup and shutdown
Types of Health Checks
| Type | Purpose | Used By |
|---|---|---|
| Liveness Check | Checks if the app is running (not dead or stuck) | Restart if it fails (e.g., Kubernetes) |
| Readiness Check | Checks if the app is ready to serve requests | Add/remove instance from load balancer |
| Startup Check | Checks if the app has finished initializing | Delay liveness and readiness until app is ready |
Where Health Checks Are Used
-
Load Balancers
- Detect unhealthy nodes and stop sending traffic.
- Example: AWS ALB/ELB, NGINX, HAProxy
-
Container Orchestrators (like Kubernetes)
- Kill and restart failed containers automatically.
- Only send traffic to "ready" pods.
-
Service Meshes
- Decide routing based on service health (e.g., Istio, Linkerd).
-
Monitoring Systems
- Collect health status to alert or visualize system state.
Health Check in Kubernetes
Application: Food Delivery Backend API. Let's say you have a Node.js app running in Kubernetes. You want to:
- Restart it if it crashes (liveness)
- Wait for DB connection before allowing traffic (readiness)
Sample Express API
app.get("/healthz", (req, res) => {
if (db.isConnected()) {
res.status(200).send("OK");
} else {
res.status(500).send("DB not connected");
}
});
Kubernetes YAML
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
Kubernetes will:
- Check
/healthzevery 5 seconds. - If 3 consecutive failures → restart the container (liveness).
- If it fails readiness → stop sending traffic to it.
Characteristics of a Good Health Check Endpoint
| Characteristic | Why It Matters |
|---|---|
| Fast | Should respond quickly (e.g., <100ms) |
| Non-blocking | Should not interfere with actual processing logic |
| Minimal logic | Avoid full computation — just check essentials (DB, cache, etc.) |
| Returns correct status codes | 200 OK for healthy, 500/503 for unhealthy |
| Customizable | Should allow adding logic like DB/caching/service dependency checks |
What Happens Without Health Checks?
- A crashed service keeps receiving traffic, causing errors.
- An initializing service gets hit before it’s ready.
- Failures go unnoticed until users complain or traffic drops.
Recovery
Recovery is the process of bringing a system or component back to a working state after a failure has occurred.
Why It Matters: No matter how robust a system is, failures are inevitable — recovery ensures minimal downtime and data loss.
Recovery Techniques
| Type | Description |
|---|---|
| Automatic Recovery | System detects failure and recovers without human intervention |
| Manual Recovery | Requires human to restore from backups or fix configuration |
| Checkpointing | Periodic snapshots to resume from last good state |
| Failover | Switch to standby system/service (e.g., secondary DB) |
Example
- A PostgreSQL DB crashes. AWS RDS triggers automatic failover to a read replica.
- Users see minimal downtime (a few seconds) and the system continues operating.
System Stability
Stability means a system continues operating within expected performance and behavior limits, even under stress or partial failure.
Why It Matters: Unstable systems may:
- Crash under load
- Return incorrect data
- Exhibit erratic behavior
Stability Strategies
| Strategy | Description |
|---|---|
| Load Shedding | Drop some requests when system is overwhelmed (return 429 Too Many Requests) |
| Rate Limiting | Control how much traffic a system accepts |
| Backpressure | Tell clients to slow down sending data (common in message queues) |
| Resource Isolation | Separate critical components to avoid full system crash |
Example:
- If your payment service is getting overloaded, you:
- Temporarily stop accepting new requests (load shedding)
- Let current requests finish without crashing the entire system
Timeout
A timeout defines how long a system or client should wait for a response before giving up.
Why It Matters:
Without timeouts:
- A system might wait forever for a response
- Threads or resources may be blocked
- Cascading failures can occur
Usage:
- Set timeouts for API calls, DB queries, external services
import requests
requests.get("https://api.payment.com/pay", timeout=3) # Wait max 3s
- If the payment API doesn’t respond within 3 seconds, retry or fail gracefully.
Retries
Retries automatically reattempt failed requests in case of transient errors (e.g., network blips, timeouts).
Why It Matters:
Many failures are temporary — retrying can resolve the issue without user impact.
Retry Best Practices
| Practice | Why It Helps |
|---|---|
| Exponential Backoff | Avoid overwhelming services with aggressive retries |
| Jitter | Add randomness to prevent retry storms |
| Retry Budget | Limit number of retries to avoid infinite loops |
Circuit Breaker Pattern
A circuit breaker prevents a system from repeatedly trying to use a failing component, giving it time to recover.
Inspired by electrical circuit breakers, it works in three states:
| State | Description |
|---|---|
| Closed | Requests pass through normally |
| Open | Requests fail immediately (service is down) |
| Half-Open | Send a few trial requests to check if service has recovered |
Why It Matters:
- Prevents overwhelming a failing service
- Helps fail fast, rather than wasting time
- Essential for graceful degradation
Example of Circuit Breaker
Suppose the Payment Service is failing:
- After 5 failures in a row → Circuit opens.
- For the next 30 seconds → All payment calls fail instantly.
- After 30 seconds → Half-open: 1 test request sent.
- If it succeeds → Circuit closes.
- If it fails → Circuit remains open.