Skip to main content

Reliability

Reliability

Reliability in system design refers to the ability of a system to consistently perform its intended function correctly, even in the presence of faults or failures. It means the system should be fault-tolerant, resilient, and able to recover gracefully from errors.

  • Availability: The system is up and accessible when needed.
  • Fault tolerance: The system continues operating even if some components fail.
  • Durability: Once data is written, it remains available without loss.
  • Consistency: The system behaves in an expected way under all circumstances.

Goals of a Reliable System

  • Minimize Downtime – Keep services running with minimal interruptions.
  • Fail Gracefully – If a part fails, it doesn’t crash the whole system.
  • Quick Recovery – Recover fast from outages or failures.
  • Prevent Data Loss – Data must not be lost during or after failures.
  • Predictable Behavior – The system works the same way every tim

Improve Reliability

TechniqueDescription
RedundancyDuplicate components (servers, DBs) to handle failures
Load BalancingDistribute traffic across multiple instances
Failover MechanismsAutomatically switch to backup systems if the primary one fails
ReplicationCopy data across multiple nodes for high availability and durability
Health ChecksMonitor services and restart failed ones
Retries with BackoffRetry failed requests after increasing intervals
Graceful DegradationThe system reduces functionality instead of failing completely
Monitoring & AlertingDetect and respond to issues quickly
Backups & SnapshotsRecover from disasters or data corruption

Example of Reliable E-commerce System

Imagine you're designing an e-commerce platform like Amazon. Reliability is crucial—downtime means lost sales and customers.

Components

  • Frontend Web Servers
  • Backend Services (like Product Service, Payment Service)
  • Database (e.g., MySQL/PostgreSQL)
  • Cache Layer (e.g., Redis)
  • Message Queue (e.g., RabbitMQ/Kafka)

How to Design for Reliability

  1. Web Servers:

    • Use load balancer (like Nginx or AWS ELB) to distribute traffic.
    • Auto-scale using Kubernetes or AWS Auto Scaling.
  2. Backend Services:

    • Deploy multiple instances (e.g., 3 replicas of Payment Service).
    • Implement circuit breakers to avoid cascading failures.
  3. Database:

    • Use master-slave replication or multi-master.
    • Enable automated backups and point-in-time recovery.
  4. Caching Layer:

    • Use Redis in cluster mode with failover support.
  5. Message Queue:

    • Use Kafka with replication and acknowledgments to prevent message loss.
    • Retry failed messages with dead-letter queues.
  6. Monitoring & Alerts:

    • Tools like Prometheus + Grafana, Datadog, or ELK Stack.
    • Set alerts on latency, error rates, and service health.

Failure

In system design, failure refers to any event where a system, component, or service stops working as expected. This could mean downtime, data loss, performance degradation, or incorrect results — essentially, anything that breaks the system’s reliability, availability, or correctness.

Types of Failure

TypeDescription
Hardware failureDisk crash, memory error, power outage
Software failureBugs, crashes, memory leaks, unhandled exceptions
Network failurePacket loss, timeout, DNS issues, latency spikes
Database failureCorruption, connection timeouts, replication lag
Dependency failureExternal services (APIs, payment gateways) go down
Human errorMisconfigurations, accidental deletions, bad deployments
Security failureBreach, DDoS attacks, unauthorized access

Failure vs Fault

TermDefinition
FaultA bug or issue in the system (e.g., bad code, misconfiguration)
FailureWhen a fault manifests and the system behaves incorrectly or crashes

Examples of Failures

1. Database Failure

Scenario: Your e-commerce app uses a single MySQL instance. One day, the database crashes due to disk failure. Impact: Entire website becomes read-only or completely non-functional.

2. Service Dependency Failure

Your service uses an external payment gateway. That API becomes unavailable. Impact: Users can’t complete checkouts, causing loss in revenue.

3. Network Partitioning (Split-Brain)

Two services hosted in separate regions lose connection due to a network partition. Impact: They operate independently, potentially causing conflicting data writes.

4. Deployment Failure

A new release has a bug that crashes the user authentication service. Impact: Users cannot log in.

How to Handle Failures (Design Principles)

TechniqueDescription
RedundancyUse backup servers/services in case of failure
FailoverAutomatically switch to standby systems
Circuit Breaker PatternTemporarily stop requests to a failing service
Retries with BackoffRetry failed requests, but not too aggressively
TimeoutsPrevent hanging requests due to slow/failing services
BulkheadsIsolate failures to small parts of the system
Graceful DegradationReduce functionality instead of total failure
Monitoring and AlertsDetect and respond quickly to failures
Disaster RecoveryPlan and test for catastrophic failures

Example of Netflix

Netflix must be resilient to failures since it serves millions globally.

Failure Scenario:

A single data center goes down due to a power issue.

Design Solutions Netflix uses:
  • Multi-region deployments: Traffic rerouted to healthy regions.
  • Chaos Monkey (part of the Simian Army): Intentionally kills services in production to test resilience.
  • Retry and backoff: If a microservice fails, the caller retries with increasing intervals.
  • Circuit breakers: If a service fails continuously, calls are stopped temporarily.

Result: End users likely don’t notice the issue.

Failure Detection & Recovery

StepDescription
DetectionUse health checks, metrics, heartbeat signals
IsolationUse containerization or process boundaries to localize failure
RecoveryAuto-restart, failover, database restore, or fallback services

Availability

Availability in system design refers to how accessible and operational a system is at any given time. It answers the question: “Can users access the system when they need to?”

A highly available system is designed to ensure minimal downtime and maximum uptime, even in the face of failures or heavy load.

Availability = (Uptime) / (Uptime + Downtime)

For example, a service that is available 99.9% of the time means it may be down for about 43 minutes per month.

Availability %Downtime per YearDowntime per Month
99%~3.65 days~7.2 hours
99.9%~8.8 hours~43.8 minutes
99.99%~52.6 minutes~4.4 minutes
99.999%~5.26 minutes~26 seconds

Key Concepts for High Availability

ConceptExplanation
RedundancyDuplicate components to take over if one fails
FailoverSwitch automatically to a backup system or server
Load BalancingDistribute incoming traffic to multiple servers
Health ChecksConstant monitoring to ensure services are running properly
Stateless ArchitectureServers don't store session data; allows any instance to handle requests
ScalabilityHandle traffic surges without going down
Monitoring & AlertsGet notified instantly of any service degradation
Disaster RecoveryPlans to recover from major failures (like region-wide outages)

Techniques to Improve Availability

TechniqueHow It Helps
Multi-Zone DeploymentDeploy services across availability zones or regions
Auto ScalingAdd/remove servers automatically to maintain service health
CachingReduce load on databases, enabling faster and more resilient performance
Database ReplicationKeep secondary DBs ready for failover
Circuit Breaker PatternAvoid cascading failures

Example of Highly Available Web App

Let’s say you are designing a food delivery app backend like Uber Eats.

Goal: 99.99% Availability

Components:

  • Frontend Web Servers
  • Backend Microservices (e.g., Order, Payment, Notification)
  • Database (e.g., PostgreSQL)
  • Cache (Redis)
  • Message Queue (e.g., Kafka)
  • Load Balancer (e.g., NGINX, AWS ALB)

High Availability Design:

  1. Load Balancers:

    • Distribute incoming traffic to multiple web servers.
    • If one server goes down, traffic is rerouted.
  2. Web and Backend Services:

    • Deployed in multiple availability zones (AZs).
    • Use auto-scaling groups in AWS or Kubernetes deployments with multiple replicas.
  3. Database:

    • Master-Slave replication.
    • Automatic failover to replicas using tools like Amazon RDS Multi-AZ or Patroni.
  4. Caching Layer:

    • Redis Sentinel or Redis Cluster for high availability and failover.
  5. Message Queue:

    • Kafka or RabbitMQ with replication to prevent message loss if a broker fails.
  6. Monitoring:

    • Use Prometheus + Grafana or Datadog.
    • Trigger alerts if response times spike or instances fail.
  7. Disaster Recovery Plan:

    • Nightly backups.
    • Infrastructure templates (e.g., Terraform) to re-deploy in a new region quickly.

Availability vs Reliability

TermFocusExample
AvailabilitySystem is accessible and workingWebsite loads when the user visits it
ReliabilitySystem is functionally correct over timeThe correct product is shown when searched

Example Availability in AWS

Imagine hosting your app on AWS:

  • EC2 instances in 3 AZs behind an Elastic Load Balancer
  • RDS PostgreSQL in Multi-AZ mode
  • S3 for asset storage (which is designed for 99.999999999% durability and high availability)
  • CloudFront CDN to distribute content globally

This setup ensures that even if one data center (AZ) goes down, your app remains availabl

Fault Tolerance

Fault tolerance means the system can withstand failures without affecting the overall functionality.

Failures are inevitable in any large-scale system—due to hardware crashes, software bugs, network issues, or human errors. A fault-tolerant system anticipates these failures and is designed to isolate, absorb, or recover from them.

Fault vs Failure vs Fault Tolerance

TermMeaning
FaultA defect or abnormal condition (e.g., a bug, hardware failure)
FailureWhen a fault causes a system or component to behave incorrectly
Fault ToleranceThe system’s ability to continue functioning despite the fault

Characteristics of Fault-Tolerant Systems

FeatureDescription
RedundancyHaving backup components that can take over in case of failure
FailoverAutomatic switching to a backup system/component
ReplicationDuplication of data/services across multiple nodes
IsolationContain failures to prevent cascading breakdowns
Graceful DegradationIf part of the system fails, reduce functionality instead of full crash
Monitoring & AlertsDetect faults early and respond proactively

Netflix's Fault Tolerance

Netflix is a great real-world example of extreme fault tolerance:

  • Runs across multiple AWS regions.
  • Uses Chaos Monkey, a tool that intentionally kills random servers to test fault tolerance.
  • Relies on microservices with:
    • Redundancy
    • Circuit breakers
    • Service discovery
    • Retry and timeout logic
    • Eventual consistency in distributed systems Even if one part of Netflix breaks (e.g., recommendations), the core video streaming still works—thanks to graceful degradation.

Fault Tolerance in Distributed Systems

Distributed systems (e.g., cloud apps, microservices) face unique challenges:

Fault TypeStrategy to Handle
Node failureUse replication and automatic failover
Network partitionApply CAP theorem trade-offs (choose between consistency and availability)
Service crashRestart with health checks (e.g., Kubernetes liveness probes)
Disk failureUse RAID, cloud storage, and backups
Corrupted dataUse checksums and versioned backups

Redundancy

Redundancy in system design refers to having multiple components or resources that serve the same purpose, so that if one fails, the others can take over. It is a core technique for achieving fault tolerance, high availability, and reliability in modern systems.

Why is Redundancy Important?

  • Prevents single points of failure (SPOF)
  • Improves system availability and reliability
  • Enables load balancing and failover
  • Supports disaster recovery plans

Types of Redundancy

TypeDescription
Hardware RedundancyMultiple physical devices (e.g., servers, disks, power supplies)
Software RedundancyMultiple software instances performing the same function
Network RedundancyMultiple network paths, routers, ISPs
Data RedundancyStoring copies of data in different places (e.g., replication, backups)
Geographic RedundancyDeploying across multiple data centers or regions
Service RedundancyRedundant APIs or microservices doing the same job

Redundancy in a Web Application

Imagine you're building an online ticket booking system. Here’s how redundancy is applied:

Goal: Ensure 99.99% availability and zero data loss

Components:

  • Web servers
  • Backend services (Booking, Payment, Notification)
  • Databases
  • File storage
  • Message queues

Where Redundancy is Applied:

  1. Web Servers (Software Redundancy)

    • Deploy multiple web server instances behind a load balancer.
    • If one crashes, others continue to serve traffic.
  2. Database (Data Redundancy)

    • Use master-slave replication (e.g., MySQL, PostgreSQL).
    • If the master goes down, the slave can take over (automatic failover).
  3. File Storage (Geographic Redundancy)

    • Store user-uploaded files in cloud storage (e.g., AWS S3), replicated across multiple availability zones (AZs).
  4. Message Queue (Service Redundancy)

    • Kafka cluster with multiple brokers and topic replication.
    • Messages remain safe even if one broker fails.
  5. Network (Network Redundancy)

    • Use multiple NICs, routers, and ISPs to ensure internet connectivity even if one fails.

Failure Scenario

Without Redundancy

  • You have a single database server.
  • It crashes due to disk failure.
  • Entire system goes down. Orders can't be placed.

With Redundancy

  • You have a replicated database setup.
  • Master fails → slave takes over in seconds.
  • System stays online, users continue booking tickets.

Redundancy vs Replication vs Backup

ConceptPurposeExample
RedundancyAvailability during failureMultiple web servers
ReplicationReal-time duplication of dataMySQL master-slave
BackupRecovery from long-term data lossNightly database snapshots

Design Considerations

ConsiderationWhy It Matters
CostMore components = more expense
ConsistencyNeed to handle data consistency across replicas
Failover logicMust be well-tested and fast
MonitoringAlert if primary fails and system switches to backup
TestingSimulate failures regularly (e.g., Chaos Engineering)

Common Redundancy Patterns

ConsiderationWhy It Matters
CostMore components = more expense
ConsistencyNeed to handle data consistency across replicas
Failover logicMust be well-tested and fast
MonitoringAlert if primary fails and system switches to backup
TestingSimulate failures regularly (e.g., Chaos Engineering)

Fault Detection

Fault detection is the process of identifying when a component or service in a system is behaving abnormally or failing, so corrective action can be taken.

It is a key first step in building fault-tolerant, highly available, and resilient systems.

Why Is Fault Detection Important?

  • Prevents cascading failures by catching issues early
  • Improves reliability and uptime
  • Enables automated recovery (e.g., auto-restart, failover)
  • Helps alert human operators quickly
  • Ensures service-level objectives (SLOs) are met

Fault Detection vs Fault Tolerance

ConceptPurpose
Fault DetectionIdentifies faults when they happen
Fault ToleranceContinues operation despite the fault

How Fault Detection Works

Fault detection usually involves observing system behavior and checking if it deviates from the expected state.

Common Techniques:

TechniqueDescription
Health ChecksRegularly ping services to check if they’re responsive
HeartbeatsComponents send periodic signals to confirm they’re alive
TimeoutsIf a service doesn't respond in time, it may be considered faulty
Monitoring & MetricsUse tools to watch CPU, memory, error rates, request latency, etc.
Log AnalysisScan logs for known error patterns
Synthetic TestsSimulate user activity to detect failures in end-to-end flows
Alerting SystemsSend notifications (email, Slack, PagerDuty) when faults are detected
Anomaly Detection (AI/ML)Detect subtle failures by modeling normal behavior and finding deviations

Fault Detection in a Web Application

Imagine a ride-sharing app backend with services like:

  • User Service
  • Ride Matching Service
  • Payment Service

All running on Kubernetes.

Fault Detection Methods:

  1. Liveness and Readiness Probes (Kubernetes)

    livenessProbe:
    httpGet:
    path: /healthz
    port: 8080
    initialDelaySeconds: 10
    periodSeconds: 5
    • Kubernetes kills and restarts a container if it fails health checks.
    • Detects fault automatically and triggers self-healing.
  2. Heartbeats Between Services

    • Each service sends heartbeats to a central monitoring service.
    • If no heartbeat is received within X seconds → mark as unhealthy.
  3. Monitoring with Prometheus + Grafana

    • Track:
      • Error rate > 5% → raise alert
      • CPU usage > 90% → potential overload
      • Request latency > 3s → backend lag
  4. Alerting via PagerDuty

    • When a fault is detected (e.g., Payment Service crashes), a notification is sent to an engineer:
PaymentService error rate > 10% for last 5 mins on prod cluster

Health Check

A health check is a mechanism that tests whether a component or service in a system is working correctly. It’s a foundational tool used in modern system design for:

  • Monitoring service health
  • Triggering auto-healing
  • Enabling load balancing
  • Facilitating graceful startup and shutdown

Types of Health Checks

TypePurposeUsed By
Liveness CheckChecks if the app is running (not dead or stuck)Restart if it fails (e.g., Kubernetes)
Readiness CheckChecks if the app is ready to serve requestsAdd/remove instance from load balancer
Startup CheckChecks if the app has finished initializingDelay liveness and readiness until app is ready

Where Health Checks Are Used

  1. Load Balancers

    • Detect unhealthy nodes and stop sending traffic.
    • Example: AWS ALB/ELB, NGINX, HAProxy
  2. Container Orchestrators (like Kubernetes)

    • Kill and restart failed containers automatically.
    • Only send traffic to "ready" pods.
  3. Service Meshes

    • Decide routing based on service health (e.g., Istio, Linkerd).
  4. Monitoring Systems

    • Collect health status to alert or visualize system state.

Health Check in Kubernetes

Application: Food Delivery Backend API. Let's say you have a Node.js app running in Kubernetes. You want to:

  • Restart it if it crashes (liveness)
  • Wait for DB connection before allowing traffic (readiness)

Sample Express API

app.get("/healthz", (req, res) => {
if (db.isConnected()) {
res.status(200).send("OK");
} else {
res.status(500).send("DB not connected");
}
});

Kubernetes YAML

livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3

readinessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 5
periodSeconds: 5

Kubernetes will:

  • Check /healthz every 5 seconds.
  • If 3 consecutive failures → restart the container (liveness).
  • If it fails readiness → stop sending traffic to it.

Characteristics of a Good Health Check Endpoint

CharacteristicWhy It Matters
FastShould respond quickly (e.g., <100ms)
Non-blockingShould not interfere with actual processing logic
Minimal logicAvoid full computation — just check essentials (DB, cache, etc.)
Returns correct status codes200 OK for healthy, 500/503 for unhealthy
CustomizableShould allow adding logic like DB/caching/service dependency checks

What Happens Without Health Checks?

  • A crashed service keeps receiving traffic, causing errors.
  • An initializing service gets hit before it’s ready.
  • Failures go unnoticed until users complain or traffic drops.

Recovery

Recovery is the process of bringing a system or component back to a working state after a failure has occurred.

Why It Matters: No matter how robust a system is, failures are inevitable — recovery ensures minimal downtime and data loss.

Recovery Techniques

TypeDescription
Automatic RecoverySystem detects failure and recovers without human intervention
Manual RecoveryRequires human to restore from backups or fix configuration
CheckpointingPeriodic snapshots to resume from last good state
FailoverSwitch to standby system/service (e.g., secondary DB)

Example

  • A PostgreSQL DB crashes. AWS RDS triggers automatic failover to a read replica.
  • Users see minimal downtime (a few seconds) and the system continues operating.

System Stability

Stability means a system continues operating within expected performance and behavior limits, even under stress or partial failure.

Why It Matters: Unstable systems may:

  • Crash under load
  • Return incorrect data
  • Exhibit erratic behavior

Stability Strategies

StrategyDescription
Load SheddingDrop some requests when system is overwhelmed (return 429 Too Many Requests)
Rate LimitingControl how much traffic a system accepts
BackpressureTell clients to slow down sending data (common in message queues)
Resource IsolationSeparate critical components to avoid full system crash

Example:

  • If your payment service is getting overloaded, you:
  • Temporarily stop accepting new requests (load shedding)
  • Let current requests finish without crashing the entire system

Timeout

A timeout defines how long a system or client should wait for a response before giving up.

Why It Matters:

Without timeouts:

  • A system might wait forever for a response
  • Threads or resources may be blocked
  • Cascading failures can occur

Usage:

  • Set timeouts for API calls, DB queries, external services
import requests
requests.get("https://api.payment.com/pay", timeout=3) # Wait max 3s
  • If the payment API doesn’t respond within 3 seconds, retry or fail gracefully.

Retries

Retries automatically reattempt failed requests in case of transient errors (e.g., network blips, timeouts).

Why It Matters:

Many failures are temporary — retrying can resolve the issue without user impact.

Retry Best Practices

PracticeWhy It Helps
Exponential BackoffAvoid overwhelming services with aggressive retries
JitterAdd randomness to prevent retry storms
Retry BudgetLimit number of retries to avoid infinite loops

Circuit Breaker Pattern

A circuit breaker prevents a system from repeatedly trying to use a failing component, giving it time to recover.

Inspired by electrical circuit breakers, it works in three states:

StateDescription
ClosedRequests pass through normally
OpenRequests fail immediately (service is down)
Half-OpenSend a few trial requests to check if service has recovered

Why It Matters:

  • Prevents overwhelming a failing service
  • Helps fail fast, rather than wasting time
  • Essential for graceful degradation

Example of Circuit Breaker

Suppose the Payment Service is failing:

  • After 5 failures in a row → Circuit opens.
  • For the next 30 seconds → All payment calls fail instantly.
  • After 30 seconds → Half-open: 1 test request sent.
    • If it succeeds → Circuit closes.
    • If it fails → Circuit remains open.