Reliability

Reliability in system design refers to the ability of a system to consistently perform its intended function correctly, even in the presence of faults or failures. It means the system should be fault-tolerant, resilient, and able to recover gracefully from errors.

Availability: The system is up and accessible when needed.
Fault tolerance: The system continues operating even if some components fail.
Durability: Once data is written, it remains available without loss.
Consistency: The system behaves in an expected way under all circumstances.

Goals of a Reliable System

Minimize Downtime – Keep services running with minimal interruptions.
Fail Gracefully – If a part fails, it doesn’t crash the whole system.
Quick Recovery – Recover fast from outages or failures.
Prevent Data Loss – Data must not be lost during or after failures.
Predictable Behavior – The system works the same way every tim

Improve Reliability

Technique	Description
Redundancy	Duplicate components (servers, DBs) to handle failures
Load Balancing	Distribute traffic across multiple instances
Failover Mechanisms	Automatically switch to backup systems if the primary one fails
Replication	Copy data across multiple nodes for high availability and durability
Health Checks	Monitor services and restart failed ones
Retries with Backoff	Retry failed requests after increasing intervals
Graceful Degradation	The system reduces functionality instead of failing completely
Monitoring & Alerting	Detect and respond to issues quickly
Backups & Snapshots	Recover from disasters or data corruption

Example of Reliable E-commerce System

Imagine you're designing an e-commerce platform like Amazon. Reliability is crucial—downtime means lost sales and customers.

Components

Frontend Web Servers
Backend Services (like Product Service, Payment Service)
Database (e.g., MySQL/PostgreSQL)
Cache Layer (e.g., Redis)
Message Queue (e.g., RabbitMQ/Kafka)

How to Design for Reliability

Web Servers:
- Use load balancer (like Nginx or AWS ELB) to distribute traffic.
- Auto-scale using Kubernetes or AWS Auto Scaling.
Backend Services:
- Deploy multiple instances (e.g., 3 replicas of Payment Service).
- Implement circuit breakers to avoid cascading failures.
Database:
- Use master-slave replication or multi-master.
- Enable automated backups and point-in-time recovery.
Caching Layer:
- Use Redis in cluster mode with failover support.
Message Queue:
- Use Kafka with replication and acknowledgments to prevent message loss.
- Retry failed messages with dead-letter queues.
Monitoring & Alerts:
- Tools like Prometheus + Grafana, Datadog, or ELK Stack.
- Set alerts on latency, error rates, and service health.

Failure

In system design, failure refers to any event where a system, component, or service stops working as expected. This could mean downtime, data loss, performance degradation, or incorrect results — essentially, anything that breaks the system’s reliability, availability, or correctness.

Types of Failure

Type	Description
Hardware failure	Disk crash, memory error, power outage
Software failure	Bugs, crashes, memory leaks, unhandled exceptions
Network failure	Packet loss, timeout, DNS issues, latency spikes
Database failure	Corruption, connection timeouts, replication lag
Dependency failure	External services (APIs, payment gateways) go down
Human error	Misconfigurations, accidental deletions, bad deployments
Security failure	Breach, DDoS attacks, unauthorized access

Failure vs Fault

Term	Definition
Fault	A bug or issue in the system (e.g., bad code, misconfiguration)
Failure	When a fault manifests and the system behaves incorrectly or crashes

Examples of Failures

1. Database Failure

Scenario: Your e-commerce app uses a single MySQL instance. One day, the database crashes due to disk failure. Impact: Entire website becomes read-only or completely non-functional.

2. Service Dependency Failure

Your service uses an external payment gateway. That API becomes unavailable. Impact: Users can’t complete checkouts, causing loss in revenue.

3. Network Partitioning (Split-Brain)

Two services hosted in separate regions lose connection due to a network partition. Impact: They operate independently, potentially causing conflicting data writes.

4. Deployment Failure

A new release has a bug that crashes the user authentication service. Impact: Users cannot log in.

How to Handle Failures (Design Principles)

Technique	Description
Redundancy	Use backup servers/services in case of failure
Failover	Automatically switch to standby systems
Circuit Breaker Pattern	Temporarily stop requests to a failing service
Retries with Backoff	Retry failed requests, but not too aggressively
Timeouts	Prevent hanging requests due to slow/failing services
Bulkheads	Isolate failures to small parts of the system
Graceful Degradation	Reduce functionality instead of total failure
Monitoring and Alerts	Detect and respond quickly to failures
Disaster Recovery	Plan and test for catastrophic failures

Example of Netflix

Netflix must be resilient to failures since it serves millions globally.

Failure Scenario:

A single data center goes down due to a power issue.

Design Solutions Netflix uses:

Multi-region deployments: Traffic rerouted to healthy regions.
Chaos Monkey (part of the Simian Army): Intentionally kills services in production to test resilience.
Retry and backoff: If a microservice fails, the caller retries with increasing intervals.
Circuit breakers: If a service fails continuously, calls are stopped temporarily.

Result: End users likely don’t notice the issue.

Failure Detection & Recovery

Step	Description
Detection	Use health checks, metrics, heartbeat signals
Isolation	Use containerization or process boundaries to localize failure
Recovery	Auto-restart, failover, database restore, or fallback services

Availability

Availability in system design refers to how accessible and operational a system is at any given time. It answers the question: “Can users access the system when they need to?”

A highly available system is designed to ensure minimal downtime and maximum uptime, even in the face of failures or heavy load.

Availability = (Uptime) / (Uptime + Downtime)

For example, a service that is available 99.9% of the time means it may be down for about 43 minutes per month.

Availability %	Downtime per Year	Downtime per Month
99%	~3.65 days	~7.2 hours
99.9%	~8.8 hours	~43.8 minutes
99.99%	~52.6 minutes	~4.4 minutes
99.999%	~5.26 minutes	~26 seconds

Key Concepts for High Availability

Concept	Explanation
Redundancy	Duplicate components to take over if one fails
Failover	Switch automatically to a backup system or server
Load Balancing	Distribute incoming traffic to multiple servers
Health Checks	Constant monitoring to ensure services are running properly
Stateless Architecture	Servers don't store session data; allows any instance to handle requests
Scalability	Handle traffic surges without going down
Monitoring & Alerts	Get notified instantly of any service degradation
Disaster Recovery	Plans to recover from major failures (like region-wide outages)

Techniques to Improve Availability

Technique	How It Helps
Multi-Zone Deployment	Deploy services across availability zones or regions
Auto Scaling	Add/remove servers automatically to maintain service health
Caching	Reduce load on databases, enabling faster and more resilient performance
Database Replication	Keep secondary DBs ready for failover
Circuit Breaker Pattern	Avoid cascading failures

Example of Highly Available Web App

Let’s say you are designing a food delivery app backend like Uber Eats.

Goal: 99.99% Availability

Components:

Frontend Web Servers
Backend Microservices (e.g., Order, Payment, Notification)
Database (e.g., PostgreSQL)
Cache (Redis)
Message Queue (e.g., Kafka)
Load Balancer (e.g., NGINX, AWS ALB)

High Availability Design:

Load Balancers:
- Distribute incoming traffic to multiple web servers.
- If one server goes down, traffic is rerouted.
Web and Backend Services:
- Deployed in multiple availability zones (AZs).
- Use auto-scaling groups in AWS or Kubernetes deployments with multiple replicas.
Database:
- Master-Slave replication.
- Automatic failover to replicas using tools like Amazon RDS Multi-AZ or Patroni.
Caching Layer:
- Redis Sentinel or Redis Cluster for high availability and failover.
Message Queue:
- Kafka or RabbitMQ with replication to prevent message loss if a broker fails.
Monitoring:
- Use Prometheus + Grafana or Datadog.
- Trigger alerts if response times spike or instances fail.
Disaster Recovery Plan:
- Nightly backups.
- Infrastructure templates (e.g., Terraform) to re-deploy in a new region quickly.

Availability vs Reliability

Term	Focus	Example
Availability	System is accessible and working	Website loads when the user visits it
Reliability	System is functionally correct over time	The correct product is shown when searched

Example Availability in AWS

Imagine hosting your app on AWS:

EC2 instances in 3 AZs behind an Elastic Load Balancer
RDS PostgreSQL in Multi-AZ mode
S3 for asset storage (which is designed for 99.999999999% durability and high availability)
CloudFront CDN to distribute content globally

This setup ensures that even if one data center (AZ) goes down, your app remains availabl

Fault Tolerance

Fault tolerance means the system can withstand failures without affecting the overall functionality.

Failures are inevitable in any large-scale system—due to hardware crashes, software bugs, network issues, or human errors. A fault-tolerant system anticipates these failures and is designed to isolate, absorb, or recover from them.

Fault vs Failure vs Fault Tolerance

Term	Meaning
Fault	A defect or abnormal condition (e.g., a bug, hardware failure)
Failure	When a fault causes a system or component to behave incorrectly
Fault Tolerance	The system’s ability to continue functioning despite the fault

Characteristics of Fault-Tolerant Systems

Feature	Description
Redundancy	Having backup components that can take over in case of failure
Failover	Automatic switching to a backup system/component
Replication	Duplication of data/services across multiple nodes
Isolation	Contain failures to prevent cascading breakdowns
Graceful Degradation	If part of the system fails, reduce functionality instead of full crash
Monitoring & Alerts	Detect faults early and respond proactively

Netflix's Fault Tolerance

Netflix is a great real-world example of extreme fault tolerance:

Runs across multiple AWS regions.
Uses Chaos Monkey, a tool that intentionally kills random servers to test fault tolerance.
Relies on microservices with:
- Redundancy
- Circuit breakers
- Service discovery
- Retry and timeout logic
- Eventual consistency in distributed systems Even if one part of Netflix breaks (e.g., recommendations), the core video streaming still works—thanks to graceful degradation.

Fault Tolerance in Distributed Systems

Distributed systems (e.g., cloud apps, microservices) face unique challenges:

Fault Type	Strategy to Handle
Node failure	Use replication and automatic failover
Network partition	Apply CAP theorem trade-offs (choose between consistency and availability)
Service crash	Restart with health checks (e.g., Kubernetes liveness probes)
Disk failure	Use RAID, cloud storage, and backups
Corrupted data	Use checksums and versioned backups

Redundancy

Redundancy in system design refers to having multiple components or resources that serve the same purpose, so that if one fails, the others can take over. It is a core technique for achieving fault tolerance, high availability, and reliability in modern systems.

Why is Redundancy Important?

Prevents single points of failure (SPOF)
Improves system availability and reliability
Enables load balancing and failover
Supports disaster recovery plans

Types of Redundancy

Type	Description
Hardware Redundancy	Multiple physical devices (e.g., servers, disks, power supplies)
Software Redundancy	Multiple software instances performing the same function
Network Redundancy	Multiple network paths, routers, ISPs
Data Redundancy	Storing copies of data in different places (e.g., replication, backups)
Geographic Redundancy	Deploying across multiple data centers or regions
Service Redundancy	Redundant APIs or microservices doing the same job

Redundancy in a Web Application

Imagine you're building an online ticket booking system. Here’s how redundancy is applied:

Goal: Ensure 99.99% availability and zero data loss

Components:

Web servers
Backend services (Booking, Payment, Notification)
Databases
File storage
Message queues

Where Redundancy is Applied:

Web Servers (Software Redundancy)
- Deploy multiple web server instances behind a load balancer.
- If one crashes, others continue to serve traffic.
Database (Data Redundancy)
- Use master-slave replication (e.g., MySQL, PostgreSQL).
- If the master goes down, the slave can take over (automatic failover).
File Storage (Geographic Redundancy)
- Store user-uploaded files in cloud storage (e.g., AWS S3), replicated across multiple availability zones (AZs).
Message Queue (Service Redundancy)
- Kafka cluster with multiple brokers and topic replication.
- Messages remain safe even if one broker fails.
Network (Network Redundancy)
- Use multiple NICs, routers, and ISPs to ensure internet connectivity even if one fails.

Failure Scenario

Without Redundancy

You have a single database server.
It crashes due to disk failure.
Entire system goes down. Orders can't be placed.

With Redundancy

You have a replicated database setup.
Master fails → slave takes over in seconds.
System stays online, users continue booking tickets.

Redundancy vs Replication vs Backup

Concept	Purpose	Example
Redundancy	Availability during failure	Multiple web servers
Replication	Real-time duplication of data	MySQL master-slave
Backup	Recovery from long-term data loss	Nightly database snapshots

Design Considerations

Consideration	Why It Matters
Cost	More components = more expense
Consistency	Need to handle data consistency across replicas
Failover logic	Must be well-tested and fast
Monitoring	Alert if primary fails and system switches to backup
Testing	Simulate failures regularly (e.g., Chaos Engineering)

Common Redundancy Patterns

Consideration	Why It Matters
Cost	More components = more expense
Consistency	Need to handle data consistency across replicas
Failover logic	Must be well-tested and fast
Monitoring	Alert if primary fails and system switches to backup
Testing	Simulate failures regularly (e.g., Chaos Engineering)

Fault Detection

Fault detection is the process of identifying when a component or service in a system is behaving abnormally or failing, so corrective action can be taken.

It is a key first step in building fault-tolerant, highly available, and resilient systems.

Why Is Fault Detection Important?

Prevents cascading failures by catching issues early
Improves reliability and uptime
Enables automated recovery (e.g., auto-restart, failover)
Helps alert human operators quickly
Ensures service-level objectives (SLOs) are met

Fault Detection vs Fault Tolerance

Concept	Purpose
Fault Detection	Identifies faults when they happen
Fault Tolerance	Continues operation despite the fault

How Fault Detection Works

Fault detection usually involves observing system behavior and checking if it deviates from the expected state.

Common Techniques:

Technique	Description
Health Checks	Regularly ping services to check if they’re responsive
Heartbeats	Components send periodic signals to confirm they’re alive
Timeouts	If a service doesn't respond in time, it may be considered faulty
Monitoring & Metrics	Use tools to watch CPU, memory, error rates, request latency, etc.
Log Analysis	Scan logs for known error patterns
Synthetic Tests	Simulate user activity to detect failures in end-to-end flows
Alerting Systems	Send notifications (email, Slack, PagerDuty) when faults are detected
Anomaly Detection (AI/ML)	Detect subtle failures by modeling normal behavior and finding deviations

Fault Detection in a Web Application

Imagine a ride-sharing app backend with services like:

User Service
Ride Matching Service
Payment Service

All running on Kubernetes.

Fault Detection Methods:

Liveness and Readiness Probes (Kubernetes)
```
livenessProbe:
httpGet:
  path: /healthz
  port: 8080
initialDelaySeconds: 10
periodSeconds: 5
```
- Kubernetes kills and restarts a container if it fails health checks.
- Detects fault automatically and triggers self-healing.
Heartbeats Between Services
- Each service sends heartbeats to a central monitoring service.
- If no heartbeat is received within X seconds → mark as unhealthy.
Monitoring with Prometheus + Grafana
- Track:
  - Error rate > 5% → raise alert
  - CPU usage > 90% → potential overload
  - Request latency > 3s → backend lag
Alerting via PagerDuty
- When a fault is detected (e.g., Payment Service crashes), a notification is sent to an engineer:

PaymentService error rate > 10% for last 5 mins on prod cluster

Health Check

A health check is a mechanism that tests whether a component or service in a system is working correctly. It’s a foundational tool used in modern system design for:

Monitoring service health
Triggering auto-healing
Enabling load balancing
Facilitating graceful startup and shutdown

Types of Health Checks

Type	Purpose	Used By
Liveness Check	Checks if the app is running (not dead or stuck)	Restart if it fails (e.g., Kubernetes)
Readiness Check	Checks if the app is ready to serve requests	Add/remove instance from load balancer
Startup Check	Checks if the app has finished initializing	Delay liveness and readiness until app is ready

Where Health Checks Are Used

Load Balancers
- Detect unhealthy nodes and stop sending traffic.
- Example: AWS ALB/ELB, NGINX, HAProxy
Container Orchestrators (like Kubernetes)
- Kill and restart failed containers automatically.
- Only send traffic to "ready" pods.
Service Meshes
- Decide routing based on service health (e.g., Istio, Linkerd).
Monitoring Systems
- Collect health status to alert or visualize system state.

Health Check in Kubernetes

Application: Food Delivery Backend API. Let's say you have a Node.js app running in Kubernetes. You want to:

Restart it if it crashes (liveness)
Wait for DB connection before allowing traffic (readiness)

Sample Express API

app.get("/healthz", (req, res) => {
  if (db.isConnected()) {
    res.status(200).send("OK");
  } else {
    res.status(500).send("DB not connected");
  }
});

Kubernetes YAML

livenessProbe:
  httpGet:
    path: /healthz
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /healthz
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 5

Kubernetes will:

Check /healthz every 5 seconds.
If 3 consecutive failures → restart the container (liveness).
If it fails readiness → stop sending traffic to it.

Characteristics of a Good Health Check Endpoint

Characteristic	Why It Matters
Fast	Should respond quickly (e.g., `<100ms`)
Non-blocking	Should not interfere with actual processing logic
Minimal logic	Avoid full computation — just check essentials (DB, cache, etc.)
Returns correct status codes	`200 OK` for healthy, `500`/`503` for unhealthy
Customizable	Should allow adding logic like DB/caching/service dependency checks

What Happens Without Health Checks?

A crashed service keeps receiving traffic, causing errors.
An initializing service gets hit before it’s ready.
Failures go unnoticed until users complain or traffic drops.

Recovery

Recovery is the process of bringing a system or component back to a working state after a failure has occurred.

Why It Matters: No matter how robust a system is, failures are inevitable — recovery ensures minimal downtime and data loss.

Recovery Techniques

Type	Description
Automatic Recovery	System detects failure and recovers without human intervention
Manual Recovery	Requires human to restore from backups or fix configuration
Checkpointing	Periodic snapshots to resume from last good state
Failover	Switch to standby system/service (e.g., secondary DB)

Example

A PostgreSQL DB crashes. AWS RDS triggers automatic failover to a read replica.
Users see minimal downtime (a few seconds) and the system continues operating.

System Stability

Stability means a system continues operating within expected performance and behavior limits, even under stress or partial failure.

Why It Matters: Unstable systems may:

Crash under load
Return incorrect data
Exhibit erratic behavior

Stability Strategies

Strategy	Description
Load Shedding	Drop some requests when system is overwhelmed (return 429 Too Many Requests)
Rate Limiting	Control how much traffic a system accepts
Backpressure	Tell clients to slow down sending data (common in message queues)
Resource Isolation	Separate critical components to avoid full system crash

Example:

If your payment service is getting overloaded, you:
Temporarily stop accepting new requests (load shedding)
Let current requests finish without crashing the entire system

Timeout

A timeout defines how long a system or client should wait for a response before giving up.

Why It Matters:

Without timeouts:

A system might wait forever for a response
Threads or resources may be blocked
Cascading failures can occur

Usage:

Set timeouts for API calls, DB queries, external services

import requests
requests.get("https://api.payment.com/pay", timeout=3)  # Wait max 3s

If the payment API doesn’t respond within 3 seconds, retry or fail gracefully.

Retries

Retries automatically reattempt failed requests in case of transient errors (e.g., network blips, timeouts).

Why It Matters:

Many failures are temporary — retrying can resolve the issue without user impact.

Retry Best Practices

Practice	Why It Helps
Exponential Backoff	Avoid overwhelming services with aggressive retries
Jitter	Add randomness to prevent retry storms
Retry Budget	Limit number of retries to avoid infinite loops

Circuit Breaker Pattern

A circuit breaker prevents a system from repeatedly trying to use a failing component, giving it time to recover.

Inspired by electrical circuit breakers, it works in three states:

State	Description
Closed	Requests pass through normally
Open	Requests fail immediately (service is down)
Half-Open	Send a few trial requests to check if service has recovered

Why It Matters:

Prevents overwhelming a failing service
Helps fail fast, rather than wasting time
Essential for graceful degradation

Example of Circuit Breaker

Suppose the Payment Service is failing:

After 5 failures in a row → Circuit opens.
For the next 30 seconds → All payment calls fail instantly.
After 30 seconds → Half-open: 1 test request sent.
- If it succeeds → Circuit closes.
- If it fails → Circuit remains open.

Reliability​

Improve Reliability​

Example of Reliable E-commerce System​

Components​

How to Design for Reliability​

Failure​

Types of Failure​

Failure vs Fault​

Examples of Failures​

1. Database Failure​

2. Service Dependency Failure​

3. Network Partitioning (Split-Brain)​

4. Deployment Failure​

How to Handle Failures (Design Principles)​

Example of Netflix​

Failure Scenario:​

Design Solutions Netflix uses:​

Failure Detection & Recovery​

Availability​

Key Concepts for High Availability​

Techniques to Improve Availability​

Example of Highly Available Web App​

Components:​

High Availability Design:​

Availability vs Reliability​

Example Availability in AWS​

Fault Tolerance​

Fault vs Failure vs Fault Tolerance​

Characteristics of Fault-Tolerant Systems​

Netflix's Fault Tolerance​

Fault Tolerance in Distributed Systems​

Redundancy​

Types of Redundancy​

Redundancy in a Web Application​

Components:​

Where Redundancy is Applied:​

Failure Scenario​

Without Redundancy​

With Redundancy​

Redundancy vs Replication vs Backup​

Design Considerations​

Common Redundancy Patterns​

Fault Detection​

Fault Detection vs Fault Tolerance​

How Fault Detection Works​

Fault Detection in a Web Application​

Fault Detection Methods:​

Health Check​

Types of Health Checks​

Where Health Checks Are Used​

Health Check in Kubernetes​

Characteristics of a Good Health Check Endpoint​

Recovery​

Recovery Techniques​

System Stability​

Stability Strategies​

Timeout​

Retries​

Retry Best Practices​

Circuit Breaker Pattern​

Example of Circuit Breaker​

Reliability

Improve Reliability

Example of Reliable E-commerce System

Components

How to Design for Reliability

Failure

Types of Failure

Failure vs Fault

Examples of Failures

1. Database Failure

2. Service Dependency Failure

3. Network Partitioning (Split-Brain)

4. Deployment Failure

How to Handle Failures (Design Principles)

Example of Netflix

Failure Scenario:

Design Solutions Netflix uses:

Failure Detection & Recovery

Availability

Key Concepts for High Availability

Techniques to Improve Availability

Example of Highly Available Web App

Components:

High Availability Design:

Availability vs Reliability

Example Availability in AWS

Fault Tolerance

Fault vs Failure vs Fault Tolerance

Characteristics of Fault-Tolerant Systems

Netflix's Fault Tolerance

Fault Tolerance in Distributed Systems

Redundancy

Types of Redundancy

Redundancy in a Web Application

Components:

Where Redundancy is Applied:

Failure Scenario

Without Redundancy

With Redundancy

Redundancy vs Replication vs Backup

Design Considerations

Common Redundancy Patterns

Fault Detection

Fault Detection vs Fault Tolerance

How Fault Detection Works

Fault Detection in a Web Application

Fault Detection Methods:

Health Check

Types of Health Checks

Where Health Checks Are Used

Health Check in Kubernetes

Characteristics of a Good Health Check Endpoint

Recovery

Recovery Techniques

System Stability

Stability Strategies

Timeout

Retries

Retry Best Practices

Circuit Breaker Pattern

Example of Circuit Breaker