DDIA Chapter 1: Reliable, Scalable, and Maintainable Applications

Key Themes

Designing data-intensive applications involves three primary concerns:

Reliability: The system should continue to work correctly even in the face of adversity
Scalability: The system should handle increased load
Maintainability: The system should be easy to operate and modify

Reliability means a system continues to perform its required function correctly even when things go wrong
Faults are not the same as failures
- A fault is a component of the system that deviates from its spec
- A failure is when the entire system stops providing the required service

Hardware Faults
- Hard disk crashes
- RAM failures
- Power outages
- Solutions include redundancy and failover mechanisms
Software Errors
- Systematic errors that are often more correlated
- Can cause cascading failures
- Examples: software bugs, runaway processes, external dependencies
Human Errors
- Most common source of reliability issues
- Strategies to mitigate:
  - Design systems that minimize opportunities for human error
  - Provide clear and easy-to-use interfaces
  - Implement thorough testing
  - Allow quick and easy recovery from human errors
  - Set up detailed monitoring and alerting
  - Implement good management practices and training

Scalability is the ability of a system to handle increased load
Load can be measured by various parameters depending on the system (e.g., requests per second, read/write ratio, active users)

Use load parameters to quantify the current load
Example: Twitter’s scalability challenge of delivering tweets to followers
- Approach: Fanout writes (pre-compute and cache tweet distributions)
- Hybrid approach: Cache for most users, compute in real-time for users with many followers

Vertical Scaling (Scaling Up)
- Adding more power to a single node
- Limited by hardware capabilities
- Simple but has significant limitations
Horizontal Scaling (Scaling Out)
- Adding more machines to distribute the load
- More complex but often more flexible
- Requires distributed system design

Response Time: Time between a client sending a request and receiving a response
Latency: Wait time before processing begins
Use percentiles (median, 95th, 99th) to understand performance distribution
Average (mean) can be misleading due to outliers

Operability
- Make it easy for operations teams to keep the system running smoothly
- Provide good monitoring, support for automation, clear documentation
Simplicity
- Reduce complexity through abstraction
- Use tools and techniques that make systems easier to understand
- Avoid unnecessary complexity
Evolvability
- Make it easy to modify the system in the future
- Support for extensibility
- Accommodate changing requirements