Key Themes
Designing data-intensive applications involves three primary concerns:
- Reliability: The system should continue to work correctly even in the face of adversity
- Scalability: The system should handle increased load
- Maintainability: The system should be easy to operate and modify
Reliability
Defining Reliability
- Reliability means a system continues to perform its required function correctly even when things go wrong
- Faults are not the same as failures
- A fault is a component of the system that deviates from its spec
- A failure is when the entire system stops providing the required service
Types of Faults
- Hardware Faults
- Hard disk crashes
- RAM failures
- Power outages
- Solutions include redundancy and failover mechanisms
- Software Errors
- Systematic errors that are often more correlated
- Can cause cascading failures
- Examples: software bugs, runaway processes, external dependencies
- Human Errors
- Most common source of reliability issues
- Strategies to mitigate:
- Design systems that minimize opportunities for human error
- Provide clear and easy-to-use interfaces
- Implement thorough testing
- Allow quick and easy recovery from human errors
- Set up detailed monitoring and alerting
- Implement good management practices and training
Scalability
Defining Scalability
- Scalability is the ability of a system to handle increased load
- Load can be measured by various parameters depending on the system (e.g., requests per second, read/write ratio, active users)
Describing Load
- Use load parameters to quantify the current load
- Example: Twitter’s scalability challenge of delivering tweets to followers
- Approach: Fanout writes (pre-compute and cache tweet distributions)
- Hybrid approach: Cache for most users, compute in real-time for users with many followers
Handling Increased Load
- Vertical Scaling (Scaling Up)
- Adding more power to a single node
- Limited by hardware capabilities
- Simple but has significant limitations
- Horizontal Scaling (Scaling Out)
- Adding more machines to distribute the load
- More complex but often more flexible
- Requires distributed system design
Performance Metrics
- Response Time: Time between a client sending a request and receiving a response
- Latency: Wait time before processing begins
- Use percentiles (median, 95th, 99th) to understand performance distribution
- Average (mean) can be misleading due to outliers
Maintainability
Key Principles
- Operability
- Make it easy for operations teams to keep the system running smoothly
- Provide good monitoring, support for automation, clear documentation
- Simplicity
- Reduce complexity through abstraction
- Use tools and techniques that make systems easier to understand
- Avoid unnecessary complexity
- Evolvability
- Make it easy to modify the system in the future
- Support for extensibility
- Accommodate changing requirements
Conclusion
- Data-intensive applications require careful design
- Focus on reliability, scalability, and maintainability
- No one-size-fits-all solution; each system needs tailored approaches
- Continuous learning and adaptation are crucial
Key Takeaways
- Anticipate and design for potential failures
- Understand your system’s load characteristics
- Prioritize simplicity and ease of maintenance
- Use appropriate tools and techniques for your specific requirements
Reference
https://www.geeksforgeeks.org/transaction-isolation-levels-dbms/