Key Themes

Designing data-intensive applications involves three primary concerns:

  1. Reliability: The system should continue to work correctly even in the face of adversity
  2. Scalability: The system should handle increased load
  3. Maintainability: The system should be easy to operate and modify

Reliability

Defining Reliability

  • Reliability means a system continues to perform its required function correctly even when things go wrong
  • Faults are not the same as failures
    • A fault is a component of the system that deviates from its spec
    • A failure is when the entire system stops providing the required service

Types of Faults

  1. Hardware Faults
    • Hard disk crashes
    • RAM failures
    • Power outages
    • Solutions include redundancy and failover mechanisms
  2. Software Errors
    • Systematic errors that are often more correlated
    • Can cause cascading failures
    • Examples: software bugs, runaway processes, external dependencies
  3. Human Errors
    • Most common source of reliability issues
    • Strategies to mitigate:
      • Design systems that minimize opportunities for human error
      • Provide clear and easy-to-use interfaces
      • Implement thorough testing
      • Allow quick and easy recovery from human errors
      • Set up detailed monitoring and alerting
      • Implement good management practices and training

Scalability

Defining Scalability

  • Scalability is the ability of a system to handle increased load
  • Load can be measured by various parameters depending on the system (e.g., requests per second, read/write ratio, active users)

Describing Load

  • Use load parameters to quantify the current load
  • Example: Twitter’s scalability challenge of delivering tweets to followers
    • Approach: Fanout writes (pre-compute and cache tweet distributions)
    • Hybrid approach: Cache for most users, compute in real-time for users with many followers

Handling Increased Load

  1. Vertical Scaling (Scaling Up)
    • Adding more power to a single node
    • Limited by hardware capabilities
    • Simple but has significant limitations
  2. Horizontal Scaling (Scaling Out)
    • Adding more machines to distribute the load
    • More complex but often more flexible
    • Requires distributed system design

Performance Metrics

  • Response Time: Time between a client sending a request and receiving a response
  • Latency: Wait time before processing begins
  • Use percentiles (median, 95th, 99th) to understand performance distribution
  • Average (mean) can be misleading due to outliers

Maintainability

Key Principles

  1. Operability
    • Make it easy for operations teams to keep the system running smoothly
    • Provide good monitoring, support for automation, clear documentation
  2. Simplicity
    • Reduce complexity through abstraction
    • Use tools and techniques that make systems easier to understand
    • Avoid unnecessary complexity
  3. Evolvability
    • Make it easy to modify the system in the future
    • Support for extensibility
    • Accommodate changing requirements

Conclusion

  • Data-intensive applications require careful design
  • Focus on reliability, scalability, and maintainability
  • No one-size-fits-all solution; each system needs tailored approaches
  • Continuous learning and adaptation are crucial

Key Takeaways

  • Anticipate and design for potential failures
  • Understand your system’s load characteristics
  • Prioritize simplicity and ease of maintenance
  • Use appropriate tools and techniques for your specific requirements

Reference

https://www.geeksforgeeks.org/transaction-isolation-levels-dbms/