I am going to keep all the notes for the books that I am reading in a blog format. I didn’t like the fact that I kept highlighting books and taking notes but rarely ever went through the concepts again except for the rare moments when it randomly hit me that I had notes on them. So I decided to put it out there. This way the chances of me going through the notes again is very likely.
So starting with Designing Data Intensive Applications by Martin Kleppmann. It’s one of those books that is so rich in information that you need go through it multiple times to grasp the material. One of my favorites lots to learn. Here we go.
These are my chapter 1 notes for the book.
Overview
- Applications today are data-intensive as opposed to compute-intensive.
- Raw CPU power is rarely a limiting factor. Bigger problems are amount of data, complexity of data, speed with which it is changing.
- A data-intensive application is built from standard building blocks that provide functionalities like
- databases
- caches
- search indexes
- stream processing
- batch processing
- There are many database systems with different characteristics because different applications have different requirements
3 concerns in software systems
- Reliability : The system should continue to work correctly even in the face of adversity (hardware software faults, human error).
- Scalability : As the system grows (in data volume, traffic. complexity) there should be reasonable ways of dealing with the growth.
- Maintainability : The systems should be designed in a manner in which people will be able to adapt to the changes easily and work on them productively.
RELIABILITY
it continues to work correctly even when things go wrong
Things that can go wrong are faults and systems that can anticipate faults and can cope with them are called fault-tolerant and resilient.
Fault vs Failure
→ Fault is defined a one component of the system deviating from its spec.
→ Failure is when the system as a whole stops providing the required service to the user.
It's impossible to reduce probability of fault to zero thus it is best to design fault tolerant mechanisms that prevents faults from causing failures.
- Counter intuitively in such fault tolerant systems it can make sense to increase the rate of faults by triggering them deliberately e.g randomly killing individual processes without warning . Many critical bugs are due to poor error handling
- By deliberately inducing faults we ensure the fault tolerance machinery is continually exercised and tested. e.g Netflix Chaos Monkey
SCALABILITY
Scalability is the term used to describe system's ability to cope with increased load.
System should be capable of handling ways of dealing with growth in data volume, traffic volume or complexity.
- Even if a system works reliably today, it doesn't mean it will necessarily work reliably in the future. One common reason for degradation is additional load.
- In cases when there is sudden increase in the load on one component of application; there has to be proper mechanisms in place to handle the additional load.
Describing Load
- Load can be described with a few parameters load parameters like
requests per second to a web server
,number of reads in the database
,number of simultaneous users actively
- Taking Twitter as an example, two of Twitter's main operations are :
i) Post tweet - User can publish a new message to their followers (4.6k requests/second on average and 12k requests/second on peak).
ii) Home timeline - User can view tweets posted by people they follow (300k requests/second).
Handling 12000 tweets/s would be fairly easy however Twitter’s main scalability challenge is not primarily due to tweet volume but due to fan-out i.e each user follows many people and each user is followed by many people.
→ Version 1 failed to keep up with the load of home timeline queries so company switched to approach 2. It works better because the average rate of published tweets is two order of magnitude less than the rate of home timeline reads and so its preferable to do more work at write time and less at read time.
→ The downside of version 2 is that posting a tweet requires a lot of extra work. On average, a tweet is delivered to about 75 followers, so 4.6k tweets/s become 345k writes/s to the home timeline caches. However, the average is the catch here. Not all users have a mere 75 followers, some users even have 30 mil followers. This means that a single tweet results in 30 mil write to home timelines of different users.
Describing Performance
Now we investigate what happens when load increases
- How is the performance of our system affected when we increase the load parameter and keep the system resources unchanged?
- When we increase the load parameter how much do we need to increase the resources if we want to keep the resources unchanged?
In a batch processing system such as Hadoop we usually care about throughput i.e the number of records we can process/s or total time it takes to run a job on a dataset of a certain size
In online systems, more important is the service's response time i.e. time between a client sending a request and receiving a response
Approaches to coping with load
How do we maintain good performance even when our load parameters increase by some amount?
- Scaling up - Vertical scaling i.e moving to a more powerful machine
- Scaling out - Horizontal scaling i.e distributing the load across multiple smaller machines
- Elastic systems - Automatically add computing resources when detected load increase. Quite useful if load is unpredictable.
MAINTAINABILITY
Majority of the cost of software us not in the initial development but in it’s ongoing maintenance.
We should design software in such a way that it will hopefully minimize pain during maintenance.
3 design principles for software systems regarding maintainence:
- Operability : Make it easy for operations teams to keep the system running smoothly
- Simplicity : Make it easy for new engineers to understand the system
- Evolvability : Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change