Chapter 1 notes for Designing Data Intensive Applications By Martin Kleppmann
Aug 27, 2022
I am going to keep all the notes for the books that I am reading in a blog format. I didn’t like the fact that I kept highlighting books and taking notes but rarely ever went through the concepts again except for the rare moments when it randomly hit me that I had notes on them. So I decided to put it out there. This way the chances of me going through the notes again is very likely.
So starting with Designing Data Intensive Applications by Martin Kleppmann. It’s one of those books that is so rich in information that you need go through it multiple times to grasp the material. One of my favorites lots to learn. Here we go.
These are my chapter 1 notes for the book.
Overview
Applications today are data-intensive as opposed to compute-intensive.
Raw CPU power is rarely a limiting factor. Bigger problems are amount of data, complexity of data, speed with which it is changing.
A data-intensive application is built from standard building blocks that provide functionalities like
databases
caches
search indexes
stream processing
batch processing
There are many database systems with different characteristics because different applications have different requirements
Thinking about Data Systems
Typically databases, queues, caches etc are thought to be as different categories of tools. Although they are superficially similar i.e database and message queue both store data for some time
The tools no longer fit into the traditional categories. eg. data stores that are also used as message queues → Redis, message queues with database like durability guarantees → Apache Kafka
Increasingly many applications now have such demanding and wide ranging requirements that a single tool can no longer meet all of its data processing and storage needs.
Work is broken down into tasks that can be performed efficiently on a single tool and those tools are tied together using application code.
If the application has a caching layer or a full-text search server (like Elasticsearch) separate from the main database; it's the responsibility of the application code to keep caches and indexes in sync with the main database.
When we combine several tools to provide a service, the service's interface or API hides those details from clients. We've created a new special-purpose data system from smaller general-purpose components.
When designing a data system many questions arise :
How to ensure data remains correct and consistent even when things go wrong internally?
How to provide consistently good performance to clients even when parts of the system are degraded?
How to scale to handle an increase in load?
What does a good API for the service look like?
3 concerns in software systems
Reliability : The system should continue to work correctly even in the face of adversity (hardware software faults, human error).
Scalability : As the system grows (in data volume, traffic. complexity) there should be reasonable ways of dealing with the growth.
Maintainability : The systems should be designed in a manner in which people will be able to adapt to the changes easily and work on them productively.
RELIABILITY
A software is said to be reliable if :
The application performs functions that user expected.
It can tolerate user's mistakes or using the software in unexpected ways.
It's performance is good enough for required use case under expected load and volume.
The system prevents unauthorized access and abuse.
it continues to work correctly even when things go wrong
Things that can go wrong are faults and systems that can anticipate faults and can cope with them are called fault-tolerant and resilient.
It only makes sense to talk about tolerating certain types of faults
Fault-tolerant the term is slightly misleading suggesting we could make a system tolerant to very kind of fault which in reality is not feasible. If the entire planet were swallowed by black hole tolerance of that fault would require web hosting in space-good luck getting the budget item approved lmao.
Fault vs Failure
→ Fault is defined a one component of the system deviating from its spec.
→ Failure is when the system as a whole stops providing the required service to the user.
It's impossible to reduce probability of fault to zero thus it is best to design fault tolerant mechanisms thatprevents faults from causing failures.
Counter intuitively in such fault tolerant systems it can make sense to increase the rate of faults by triggering them deliberately e.g randomly killing individual processes without warning . Many critical bugs are due to poor error handling
By deliberately inducing faults we ensure the fault tolerance machinery is continually exercised and tested. e.g Netflix Chaos Monkey
Hardware Faults
→HDD crash, faulty RAM, power grid blackout etc. HDD MTFF 10-50 years. On a storage cluster with 10,000 disks, on an average one disk should die per day.
Add redundancy to the individual hardware components in order to reduce failure rate of the system.
Disks may be set up in RAID configuration.
Servers may have dual power supplies and hot swappable CPUs.
Data centers may have batteries and diesel generators for backup power.
Software Faults
A bug that cause every instance of application server to crash.
A service that system depends on which slows down, becomes unresponsive or starts returning corrupted responses.
Cascading failures where a small fault in one component triggers a fault in another component.
Human Errors
Design systems in a way that minimizes opportunities for error.
Unit tests, integration tests, manual tests.
SCALABILITY
Scalability is the term used to describe system's ability to cope with increased load.
System should be capable of handling ways of dealing with growth in data volume, traffic volume or complexity.
Even if a system works reliably today, it doesn't mean it will necessarily work reliably in the future. One common reason for degradation is additional load.
In cases when there is sudden increase in the load on one component of application; there has to be proper mechanisms in place to handle the additional load.
Describing Load
Load can be described with a few parameters load parameters like requests per second to a web server, number of reads in the database, number of simultaneous users actively
Taking Twitter as an example, two of Twitter's main operations are :
i)Post tweet - User can publish a new message to their followers (4.6k requests/second on average and 12k requests/second on peak).
ii)Home timeline - User can view tweets posted by people they follow (300k requests/second).
Handling 12000 tweets/s would be fairly easy however Twitter’s main scalability challenge is not primarily due to tweet volume but due to fan-out i.e each user follows many people and each user is followed by many people.
There are 2 ways of handling these operations :
Posting a tweet simply insert the new tweet into a global collection of tweet.
When user requests their home timeline look up all the people they follow, find all the tweets for each of those users and merge them
2. Maintain a cache for each user's home timeline like a mailbox of tweets for each recipient.
When a user posts a new tweet look up all the people that follow the user and insert the new tweet into each of their home timeline caches. The request to read the home timeline is cheap because the results have been computed ahead of time.
→ Version 1 failed to keep up with the load of home timeline queries so company switched to approach 2. It works better because the average rate of published tweets is two order of magnitude less than the rate of home timeline reads and so its preferable to do more work at write timeand less at read time.
→ The downside of version 2 is that posting a tweet requires a lot of extra work. On average, a tweet is delivered to about 75 followers, so 4.6k tweets/s become 345k writes/s to the home timeline caches. However, the average is the catch here. Not all users have a mere 75 followers, some users even have 30 mil followers. This means that a single tweet results in 30 mil write to home timelines of different users.
Describing Performance
Now we investigate what happens when load increases
How is the performance of our system affected when we increase the load parameter and keep the system resources unchanged?
When we increase the load parameter how much do we need to increase the resources if we want to keep the resources unchanged?
In a batch processing systemsuch as Hadoop we usually care about throughput i.e the number of records we can process/s or total time it takes to run a job on a dataset of a certain size
In online systems, more important is the service's response time i.e. time between a client sending a request and receiving a response
Latency & Response Time
Latency and response time are used synonymous but they aren't the same.
Latency - duration that a request is waiting to be handled i.e during which the process is latent, awaiting service.
Response time - what the client sees → time to process the request(service time) + network delays + queuing delays.
Response time is not a single number. It can be considered as a distribution of values that you can measure.
-Most requests are reasonably fast but there are outliers even when you’d think all requests should take same time you get variation i.e page fault ,loss of n/w packet etc.
Median is a good metric to see how long users typically have to wait
to check how bad outliers are look at higher percentiles eg 95th percentile response time is 1.5 s that means 95/100 requests take less than 1.5 s and 5 out of 100 take 1.5 s or more
Higher percentiles are also called tail latencies are important as they directly affect users’ experience of the service. Customers with the slowest requests are the ones who have the most data on their accounts because they have made purchases. Amazon use-case example in tail latencies
Queueing delays often account for large part of response time at high percentiles as a server can only process a small number of things in parallel.
It only takes a small number of slow requests to hold up the processing of subsequent requests.
Even if you make the calls in parallel the end-user request still needs to wait for the slowest of the parallel calls to complete.
It takes just one slow call to make the entire end-user request slow.
Tail latency amplification - Even if a small % of end-user requests require multiple backend calls, there is a high chance of getting a slow call if the end-user request requires multiple backend calls. Thus, a higher proportion of end-user requests end up being slow.
If you want to monitor response times for a service on a dashboard, you need to monitor it on an ongoing basis. A good idea is to keep a rolling window of response times of requests in the last 10 minutes. So there could be a graph of the median and various percentiles over that window.
Averaging percentiles is useless, the right way of aggregating response time data is by adding histograms.
Approaches to coping with load
How do we maintain good performance even when our load parameters increase by some amount?
Scaling up - Vertical scaling i.e moving to a more powerful machine
Scaling out - Horizontal scaling i.e distributing the load across multiple smaller machines
Elastic systems - Automatically add computing resources when detected load increase. Quite useful if load is unpredictable.
MAINTAINABILITY
Majority of the cost of software us not in the initial development but in it’s ongoing maintenance.
We should design software in such a way that it will hopefully minimize pain during maintenance.
3 design principles for software systems regarding maintainence:
Operability : Make it easy for operations teams to keep the system running smoothly
Simplicity : Make it easy for new engineers to understand the system
Evolvability : Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change
Operability : Making Life Easy for Operations
Good Operability means making routine tasks easy, allowing the operations team to focus their efforts on high-value activities.
Data systems can do the following to make routine tasks easy e.g.
Providing visibility into the runtime behavior and internals of the system, with good monitoring.
Providing good support for automation and integration with standard tools.
Providing good documentation and an easy-to-understand operational model ("If I do X, Y will happen").
Self-healing where appropriate, but also giving administrators manual control over the system state when needed.
Simplicity : Managing Complexity
As projects get larger they also become very complex and difficult to understand. We should make it easy for new engineers in the team to understand the system by removing complexity.
Reducing complexity improves software maintainability, which is why simplicity should be a key goal for the systems we build.
Abstraction is one of the best tools that we have for dealing with accidental complexity. A good abstraction can hide a great deal of implementation detail behind a clean, simple-to-understand facade.
Evolvability : Making Changes Easy
System’s requirements keep evolving. You learn new facts, business priorities change, users request new features, new platforms replace old platforms.
The ease with which you can modify data system and adapt it to changing requirements, is closely linked to its simplicity and its abstractions
Simple and easy-to understand systems are usually easier to modify than complex ones