Chapter 2 notes for Designing Data Intensive Applications By Martin Kleppmann
Sep 18, 2023
Overview
Data models are perhaps the most important part of developing software as they have a profound effect on not only how the software is written but also how we think about the problems we are solving
Most applications are built by layering one data model on top of another. How is it represented in terms of next-lower layer?
An app developer models data in terms of objects, data structures and API's that manipulate those data structures.
To store these data structures, we use a general purpose data model like JSON / XML, tables in a relational database etc.
The engineers who built the database software decide a way to represent JSON like data model in form of bytes in memory or disk.
On more lower level, hardware engineers figure out how to represent bytes in form of electrical currents, magnetic fields, etc.
Each layer hides the complexity of the layers below it by providing a clean data model.
Relational Model vs Document Model
The best known data model of today is of SQL, based on relational model where data is organized into relations (tables in SQL) where each relation is an unordered collection of tuples (rows in SQL)
Goal of the relational model was to hide that implementation detail behind a cleaner interface.
Relational databases generalize very well, beyond their original scope of business data processing, to a broad variety of use cases. Much of what you see on the web today is still powered by relational databases, be it online publishing, discussion, social networking, e-commerce, games etc.
Birth of NoSQL
Not Only SQL
There are several driving forces behind the adoption of NoSQL databases :
A need for greater scalability than relational databases can easily achieve
Free and opensource
Specialized query operations that are not well supported by the relational model
Desire for a more dynamic and expressive data model
Object-Relational Mismatch
Most application development today is done in OOP languages leading to criticism of SQL data model.
Impedance mismatch - if data is stored in relational tables, an awkward translation layer is required between the objects in the application code and the database model of tables, rows, and columns. The disconnect between the models is Impedance mismatch. ORM frameworks reduce the amount of boilerplate code required for translation layer
JSON representation has better locality than multi-table schema.
In JSON representation all the relevant information is in one place and one query is sufficient
whereas fetching a profile in relational schema we need to perform multiple queries and joins
One-to-many relationships imply a tree structure in data and the JSON representation makes this tree structure explicit
Many-to-One and Many-to-Many Relationships
In relational databases, it's normal to refer to rows in other tables by ID, because joins are easy.
In document databases, joins are not needed for (one/many) -to-many tree structures, because such relationships involve data normalization and support for joins is often weak in document databases.
Also, if the database inherently doesn't support joins, then you have to emulate a join in application code by making multiple queries to the database.
data has a tendency of becoming more interconnected as features are added to applications.
Are document databases repeating history?
The most popular database for business processing systems was IBM's IMS developed in 1970.
It had a fairly simple model hierarchical model which had striking similarities with JSON model used by document databases. It represented all data as a tree of records nested within records like JSON structure
Network model
generalization of hierarchical model
In the tree structure of the hierarchical model, every record had exactly one parent
whereas in the network model every record can have multiple parents.
complicated
Relational model
laid out all data in the open: a relation(table) is simply a collection of rows(tuples)
simpler
Comparison to document databases
Foreign key in relational model and document reference in document model
Relational Versus Document Databases of Today
Document model provides :
Schema flexibility
Better performance due to locality
For some applications it is closer to the data structures used by the application
Relational model provides :
Better support for joins
Better support for many-to-one and many-to-many relationships
If data in the application has a document-like structure, then it’s a good idea to use a document model. the relational technique of shredding(splitting document-like structure into multiple tables) leads to cumbersome schemas and complicated application code
You can’t refer directly to the nested item within a document. As long as document is not too deeply nested it’s not a problem
If application uses many-to-many relationships document model becomes less appealing as it leads to significantly more complex application code and worse performance
It’s not possible to say which data model leads to more simpler application code as it depends on the kinds of relationships that exist between data items
For highly interconnected data, the document model is awkward, the relational model is acceptable and the graph model is the most natural.
Schema flexibility in the document model
Document databases are sometimes called schemaless which is misleading.
The code that reads the data has an implicit schema but it is not enforced by the database: Also known as schema-on-read - The structure of the data is implicit, and only interpreted when the data is read.
In contrast schema-on-write which is the traditional approach for relational databases, the schema is explicit and the database ensures all written data conforms to it.
The difference between the approaches is particularly noticeable in situations where an application wants to change the format of its data.
For example, say you are currently storing each user’s full name in one field, and you instead want to store the first name and last name separately.
In a document database, you would just start writing new documents with the new fields and have code in the application that handles the case when old documents are read. For example:
if (user && user.name && !user.first_name) {
// Documents written before Dec 8, 2013 don't have first_name
user.first_name = user.name.split(" ")[0];
}
python
But for relational schema, we need to perform migration in the database structure first before adding a new field
ALTERTABLE users ADDCOLUMN first_name text;
UPDATE users SET first_name = split_part(name, ' ', 1); -- PostgreSQLUPDATE users SET first_name = substring_index(name, ' ', 1); -- MySQL
sql
Schema changes require downtime and they are likely very slow. Running the UPDATE statement on a large table is likely to be slow on any database, since every row needs to be re-written.
The schema on read approach is advantageous if the data is heterogeneous i.e items in the collections don’t all have the same structure
In case where all records are expected to have the same structure, schemas are useful mechanism for documenting and enforcing the nature
Data locality for queries
A document is usually stored in a single continuous string encoded as JSON, XML. If the application needs to access the entire document frequently there is a performance advantage to this storage locality.
If data is split across multiple tables, multiple index lookups are required to retrieve it all which may require more disk seeks and take more time.
Locality advantage only applies if you need large parts of the document at the same time.
Database typically needs to load the entire document even if we access a small part of it which is wasteful on large documents.
On updates to document the entire document needs to be re-written
It is recommended to keep documents small and avoid writes that increase the size of the document
Grouping related data together for locality is not limited to document model. e.g Google’s Spanner database offers the same locality properties in a relational model
Query Languages of data
An imperative language tells the computer to perform certain operations in a certain order. Imagine stepping through the code line by line, evaluating conditions, updating variables, and deciding whether to go around the loop one more time.
In a declarative query language, like SQL, you just specify the pattern of the data you want—what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated)—but not how to achieve that goal.
Declarative query language is more concise and easier to work with than imperative API
It also hides the implementation details of the database engine, which makes it possible for the database system to introduce performance improvements without requiring any changes to queries.
Declarative languages often lend themselves to parallel execution while imperative code is very hard to parallelize across multiple cores because it specifies instructions that must be performed in a particular order.
Declarative languages specify only the pattern of the results, not the algorithm that is used to determine results.
Declarative Queries on the Web
Advantages of declarative query languages are not limited to just databases.
In a web browser CSS and XSL are both declarative languages for specifying the styling of the document
In a web browser, using declarative CSS styling is much better than manipulating styles imperatively in JavaScript. Similarly, in databases, declarative query languages like SQL turned out to be much better than imperative query APIs.
MapReduce Querying
MapReduce is a programming model for processing large amounts of data in bulk across many machines, and a limited form of MapReduce is supported by MongoDB and CouchDB.
MapReduce is somewhere between declarative and imperative.
SQL can, but not necessarily have to, be implemented by MapReduce operations.
A declarative query language offers more opportunities for a query optimizer to improve the performance of a query.
MongoDB supports a declarative query language called aggregation pipeline, where users don’t need to coordinate a map and reduce function themselves.
Graph-like data models
Many-to-many relationships are an important distinguishing feature between different data models.
If your application has mostly one-to-many relationships (tree-structured data) or no relationships between records, the document model is appropriate.
The relational model can handle simple cases of many-to-many relationships
But as the connections within your data become more complex, it becomes more natural to start modeling your data as a graph.
A graph consists of two kinds of objects: vertices (also known as nodes or entities) and edges (also known as relationships or arcs).
Many kinds of data can be modeled as a graph
Social graph - vertices → people, edges → indicate people know each other
Web graph - vertices → web pages, edges → indicate HTML links to other pages
Road or Railway graph - vertices → junctions, edges → represent the roads between them
Property Graphs
Any vertex can have an edge connecting it with any other vertex.
There is no schema that restricts which kinds of things can or cannot be associated.
By using different labels for different kinds of relationships, you can store several different kinds of information in a single graph, while still maintaining a clean data model.
CREATETABLE vertices (
vertex_id integerPRIMARY KEY,
properties json
);
CREATETABLE edges (
edge_id integerPRIMARY KEY,
tail_vertex integerREFERENCES vertices (vertex_id),
head_vertex integerREFERENCES vertices (vertex_id),
label text,
properties json
);
CREATE INDEX edges_tails ON edges (tail_vertex);
CREATE INDEX edges_heads ON edges (head_vertex)
sql
Representing a property graph using a relational schema
Cypher Query Language
Cypher is a declarative query language for property graphs, created for the Neo4j graph database
It’s like SQL for graph database and was inspired by SQL
Example 2-3. A subset of the data in Figure 2-5, represented as a Cypher query