Echo Kanak

Data Models and Query Languages

Chapter 2 notes for Designing Data Intensive Applications By Martin Kleppmann
Sep 18, 2023

Overview

Data models are perhaps the most important part of developing software as they have a profound effect on not only how the software is written but also how we think about the problems we are solving

Most applications are built by layering one data model on top of another. How is it represented in terms of next-lower layer?

  • An app developer models data in terms of objects, data structures and API's that manipulate those data structures.
  • To store these data structures, we use a general purpose data model like JSON / XML, tables in a relational database etc.
  • The engineers who built the database software decide a way to represent JSON like data model in form of bytes in memory or disk.
  • On more lower level, hardware engineers figure out how to represent bytes in form of electrical currents, magnetic fields, etc.

Each layer hides the complexity of the layers below it by providing a clean data model.

Relational Model vs Document Model

  • The best known data model of today is of SQL, based on relational model where data is organized into relations (tables in SQL) where each relation is an unordered collection of tuples (rows in SQL)
  • Goal of the relational model was to hide that implementation detail behind a cleaner interface.
  • Relational databases generalize very well, beyond their original scope of business data processing, to a broad variety of use cases. Much of what you see on the web today is still powered by relational databases, be it online publishing, discussion, social networking, e-commerce, games etc.

Birth of NoSQL
Object-Relational Mismatch
Many-to-One and Many-to-Many Relationships

Are document databases repeating history?

  • The most popular database for business processing systems was IBM's IMS developed in 1970.
  • It had a fairly simple model hierarchical model which had striking similarities with JSON model used by document databases. It represented all data as a tree of records nested within records like JSON structure

Network model

  • generalization of hierarchical model
  • In the tree structure of the hierarchical model, every record had exactly one parent
  • whereas in the network model every record can have multiple parents.
  • complicated

Relational model

  • laid out all data in the open: a relation(table) is simply a collection of rows(tuples)
  • simpler

Comparison to document databases

  • Foreign key in relational model and document reference in document model

Relational Versus Document Databases of Today

Document model provides :

  1. Schema flexibility
  1. Better performance due to locality
  1. For some applications it is closer to the data structures used by the application

Relational model provides :

  1. Better support for joins
  1. Better support for many-to-one and many-to-many relationships

If data in the application has a document-like structure, then it’s a good idea to use a document model. the relational technique of shredding(splitting document-like structure into multiple tables) leads to cumbersome schemas and complicated application code
  • You can’t refer directly to the nested item within a document. As long as document is not too deeply nested it’s not a problem
  • If application uses many-to-many relationships document model becomes less appealing as it leads to significantly more complex application code and worse performance
  • It’s not possible to say which data model leads to more simpler application code as it depends on the kinds of relationships that exist between data items
  • For highly interconnected data, the document model is awkward, the relational model is acceptable and the graph model is the most natural.
Schema flexibility in the document model
Data locality for queries

Query Languages of data

  • An imperative language tells the computer to perform certain operations in a certain order. Imagine stepping through the code line by line, evaluating conditions, updating variables, and deciding whether to go around the loop one more time.
  • In a declarative query language, like SQL, you just specify the pattern of the data you want—what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated)—but not how to achieve that goal.
  • Declarative query language is more concise and easier to work with than imperative API
  • It also hides the implementation details of the database engine, which makes it possible for the database system to introduce performance improvements without requiring any changes to queries.
  • Declarative languages often lend themselves to parallel execution while imperative code is very hard to parallelize across multiple cores because it specifies instructions that must be performed in a particular order.
  • Declarative languages specify only the pattern of the results, not the algorithm that is used to determine results.

Declarative Queries on the Web

MapReduce Querying

  • MapReduce is a programming model for processing large amounts of data in bulk across many machines, and a limited form of MapReduce is supported by MongoDB and CouchDB.
  • MapReduce is somewhere between declarative and imperative.
  • SQL can, but not necessarily have to, be implemented by MapReduce operations.
  • A declarative query language offers more opportunities for a query optimizer to improve the performance of a query.
  • MongoDB supports a declarative query language called aggregation pipeline, where users don’t need to coordinate a map and reduce function themselves.

Graph-like data models

  • Many-to-many relationships are an important distinguishing feature between different data models.
  • If your application has mostly one-to-many relationships (tree-structured data) or no relationships between records, the document model is appropriate.
  • The relational model can handle simple cases of many-to-many relationships
  • But as the connections within your data become more complex, it becomes more natural to start modeling your data as a graph.

A graph consists of two kinds of objects: vertices (also known as nodes or entities) and edges (also known as relationships or arcs).

Many kinds of data can be modeled as a graph

  1. Social graph - vertices → people, edges → indicate people know each other
  1. Web graph - vertices → web pages, edges → indicate HTML links to other pages
  1. Road or Railway graph - vertices → junctions, edges → represent the roads between them

Property Graphs
Cypher Query Language