Data Models and Query Languages

Overview

Data models are perhaps the most important part of developing software as they have a profound effect on not only how the software is written but also how we think about the problems we are solving

Most applications are built by layering one data model on top of another. How is it represented in terms of next-lower layer?

An app developer models data in terms of objects, data structures and API's that manipulate those data structures.

To store these data structures, we use a general purpose data model like JSON / XML, tables in a relational database etc.

The engineers who built the database software decide a way to represent JSON like data model in form of bytes in memory or disk.

On more lower level, hardware engineers figure out how to represent bytes in form of electrical currents, magnetic fields, etc.

Each layer hides the complexity of the layers below it by providing a clean data model.

Relational Model vs Document Model

The best known data model of today is of SQL, based on relational model where data is organized into relations (tables in SQL) where each relation is an unordered collection of tuples (rows in SQL)

Goal of the relational model was to hide that implementation detail behind a cleaner interface.

Relational databases generalize very well, beyond their original scope of business data processing, to a broad variety of use cases. Much of what you see on the web today is still powered by relational databases, be it online publishing, discussion, social networking, e-commerce, games etc.

Birth of NoSQL

Object-Relational Mismatch

Most application development today is done in OOP languages leading to criticism of SQL data model.

Impedance mismatch - if data is stored in relational tables, an awkward translation layer is required between the objects in the application code and the database model of tables, rows, and columns. The disconnect between the models is Impedance mismatch. ORM frameworks reduce the amount of boilerplate code required for translation layer

JSON representation has better locality than multi-table schema.

In JSON representation all the relevant information is in one place and one query is sufficient

whereas fetching a profile in relational schema we need to perform multiple queries and joins

One-to-many relationships imply a tree structure in data and the JSON representation makes this tree structure explicit

Many-to-One and Many-to-Many Relationships

In relational databases, it's normal to refer to rows in other tables by ID, because joins are easy.

In document databases, joins are not needed for (one/many) -to-many tree structures, because such relationships involve data normalization and support for joins is often weak in document databases.

Also, if the database inherently doesn't support joins, then you have to emulate a join in application code by making multiple queries to the database.

data has a tendency of becoming more interconnected as features are added to applications.

Are document databases repeating history?

The most popular database for business processing systems was IBM's IMS developed in 1970.

It had a fairly simple model hierarchical model which had striking similarities with JSON model used by document databases. It represented all data as a tree of records nested within records like JSON structure

Network model

generalization of hierarchical model

In the tree structure of the hierarchical model, every record had exactly one parent

whereas in the network model every record can have multiple parents.

complicated

Relational model

laid out all data in the open: a relation(table) is simply a collection of rows(tuples)

simpler

Comparison to document databases

Foreign key in relational model and document reference in document model

Relational Versus Document Databases of Today

Document model provides :

Schema flexibility

Better performance due to locality

For some applications it is closer to the data structures used by the application

Relational model provides :

Better support for joins

Better support for many-to-one and many-to-many relationships

If data in the application has a document-like structure, then it’s a good idea to use a document model. the relational technique of shredding(splitting document-like structure into multiple tables) leads to cumbersome schemas and complicated application code

You can’t refer directly to the nested item within a document. As long as document is not too deeply nested it’s not a problem

If application uses many-to-many relationships document model becomes less appealing as it leads to significantly more complex application code and worse performance

It’s not possible to say which data model leads to more simpler application code as it depends on the kinds of relationships that exist between data items

For highly interconnected data, the document model is awkward, the relational model is acceptable and the graph model is the most natural.

Schema flexibility in the document model

Document databases are sometimes called schemaless which is misleading.

The code that reads the data has an implicit schema but it is not enforced by the database: Also known as schema-on-read - The structure of the data is implicit, and only interpreted when the data is read.

In contrast schema-on-write which is the traditional approach for relational databases, the schema is explicit and the database ensures all written data conforms to it.

The difference between the approaches is particularly noticeable in situations where an application wants to change the format of its data.

For example, say you are currently storing each user’s full name in one field, and you instead want to store the first name and last name separately.

In a document database, you would just start writing new documents with the new fields and have code in the application that handles the case when old documents are read. For example:

if (user && user.name && !user.first_name) {
// Documents written before Dec 8, 2013 don't have first_name
user.first_name = user.name.split(" ")[0];
}

python

But for relational schema, we need to perform migration in the database structure first before adding a new field

ALTER TABLE users ADD COLUMN first_name text;
UPDATE users SET first_name = split_part(name, ' ', 1); -- PostgreSQL
UPDATE users SET first_name = substring_index(name, ' ', 1); -- MySQL

sql

Schema changes require downtime and they are likely very slow. Running the UPDATE statement on a large table is likely to be slow on any database, since every row needs to be re-written.

The schema on read approach is advantageous if the data is heterogeneous i.e items in the collections don’t all have the same structure

In case where all records are expected to have the same structure, schemas are useful mechanism for documenting and enforcing the nature

Data locality for queries

A document is usually stored in a single continuous string encoded as JSON, XML. If the application needs to access the entire document frequently there is a performance advantage to this storage locality.

If data is split across multiple tables, multiple index lookups are required to retrieve it all which may require more disk seeks and take more time.

Locality advantage only applies if you need large parts of the document at the same time.

Database typically needs to load the entire document even if we access a small part of it which is wasteful on large documents.

On updates to document the entire document needs to be re-written

It is recommended to keep documents small and avoid writes that increase the size of the document

Grouping related data together for locality is not limited to document model. e.g Google’s Spanner database offers the same locality properties in a relational model

Query Languages of data

An imperative language tells the computer to perform certain operations in a certain order. Imagine stepping through the code line by line, evaluating conditions, updating variables, and deciding whether to go around the loop one more time.

In a declarative query language, like SQL, you just specify the pattern of the data you want—what conditions the results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and aggregated)—but not how to achieve that goal.

Declarative query language is more concise and easier to work with than imperative API

It also hides the implementation details of the database engine, which makes it possible for the database system to introduce performance improvements without requiring any changes to queries.

Declarative languages often lend themselves to parallel execution while imperative code is very hard to parallelize across multiple cores because it specifies instructions that must be performed in a particular order.

Declarative languages specify only the pattern of the results, not the algorithm that is used to determine results.

Declarative Queries on the Web

Advantages of declarative query languages are not limited to just databases.

In a web browser CSS and XSL are both declarative languages for specifying the styling of the document

In a web browser, using declarative CSS styling is much better than manipulating styles imperatively in JavaScript. Similarly, in databases, declarative query languages like SQL turned out to be much better than imperative query APIs.

MapReduce Querying

MapReduce is a programming model for processing large amounts of data in bulk across many machines, and a limited form of MapReduce is supported by MongoDB and CouchDB.

MapReduce is somewhere between declarative and imperative.

SQL can, but not necessarily have to, be implemented by MapReduce operations.

A declarative query language offers more opportunities for a query optimizer to improve the performance of a query.

MongoDB supports a declarative query language called aggregation pipeline, where users don’t need to coordinate a map and reduce function themselves.

Graph-like data models

Many-to-many relationships are an important distinguishing feature between different data models.

If your application has mostly one-to-many relationships (tree-structured data) or no relationships between records, the document model is appropriate.

The relational model can handle simple cases of many-to-many relationships

But as the connections within your data become more complex, it becomes more natural to start modeling your data as a graph.

A graph consists of two kinds of objects: vertices (also known as nodes or entities) and edges (also known as relationships or arcs).

Many kinds of data can be modeled as a graph

Social graph - vertices → people, edges → indicate people know each other

Web graph - vertices → web pages, edges → indicate HTML links to other pages

Road or Railway graph - vertices → junctions, edges → represent the roads between them

Property Graphs

Any vertex can have an edge connecting it with any other vertex.

There is no schema that restricts which kinds of things can or cannot be associated.

By using different labels for different kinds of relationships, you can store several different kinds of information in a single graph, while still maintaining a clean data model.


CREATE TABLE vertices (
vertex_id integer PRIMARY KEY,
properties json
);
CREATE TABLE edges (
edge_id integer PRIMARY KEY,
tail_vertex integer REFERENCES vertices (vertex_id),
head_vertex integer REFERENCES vertices (vertex_id),
label text,
properties json
);
CREATE INDEX edges_tails ON edges (tail_vertex);
CREATE INDEX edges_heads ON edges (head_vertex)

sql

Representing a property graph using a relational schema

Cypher Query Language

Cypher is a declarative query language for property graphs, created for the Neo4j graph database

It’s like SQL for graph database and was inspired by SQL

Example 2-3. A subset of the data in Figure 2-5, represented as a Cypher query

CREATE

(NAmerica:Location {name:'North America', type:'continent'}),

(USA:Location {name:'United States', type:'country' }),

(Idaho:Location {name:'Idaho', type:'state' }),

(Lucy:Person {name:'Lucy' }),

(Idaho) -[:WITHIN]-> (USA) -[:WITHIN]-> (NAmerica),

(Lucy) -[:BORN_IN]-> (Idaho)

sql