Low Level Design: NoSQL Database Types Compared

⏱ 9 min read

Relational databases have been the default choice for decades, and for good reason — SQL is powerful, ACID transactions are correctness guarantees, and the relational model handles most business data well. But relational databases hit real limits at scale, and some data simply does not fit the tabular model. NoSQL databases trade some SQL capabilities for horizontal scalability, flexible schemas, and data models optimized for specific access patterns.

Why NoSQL Exists

The pressures that drove NoSQL adoption in the 2000s were concrete. First, horizontal scalability: relational databases scale vertically (bigger machine) up to a point, but sharding a relational database is complex and limits query capability. NoSQL databases are designed from the ground up to distribute across commodity nodes. Second, flexible schema: when your data structure evolves rapidly or varies per record (e.g., product attributes that differ by category), maintaining ALTER TABLE migrations across hundreds of millions of rows is painful. Third, specialized data models: graphs, time series, and document hierarchies are awkward to represent in normalized tables — a native data model means simpler queries and better performance.

Document Stores

Document stores persist semi-structured data as JSON or BSON documents. Each document is a self-contained record that can have nested objects and arrays. Unlike relational rows, documents in the same collection can have different fields. You can query by any field, including nested fields.

Representatives: MongoDB (the dominant player), Couchbase (document store plus key-value), Google Firestore (serverless, real-time sync).

Good for: user profiles (each user has different optional fields), product catalogs (electronics have different attributes than clothing), CMS content (articles with variable metadata), any domain where the entity shape varies.

Bad for: multi-document transactions (MongoDB added these in 4.0 but they carry a performance cost), complex joins across collections (you embed related data in the document or denormalize, trading query flexibility for write complexity), and workloads that need arbitrary relational queries.

Key-Value Stores

The simplest NoSQL model: a hash map at scale. Every operation is get(key), put(key, value), or delete(key). The value is opaque to the database — a blob of bytes. No query by value, no secondary indexes (in the pure model), no schema.

Representatives: Redis (in-memory, rich data structures: strings, hashes, lists, sets, sorted sets), DynamoDB (AWS managed, persistent, single-digit-millisecond latency), Riak (distributed, AP-focused).

Good for: session storage (user session keyed by session ID), caching (cache aside: check Redis, fall back to DB on miss, write back to Redis), shopping carts (cart keyed by user ID), leaderboards (Redis sorted sets make this trivial), rate limiting (Redis atomic increment with TTL), pub/sub messaging.

Bad for: any query not by primary key, complex relationships, reporting, or workloads that need to find records by attribute values. Accessing a key-value store without knowing the key is a full scan — a sign you need a different database type.

Wide-Column Stores

Wide-column stores organize data into rows identified by a row key, with columns grouped into column families. The key difference from relational tables: rows can have different sets of columns, and columns are sparse (absent columns take no storage). Rows are sorted by row key, which enables efficient range scans.

Representatives: Apache Cassandra (peer-to-peer, highly available, tunable consistency), HBase (Hadoop ecosystem, HDFS-backed), Google Bigtable (the original, basis for both HBase and Cassandra’s design).

Good for: time series data (row key = device ID + timestamp, range scan gives all readings for a device in a time window), IoT sensor data, activity feeds (events sorted by time), messaging (conversations sorted by message timestamp). Cassandra is purpose-built for write-heavy workloads — writes are appended to commit log and memtable, then flushed to SSTables, making writes very fast.

Bad for: ad-hoc queries (Cassandra requires queries to match the partition key; you model the table around your query, not around the entity), complex aggregations, joins. Cassandra’s data modeling discipline is significant — you denormalize heavily and create one table per query pattern.

Graph Databases

Graph databases store data as nodes (entities), edges (relationships), and properties on both. Relationships are first-class citizens stored as direct pointers — graph traversal is O(number of relationships traversed), not O(table size) as with SQL JOINs.

Representatives: Neo4j (the market leader, Cypher query language), Amazon Neptune (managed, supports Gremlin and SPARQL), TigerGraph (distributed, high-performance analytics).

Good for: social networks (friend-of-friend queries, influence propagation), fraud detection (find accounts connected through suspicious transaction chains), knowledge graphs (entity relationships), recommendation engines (users who bought X also bought Y — traversal through purchase graphs), access control (role hierarchies, permission inheritance).

Bad for: simple CRUD with no meaningful relationships, bulk analytics over all records (graph databases optimize for traversal, not full-scan aggregation), workloads where data is fundamentally tabular. Graph databases tend to be more operationally complex and have smaller ecosystems than relational or document databases.

Time Series Databases

Time series databases are optimized for a specific pattern: append-heavy writes of timestamped measurements, with queries that aggregate over time windows. They apply columnar compression (e.g., delta encoding for timestamps, XOR encoding for floating-point values), downsampling (automatically aggregate old data into lower resolution), and retention policies (automatically expire data older than N days).

Representatives: InfluxDB (purpose-built time series, Flux query language), Prometheus (metrics scraping and alerting, pull model), TimescaleDB (time series extension on PostgreSQL — SQL plus time series optimizations).

Good for: application metrics and monitoring, infrastructure metrics (CPU, memory, disk), IoT sensor readings, financial tick data, event logging with time-window aggregation queries like "average CPU over last 5 minutes per host."

Bad for: relational queries, entity relationships, workloads where time is not the primary dimension of access. TimescaleDB is an interesting hybrid — you get SQL and time series optimizations together, useful when you need both.

CAP Theorem

The CAP theorem states that a distributed data store can guarantee at most two of three properties simultaneously: Consistency (every read sees the most recent write), Availability (every request receives a response, not an error), Partition tolerance (the system continues operating despite network partitions that split nodes).

In practice, network partitions happen, so P is mandatory. The real choice is between CP (sacrifice availability during partitions — return an error rather than stale data) and AP (sacrifice consistency — return potentially stale data rather than erroring). Cassandra with strong consistency settings is CP; DynamoDB’s eventually-consistent reads are AP. A single-node relational database is effectively CA — it doesn’t face partition scenarios, but it’s also not distributed.

PACELC extends CAP: even without partitions (the normal case), there is a trade-off between latency and consistency. Systems with synchronous replication have stronger consistency but higher write latency; asynchronous replication lowers latency at the cost of potential data loss on failure.

When to Use a Relational Database

NoSQL is not categorically better — relational databases remain the right choice for many workloads. Use a relational database when: you need complex multi-entity transactions with ACID guarantees (financial systems, inventory management); your queries are not known in advance and you need arbitrary JOIN flexibility; you have well-understood, stable schema; you need complex aggregation and reporting with GROUP BY across arbitrary dimensions; or your data has genuine relational structure (foreign keys, normalization) that document embedding would fight against.

PostgreSQL in particular has blurred the line — with JSONB columns, table partitioning, logical replication, and TimescaleDB, it covers many use cases that previously required a dedicated NoSQL database. The operational simplicity of one database type is a legitimate factor in the decision.

Quick Selection Guide

User sessions and caching: Redis. Flexible entity storage with varying attributes: MongoDB. Write-heavy time series and IoT: Cassandra or InfluxDB. Social graph traversal and fraud detection: Neo4j or Neptune. Application metrics and monitoring: Prometheus plus Grafana or InfluxDB. Anything requiring ACID transactions and complex queries: PostgreSQL. When in doubt: PostgreSQL first — migrate to a specialized store only when you hit a concrete limit it cannot solve.

Frequently Asked Questions

When should you choose a document database over a relational database?

Choose a document database when: the data has a variable or evolving schema (different products have different attributes), the data is naturally hierarchical (nested objects fit naturally in documents), read patterns predominantly retrieve one entity at a time (user profile, product page), and the application needs to scale writes horizontally across shards. Choose a relational database when: complex queries with arbitrary joins are needed, strict ACID transactions across multiple entities are required (payments, inventory), or the data relationships are complex and well-defined.

What is the CAP theorem and what does it mean for NoSQL databases?

The CAP theorem states that a distributed system can guarantee at most two of three properties: Consistency (every read returns the most recent write or an error), Availability (every request receives a non-error response), and Partition tolerance (the system continues operating during network partitions). Since network partitions are inevitable in distributed systems, the real choice is between CP (sacrifice availability during partitions — return error if uncertain) and AP (sacrifice consistency — return possibly stale data). Cassandra is AP; HBase is CP; most RDBMS are CA in the single-node case.

When is a wide-column store like Cassandra a better choice than a document database?

Wide-column stores (Cassandra, HBase) excel at: time-series data where rows are keyed by entity_id and column names encode timestamps, high-write throughput (append-only LSM-tree writes, no read-before-write), range queries on the row key (scan all activity for a user from time A to time B), and known access patterns designed at schema time. Document stores (MongoDB) are better for: ad-hoc queries by any field, flexible schemas that change frequently, and applications that need secondary indexes across many fields. Wide-column stores require upfront data modeling — if access patterns change, schema redesign is needed.