Low Level Design: Data Encoding and Serialization Design

⏱ 3 min read

Data encoding (serialization) converts in-memory data structures to a byte format for storage or transmission. The choice of encoding format affects performance (serialization/deserialization speed, CPU usage), size (wire bandwidth, storage cost), schema evolution (backward/forward compatibility), and debugging (human-readability). JSON, Protocol Buffers, Apache Avro, MessagePack, Thrift, and FlatBuffers each make different tradeoffs. Understanding serialization design is essential for high-throughput systems where encoding overhead is measurable, and for distributed systems where schema evolution across service versions is critical.

Text Formats: JSON and CSV

JSON is the universal interchange format: human-readable, supported by every language, schema-less (flexible but inconsistent). JSON drawbacks: verbose (field names repeated in every object — 60% of JSON may be field names), slow to parse (string-based, no direct binary access), no native binary type (base64-encode binary data), no schema means typos are invisible. CSV: compact for tabular data, poor for nested structures, no schema (column order is implicit). When to use JSON: public REST APIs (developer experience matters), configuration files, logs (human-readable debugging). When to avoid JSON: high-throughput internal service-to-service calls (use gRPC/Protobuf), large datasets (use Parquet, Avro), real-time streaming (use Avro or Protobuf with schema registry).

// Protocol Buffers (protobuf) example
// Define schema in .proto file
syntax = "proto3";
message Order {
    string order_id = 1;
    int64  user_id  = 2;
    repeated OrderItem items = 3;
    OrderStatus status = 4;
    google.protobuf.Timestamp created_at = 5;
}
message OrderItem {
    string product_id = 1;
    int32  quantity   = 2;
    double price      = 3;
}
enum OrderStatus {
    PENDING   = 0;
    PAID      = 1;
    SHIPPED   = 2;
    CANCELLED = 3;
}

// Field numbers (=1, =2) identify fields in binary encoding
// NEVER reuse or remove field numbers (breaks wire compatibility)
// Adding new fields with new numbers is safe (backward compatible)
// Removing fields: mark as reserved to prevent reuse

// Size comparison for a typical Order object:
// JSON:     ~350 bytes (field names + values + syntax)
// Protobuf: ~80 bytes (field numbers + values, no names)
// Avro:     ~70 bytes with schema registry (schema sent separately)
// MsgPack:  ~120 bytes (compact JSON-like binary)

Schema Evolution: Backward and Forward Compatibility

In distributed systems, services are deployed at different times — old readers encounter new data, new readers encounter old data. Backward compatibility: new code can read data written by old code. Forward compatibility: old code can read data written by new code. Protocol Buffers rules for safe evolution: add new fields with new field numbers (old code ignores unknown fields — forward compatible), never change field type or number, mark removed fields as reserved to prevent reuse. Avro schema evolution: reader and writer schemas can differ if the resolution rules are followed (added fields with defaults, removed fields ignored by reader). JSON: no enforcement, schema changes silently break readers (typo in new field name goes undetected). Thrift/Avro/Protobuf are all designed for schema evolution as a first-class concern — choose them for long-lived data storage where schemas will evolve.

Columnar Formats: Parquet and Arrow

Row formats (JSON, Protobuf, Avro) serialize one complete record at a time. Columnar formats (Apache Parquet, Apache ORC) serialize all values for one column together. Parquet is the standard columnar storage format for data lakes (S3 + Spark, BigQuery, Athena). Benefits: read only the queried columns (predicate pushdown), excellent compression (similar values together), vectorized SIMD processing. Apache Arrow is an in-memory columnar format: zero-copy data sharing between systems (Spark, Pandas, DuckDB all read Arrow format directly without deserialization). Arrow enables passing large datasets between a Python data science layer and a C++ processing layer without copying. For storage: use Parquet (compressed, good for archival). For in-memory inter-process: use Arrow (zero-copy, fast random access).

Key Interview Discussion Points

Varint encoding: Protocol Buffers use variable-length integers (small numbers take fewer bytes) — field 1 with value 1 encodes in 2 bytes; field 1 with value 1 billion encodes in 6 bytes; this makes Protobuf efficient for typical business data where most integers are small
Schema registry: in Kafka with Avro encoding, the schema is not included in every message (would be 10x overhead); instead, each message contains a schema ID (4 bytes) and the Confluent Schema Registry resolves the ID to the full schema on read; this enables schema evolution across producer/consumer versions
Zero-copy deserialization: FlatBuffers and Cap’n Proto encode data in a format that can be accessed directly from the byte buffer without parsing — field access reads from the raw bytes via offset calculation; eliminates serialization overhead entirely for read-heavy workloads
Compression: apply compression (zstd, lz4, gzip) after serialization to further reduce size; zstd achieves 3-5x compression for JSON and 1.5-2x for Protobuf at low CPU cost; lz4 compresses faster but less aggressively; choose based on CPU budget vs bandwidth budget
MessagePack: binary JSON (same type system as JSON but binary encoding) — ~2x smaller and faster than JSON with no schema required; good for caching serialized objects where JSON overhead matters but schema evolution is minimal