Data encoding (serialization) converts in-memory data structures to a byte format for storage or transmission. The choice of encoding format affects performance (serialization/deserialization speed, CPU usage), size (wire bandwidth, storage cost), schema evolution (backward/forward compatibility), and debugging (human-readability). JSON, Protocol Buffers, Apache Avro, MessagePack, Thrift, and FlatBuffers each make different tradeoffs. Understanding serialization design is essential for high-throughput systems where encoding overhead is measurable, and for distributed systems where schema evolution across service versions is critical.
Text Formats: JSON and CSV
JSON is the universal interchange format: human-readable, supported by every language, schema-less (flexible but inconsistent). JSON drawbacks: verbose (field names repeated in every object — 60% of JSON may be field names), slow to parse (string-based, no direct binary access), no native binary type (base64-encode binary data), no schema means typos are invisible. CSV: compact for tabular data, poor for nested structures, no schema (column order is implicit). When to use JSON: public REST APIs (developer experience matters), configuration files, logs (human-readable debugging). When to avoid JSON: high-throughput internal service-to-service calls (use gRPC/Protobuf), large datasets (use Parquet, Avro), real-time streaming (use Avro or Protobuf with schema registry).
// Protocol Buffers (protobuf) example
// Define schema in .proto file
syntax = "proto3";
message Order {
string order_id = 1;
int64 user_id = 2;
repeated OrderItem items = 3;
OrderStatus status = 4;
google.protobuf.Timestamp created_at = 5;
}
message OrderItem {
string product_id = 1;
int32 quantity = 2;
double price = 3;
}
enum OrderStatus {
PENDING = 0;
PAID = 1;
SHIPPED = 2;
CANCELLED = 3;
}
// Field numbers (=1, =2) identify fields in binary encoding
// NEVER reuse or remove field numbers (breaks wire compatibility)
// Adding new fields with new numbers is safe (backward compatible)
// Removing fields: mark as reserved to prevent reuse
// Size comparison for a typical Order object:
// JSON: ~350 bytes (field names + values + syntax)
// Protobuf: ~80 bytes (field numbers + values, no names)
// Avro: ~70 bytes with schema registry (schema sent separately)
// MsgPack: ~120 bytes (compact JSON-like binary)
Schema Evolution: Backward and Forward Compatibility
In distributed systems, services are deployed at different times — old readers encounter new data, new readers encounter old data. Backward compatibility: new code can read data written by old code. Forward compatibility: old code can read data written by new code. Protocol Buffers rules for safe evolution: add new fields with new field numbers (old code ignores unknown fields — forward compatible), never change field type or number, mark removed fields as reserved to prevent reuse. Avro schema evolution: reader and writer schemas can differ if the resolution rules are followed (added fields with defaults, removed fields ignored by reader). JSON: no enforcement, schema changes silently break readers (typo in new field name goes undetected). Thrift/Avro/Protobuf are all designed for schema evolution as a first-class concern — choose them for long-lived data storage where schemas will evolve.
Columnar Formats: Parquet and Arrow
Row formats (JSON, Protobuf, Avro) serialize one complete record at a time. Columnar formats (Apache Parquet, Apache ORC) serialize all values for one column together. Parquet is the standard columnar storage format for data lakes (S3 + Spark, BigQuery, Athena). Benefits: read only the queried columns (predicate pushdown), excellent compression (similar values together), vectorized SIMD processing. Apache Arrow is an in-memory columnar format: zero-copy data sharing between systems (Spark, Pandas, DuckDB all read Arrow format directly without deserialization). Arrow enables passing large datasets between a Python data science layer and a C++ processing layer without copying. For storage: use Parquet (compressed, good for archival). For in-memory inter-process: use Arrow (zero-copy, fast random access).
Key Interview Discussion Points
- Varint encoding: Protocol Buffers use variable-length integers (small numbers take fewer bytes) — field 1 with value 1 encodes in 2 bytes; field 1 with value 1 billion encodes in 6 bytes; this makes Protobuf efficient for typical business data where most integers are small
- Schema registry: in Kafka with Avro encoding, the schema is not included in every message (would be 10x overhead); instead, each message contains a schema ID (4 bytes) and the Confluent Schema Registry resolves the ID to the full schema on read; this enables schema evolution across producer/consumer versions
- Zero-copy deserialization: FlatBuffers and Cap’n Proto encode data in a format that can be accessed directly from the byte buffer without parsing — field access reads from the raw bytes via offset calculation; eliminates serialization overhead entirely for read-heavy workloads
- Compression: apply compression (zstd, lz4, gzip) after serialization to further reduce size; zstd achieves 3-5x compression for JSON and 1.5-2x for Protobuf at low CPU cost; lz4 compresses faster but less aggressively; choose based on CPU budget vs bandwidth budget
- MessagePack: binary JSON (same type system as JSON but binary encoding) — ~2x smaller and faster than JSON with no schema required; good for caching serialized objects where JSON overhead matters but schema evolution is minimal