Compatibility Definitions
Schema evolution is the ability to change a data schema over time without breaking producers and consumers that are deployed at different versions. Three compatibility modes matter:
- Backward compatibility: A new schema can read data written by the old schema. New reader, old data.
- Forward compatibility: An old schema can read data written by the new schema. Old reader, new data.
- Full compatibility: Both backward and forward. Required for rolling deploys where producers and consumers update independently.
Protobuf Schema Evolution Rules
Protocol Buffers achieve compatibility through field numbers. The wire format carries field numbers, not names, so names can be changed freely. The rules are strict:
// v1
message User {
int64 user_id = 1;
string email = 2;
string full_name = 3;
}
// v2 — safe: added optional field, reserved deleted field
message User {
int64 user_id = 1;
string email = 2;
// full_name removed; number 3 reserved permanently
reserved 3;
reserved "full_name";
string display_name = 4; // new optional field
string phone = 5; // new optional field
}
Never reuse a field number. If field 3 is reused with a different type, old decoders will misinterpret the bytes. The reserved keyword prevents accidental reuse. All new fields must be optional (proto3 fields are optional by default). Required fields in proto2 cannot be safely added or removed.
Avro and Schema Registry
Avro schemas are stored in a schema registry (Confluent Schema Registry is the reference implementation). Each schema is identified by a subject (typically topic-value or topic-key) and a version number. Kafka messages carry only a schema ID (4-byte magic byte + schema ID), so consumers fetch the schema from the registry to deserialize.
POST /subjects/orders-value/versions
{
"schema": "{"type":"record","name":"Order","fields":[...]}"
}
Response: {"id": 42}
Compatibility mode is set per subject:
- BACKWARD: New schema must be readable by consumers using the previous schema. Default; safe for most rolling deploys when you deploy consumers first.
- FORWARD: Previous schema consumers can read data written with new schema. Deploy producers first.
- FULL: Both. Most restrictive; required when deploy order cannot be controlled.
- NONE: No compatibility checking. Only for development environments.
Schema fingerprints (MD5 or SHA-256 of the canonical schema JSON) enable local caching: if the fingerprint matches a locally cached schema, skip the registry call.
Safe and Unsafe Evolution Operations
Safe Operations
- Add an optional field with a default value
- Add a new enum symbol at the end of the symbol list
- Rename a field using an alias (Avro aliases, Protobuf field name change)
- Widen a numeric type (int32 → int64 in compatible systems)
- Add a new record type that is not referenced by existing records
Unsafe Operations
- Change a field's data type (string → integer)
- Change a field number in Protobuf
- Remove a required field
- Rename a field without an alias (Avro) or change a field number (Protobuf)
- Add a new required field (breaks old writers that don't know about it)
- Reorder enum values (ordinal-based systems break)
Rolling Deploy Strategy
The order of deployment determines which compatibility mode is sufficient:
Backward compatibility (BACKWARD mode):
1. Deploy new consumers (can read old and new schema)
2. Deploy new producers (start writing new schema)
Old consumers are already retired before new schema data appears.
Forward compatibility (FORWARD mode):
1. Deploy new producers (write new schema)
2. Deploy new consumers
Old consumers must tolerate new fields — they ignore unknown fields.
Full compatibility removes the constraint on deploy order entirely, which is essential in large organizations where service teams deploy independently.
JSON Schema Evolution Challenges
JSON Schema has no built-in compatibility enforcement. Teams rely on convention: never remove fields, always add fields as optional, use additionalProperties: true in validators so unknown fields don't cause validation failures. Tools like json-schema-diff-validator can be run in CI to detect breaking changes, but they require discipline to enforce.
Event Sourcing and Upcasters
Event sourcing stores events permanently, so schema evolution must handle arbitrarily old event versions. An upcaster is a function that transforms an event from version N to version N+1 before it reaches application code:
function upcast(event):
if event.schema_version == 1:
event.payload.display_name = event.payload.full_name
del event.payload.full_name
event.schema_version = 2
return event
Upcasters are chained so that event version 1 passes through upcast 1→2 then 2→3 automatically. This keeps application code clean while supporting the full event history without a bulk migration.
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture