Schemas and Satisfaction: Signatures, models, and the conditions for validity

A3 · A3bProse proofThe formal structure of conditions under which a claim may be relied upon.

Of things said without any combination, each signifies either substance or quantity or qualification or a relative or where or when or being-in-a-position or having or doing or being-affected.

Aristotle, Categories 1b25-2a4

This chapter formalizes the schema as a mathematical signature (A3, A3b): a set of types, predicates with arities, and constraints. Adding a predicate is not inserting a row but extending the signature — a model-class change that alters what the system can express. The reader who wants the historical argument for why schemas freeze the world at migration time should read Vol I, Chapter 5 (The Empire of Tables).

The Query That Cannot Be Typed

The database is perfect. Every dress has a primary key. Every attribute is typed. Foreign keys enforce referential integrity. Constraints prevent impossible states: price must be positive, size must be in the allowed set, color must come from the approved palette. The schema is a contract, and the data honors it.

A user asks: "Show me something less puffy."

The system can store "puffy." It can add a row to a tags table, or a key to a JSON blob, or an entry in a full-text index. What it cannot do—without changing the contract the schema enforces—is treat puffiness as a typed predicate with declared semantics, domain constraints, referential integrity, and downstream guarantees. To ask "show me something less puffy" as a certified query—one the database can type-check, optimize, and guarantee—the vocabulary must already contain a puffiness predicate.

This is not a data gap. It is a vocabulary gap. The database can store facts it cannot yet speak.

Predicate novelty is not data; it is a language change. And language changes are governed, not queried.

Relational systems were built to make language stable. To understand why, we need to see what problem the schema empire was built to solve—and what it purchased by solving it.


What the Relational Empire Bought

Before relational databases, data lived in files. To query the data, you navigated the file—you knew the physical structure, the byte offsets, the access paths. A program that wanted to answer "which customers ordered product X" needed to know how customer records were stored, how orders were linked, where the indexes lived. Changing how you asked meant changing how you stored. Data and access were coupled.

Codd's insight was separation(Codd 1970)Edgar F. Codd, "A Relational Model of Data for Large Shared Data Banks," Communications of the ACM 13, no. 6 (1970): 377–387.View in bibliography. Give every record an address—a key—that is independent of physical location. Let the query specify what data is needed; let the system figure out how to retrieve it. The programmer asks; the optimizer answers. The abstraction boundary between logical and physical transformed what was possible.

The core invariant is checkable reference: every pointer must land. A foreign key in one table points to a primary key in another. The database enforces that the target exists. You cannot insert an order for a customer who does not exist. You cannot delete a product that has pending orders. You cannot reference a category that the category table does not contain. The data forms a coherent graph with no dangling pointers.

This separation enabled a cascade of guarantees:

Referential integrity. Foreign keys point to real records. The database enforces this at write time. An application cannot accidentally create orphaned references.

Constraints. Business rules are enforced at the data layer. Price must be positive. Email must be unique. Status must come from an enumerated set. The constraint is declared once and holds everywhere.

Transactions. Changes are atomic, consistent, isolated, and durable(Gray 1981)Jim Gray, "The Transaction Concept: Virtues and Limitations," Proceedings of the 7th International Conference on Very Large Data Bases (VLDB) (1981): 144–154.View in bibliography. A transfer between accounts either completes entirely or not at all. Concurrent modifications do not corrupt state. Crashes do not lose committed work.

Query optimization. The programmer specifies the question; the database chooses the execution plan. Indexes, join orders, caching strategies—all determined by the optimizer, invisible to the query author.

These guarantees are not decorative. They are what makes relational databases trustworthy for financial transactions, medical records, airline reservations, government registries. When the stakes are high, you want a system that refuses invalid states rather than one that accepts anything and hopes for the best.

The price was vocabulary. The schema—the declaration of tables, columns, types, constraints—must be fixed before data arrives. You cannot insert a row that violates the schema. You cannot query a column that does not exist. The schema is a contract, negotiated at design time, enforced at runtime. Changing the contract is not a query; it is a migration—a governance event that touches data, applications, and downstream systems.

The empire's gift is certification: the system refuses to speak untyped nonsense.


The Escape Hatches

The empire has well-known escape hatches. Each buys flexibility by quietly surrendering the guarantees the empire exists to provide.

Entity-Attribute-Value (EAV). Instead of a fixed schema, store triples: (entity_id, attribute_name, attribute_value). Now any attribute can be added without DDL. The cost: constraints evaporate. You can enforce that "puffiness" comes from a valid domain only by reintroducing a schema by other means—triggers, application logic, and code paths that reassemble the contract you dissolved. Joins become semantic guesses—correlating on attribute names that may or may not mean the same thing across rows. Type safety disappears; "price" might be a number in one row and a string in another.

JSON columns. Modern databases let you store semi-structured data in JSON blobs. Schema-on-read: store anything, parse it later. The cost: invariants become application-level conventions. The database can enforce some properties, but only by rebuilding the contract piecemeal—check constraints, generated columns, triggers—until the "escape hatch" has quietly turned back into a schema. You have storage without certification unless you reassemble certification by other means.

Tag tables. A join table linking entities to freeform tags. Add a tag instantly; no DDL required. The cost: you have created an untyped predicate space. There is no domain constraint on what tags are valid. There is no referential integrity for what tags mean. "Puffy" and "poofy" and "voluminous" coexist without the system knowing whether they are synonyms, hyponyms, or unrelated.

Each escape is a controlled retreat from the relational contract. The system can now store "puffy," but it cannot type-check "puffy." It cannot enforce that puffiness values come from a valid domain. It cannot guarantee that queries against puffiness will be optimized, or even well-formed. The flexibility is real, and so is the loss.

Each hatch lets you store more, but certify less.


Schema as Signature

To understand what the escapes surrender, we need to see what the schema formally provides.

A schema is not a "table definition" in the colloquial sense. It is a signature: a formal declaration of the language in which data can be expressed. The signature specifies what distinctions can be made, what questions can be asked, what constraints must hold.

A3
Schema as Signature (A3)

A schema is a signature Σ=(T,P,I)\Sigma = (T, P, I) where:

  • TT is a set of types (sorts): the domains of discourse—strings, integers, dates, enumerations, user-defined types
  • PP is a set of predicates (relation symbols—and, in practice, function-like attributes): the questions that can be asked—has_color, has_size, is_in_category, each with declared arities over TT
  • II is a set of constraints (integrity rules): the invariants that must hold—price > 0, size ∈ {'XS','S','M','L','XL'}, FOREIGN KEY (category_id) REFERENCES categories(id)

Adding a predicate is not "adding a column." It is a signature morphism ΣΣ\Sigma \to \Sigma'—a map from the old language to a new language that contains more distinctions.

The class of valid models changes: Mod(Σ)Mod(Σ)\mathsf{Mod}(\Sigma') \neq \mathsf{Mod}(\Sigma).

This is why schema changes are governance events. You are not changing data; you are changing the language in which data can be expressed. Old queries remain well-formed in the extended language, but their contract can break—through interface assumptions, constraint changes, or altered NULL/typing semantics. New queries may be meaningless for old data. Applications that assumed one vocabulary must be updated to speak another.

A schema is language about data. A migration is language about language. The confusion that "adding a column" is a trivial data operation arises from conflating these levels. When you change the schema, you change the space of expressible distinctions—and sometimes the very well-formedness or meaning of downstream questions—not just which statements happen to be true.

A schema is a language whose chief virtue is that it can be certified.

Remark(On Signature Morphisms)

In the formal literature, particularly in the theory of institutions(Burstall 1992)Joseph A. Goguen and Rod M. Burstall, "Institutions: Abstract Model Theory for Specification and Programming," Journal of the ACM 39, no. 1 (1992): 95–146.View in bibliography, a signature morphism σ:ΣΣ\sigma: \Sigma \to \Sigma' is a structure-preserving map between signatures. The key property is that models translate contravariantly: a model of the larger signature Σ\Sigma' restricts to a model of the smaller signature Σ\Sigma. This is the formal basis for what practitioners call "backward compatibility"—queries written against Σ\Sigma should still work when evaluated against models of Σ\Sigma'. You don't need institution theory to run Postgres; it just names the compatibility intuition precisely.

Remark(Categorical Schemas vs. SQL Rigidity)

The rigidity critique applies to SQL/relational implementations, not to the mathematical concept of schema. In functorial data migration(Spivak 2012)David I. Spivak, "Functorial Data Migration," Information and Computation 217 (2012): 31–51.View in bibliography, schemas are categories and migrations are functors—structure-preserving maps that make data portability a first-class operation. Such categorical databases have the flexibility The Proofs demands: migration is governed but not ceremonial. The "Schema Empire" of this chapter refers to the implementation reality of DDL, migration scripts, and downstream coordination, not to the mathematical framework. The Third Mode we construct is closer to categorical databases than to SQL: schemas evolve via witnessed signature morphisms, not governance tickets.


Model and Satisfaction

The signature tells us what predicates exist. It does not yet tell us what makes a model of that signature valid—what constraints the data must satisfy, and what "safe change" means.

A3b
Model and Satisfaction (A3b)

A model MM is an interpretation of the signature Σ\Sigma: actual data that assigns values to the types, extensions to the predicates, and truth values to ground instances.

The satisfaction relation M(Σ,I)M \vDash (\Sigma, I) under logic LL asserts: model MM satisfies signature Σ\Sigma and constraints II, as judged by logic LL. SQL's logic is already non-classical (three-valued)(Date 2009, ch. 4)C. J. Date, SQL and Relational Theory: How to Write Accurate SQL Code (Sebastopol, CA: O'Reilly Media, 2009), ch. 4.View in bibliography, which is why absence becomes semantics, not syntax.

Conservativity (crisp test):

A schema extension ΣΣ\Sigma \to \Sigma' is conservative over (Σ,I)(\Sigma, I) in logic LL if and only if:

Every old query has the same truth-value on every old dataset, when interpreted inside the new schema.

Formally: for all sentences ϕ\phi in the language of Σ\Sigma, and for all models M(Σ,I)M \vDash (\Sigma, I) that extend to M(Σ,I)M' \vDash (\Sigma', I'), we have MϕM \vDash \phi iff MϕM' \vDash \phi.

Safe schema evolution = conservative extension. Breaking changes fail conservativity and require migration semantics.

Conservativity is the formal test for "safe evolution." If you can add a column without changing the answers to any existing query on any existing data, the extension is conservative. If adding the column changes answers—or makes old queries ill-formed—the extension is breaking.

Example(Conservativity Failure)

Original schema Σ\Sigma:

items(id INT PRIMARY KEY, price DECIMAL CHECK (price > 0))

Extended schema Σ\Sigma':

items(id INT PRIMARY KEY, price DECIMAL CHECK (price > 0), 
      puffiness VARCHAR(20) CHECK (puffiness IN ('none','subtle','dramatic')))

with puffiness allowing NULL for existing records.

Query 1: SELECT * FROM items WHERE price > 100

Under Σ\Sigma: returns all expensive items. Under Σ\Sigma': returns the same rows (puffiness column present but not filtered). Verdict: Conservative for this query.

Query 2: Application code assumes SELECT * FROM items returns rows with exactly 2 columns.

Under Σ\Sigma': returns 3 columns. Verdict: Breaking change at the API layer. (Not a failure of logical conservativity; a failure of an external interface contract.)

Query 3: Business logic treats puffiness IS NULL as "not puffy" (closed-world assumption).

Under Σ\Sigma' before backfill: all existing items have puffiness = NULL. If CWA applies: all items are "not puffy." If OWA applies: all items have "unknown" puffiness. Verdict: Semantic ambiguity. The extension is conservative only relative to a chosen semantics for absence.

The third case is the trap. Conservativity is a logical property, but breaking changes often appear at the application or semantic layer. NULL in SQL notoriously conflates "unknown," "not applicable," and "false." A schema extension that introduces nullable columns is conservative in the narrow query sense but may break downstream systems that interpret NULL with different assumptions.

This foreshadows the next chapter, where we formalize epistemic status and show that CWA/OWA are not just database conventions but logic choices that must be explicit in the provenance.


Four Failures of Vocabulary Rigidity

The schema as signature explains why vocabulary cannot grow on demand. Four touchstones demonstrate the consequences from orthogonal angles.

T6: Predicate Invention

"Show me puffy dresses."

The catalog schema has predicates for color, size, price, silhouette, fabric, brand. It does not have a predicate for puffiness. The user's concept is real—there are dresses that are puffy and dresses that are not—but the schema provides no column in which to express this distinction.

To add puffiness as a certified predicate, the system must traverse governance layers:

LayerWhat Must Happen
Semantic definitionWhat does "puffiness" mean? What are its valid values? Who decides?
Constraint specificationpuffiness ENUM('none','subtle','moderate','dramatic')—or continuous scale? Nullable?
InstrumentationHow will puffiness be measured for new items? Manual labeling? ML classifier?
Backfill50,000 existing items need puffiness values. Who labels them? With what accuracy?
API contracts12 endpoints expose item data; all must handle the new field. Backward compatibility?
Downstream systemsSearch, recommendation, analytics all need to understand puffiness.
MonitoringNull-rate alerts, value distribution drift, quality assurance.

Users invent terms faster than contracts can evolve. By the time the governance layers have processed "puffiness," users have moved to "coastal grandmother aesthetic." The schema is always behind the vocabulary.

The relational empire cannot mint new predicates on demand. That is not a bug; it is the consequence of the discipline that makes the empire reliable.

T7: Contextual Equivalence

"NYC" and "New York City" refer to the same place—sometimes.

Two records in the catalog list supplier locations. Supplier A is based in "NYC." Supplier B is based in "New York City." Are these the same location? For shipping purposes, probably yes. For tax purposes, maybe not—"NYC" might mean Manhattan specifically, while "New York City" might include all five boroughs.

The relational empire can enforce equivalence. The standard pattern is a canonical-ID table with an alias lookup:

locations(canonical_id INT PRIMARY KEY, canonical_name VARCHAR)
location_aliases(alias VARCHAR PRIMARY KEY, canonical_id INT REFERENCES locations)

The limitation is not representational impossibility. It is scope and governance.

Declaring equivalence is itself a governed act. Who decides that "NYC" = "New York City"? What process blesses that mapping? What happens when someone disagrees?

The declaration is typically global. Once the alias table says "NYC" maps to canonical ID 42, that mapping applies everywhere—shipping, taxes, analytics, reporting. The native machinery does not force scoped equivalence; you must build it explicitly (add a context column to the alias table, partition the mappings), and then govern it.

There is no witnessed, scoped equivalence by default. The system knows that two strings are mapped to the same ID. It does not know why, or under what conditions, or who attested. The mapping is a fact without a witness.

Engineers build normalization tables; changes require human curation; scope is all-or-nothing. This is workable for stable equivalences. For contextual ones, the cost is not just more rows—it is ongoing governance of scope, time, and authority.

T9: Schema Evolution

Employees and contractors were once separate concepts. The company had two tables: employees with salary, benefits, and tenure; contractors with hourly rate, agency, and contract end date. Reports were written against these tables. Dashboards displayed headcounts by type. Payroll systems knew which table to query.

Then the business model changed. The distinction between employees and contractors became less meaningful. The company wanted a unified "workers" concept that could represent either, with type-specific attributes stored conditionally.

This is schema evolution under concept drift. The vocabulary of the organization has changed; the schema must follow.

The migration is not trivial:

  • Data transformation. Records from both tables must be unified, with conflicts resolved (what if the same person appears in both?).
  • Constraint rewriting. Constraints that referenced employees or contractors specifically must be generalized or conditional.
  • Application updates. Every query, report, and dashboard that touched the old tables must be rewritten.
  • Downstream coordination. Payroll, benefits, reporting systems all have their own assumptions.

The extension from {employees, contractors} to {workers} is non-conservative in the obvious sense: old queries like SELECT COUNT(*) FROM employees cease to denote. The change is breaking.

Breaking changes require version compatibility certificates—explicit documentation of what changed, what broke, and how to migrate—or a migration plan that preserves backward compatibility through views, synonyms, or API versioning.

T10: Higher-Arity Events

"Alice introduced Bob to Carol at the conference."

This is a single event with four participants: an introducer (Alice), two introducees (Bob and Carol), and a context (the conference). The event has structure: Alice is the agent; Bob and Carol are the patients; the conference is the setting. There are constraints: the introducer cannot be one of the introducees.

The relational model can represent this cleanly. An introductions table with four foreign keys—introducer_id, introducee_1_id, introducee_2_id, context_id—plus role constraints and timestamps. Perfectly valid SQL.

The problem is not representational impossibility. It is representational default.

Systems drift toward binary edges because they are easier to model, join, index, and reuse. Knowledge graphs default to (subject, predicate, object) triples. ORMs default to has_many / belongs_to. The n-ary event table requires more thought, more constraints, more documentation. The path of least resistance is to decompose:

  • (Alice, introduced, Bob)
  • (Alice, introduced, Carol)
  • (introduction_event_42, at, conference)

The decomposition loses structure. The integrity constraint "introducer ≠ introducee" can no longer be expressed—there is no single row that contains both. The event unity—all participants in a single introduction—can no longer be queried without a join key that may or may not have been added. The role structure—who introduced whom to whom—is implicit rather than typed.

The failure mode is not "relational can't do n-ary." It is "the default path loses n-ary structure, and the loss appears as constraints you can no longer enforce."


The Fashion Catalog Revisited

Return to the running example with the full machinery in view.

The catalog schema:

items(id INT PRIMARY KEY,
      category VARCHAR(50),
      subcategory VARCHAR(50),
      price DECIMAL(10,2) CHECK (price > 0),
      color VARCHAR(30),
      silhouette VARCHAR(30) CHECK (silhouette IN ('fitted','relaxed','A-line','empire','shift')),
      fabric VARCHAR(50),
      brand VARCHAR(100))

This schema was designed in 2022 by engineers who consulted with merchandisers about "the attributes that matter." They anticipated color, size, price, silhouette, fabric, brand. They did not anticipate:

  • Puffiness — a style dimension that cuts across silhouette
  • Flowy — a movement quality that depends on fabric and cut and weight
  • Cottagecore — an aesthetic that combines multiple attributes in ways the schema cannot express
  • Coastal grandmother — a trend that emerged eighteen months after schema design
  • Quiet luxury — a positioning that depends on brand perception, not intrinsic attributes

The schema is correct for what it declares, but incomplete for what users need.

The gap between designed vocabulary and needed vocabulary grows over time. Users invent terms faster than schemas can evolve. Each new term that matters requires the governance layers: semantic definition, constraint specification, instrumentation, backfill, API updates, downstream coordination, monitoring.

The catalog can use escape hatches. It can add a tags table for freeform attributes. It can add a metadata JSON column for semi-structured data. It can build a search layer that does full-text matching against descriptions and reviews.

Each escape surrenders guarantees. Tags are untyped; "puffy" and "poofy" coexist without declared equivalence. JSON metadata has no schema enforcement; "puffiness: high" in one record and "puffiness: very" in another cannot be compared. Full-text search matches strings, not predicates; a review saying "not puffy" matches the query "puffy."

The catalog faces a choice that is structural, not tactical: maintain the relational contract and accept vocabulary rigidity, or escape to flexibility and lose the guarantees that make certified queries possible.


Consequence

The schema empire solved the reference problem. Every record has a key. Every pointer must land. Constraints are enforced at the data layer. These guarantees made relational databases the foundation of critical infrastructure for half a century.

But vocabulary is frozen. The schema declares what questions can be asked; questions outside the schema are either ill-formed (syntax errors) or escape to untyped territory (JSON, tags, free-text). To ask a new question as a certified predicate—type-checked, constrained, optimized, guaranteed—you must change the language. And changing the language is expensive.

Predicate novelty is not data; it is a language change.

To the string empire's commitment gap and the retrieval layer's provenance gap, the schema empire adds a vocabulary gap: constraints are enforced, but the vocabulary those constraints reference cannot grow on demand.

None of these gaps is a bug. Each architecture optimizes for a real need. The problem is that the needs are not exclusive. Real systems require commitment discipline and provenance and vocabulary flexibility. The question is whether these requirements can be satisfied together—and if so, by what kind of object.

The relational empire has one more secret: the lie of NULL(Date 2009, ch. 4–5)C. J. Date, SQL and Relational Theory: How to Write Accurate SQL Code (Sebastopol, CA: O'Reilly Media, 2009), ch. 4–5.View in bibliography. A cell that is empty could mean "unknown," "not applicable," "intentionally omitted," or "false." The schema does not distinguish. SQL's three-valued logic papers over the ambiguity, but the ambiguity remains. When we try to merge data from different sources—some operating under closed-world assumptions, some under open-world—this ambiguity becomes fatal.

We turn now to closed worlds, open worlds, and the epistemic status that gluing requires.


Litmus Cases

This chapter introduced four tests that demonstrate vocabulary rigidity from orthogonal angles.

CaseNameFailure ModeResolved in
T6Predicate InventionSchema cannot mint certified predicates on demandPart VI, Ch 27
T7Contextual EquivalenceEquivalence is global + governed; no scoped witnessPart VI, Ch 28
T9Schema EvolutionConcept drift requires breaking changes or migrationPart V, Ch 24
T10Higher-Arity EventsRepresentational default loses event unityPart VI, Ch 29

T6 (Predicate Invention): "Show me puffy dresses." The user's concept is real, but the schema has no predicate for puffiness. Adding it requires governance layers that move slower than user vocabulary evolves.

T7 (Contextual Equivalence): "NYC" and "New York City"—same or different? The schema can enforce global equivalence through canonical-ID tables. It cannot express scoped equivalence that depends on context.

T9 (Schema Evolution): "Employees" and "contractors" merge into "workers." The vocabulary of the business has changed; the schema must follow. The change is breaking; migration semantics are required.

T10 (Higher-Arity Events): "Alice introduced Bob to Carol." The event has four participants and structural constraints. The relational model can represent this, but the default path decomposes into binary edges, losing event unity and role constraints.

Together with the cases from the preceding chapters, these form a growing test suite. Any proposed alternative must address all ten or explain why resolution is impossible.