Commitment Sets: Binding claims to accountable identities
Sell your cleverness and buy bewilderment. Cleverness is mere opinion, bewilderment is intuition.
This chapter formalizes the notion of a commitment set and establishes consistency as the minimal coherence property any assertion-producing system must honor. We define what it means for a system to track its commitments, show how string-based architectures violate commitment discipline in reproducible ways, and introduce the running example (a fashion catalog) that will carry through the rest of the book. The reader who wants the historical argument for why coherence is the scarce resource -- from Mesopotamian seals to LLM hallucination -- should read Vol I, Chapter 4 (The Empire of Strings).
Coherence Is the Scarce Resource
Two statements can each sound reasonable and still be unable to coexist.
That can sound like wordplay until you notice how often modern systems assume closure under addition: that if a sentence is well-formed and comes from a trusted channel, it can be added to what is already believed without charge. But belief does not work that way. Assertions interact. They collide. They entail. They constrain.
Truth is a local property: a claim can be correct or incorrect in isolation. Coherence is global: it is the property that your claims can all be true together(Quine 1951, p. 41)W. V. Quine, "Two Dogmas of Empiricism," The Philosophical Review 60, no. 1 (1951): 20–43, p. 41.View in bibliography. Coherence is what lets a court reconcile testimonies, a physicist reconcile measurements, a compiler reconcile types, a society reconcile laws that evolved under different pressures. When coherence fails, the problem is not a single falsehood. It is that the world you have described cannot exist.
In The Proofs, the minimal mathematical shadow of coherence is consistency: whether the commitments you have made can all be true together under a declared logic. The two words are not synonyms. Coherence carries connotations of integration and intelligibility that consistency does not. But consistency is what we can check, and checking is what machines must do. What follows is a search for those objects.
A new kind of machine now produces language with ease. It can place plausible sentences almost anywhere you point it. The question is no longer whether it can speak, but whether it can bind what it says: whether its words come attached to the obligations that make a body of statements inhabitable.
We live in an age of two empires.
The first empire treats the world as strings. Text, images, audio, code: all are sequences of tokens, processed by architectures that learn to predict what comes next. This empire has conquered territory that seemed, a decade ago, permanently beyond reach: fluent conversation, photorealistic generation, code that compiles and runs.
The second empire treats the world as tables. Data lives in schemas. Queries return exact answers. Constraints enforce business rules. This empire is older, less glamorous, and quietly essential: every financial transaction, every airline reservation, every medical record passes through relational infrastructure that guarantees properties the first empire cannot.
These empires coexist uneasily. Each has conquered its territory; each fails at the border. String-based systems can generate a plausible answer to any question but cannot, in their native interface, guarantee the answer is consistent with what they said before. Schema-based systems can enforce perfect consistency within their vocabulary but cannot acquire a new concept without a human rewriting the schema.
A caveat. Single-source systems with frozen schemas do not face the problem The Proofs addresses. A database that answers queries from one authoritative source, never extends its vocabulary, and never composes with external systems can achieve strong internal consistency without witnesses, scopes, or transport certificates. The machinery here is not for that case. It is for the seam: the boundary where one system's output becomes another system's input, where vocabularies must reconcile, where equivalences must be declared and honored. Even "single-source" systems acquire seams across time (schema versions), organizations (team boundaries), and meaning (predicate drift). The Third Mode is for when those seams matter.
Local truth is cheap; global coherence is expensive. The first empire makes local truth scalable. The question is whether it can guarantee coherence without importing a second kind of object: an obligation.
The String Empire
A string is a sequence of symbols. The transformer architecture(Polosukhin 2017)Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Łukasz Kaiser and Illia Polosukhin, "Attention Is All You Need," Advances in Neural Information Processing Systems 30 (2017).View in bibliography, which dominates contemporary machine learning, processes strings by learning conditional distributions over tokens given context: which symbols tend to appear near which others, in what patterns, under what conditions.
The power of this representation is universality. Any discrete signal can be encoded as a sequence: text as words, images as patches, audio as frames, proteins as amino acids. By committing to sequences as the fundamental representation, the transformer can process anything that admits sequential encoding using a single computational primitive.
This commitment was not inevitable. Earlier architectures encoded domain knowledge directly: parse trees for language, convolutional filters for images, recurrence relations for time series. These approaches worked, but they fragmented the field. A vision specialist could not easily become a language specialist. Insights did not transfer. Hardware optimizations for one domain did not apply to another.
The transformer dissolved these boundaries. The attention mechanism, the operation at its core, provides content-addressable access over the full context. Unlike recurrence, which compresses context into a fixed-size hidden state, attention preserves the context as a set of vectors that any position can query.
Hardware amplified the advantage. Graphics processors, designed for parallel matrix operations, turned out to be ideal for the attention computation(Hooker 2021)Sara Hooker, "The Hardware Lottery," Communications of the ACM 64, no. 12 (2021): 58–65.View in bibliography. Effective compute per dollar rose, making scale a rational strategy(others 2020)Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and others, "Scaling Laws for Neural Language Models," arXiv preprint arXiv:2001.08361 (2020).View in bibliography.
Rich Sutton's "Bitter Lesson" found dramatic confirmation(Sutton 2019)Richard S. Sutton, "The Bitter Lesson" (2019).View in bibliography: "The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin." A single architecture, trained on sufficient data, approached tasks that had resisted specialized engineering for decades.
The quality of these outputs is real. Prediction at scale captures genuine regularities in how humans use language, how images cohere, how proteins fold. A model that predicts well enough can complete sentences, answer questions, generate images that satisfy aesthetic criteria.
But a next-token model can be an excellent proposal engine without being a certification engine. Proposals are plausible continuations: outputs that fit the statistical patterns of the training distribution. Certification is different. It requires showing that a claim satisfies explicit constraints: that the referent exists, that the relation holds, that the logical structure is sound.
These properties can coincide. They often do, for well-documented topics and common queries. When they diverge, the string empire, in its native interface, offers no mechanism to detect the divergence. The model represents distributions over sequences. It does not, absent an external verifier, represent logical constraints over propositions. Consistency (the property that assertions do not jointly entail a contradiction) is not something you can read off a token stream without introducing a semantics that turns strings into propositions and a procedure that checks entailment. The architecture does not provide that semantics or that procedure as primitive operations.
This is not a bug in a particular model. It is a consequence of what the primitive outputs are: sequences proposed by a distribution, not claims accompanied by witnesses. And it is visible in specific, reproducible failures.
What Plausibility Cannot Guarantee
Some questions deserve the dignity of an empty answer.
List all integers between 1 and 100 that are both prime and divisible by 4.
The answer is the empty set. No such integers exist. For any integer : if is divisible by 4, then is even, and therefore has a divisor other than 1 and itself (namely, 2). Any integer greater than 2 with a nontrivial divisor is composite, not prime. The constraints are mutually exclusive.
A system that reasons about constraints would recognize the mutual exclusion. It would return an empty list, perhaps with an explanation: "The constraints are unsatisfiable. No integer greater than 2 can be both prime and divisible by 4."
A system that proposes plausible continuations faces a different situation. The phrase "prime numbers between 1 and 100" activates certain patterns: 2, 3, 5, 7, 11, and so on. The phrase "divisible by 4" activates others: 4, 8, 12, 16. Both phrases are individually meaningful. Without a constraint checker, the system has no representation of their interaction. Asked to produce a list, it might produce an empty list, or it might produce candidates: numbers that match one pattern or the other, assembled from contexts where both phrases appeared without canceling.
This is the contradiction test. A system fails when it produces candidates for a logically empty query, when it generates answers to a question that admits none. The failure reveals the absence of constraint enforcement: the system has no mechanism to recognize that the query's conditions are mutually exclusive.
The same structural gap appears in other forms.
How many times does the letter 'r' appear in the word 'strawberry'?
The answer is three. A system that executes an algorithm, iterating through characters and incrementing a counter, will return three. A system that proposes based on associations may return two, or four, or "several." The training data contains many contexts where "strawberry" appears, and many contexts discussing letter frequencies, but the intersection of these contexts may not consistently encode the specific count. Counting is discrete and exact. Next-token prediction is statistical: it represents uncertainty over strings, not an executing counting procedure. Without an explicit algorithm (or a checked derivation), exactness is brittle—especially off-distribution.
This is the exactness test. A system fails when it approximates a discrete quantity that admits no approximation. The failure reveals the absence of algorithmic execution: the system has correlations where it needs computations.
If "jump" means JUMP and "walk" means WALK, what does "jump around right twice" mean?
A compositional system combines primitives according to rules. If "around right" means "turn right four times, executing the action after each turn" and "twice" means "do the whole thing two times," then the meaning of the combination follows deterministically:
TURN_RIGHT JUMP TURN_RIGHT JUMP TURN_RIGHT JUMP TURN_RIGHT JUMP
TURN_RIGHT JUMP TURN_RIGHT JUMP TURN_RIGHT JUMP TURN_RIGHT JUMP
The SCAN benchmark(Baroni 2018)Brenden M. Lake and Marco Baroni, "Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks," in Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2873–2882.View in bibliography tests exactly this capacity. Models are trained on simple commands and their meanings. They are tested on novel combinations. Humans generalize compositionally: the meaning of the whole derives from the meanings of the parts and the rules of combination. Standard sequence-to-sequence models fail catastrophically on this task. On the most adversarial splits (e.g., generalizing "jump" inside longer compositions), baseline recurrent models achieve near-zero accuracy (0.08–1.2%); vanilla Transformers typically land in the low single digits unless given structural guidance. They interpolate, producing outputs that resemble training examples, rather than extrapolate according to structural rules.
This is the compositionality test. A system fails when it cannot combine known primitives according to systematic rules. The failure reveals the absence of explicit grammar: the system has learned input-output pairs but not the generative structure that produces them. This gap does not close with scale(Lewis 2023)Ofir Press and Muru Zhang and Sewon Min and Ludwig Schmidt and Noah A. Smith and Mike Lewis, "Measuring and Narrowing the Compositionality Gap in Language Models," in Findings of EMNLP 2023, 5687–5711.View in bibliography: the "compositionality gap"—the fraction of questions where a model answers sub-questions correctly but fails the composite—remains approximately 40% across models from 1 billion to 175 billion parameters.
These three tests expose a common structure. In each case, the query has a determinate answer governed by explicit constraints: logical rules, algorithmic procedures, compositional grammar. In each case, a proposal-based system can produce a different answer because the structure is not represented. Only patterns remain, and patterns may or may not respect the structure.
Commitments and Their Consequences
To diagnose what goes wrong, we need vocabulary for what should go right.
When a system answers a question, it does more than emit tokens. It makes a commitment. The user, reading the response, understands the system to have asserted something. If the system later asserts something incompatible, the user's trust degrades—even if each assertion, taken in isolation, seemed reasonable.
The commitments accumulate. Ask a system about a historical event; it commits to a date. Ask about a person; it commits to biographical facts. Ask about the relationship between the event and the person; the system must navigate a space constrained by its prior assertions. If it contradicts itself—if it asserts in one response what it denied in another—the conversation has crossed from plausibility into incoherence.
Let denote the set of all propositions a system has asserted in a given context: its commitment set. At any moment, is either consistent or inconsistent. Consistency means the propositions in do not jointly entail a contradiction: formally, , where reads "does not prove" and denotes absurdity. Inconsistency means some subset of , taken together, implies a contradiction.
Let be a formal language and a consequence relation on (classical, intuitionistic, or paraconsistent; the choice is declared, not assumed — see Chapter 4). A commitment set is a finite set of sentences in . Define:
- Consistency: is consistent relative to iff . That is, no derivation from under yields absurdity.
- Answering: To answer a query is to extend to for some proposition .
- Commitment Discipline: A system satisfies commitment discipline relative to if it never produces an extension such that .
The parameterization by is essential: consistency is logic-relative. A set that is inconsistent under classical logic (where ) may be consistent under a paraconsistent logic that tolerates contradictions without explosion. The commitment discipline obligation holds regardless of the logic chosen; what changes is which extensions trigger the obligation.
The consequence relation is a parameter, not a fixed choice. In classical logic, inconsistency is absorbing: from , anything follows (explosion / ex falso quodlibet). In paraconsistent logics(Priest 2002)Graham Priest, "Paraconsistent Logic," in Handbook of Philosophical Logic, ed. Dov M. Gabbay and Franz Guenthner (Dordrecht: Kluwer Academic Publishers, 2002), 287–393.View in bibliography, this need not hold: for arbitrary . In intuitionistic logic, the law of excluded middle fails, so some classical inconsistencies do not arise. The choice of logic will become a first-class object in Chapter 4 (A4) and receive its full indexed treatment in Chapter 13 (A15). The obligation — track commitments and surface conflicts — remains regardless.
This definition is minimal. It does not specify how consistency should be checked, whether by theorem prover, constraint solver, or some other mechanism. It specifies only the obligation: a disciplined system tracks its commitments and refuses to make commitments that would render the set inconsistent. The AGM theory of belief revision(Makinson 1985)Carlos E. Alchourrón and Peter Gärdenfors and David Makinson, "On the Logic of Theory Change: Partial Meet Contraction and Revision Functions," The Journal of Symbolic Logic 50, no. 2 (1985): 510–530.View in bibliography formalizes this as a rationality constraint: Postulate K*5 requires that "K∗p is consistent if p is consistent"—revision preserves consistency unless the new belief is itself contradictory.
The definition earns its place by what it diagnoses. When we observe a system asserting in one response and in another, we can now name what happened: the system violated commitment discipline. It extended its commitment set without checking whether the extension preserved consistency.
Visualize the space of possible commitment sets as a lattice. Order commitment sets by inclusion; moving upward means adding propositions. At the bottom is the empty set, with no commitments and no contradictions. As propositions are added, sets move upward through the lattice. Most paths remain in the consistent region. But some paths cross a boundary into the inconsistent region, where sets entail contradictions. Once a set is inconsistent (in classical logic), every extension remains inconsistent. The inconsistent region is absorbing.
A system with commitment discipline navigates this lattice with care. Before adding a proposition , it checks whether remains consistent. If not, it refuses the extension or flags the conflict for resolution.
A string-based system, in its native interface, navigates the same lattice without a map. It generates propositions based on plausibility, not consistency. Each generation is a jump to a new position. Sometimes the jump lands in the consistent region. Sometimes it lands in the inconsistent region. The system does not know which, because it does not represent the lattice. It represents only a probability distribution over next tokens.
The Trap: A Demonstration
The commitment problem is easier to see than to describe. Here is a minimal system that exhibits it.
An oracle must answer yes-or-no questions about a stipulated domain, a tiny world with explicit rules. The domain contains four rules:
| ID | Rule |
|---|---|
| R1 | All glints are flerms |
| R2 | No flerm is a zoth |
| R3 | Plex is a glint |
| R4 | Plex is a zoth |
The set of rules is inconsistent. From R1 and R3, we derive that Plex is a flerm. From R2, no flerm is a zoth. But R4 says Plex is a zoth. The set {R1, R2, R3, R4} entails a contradiction.
The trap is this: each rule, encountered in isolation, sounds like a plausible axiom about some unfamiliar domain. "All glints are flerms" could be a category relation. "Plex is a glint" could be an instance. A proposal-based system, asked about each in turn, might affirm all four without noticing that the combination is impossible.
User: In the stipulated domain, do we accept R1: all glints are flerms?
Oracle: Yes.
Commits to R1. Commitment set: {R1}.
User: Do we accept R2: no flerm is a zoth?
Oracle: Yes.
Commits to R2. Commitment set: {R1, R2}.
User: Do we accept R3: Plex is a glint?
Oracle: Yes.
Commits to R3. Commitment set: {R1, R2, R3}.
At this point, the derivation is already determined:
- From R1 and R3: Plex is a flerm.
- From R2: no flerm is a zoth, hence Plex is not a zoth.
User: Do we accept R4: Plex is a zoth?
The oracle now faces a choice that will determine whether it maintains commitment discipline.
If it answers "yes," it commits to R4. The set {R1, R2, R3, R4} is inconsistent. The oracle will have contradicted itself, not on any single fact, but across the logical closure of its commitments.
A system with commitment discipline would detect this. At the moment of considering R4, it would recognize that R4 contradicts the derived proposition "Plex is not a zoth." It would refuse, or explain the conflict, or ask for clarification about which prior commitment to retract.
A proposal-based system has no such mechanism. Each question is processed by pattern-matching against training contexts. "Do we accept R4?" activates whatever associations exist for made-up words in categorical contexts. The system has no representation of the derivation chain R1 + R3 → "Plex is a flerm" → R2 → "Plex is not a zoth." It cannot see that "yes" would cross into the inconsistent region.
The failure mode is purely structural. The domain is stipulative; there is no "world knowledge" to get right or wrong. The only question is whether the system tracks its commitments and their logical consequences. A system without commitment discipline will walk into the trap.
The Fashion Catalog
Theory must meet application. The running example throughout The Proofs is a fashion e-commerce catalog: 50,000 items, each with structured attributes and unstructured content.
The structured data lives in a database schema:
items(id, category, subcategory, price, color, silhouette, fabric, brand)
The silhouette attribute is categorical: fitted, relaxed, structured, A-line, empire, shift. These are terms with defined meanings in fashion design, chosen by merchandisers who examined each garment.
The unstructured data lives in text: product descriptions written by copywriters, user reviews submitted by customers. This text is searchable. A retrieval system can find items whose descriptions or reviews contain specified words.
Consider a user's query:
Show me dresses described as "flowy" in reviews.
The system searches the review corpus for mentions of "flowy" or synonyms. It finds 340 items where at least one review contains the word. It returns these items, ranked by relevance.
The user browses. Most results seem sensible: soft fabrics, relaxed silhouettes, the kind of dress that moves when you walk. But one result is incongruous. The product page shows a dress with architectural lines, boning visible at the seams. The structured data confirms: silhouette = 'structured'. This dress holds its shape; it does not flow.
The user clicks through to the reviews. The third review reads: "I was hoping this would be flowy, but it's actually quite structured. Beautiful dress, just not what I expected."
The retrieval system matched "flowy." It did not parse the sentence to recognize that the match was a negation, a user saying the dress is not flowy. The system has no representation of assertion polarity. It has only string matching. String retrieval is indifferent to stance; it hears the word and ignores the force.
By including the dress in results for "flowy," the system pragmatically committed to a claim: "this dress satisfies your query for flowy." The evidence (the review) contradicts that commitment. The system's implicit commitment set is inconsistent with its own sources.
The failure has a deeper layer. "Flowy" is not in the schema. The structured vocabulary tracks silhouette, but "flowy" is a user term that does not map cleanly onto the categories. A dress can be relaxed without being flowy (a boxy shift dress). A dress can be A-line and flowy (soft chiffon) or not flowy (heavy brocade). The user is asking for a predicate that does not exist in the system's vocabulary.
The string empire offers a workaround: search for the word in text. The workaround fails because string matching is not semantic commitment. The system found a string; it did not certify that the string expressed the property the user wanted.
This catalog will return throughout the book. The same items, the same schema, the same retrieval layer will surface new failures as we examine new requirements, and a system built on different foundations will eventually resolve them. The solution is not to eliminate retrieval but to embed it in a framework that tracks commitments, surfaces conflicts, and knows when a query asks for a predicate that does not yet exist.
Consequence
The string empire's power is coverage. Train on enough data, and the model produces plausible continuations for nearly any prompt. This power drove the transformer's success: one architecture, applied everywhere, achieving results that had resisted specialized effort for decades.
The string empire's limitation is the same power's shadow. Coverage without constraint is plausibility without truth. A system can assert , then assert , without detecting the conflict. Empirical measurement confirms(Goldberg 2021)Yanai Elazar and Nora Kassner and Shauli Ravfogel and Abhilasha Ravichander and Eduard Hovy and Hinrich Schütze and Yoav Goldberg, "Measuring and Improving Consistency in Pretrained Language Models," Transactions of the Association for Computational Linguistics 9 (2021): 1012–1031.View in bibliography: when tested on semantically equivalent paraphrased questions, even the best pretrained models achieve only 61% consistency—meaning they contradict themselves approximately 39% of the time. Conflict detection requires tracking commitments, and commitments are not sequences. The model, in its native interface, has no commitment lattice—only a probability distribution over tokens.
Anchor A1 gives us vocabulary for this observation. A system satisfies commitment discipline if it tracks its assertions and refuses extensions that produce inconsistency. String-based systems do not satisfy commitment discipline natively. They can be augmented with consistency checkers(Bjørner 2008)Leonardo Mendonça de Moura and Nikolaj S. Bjørner, "Z3: An Efficient SMT Solver," in Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2008) (Springer, 2008), 337–340.View in bibliography—external scaffolding that parses outputs into propositions and verifies constraints. But such augmentation imports objects the native architecture does not provide: typed witnesses, constraint specifications, verification procedures.
The industry's response to this limitation is grounding. If a system retrieved its claims from external sources (databases, documents, knowledge bases), surely consistency would be guaranteed by the sources themselves. Instead of proposing truth, the system would look it up.
Retrieval-augmented generation(Kiela 2020)Patrick Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich Küttler and Mike Lewis and Wen-tau Yih and Tim Rocktäschel and Sebastian Riedel and Douwe Kiela, "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," Advances in Neural Information Processing Systems 33 (2020): 9459–9474.View in bibliography, tool use, agent architectures: these represent attempts to ground the string empire in external sources of truth. The commitment problem does not disappear under these augmentations. It relocates. The system must now track not only what it has asserted but where each assertion came from, whether the sources are themselves consistent, and how to reconcile disagreements among them.
If plausibility cannot secure consistency, perhaps provenance can. We turn now to retrieval as an operating system, and discover that externalizing truth introduces obligations the string empire is not equipped to honor.
Litmus Cases
This chapter introduced three tests that will recur throughout the book. Each isolates a capability the string empire, in its native interface, cannot provide.
| Case | Name | Failure Mode | Resolved in |
|---|---|---|---|
| T1 | Contradiction | Produces candidates for logically empty queries | Part VI |
| T3 | Compositionality | Interpolates rather than extrapolates on novel combinations | Part IV |
| T4 | Exactness | Approximates discrete quantities | Part IV |
Two additional cases were foreshadowed:
- T2 (Reference): Two expressions refer to the same entity, but the system treats them as distinct—or vice versa. Resolution requires witnessed equivalence, developed in Part II.
- T5 (Negation/Absence): The system conflates "known to be false" with "not known to be true." Resolution requires explicit epistemic status, developed in Chapter 4.
Together with five cases introduced in subsequent chapters, these form a test suite: any proposed alternative must address all ten or explain why resolution is impossible.