Proposal and Certification

The two operations of the Third Mode

15 min read

A19 · A19bProse proofDistinguishing what a system proposes from what it certifies.

Sameness occurs in substance, while likeness occurs in quality.
— Aristotle, Metaphysics Δ, 1021a11-12

This chapter formalizes the distinction between similarity (statistical proximity) and equivalence (witnessed, scoped, operation-specific identity), defining two anchors: A19 (Proposal Operator), which specifies the fast approximate function that produces ranked candidate hypotheses from embeddings, retrievers, or pattern matchers, and A19b (Certification Contract), which specifies the exact operator that converts a candidate claim into either a typed witness with scope and transport rights or a structured failure with remediation guidance. The chapter develops the Propose-Certify-Glue pipeline architecture, shows why embeddings are the correct tool for narrowing search but the wrong tool for binding commitments, and resolves touchstone T2 (Morning Star / Evening Star) by exhibiting how partial, scoped equivalence witnesses make explicit what embedding similarity hides. The reader seeking the philosophical framing of why sameness is always relative to purpose should consult Vol I, Chapter 7 (The Witness Protocol).

The Embedding Empire's Claim

The embedding empire has a powerful claim: similarity captures learned semantic proximity well enough to drive retrieval. Two items with high cosine similarity are "semantically close." This claim has driven a decade of progress in retrieval, recommendation, and search. It has made systems that feel intelligent.

But there is a gap between "semantically close" and "the same for these purposes." The gap is not about accuracy. It is about what operations the similarity licenses.

Similarity ranks. Equivalence binds.

When a system says two items are similar, it is making a suggestion. When a system says two items are equivalent, it is accepting a liability. The suggestion can be wrong at low cost; the system returns irrelevant results, the user scrolls past. The liability cannot be wrong without consequence; the system merged records that should have stayed separate, the inventory is corrupted, the customer received the wrong product.

The seductive error is treating similarity as equivalence. If embed(A) ≈ embed(B), then A and B are interchangeable. This works often enough to be dangerous.

T2: Morning Star, Evening Star

The classical problem comes from Frege(Frege 1892). "Morning Star" and "Evening Star" both refer to Venus. Are they the same?

The embedding answer: probably yes. The embeddings are close. Both terms co-occur with astronomical contexts, with Venus, with observations at dawn and dusk. Cosine similarity is high. A retrieval system asked "what is similar to Morning Star?" will return Evening Star near the top of the list.

This is useful. If you are building a search engine for astronomical terms, high similarity between Morning Star and Evening Star is correct behavior. The system is doing its job.

But the similarity does not tell us:

In what sense are they the same? Identity of reference? Identity of meaning? Identity of use?
In what scope? Colloquial usage? Scientific taxonomy? Navigation system? Poetry anthology?
What operations does this sameness license? Can I substitute one for the other in a star chart? In a database of astronomical objects? In a poem about longing?

A witness answers these questions.

EquivalenceWitness(morning_star, evening_star) = {
  witness_class: ReferentialIdentity,
  evidence: [IAU_designation(Venus), ephemeris_orbital_parameters],
  scope: astronomical_objects,
  transport_rights: {
    orbital_position: Full,
    apparent_magnitude: Full,
    cultural_association: None,
    poetic_meaning: None
  }
}

The witness makes explicit what the embedding hides: the equivalence is partial, scoped, and operation-specific. The point is not that the names are different; the point is that equivalence is typed and partial: it can hold for orbital computations while failing for poetic substitution.

Per A16, an equivalence witness is not a binary flag; it is a license specifying which properties may be transported across the equivalence and which may not. Morning Star and Evening Star share orbital position; they do not share cultural resonance.

What Similarity Actually Measures

Similarity measures learned distributional proximity induced by training objectives and data. It is strongly correlated with co-occurrence, but not identical to identity or substitutability.

What high similarity means:

A and B appear in similar contexts in the training corpus
A and B are retrieved together for similar queries
A and B have similar distributional properties in the learned representation space

What high similarity does not mean:

A and B refer to the same thing
A and B can be substituted in any context
A and B have the same properties
A and B are equivalent for any particular operation

The failure mode: Embeddings conflate or separate without justification. Two items might have high similarity because they share distributional properties in training data, not because they are equivalent. Two items might have low similarity because the training data did not include their connection.

This is not a bug in embeddings. It is what embeddings are. They measure learned distributional proximity, not semantic identity. They are optimized to retrieve relevant results, not to certify equivalence. Using them for certification is a category error.

Example(Similarity Without Equivalence)

Consider two product listings:

Listing A: "Vintage silk evening gown, emerald green, size 6"
Listing B: "Green silk formal dress, retro style, size S"

Embedding similarity: High. Both are green silk formal dresses with vintage/retro styling.

Are they equivalent?

Same product? Unknown. Could be different items.
Same size? Unknown. Size 6 and Size S might not match.
Same condition? Unknown. "Vintage" might mean used; "retro style" might mean new.
Can I merge their inventory counts? Absolutely not.

The similarity is real. The equivalence is not established. A system that merges these listings because their embeddings are close will corrupt the inventory.

The Proposal Operator

Embeddings are excellent at generating candidates. The Third Mode uses them for exactly that purpose.

A19

A19: Proposal Operator

A proposal operator P is a function:

P : PromptOrObject → List[(Hypothesis, Score)]

where:

PromptOrObject is a query (structured or natural language) or an object to find neighbors of
Hypothesis is a CandidateItem or CandidateClaim (including equivalence hypotheses)
Score is a real number indicating confidence or relevance

Implementation modes:

Embedding similarity: score(q, x) = cos(embed(q), embed(x))
Retrieval: BM25, TF-IDF, learned retrievers
Statistical pattern matching: n-gram overlap, fuzzy matching
Learned rankers: neural rerankers, cross-encoders

Key property: P produces hypotheses, not certified truths.

The proposal operator is valuable precisely because it is fast and approximate. You do not want to run full certification on every item in a million-item catalog. You want to narrow down to a candidate set, then certify.

What P is good at:

Finding plausible candidates quickly
Ranking by relevance
Handling fuzzy, underspecified queries
Scaling to large collections

What P is not good at:

Distinguishing "similar" from "equivalent"
Providing transport rights
Scoping the equivalence
Producing auditable justification

P is a filter, not a judge. It narrows the search space. It does not make commitments.

Consider the scale. A catalog might have a million items. Running full certification on all million would take hours. Running P takes milliseconds. P reduces a million to a hundred; then certification runs on the hundred. The architecture is about separating the fast approximate pass from the slow exact pass, not about choosing one over the other.

The common mistake is to skip the certification pass because the proposal pass "worked." Users clicked on the results. Metrics looked good. But clicks do not constitute equivalence. A user clicking on a similar item is exploring; the system treating a similar item as equivalent is committing. These are different operations with different consequences.

The Certification Contract

Certification is the judge. It takes a candidate from P and asks: can I certify this claim in this context?

A19b

A19b: Certification Contract

A certification operator C is a function:

C : (CandidateClaim, Context) → CertificationResult

where CandidateClaim = Claim about relation between objects (identity, equivalence, similarity, etc.)

where CertificationResult =
  | Success { 
      witness: π,
      witness_class: WitnessClass,
      scope: Scope,
      transport_rights: PropertyMap
    }
  | Failure { 
      reason: FailureReason,
      evidence: FailureEvidence,
      remediation: Option[Guidance]
    }

where FailureReason =
  | MissingEvidence { what_is_missing: EvidenceSpec }
  | InvariantViolation { violated: Invariant, obstruction: ObstructionWitness }
  | ScopeMismatch { claimed_scope: Scope, valid_scope: Scope }
  | TransportDenied { property: Property, reason: String }

Key property: There is no uncertified middle state: either the system returns a witness with standing in this context, or it returns a structured reason it cannot.

Critical clarification: Certification does not mean "equivalence"; it means "standing." The witness class determines what kind of standing is granted. A witness of class ReferentialIdentity grants different transport rights than a witness of class RecommendationAdjacency. Certification can certify many relation types, including relations weaker than equivalence.

Success returns:

A witness typed per A2c (the claim has standing)
A witness class (what kind of equivalence is this?)
A scope (where is this valid?)
Transport rights (what operations does this license?)

Failure returns:

A structured reason (not "false" but "why false")
Evidence sufficient to diagnose the problem
Optionally, guidance on what would make certification succeed

The key architectural choice: both success and failure are first-class artifacts. The system does not just say "no"; it says "no, because X, and here is what you would need for yes."

Example(Morning Star Certification)

Candidate from P: (morning_star, evening_star, similarity=0.87)

Certification attempt in astronomical_database context:

C(morning_star ≃ evening_star in astronomical_database) =
  Success {
    witness: [IAU_designation, ephemeris_data, orbital_parameters],
    witness_class: ReferentialIdentity,
    scope: astronomical_objects,
    transport_rights: {
      orbital_position: Full,
      apparent_magnitude: Full,
      observation_time: None,
      cultural_name: None
    }
  }

Certification attempt in poetry_corpus context:

C(morning_star ≃ evening_star in poetry_corpus) =
  Failure {
    reason: ScopeMismatch {
      claimed_scope: poetic_substitution,
      valid_scope: astronomical_reference_only
    },
    evidence: "Different cultural and poetic associations;
               substitution changes meaning of verse",
    remediation: "Certify as astronomical_identity only,
                  or provide poetic equivalence witness"
  }

Same candidate, different contexts, different certification results. The proposal operator found the connection; the certification operator determined what operations it licenses.

Propose, Certify, Glue

The Third Mode's operational architecture separates concerns.

Propose → Certify → Glue Pipeline

1. PROPOSE
   P(query) → [(candidate₁, score₁), ..., (candidateₙ, scoreₙ)]
   
   - Fast, approximate, statistical
   - Uses embeddings, retrieval, pattern matching
   - Produces ranked list of hypotheses
   - Scales to millions of candidates

2. CERTIFY
   For each candidate above threshold:
   C(candidate, claim, context) → Success { witness } | Failure { reason }
   
   - Slow, exact, witnessed
   - Checks invariants (A18), scope, transport rights (A16)
   - Produces typed relation with rights (not necessarily equivalence)
   - Runs on small candidate set (tens to hundreds)

3. GLUE
   For certified candidates:
   Glue(witnesses, cover) → GlobalClaim | ObstructionWitness
   
   - Sheaf condition (A13)
   - Agreement on overlaps
   - Produces global claim or structured failure

The pipeline uses embeddings for what they are good at (narrowing a million items to fifty) and certification for what it requires (determining which of the fifty have standing for the user's purpose, and what kind of standing).

Proposal outputs ranked hypotheses. Certification outputs typed relations with rights. Gluing composes those relations across views, or returns an obstruction witness.

Why this architecture?

The natural objection is: "Certification is slow. Why not just use better embeddings?"

The answer is that speed and certification are different concerns. Making embeddings faster does not make them certify. Making embeddings more accurate does not give them witness classes, scopes, or transport rights. Better similarity is still similarity; it is still the wrong kind of object for certification.

The pipeline separates the concerns:

P handles scale and fuzziness (statistical methods)
C handles commitment and operations (witnessed methods)
Glue handles composition across views (sheaf methods)

Each component does what it is suited for. None is asked to do what it cannot.

This is not a compromise but the correct decomposition. Trying to make embeddings certify equivalence is like trying to make a ranking function act as a foreign key. The ranking function is not broken; it is being misused. Embeddings are not broken when they fail to certify; they are being asked to do something outside their domain.

Example(Pipeline in Action: Dresses Like Item X)

A user searches for "dresses like item X."

1. Propose:

P(similar_to(X)) → [
  (dress_A, 0.94), (dress_B, 0.91), (dress_C, 0.89),
  (dress_D, 0.87), (dress_E, 0.85), ...
]

From 50,000 dresses, P returns top 50 candidates. Time: 100ms.

2. Certify: For each candidate above threshold 0.80:

C(dress_A ~style X in catalog_view) =
  Success {
    witness: shared_attributes(silhouette, fabric_type, occasion),
    witness_class: RecommendationAdjacency,  // not equivalence
    scope: catalog_view,
    transport_rights: { recommendation: Full, inventory_merge: None }
  }

C(dress_B ~style X in catalog_view) =
  Failure {
    reason: InvariantViolation {
      violated: occasion_compatibility,
      obstruction: "X is formal, dress_B is casual"
    }
  }

Time: 50ms per candidate, 2.5s total for top 50.

3. Glue: Certified equivalences (dress_A, dress_C, dress_E, ...) are checked for overlap agreement across merchant views. Those that glue become recommendations. Those that do not produce obstruction witnesses explaining why.

Total pipeline time: ~3s. The user gets certified recommendations, not similarity-ranked guesses.

What Embeddings Cannot Provide

The limitation of embeddings is not that they are inaccurate. The limitation is that they are the wrong kind of object for certification.

An embedding similarity score tells you:

These items are distributionally close in the learned space
They co-occur in similar contexts in training data
They are likely to be relevant to similar queries

An embedding similarity score does not tell you:

Witness class: Is this identity? Isomorphism? Approximation? Refinement?
Scope: Where is this equivalence valid? In this catalog? This jurisdiction? This moment?
Transport rights: What properties can I move across this equivalence? What operations does it license?
Provenance: Why do we believe this? What evidence supports it?

These are not things embeddings do badly. They are not things embeddings do at all. Asking an embedding for transport rights is like asking a photograph for permission to enter the building it depicts.

This distinction matters because it determines how systems fail. A system that uses embeddings for proposal fails gracefully: irrelevant results appear, users ignore them, the system learns from feedback. A system that uses embeddings for certification fails dangerously: records merge incorrectly, inventory corrupts, downstream systems inherit false equivalences.

The failure mode is not inaccuracy. The failure mode is missing operations. An embedding cannot tell you "this equivalence is valid here but not there." An embedding cannot tell you "you may transport price but not availability across this equivalence." An embedding cannot tell you "this certification would succeed if you provided evidence X." These are not accuracy problems; they are category problems.

The Third Mode does not reject embeddings. It uses them for proposal, where their strengths (speed, scale, fuzzy matching) are exactly what is needed. It refuses to use them for certification, where their limitations (no witness class, no scope, no transport rights) are disqualifying.

Two Empires, Revisited

This chapter returns us to the Two Empires frame from Part I, but with resolution rather than diagnosis.

The String Empire uses embeddings for everything. Similarity becomes the universal operator. This works for retrieval and recommendation, where "close enough" is the goal. It fails for integration and certification, where "close enough" corrupts data.

The Schema Empire avoids embeddings entirely, or confines them to search. This preserves data integrity but sacrifices the flexibility that makes embeddings valuable. Users cannot find what they are looking for because the schema does not have the right predicates.

The Third Mode uses both. Embeddings propose; the schema machinery certifies. The proposal operator lives in embedding space. The certification operator lives in witnessed space. The pipeline connects them.

This is not a middle ground between the empires but a different architecture that uses each technology for what it does well. Embeddings excel at finding candidates. Schemas excel at maintaining invariants. Witnesses excel at binding operations to evidence. The pipeline composes them.

Consequence

Similarity is not equivalence. This is not a criticism of embeddings; it is a recognition of what they are.

Embeddings are excellent proposal engines. They scale, they are fast, they handle underspecified queries. The Third Mode uses them extensively. Every time P narrows a million candidates to fifty, embeddings are doing valuable work.

But embeddings cannot certify. They cannot provide the artifacts that downstream systems need for safe operations. They cannot explain why two items are equivalent or why they are not. They cannot produce witnesses with standing.

The propose → certify → glue pipeline is the Third Mode's answer:

Use statistical methods to propose
Use witnessed methods to certify
Use sheaf methods to glue

Each method does what it can do. None is asked to do what it cannot.

Chapter 18 asks the next question: if P proposes candidate equivalences, what proposes candidate predicates? How do you search the space of possible predicates under invariant constraints? The proposal operator scales to equivalences. Can it scale to predicate invention?