Identity Maintenance

Tracking entities across representations

11 min read

A23Prose proofHow identity persists when representations change.

Therefore we order you that you shall cause no harm or injury to the same Mosse in coming to us, staying, and returning.
— Aragonese Royal Safe-Conduct (1280), Arxiu de la Corona d'Aragó

This chapter formalizes identity maintenance as Anchor A23: a discipline for building context-indexed equivalence over the substrate defined in A22. A23 specifies a scope algebra (meet, leq, compatible, overlap_on), a kind order over equivalence types, scope-conditional transitive closure, and conflict handling that distinguishes scoped disagreement from genuine conflict. The formalization draws on the witnessed equivalence framework of A10 and applies it to keyless joins, widening, and kind degradation. For the narrative treatment of identity as a declared, context-dependent relation rather than a global assumption, see Vol I, Chapter 7 (The Witness Protocol).

The Key Fallacy

Primary keys are useful identifiers. The fallacy is treating them as eternal metaphysics.

A key encodes three assumptions: that identity is decided at insertion time, that it holds globally across all contexts, and that it persists permanently without revision. Real systems violate all three. Identity is often decided later, after data arrives and patterns emerge. Identity varies by context: the shipping department and the tax department may have different definitions of "same customer." Identity changes: companies merge, records are corrected, entities split.

When these assumptions fail, key-based joins produce silent errors. Two records with different keys may refer to the same entity. Two records with the same key may refer to different entities in different contexts. The join either misses matches or produces false positives, and neither failure is visible in the result set.

Keyless joins are joins that do not assume a shared primary key. They rely on equivalence witnesses: explicit, auditable evidence that two tokens should be treated as the same for specific purposes in specific contexts.

Identity as Declaration

We do not assume a global referent. We maintain licenses to treat tokens as the same for specific purposes in specific contexts. Canonicalization (a "golden record" or master entity) is a downstream view built atop those licenses, not the definition of identity itself.

This distinction separates the Third Mode from traditional entity resolution(Sunter 1969)(Christen 2012). Entity resolution assumes a single ground truth and tries to recover it from noisy data. Identity maintenance makes no such assumption. It tracks what has been declared equivalent, by whom, with what evidence, in what scope. If two contexts disagree about identity, both assertions are recorded. The disagreement is data, not error.

An entity does not "exist" in the substrate until claims mention it. Identity between entities does not "exist" until an equivalence is declared with a witness. This is the Third Mode's epistemology: existence and identity are witnessed, not assumed.

Anchor A23: Identity Maintenance

A23

Identity Maintenance

Identity maintenance builds context-indexed equivalence over the substrate (A22).

Scope Algebra (computed from site structure via refines edges):

$\mathrm{meet}(U, V) \to W \mid \bot$ : greatest lower bound; $\bot$ if no common refinement
$\mathrm{leq}(U, V)$ : U refines V (U is finer)
$\mathrm{compatible}(U, V) := \mathrm{meet}(U, V) \neq \bot$
$\mathrm{overlap\_on}(U, V, x) \to \{\mathrm{true}, \mathrm{false}, \mathrm{unknown}\}$ : consults entity-incidence index to check if x has incident claims in meet(U,V); returns unknown if co-reference hasn't stabilized

Kind Order (poset):

\mathrm{identity} \geq \mathrm{isomorphism} \geq \mathrm{approximation}

Declaration: $x \sim_U y$ with witness $\pi$ creates Equivalence(e) with scope U and kind K.

Transitive Closure (scope-conditional): $x \sim_U y$ and $y \sim_V z \Rightarrow x \sim_W z$ where $W = \mathrm{meet}(U, V)$ , only if $\mathrm{compatible}(U, V)$ .

Closure kind = $\min(\mathrm{kind}(e_1), \mathrm{kind}(e_2))$
Closure scope = $\mathrm{meet}(U, V)$ unless explicit widening witness

Conflict Handling: $x \sim_U y$ and $x \not\sim_V y$ :

If $\mathrm{meet}(U, V) = \bot$ : ScopedDisagreement (not a conflict)
If $\mathrm{meet}(U, V) \neq \bot$ and $\mathrm{overlap\_on}(U, V, x)$ : ConflictWitness

Obligation: Equivalence assertions must be witnessed. Unwitnessed equivalence is not asserted. Inequivalence is asserted the same way: as a witnessed claim (type mismatch, jurisdiction mismatch, non-overlap proof, or context constraint).

Scope Compatibility

If $x \sim_U y$ and $y \sim_V z$ , does $x \sim z$ follow?

In classical equivalence relations: yes, always. Transitivity is definitional. In the Third Mode: only if scopes are compatible.

The scope algebra gives us the tools to decide. Two contexts U and V are compatible if $\mathrm{meet}(U, V) \neq \bot$ , meaning they share a common refinement in the site structure. When they do, transitive closure produces a derived equivalence in the meet scope with the minimum kind.

Same context: $x \sim_U y$ and $y \sim_U z \Rightarrow x \sim_U z$ . Safe. The meet of U with itself is U.

Compatible contexts: $x \sim_U y$ and $y \sim_V z$ where $\mathrm{meet}(U, V) = W \neq \bot \Rightarrow x \sim_W z$ . The derived equivalence holds in the common refinement, not in either original scope.

Incompatible contexts: $x \sim_U y$ and $y \sim_V z$ where $\mathrm{meet}(U, V) = \bot$ . No automatic closure. The system does not infer $x \sim z$ in any scope.

This prevents a common failure mode: chaining equivalences across incompatible scopes to conclude that everything is equivalent to everything. The scope-conditional closure is the discipline that keeps identity maintenance honest.

Kind Degradation

Equivalence kinds form a poset: identity is strongest, then isomorphism, then approximation. When you compose equivalences, the result inherits the weakest kind.

If $e_1$ asserts $x \sim y$ with kind "identity" and $e_2$ asserts $y \sim z$ with kind "approximation," the derived equivalence $x \sim z$ has kind "approximation." You cannot chain approximations into identities. The information-theoretic loss is tracked.

This matters for transport. An identity-kind equivalence can license broad transport, but it still respects the transport certificate and footprint (A16) and context invariants. An approximation-kind equivalence licenses transport of some properties with acknowledged loss. The kind tells downstream consumers what standing the equivalence has.

Conflict and Disagreement

What happens when $x \sim_U y$ but $x \not\sim_V y$ ?

Case 1: Disjoint scopes. If $\mathrm{meet}(U, V) = \bot$ , this is not a conflict. Different contexts, different truths. The system records a ScopedDisagreement artifact:

ScopedDisagreement(
  x: token_a,
  y: token_b,
  equivalent_in: U,
  inequivalent_in: V,
  meet: ⊥
)

This is informative, not erroneous. It tells you that identity depends on context.

Case 2: Overlapping scopes. If $\mathrm{meet}(U, V) \neq \bot$ , check whether the overlap is relevant to x and y. The predicate $\mathrm{overlap\_on}(U, V, x)$ returns true if the meet contains claims incident to x.

If overlap_on returns false or unknown: no conflict yet (pending identity stabilization). The scopes overlap somewhere, but not on the entities in question.

If overlap_on returns true: conflict. The system produces a ConflictWitness:

ConflictWitness(
  entities: [x, y],
  equivalence_in_U: e1,
  inequivalence_in_V: e2,
  overlap: meet(U, V),
  incident_claims: [...],
  resolution_options: [
    "restrict e1 scope to exclude overlap",
    "restrict e2 scope to exclude overlap",
    "produce obstruction and escalate"
  ]
)

The conflict is explicit, computed, and stored. It does not disappear because someone ran a query.

T2: Morning Star and Evening Star

The canonical identity problem(Frege 1892). "Morning Star" and "Evening Star" are names for the same celestial body (Venus), but the discovery that they refer to the same object was informative. The names had different senses, different contexts of use, and different associated properties.

In the Third Mode, this is modeled as two entity tokens in two contexts, linked by a witnessed equivalence:

declare_equivalence(
  x: morning_star,
  y: evening_star,
  context: U_astronomy,
  witness: EphemerisAlignment(
    observation_set: "IAU_Venus_observations",
    method: "orbital_parameter_match"
  ),
  kind: identity
)

The result is an Equivalence(e_venus) with scope U_astronomy and kind identity.

Transport: The certificate on e_venus specifies which properties can be transported. Orbital period, mass, position: yes. These are astronomical properties, and the witness (orbital parameter match) establishes identity for astronomical purposes.

Poetic associations: no. The certificate does not cover cultural properties. A query asking "what poems mention evening_star?" cannot use e_venus to return poems about morning_star. Different scope, different properties, different standing.

Scope restriction: In U_poetry, no equivalence exists between morning_star and evening_star. A query asking "are these the same?" in U_poetry returns: Unknown. No witness exists in that scope. The system does not guess.

T7: NYC Strings

"NYC," "New York City," "New York, NY," "Manhattan," "10001." In different contexts, these may or may not refer to the same entity.

Shipping context: "NYC" and "New York, NY" are equivalent. Same delivery region. Witness: USPS address normalization database.

declare_equivalence(NYC, New_York_NY, U_shipping,
  AddressNormalization(USPS_database), identity)

Tax context: "NYC" and "New York City" are equivalent. Same tax jurisdiction. But "NYC" and "10001" are not equivalent: different granularity (city vs zip code).

declare_equivalence(NYC, New_York_City, U_tax,
  JurisdictionMapping(NY_tax_authority), identity)
// No equivalence declared for NYC ≃ 10001

Tourism context: "Manhattan" and "NYC" are equivalent as an approximation. Acceptable metonymy for tourism purposes, but not exact.

declare_equivalence(Manhattan, NYC, U_tourism,
  MetonymyAcceptance(tourism_convention), approximation)

Query behavior:

In U_shipping: "Is NYC = New York, NY?" → Yes (witnessed identity)
In U_tax: "Is NYC = 10001?" → Unknown (no equivalence or inequivalence witnessed). If the tax context carries an invariant forbidding city↔zip equivalence, that constraint yields a witnessed NotEquivalent.
In U_tourism: "Is Manhattan = NYC?" → Approximately (kind=approximation; transport is lossy)

Transitive closure: In U_shipping, if NYC $\sim$ New York, NY and New York, NY $\sim$ New York City (both in U_shipping), then NYC $\sim$ New York City in U_shipping. Same context, closure valid, scope preserved.

Cross-context: NYC $\sim_{\text{shipping}}$ New York, NY and New York, NY $\sim_{\text{tax}}$ New York City. Scope compatibility check: if shipping and tax have a common refinement, closure holds in that meet. Otherwise, no automatic closure.

Keyless Joins

A traditional join assumes shared keys:

SELECT * FROM orders JOIN customers ON orders.customer_id = customers.id

A keyless join uses equivalence witnesses:

SELECT * FROM orders JOIN customers 
  ON equivalent(orders.customer_ref, customers.entity, U_billing)

The equivalent predicate:

Looks up the equivalence index for the token pair in context U_billing
Checks scope compatibility with the query context
Returns true if a witnessed equivalence exists
Returns false if inequivalence is witnessed
Returns unknown if no witness exists either way

Unknown is not false. This is a join policy decision:

Strict policy: Join only on witnessed true. This is schema-land behavior: no match without evidence. Use when correctness matters more than recall.

Best-effort policy: Join on true; optionally include high-probability witnesses above a threshold. This bridges toward embedding-land but requires receipts. Use when recall matters but you need auditability.

Exploratory policy: Return candidate joins with confidence scores, tagged as "non-standing." This is embedding-land behavior made auditable. Use for discovery, not for downstream computation.

The policy is declared at query time. The substrate supports all three; the choice is governance.

The Equivalence Index

The substrate maintains an equivalence index for efficient queries:

EquivalenceIndex:
  by_token: Map<EntityToken, Set<Equivalence>>
  by_context: Map<Context, Set<Equivalence>>
  by_pair: Map<(EntityToken, EntityToken), Set<Equivalence>>
  closure_cache: Map<(EntityToken, EntityToken, Context), ClosureResult>

Operations:

lookup(x, y, U): Are x and y equivalent in context U? Returns Equivalence | NotEquivalent | Unknown
closure(x, U): All entities equivalent to x in context U (transitive closure)
conflicts(x): All conflicts involving x across contexts

Cost:

Declaration: O(1) plus invariant check
Lookup: O(1) with index
Closure: O(n) where n is closure size; cached after first computation
Conflict detection: O(|contexts|) for cross-context checks

The cost model (A21) applies. Large closures are expensive. The coherence budget constrains how much closure you can afford.

Widening

Equivalence scope can be extended via explicit widening. If $e$ has scope U, widening to scope V (where $\mathrm{leq}(U, V)$ ) requires a widening witness $\pi_{\text{widen}}$ .

widen(e, U → V, π_widen) → Equivalence(e') with scope V

Widening is never inferred from closure but is an explicit governance act. The widening witness records who authorized the extension, under what authority, and what evidence supports it.

This prevents scope creep. An equivalence that was valid for shipping purposes does not automatically become valid for tax purposes. Each scope extension is a decision, not an inference.

Consequence

Identity maintenance is specified. Equivalences are declared with witnesses, propagated with scope-conditional closure, and constrained by conflict handling. The substrate now has rules for populating and querying Equivalence nodes.

The discipline is: no identity without witness, no closure without compatibility, no widening without governance. Disagreements are recorded, not resolved by fiat. Conflicts are computed, not hidden.

Chapter 22 asks what a predicate must carry to be accepted into this substrate. Chapter 23 asks what a query promises when it runs against equivalence-aware data. The identity discipline is established. The predicate discipline follows.