Epistemic Status

What a system knows about what it knows

17 min read

A4Prose proofFormalizing the distinction between asserted, verified, and unknown.

What we cannot speak about we must pass over in silence.
— Ludwig Wittgenstein, Tractatus Logico-Philosophicus §7 (1921)

This chapter formalizes epistemic status (A4): the truth value of a proposition relative to a view is true, false, or undetermined, and the distinction depends on which logic the view employs. NULL conflates three distinct epistemic states; merging views requires explicit logic annotation. The reader who wants the historical argument for why absence has always been contested territory should read Vol I, Chapter 5 (The Empire of Tables).

The Lie of NULL

The fashion catalog aggregates from multiple suppliers. Supplier A's feed includes a sustainable field: items are marked TRUE, FALSE, or NULL. Supplier C's feed has the same field, but with different conventions. For Supplier A, a NULL in sustainable means "this item is not sustainable": the supplier operates under a completeness declaration for this predicate, where absence of a positive claim implies the negative. For Supplier C, a NULL means "we don't know": the supplier acknowledges incomplete information.

Supplier B's feed has no sustainable field at all.

A user asks: "Show me sustainable dresses."

What should the system return?

The merged table cannot answer correctly. Supplier A's NULLs, Supplier C's NULLs, and the synthetic NULLs created for Supplier B's items all look identical. The completeness profiles that govern their interpretation—what absence licenses you to conclude—were never stored. The database sees the same symbol; the semantics are different.

This is the lie of NULL.

There are two failures intertwined here: sometimes the predicate is absent from the signature (a vocabulary problem — the previous chapter's territory); sometimes the predicate exists but its value is absent (an epistemic problem). This chapter is about the second—and why the first becomes lethal when we pretend it's the second.

The meaning of absence is not stored in the data; it is stored in the logic.

Four Interpretations of Missingness

NULL is not one thing. It is a symbol that collapses at least four distinct situations into a single representation.

Three storage states describe what the data actually records:

Storage State	Example	What It Means
Unknown	"We haven't tested whether this dress is machine-washable"	The predicate applies, but the value is not known; it might be TRUE or FALSE
Not applicable	"Machine-washable is not a meaningful attribute for jewelry"	The predicate does not apply to this entity; the question is ill-formed
Absent-from-source	"The supplier's feed doesn't include this field"	No information was transmitted; the absence is about the data source, not the entity

Each storage state implies different behavior. Unknown values might become TRUE or FALSE with more information. Not-applicable is a typing failure: the predicate's domain does not include this entity (or includes it only via a partial function)—asking whether a necklace is machine-washable is a category error, and such values should not satisfy either positive or negative queries about the predicate unless the query explicitly ranges over applicability. Absent-from-source values should be handled according to the source's conventions, which may differ from source to source.

One derived state is not stored but inferred:

Derived State	Inference	What It Means
Inferred false	"If sustainable isn't listed, it's not sustainable"	Under a completeness assumption, absence of a positive claim licenses the negative

This fourth interpretation is different in kind. It is not a fact about what the data contains; it is a conclusion drawn from what the data omits, given an assumption about completeness. The inference is valid only if the source actually claims completeness for the predicate in question.

SQL NULL conflates all four(Codd 1979). The database stores a single symbol and provides no mechanism to distinguish which interpretation applies. A query that filters on sustainable IS NULL cannot know whether it is selecting unknown items, not-applicable items, absent items, or items that should be treated as not sustainable.

The conflation propagates. A CASE statement that maps NULL to a default value may be correct for one interpretation and wrong for another. A COUNT that excludes NULLs may undercount (if NULLs mean "not applicable" and should be excluded) or overcount (if NULLs mean "unknown" and should be counted as potential positives). The application must interpret, but the data provides no guidance.

NULL is a value that wants to be a judgement.

Completeness and Inference

The interpretation of absence depends on a completeness declaration—a claim about what the source takes itself to have covered.

The Closed-World Assumption (CWA) is the reasoning of completeness(Reiter 1978). If $P(x)$ is not derivable from the source, infer that $P(x)$ is false. The source claims to know everything relevant; what it cannot derive, it denies.

A phone directory operates under CWA. If a number isn't listed, the person has no phone—within the directory's scope. A product catalog operates under CWA for most attributes. If an item's sustainable field is not TRUE, the catalog implicitly claims the item is not sustainable—unless the catalog explicitly marks incomplete data.

The Open-World Assumption (OWA) is the reasoning of humility. If $P(x)$ is not derivable from the source, the truth of $P(x)$ is undetermined. The source acknowledges incompleteness; what it cannot derive, it does not claim to know.

Wikipedia operates under OWA. If an article doesn't mention a birthdate, we don't conclude the person has no birthdate—only that the article lacks that information. Scientific databases operate under OWA. If a study doesn't report a side effect, we don't conclude the side effect doesn't exist—only that this study didn't observe or record it.

The choice is load-bearing. Under a completeness declaration for sustainable, the query SELECT * FROM items WHERE NOT sustainable returns items where sustainable IS NULL. Under OWA (no completeness declaration), it does not. The same query, the same data, different results—because the inference regime differs.

Example(Birds and Flying)

Schema: birds(id, species, can_fly)

id	species	can_fly
1	sparrow	TRUE
2	penguin	FALSE
3	ostrich	NULL
4	kiwi	NULL

Query: "Which birds can fly?"

Both regimes agree: {sparrow}. The NULL values are not TRUE.

Query: "Which birds cannot fly?"

Under CWA (if can_fly is declared complete): {penguin, ostrich, kiwi}. NULL is inferred as FALSE; negating FALSE yields TRUE.

Under OWA (if can_fly is not declared complete): {penguin}. NULL is undetermined; undetermined values are not returned by either positive or negative queries.

The twist: What if the NULL for ostrich came from a source that simply omits can_fly for ratites (absent-from-source), while the NULL for kiwi came from a source that includes the field but hasn't assessed the value (unknown)?

SQL cannot express this. The epistemic status differs, but the storage is identical.

The bird table is the catalog's sustainable column writ small: same storage, same NULLs, different meaning depending on regime.

Most practitioners assume CWA without stating it. This works when all data comes from a single source with uniform conventions. It fails when sources are merged—and modern systems merge sources constantly.

The Merge Problem

When Supplier A (CWA: "NULL means not sustainable") is merged with Supplier C (OWA: "NULL means unknown"), the system faces an impossible situation.

Supplier A's NULL items should be treated as not sustainable—under A's completeness profile, the absence of a positive claim is evidence of the negative. Supplier C's NULL items should be treated as sustainability unknown—under C's open-world reasoning, the absence of a claim is just absence.

If the system applies CWA globally, it misrepresents Supplier C's data. Items that Supplier C genuinely doesn't know about get labeled "not sustainable"—a false negative that could mislead users.

If the system applies OWA globally, it misrepresents Supplier A's data. Items that Supplier A deliberately didn't mark sustainable get labeled "unknown"—obscuring a negative claim that the supplier intended to make.

Neither is correct. The correct answer is: it depends on the source.

But the merged table has lost the source. The NULL values are indistinguishable. The completeness profile was never stored. The merge performs semantic erasure: meaning is destroyed when provenance and completeness are dropped.

This is why Chapter 2's provenance typing matters. A witness should carry not just the source but the inference regime under which the claim was made. Without regime annotations, merging is a silent corruption of meaning.

Remark(On Vocabulary Absence vs Value Absence)

Supplier B presents a different problem: the sustainable field doesn't exist in B's feed at all. This is not a NULL; it's a schema mismatch—a vocabulary problem from Chapter 3.

When the system creates a merged table with a sustainable column, it must synthesize NULLs for Supplier B's items. But these synthetic NULLs mean something different again: "the source doesn't traffic in this concept." They are neither unknown (B didn't fail to learn the value) nor false (B didn't claim the items aren't sustainable). They are outside B's vocabulary.

Conflating predicate-absence with value-absence is the first compounding error. Treating all NULLs as equivalent, regardless of whether they came from an explicit NULL, a schema gap, or an inference, is the second.

Epistemic Status

The solution is to make the inference regime explicit. Truth is not a property of a proposition in isolation; it is a status relative to a view and that view's completeness profile.

Epistemic Status (A4)

The epistemic status of proposition $p$ relative to view $(U, R_U)$ is:

true: $U, R_U \vdash p$ — the view, under its inference regime, proves $p$
false: $U, R_U \vdash \neg p$ — the view, under its inference regime, proves $\neg p$
undetermined: neither — the view, under its inference regime, is silent on $p$

Assume $(U, R_U)$ is consistent on $p$ : it does not derive both $p$ and $\neg p$ . (Conflict-tolerant gluing is deferred to Part III.)

CWA and OWA are inference rules within the regime $R_U$ , scoped to predicates:

CWA (predicate-scoped): If predicate $P$ is declared complete in view $U$ , then failure to derive $P(x)$ licenses $\neg P(x)$
OWA: If predicate $P$ is not declared complete, failure to derive $P(x)$ yields undetermined status

The local theory $T(U) = (\Sigma, I, R_U)$ packages signature, constraints, and inference regime together. Two sources with the same signature but different completeness profiles are different theories.

The definition is predicate-scoped because completeness is rarely uniform. A supplier might claim completeness for price (every item has a price; absence would be an error) while acknowledging incompleteness for sustainable (some items haven't been assessed). The inference regime must specify which predicates are closed.

With A4 in hand, the merge problem becomes tractable. When Supplier A's data enters the system, it carries its completeness profile: sustainable is closed. When Supplier C's data enters, it carries a different profile: sustainable is open. The merged representation preserves both:

$(p_A, \pi_A)$ : sustainable = NULL for item X, witnessed by Supplier A, regime: CWA for sustainable
$(p_C, \pi_C)$ : sustainable = NULL for item Y, witnessed by Supplier C, regime: OWA for sustainable

A downstream query can now distinguish: X is not sustainable (closed-world inference); Y's sustainability is unknown (open-world uncertainty). The system can partition results by epistemic status and let the user choose which partition to include.

Remark(On Logic Selection)

The choice between CWA and OWA is sometimes framed as a "logic choice"—different logics with different inference rules. This framing is correct but can be misleading. In knowledge representation, CWA is often modeled as a completeness axiom added to an otherwise open-world theory, or as a closure operation on the minimal model. The practical effect is the same: what you infer from absence depends on what you assume about completeness.

We use "inference regime" rather than "logic" to emphasize that the choice is about what the source claims to know, not about the fundamental rules of reasoning. A source can be complete for some predicates and incomplete for others. The regime is a profile, not a global setting.

SQL's Incomplete Remedy

SQL attempted to handle uncertainty with three-valued logic. Instead of TRUE and FALSE, SQL uses TRUE, FALSE, and UNKNOWN. NULL values propagate as UNKNOWN through most operations. Comparisons involving NULL yield UNKNOWN. Boolean operations follow Kleene's three-valued truth tables(Kleene 1952, §64).

The attempt was well-intentioned. Two-valued logic cannot distinguish "definitely false" from "unknown," and the distinction matters. Three-valued logic provides a third category for uncertainty.

But three values are not enough. SQL's UNKNOWN conflates "unknown whether true or false" with "not applicable" with "absent from source." The conflation produces counterintuitive behavior.

Gotcha 1: NOT IN with NULL

SELECT * FROM items WHERE category NOT IN (SELECT category FROM banned)

If the banned table contains any NULL value in the category column, this query can return zero rows—even for items whose category is clearly not among the non-null banned categories. The NULL comparison returns UNKNOWN; NOT IN requires all comparisons to be FALSE; a single UNKNOWN poisons the entire result.

This behavior surprises even experienced SQL developers. The fix—using NOT EXISTS instead of NOT IN, or filtering out NULLs from the subquery—is a workaround for a semantic confusion that the language cannot express cleanly.

Gotcha 2: NULL = NULL yields UNKNOWN

SELECT * FROM items WHERE color = color

This query returns only rows where color IS NOT NULL. The comparison NULL = NULL yields UNKNOWN, not TRUE. Every row with a NULL color fails the WHERE clause.

The behavior follows from the semantics—if we don't know what NULL is, we can't know whether it equals itself—but it violates the reflexivity of equality that users expect. The workaround—using IS NOT DISTINCT FROM or COALESCE—is again a patch over a deeper problem.

No completeness declaration. SQL provides three-valued logic but leaves completeness assumptions to schema design and application convention. Different constructs—NOT IN, NOT EXISTS, outer joins, IS NULL—embody different practical commitments without user control. The query engine doesn't know whether a column is closed or open; it applies syntactic rules that may or may not match the intended semantics.

No per-source regime. Even if SQL had completeness declarations, they would be per-column or per-table, not per-source. A merged table combining sources with different completeness profiles would have one regime for all rows. The semantic distinction would still be lost.

Three-valued logic is a bandage on a wound that requires surgery(Zaniolo 1984). The surgery is explicit epistemic status: tracking not just what is absent but what inference regime governs the absence.

The Fashion Catalog with Epistemic Status

Return to the catalog with the full machinery in hand.

For the sustainable attribute across three suppliers, the system maintains:

Supplier A (CWA for sustainable):

Items with sustainable = TRUE: confirmed sustainable
Items with sustainable = FALSE: confirmed not sustainable
Items with sustainable = NULL: not sustainable (closed-world inference)

Supplier B (no sustainable field):

All items: sustainable outside vocabulary—not unknown, not false, but undefined in this source

Supplier C (OWA for sustainable):

Items with sustainable = TRUE: confirmed sustainable
Items with sustainable = FALSE: confirmed not sustainable
Items with sustainable = NULL: sustainability unknown

When a user queries "Show me sustainable dresses," the system can now respond:

"Results are partitioned by epistemic status (counts illustrative):

Sustainable (confirmed): ~1,200 items—sustainable = TRUE from any source.

Not sustainable (confirmed): ~900 items—sustainable = FALSE from any source.

Not sustainable (CWA inference): ~3,400 items—sustainable = NULL from Supplier A, which claims completeness.

Sustainability unknown: ~8,900 items—sustainable = NULL from Supplier C, which does not claim completeness.

Sustainability not in vocabulary: ~12,800 items—from Supplier B, whose feed doesn't include this attribute.

Which partition(s) would you like to include?"

The distinction matters operationally: a system that surfaces epistemic commitments lets users choose, while a system that hides them chooses for them. A user who wants only confirmed sustainable items gets a clean answer. A user who is willing to include unknowns can make that choice explicitly. A user who wants to exclude Supplier A's CWA inferences (perhaps suspecting the supplier over-claims completeness) can do so.

The system has not hidden the complexity; it has structured it.

Touchstones Advanced

T5 (Negation/Absence): This touchstone asked whether a system can correctly handle negation—whether it can distinguish "known to be false" from "not known to be true."

Chapter 1 foreshadowed T5 as "negation without witnesses." Chapter 2 advanced it to "logic is part of provenance." Chapter 4 resolves T5 in principle: epistemic status makes the distinction explicit. CWA and OWA are declared inference rules scoped to predicates. A system that tracks completeness profiles can distinguish negative facts from mere absences.

The resolution is "in principle" because the operational question remains: how do you actually track completeness profiles in deployed systems? That's Part V's engineering concern. The conceptual machinery is complete.

T8 (Uncertainty/Value): "Best restaurant in Berlin" is not a factual predicate. Its truth depends on preference context: best for whom? By what criteria? Under what constraints?

T8 shows that inference regimes extend beyond CWA/OWA. Some predicates are inherently non-factual—their truth varies with context in a way that makes single-valued answers inappropriate. The predicate "best" has different epistemic status in different views because different views encode different preference functions.

A4 handles T8 partially. The machinery of views and regimes can express that "best restaurant" is undetermined in one view and TRUE for a specific restaurant in another (the view that encodes a particular preference function). Full resolution requires fibered predicates (Chapter 12)—predicates whose meaning varies systematically with context, with explicit fiber structure.

For now, T8 is advanced: the framework acknowledges context-dependence. Resolution is deferred.

Consequence

The relational empire's wound is now fully diagnosed.

Chapter 3 showed that vocabulary is frozen: the schema cannot grow on demand, and escape hatches surrender certification. Chapter 4 shows that even within a fixed vocabulary, the meaning of absence varies across sources. Two sources can share a signature and still disagree on what an empty cell means. The disagreement is not about values; it is about completeness—about what the sources claim to know.

You cannot safely merge views without declaring both vocabulary and inference regime.

The meaning of absence is not stored in the data; it is stored in the logic.

The string empire produces any sentence but cannot track commitments. The retrieval layer grounds claims but cannot type provenance. The schema empire enforces constraints but cannot grow vocabulary. And even with a fixed, typed, provenance-carrying vocabulary, sources differ on what absence means—and SQL's three-valued logic cannot express the difference.

We have now diagnosed four failures. The question is no longer whether these failures exist. It is whether a single formal object can address all of them—commitment discipline, provenance typing, vocabulary flexibility, and explicit epistemic status. Chapter 5 states what that object must satisfy: the Coherence Requirement.

Litmus Cases

Case	Name	Chapter 4 Status
T5	Negation/Absence	Resolved (in principle): epistemic status explicit; CWA/OWA predicate-scoped
T8	Uncertainty/Value	Advanced: non-factual predicates acknowledged; full resolution in Chapter 12

T5 Progression:

Chapter 1: Foreshadowed as "negation without witnesses"
Chapter 2: Advanced to "logic is part of provenance; CWA/OWA as source metadata"
Chapter 4: Resolved—epistemic status is formal; inference regime is part of local theory

T8 (Uncertainty/Value): "Best restaurant in Berlin" depends on preference context. A4 provides the framework for context-indexed truth; fibered predicates (Chapter 12) provide the full resolution.

The touchstones sharpen. The failures have names, and the names point toward the objects we must build.