Chapter 10

Time Crystallized in Weights

9 min read

We can think of capital, indeed, as frozen knowledge or knowledge imposed on the material world in the form of improbable arrangements.
— Kenneth Boulding, 'The Economics of the Coming Spaceship Earth' (1966)

Replication is cheap once access exists; discovery is expensive when it requires sustained search on scarce infrastructure. A trained model can be distributed as a file. Producing that file at frontier performance requires an extended, energy-dissipating optimization process whose costs are paid in advance, irreversible in the thermodynamic sense and sunk in the accounting sense.

The cost of running a trained model, inference, is only part of the story. The cost that typically dominates frontier advancement is the cost of producing the model in the first place: the search process that discovered, out of a vast space of parameterizations explored along a highly constrained optimization path, a particular arrangement of weights that happens to be useful.

This is where the concept of thermodynamic depth becomes relevant. The term was introduced by Seth Lloyd and Heinz Pagels in a 1988 paper that asked a deceptively simple question: what makes one physical state more complex than another?(Pagels 1988) Their answer was that complexity is a property of history, not of the state itself. In their framing, a state is deep if arriving there typically requires substantial irreversible processing and thus entropy production along the way. Thermodynamic depth measures how much irreversible processing was needed to arrive at a given configuration.

Consider diamond and coal. Both are carbon atoms arranged in different crystalline structures. The difference in their properties, hardness, transparency, conductivity, is a consequence of that structural difference. Diamond forms under extreme pressure and temperature; coal reflects a different, lower-pressure pathway from organic precursors. Diamond can be synthesized, but only by imposing conditions and doing work that coal does not require. The atoms are the same; the history is different.

The scale of that work is quantifiable. Producing a one-carat synthetic diamond by high-pressure high-temperature synthesis requires sustained conditions of roughly 5 gigapascals and 1,500°C, with total energy input on the order of hundreds of kilowatt-hours. The thermodynamic depth is real but bounded: the lattice is regular, the bonding uniform, and once the requisite conditions are imposed, the structure assembles by physical necessity. A neural network's depth is different in kind—not because the physical conditions are more extreme (they are not) but because the structure is found by search rather than dictated by chemistry.

A trained neural network is analogous. The weights of a model, the billions of numerical parameters that determine its behavior, are not random. They are the product of a search process that explored a vast space of possible configurations and settled into one that minimizes a loss function. The exploration is local and path-dependent, but it is still a costly filtering process over degrees of freedom too large to enumerate. The search is not omniscient: it is an optimization constrained by architecture, data, and objective. What crystallizes is not truth but a usable compressive structure that survives deployment.

That search process is called training, and it is expensive: it requires feeding enormous quantities of data through the network, computing gradients, adjusting weights, and repeating the process millions or billions of times. Each step dissipates energy. The final configuration of weights is the residue of all that work.

The cost is physical. Training a frontier language model requires thousands of GPUs running continuously for weeks or months, consuming megawatts of power and generating corresponding quantities of waste heat. Public estimates for frontier training runs vary widely, but some place them in the high tens to several hundreds of millions of dollars, depending on whether one counts only compute or also associated infrastructure and labor. These are invoices paid in joules and dollars, and they represent an irreversible expenditure in the thermodynamic sense and sunk in the accounting sense.

GPT-3 provides a concrete benchmark. Patterson et al. estimated the training run at approximately 1,300 MWh— $4.7 \times 10^{12}$ joules spread across a computation that explored a parameter space of dimension $\sim 10^{11}$ .(Dean 2021) The per-parameter cost averages to roughly 27 joules, an amount that appears modest until multiplied by 175 billion. Subsequent frontier models have pushed training energy at least an order of magnitude higher, and the frontier continues to advance: each generation's thermodynamic expenditure becomes sunk cost before the next generation's training begins.

The result is a file: once it exists, it can be transmitted and duplicated at negligible marginal cost. But the fixed cost of producing the model in the first place is enormous, and that fixed cost can create durable advantage when access is controlled and the frontier keeps moving. In the current regime, the copy is downstream of a thermodynamic expenditure that had to be paid at least once.

Charles Bennett, in work roughly contemporaneous with Lloyd and Pagels, developed a related concept he called "logical depth."(Bennett 1988) Logical depth measures the computational resources required to produce a given output from a minimal description of it. In Bennett's framing, a random string is shallow because there is no shortcut to producing it. You just have to list the digits. A string that encodes the first million digits of pi has high logical depth because there is a short program that can generate it, but running that program takes a long time. The depth is in the computation.

Bennett's point was that depth, not mere complexity, is what distinguishes interesting structures from random ones. A random configuration of matter has high entropy but low depth. It can be produced by any number of uncontrolled processes. A crystal has low entropy and low depth. A living organism, or a trained neural network, is different: it is the product of an extended, selective, information-rich process that could not have been shortcut.

The distinctive value at the frontier does not reside in the bits themselves, the raw information content of the weight file, but in the thermodynamic depth of the process that produced them. The file you can copy is a compressed residue of that history, and the residue has value precisely because the history was costly.

There is an obvious counterargument: if the file can be copied, and copying is cheap, then why does the original cost matter? Once the model exists, anyone who obtains a copy gets the benefit of the search without paying for it. The fixed cost is sunk, and competition should drive the price toward marginal cost, which is essentially zero.

This argument is correct as far as it goes, but it misses the structure of the problem. Access to the model is not automatic. The weights of frontier models are closely held. Even when released, terms of use may restrict commercial applications. Control over the weights, and the ability to serve them at scale, is a form of property right, and property rights can sustain prices above marginal cost. Running the model at scale requires infrastructure that is not free. Inference consumes compute, and compute consumes energy. A model that costs nothing to copy still costs money to deploy, and that cost scales with usage.

More fundamentally, so long as the frontier keeps moving, a model trained today will be superseded by a model trained tomorrow, with more parameters, more data, more compute, and better performance. The thermodynamic expenditure is not a one-time cost but an ongoing race. The company that stops investing in training falls behind; the company that keeps investing stays at the frontier. The sunk cost of a trained model is only sunk until the next generation arrives.

Distillation and imitation can transfer capability at lower cost than first-principles training, by making "good enough" capability available at lower training cost. A smaller model can be trained to mimic the outputs of a larger one, compressing capability into fewer parameters at a fraction of the original training cost. This compresses rents within a generation, but it does not eliminate the cost of advancing the frontier. Distillation often keeps the follower downstream of the leader's expenditure. It accelerates diffusion but does not eliminate the search cost for whoever reaches the frontier first.

The practical meaning of "intelligence," in this context, is uncertainty reduction that survives selection. A model is intelligent insofar as it can take an input, a question, an image, a situation, and produce an output that reduces uncertainty in ways that hold up under feedback and use. The reduction is valuable because it saves time, avoids errors, or enables actions that would otherwise be impossible. And it survives selection because the model, having been trained on vast quantities of data, has learned patterns that generalize to new cases.

But the uncertainty reduction is not free. It was purchased at the cost of the training process, which itself was a form of selection: the gradient descent algorithm explored the space of possible models and selected, step by step, the configurations that minimized the loss. Training is the internal selection procedure; deployment is the external one. The intelligence in the model is the crystallized residue of internal selection, but durable value accrues only where internal and external selection align. A model that minimizes training loss but fails to reduce user uncertainty on deployment has depth without value.

In an economy where intelligence is a factor of production, the scarce resource is not replicable bits, but the capacity to produce new intelligence. That capacity is grounded in the physical infrastructure of training: the chips, the power, the cooling, the data, the engineering talent, and the organizational capability to coordinate all of these at scale. The model file is the output; the training pipeline is the asset.

This is the sense in which energy is being structured into computation and computation is being structured into intelligence. Each layer of structure represents an irreversible transformation: energy into computation, computation into parameter updates, parameter updates into uncertainty-reducing outputs. Capability tends to correlate with the depth of the search process. Value accrues when that capability meets a scarce need under defensible terms.

Intelligence is physical. It has a thermodynamic cost. That cost is falling, but it is not approaching zero — and the gap between what the frontier demands and what the infrastructure can deliver is widening, not narrowing. If this is correct, the economics of intelligence are not the economics of software. They are the economics of energy: who controls the joules, who converts them most efficiently, and who can sustain the expenditure long enough for the search to converge.