Are We Building Data Products We No Longer Need?

Most of the data products an organisation publishes today — the curated tables, the certified dashboards, the nightly extracts — exist to make a number trustworthy. Where a metric is used in many places, an organisation wants one accurate, agreed version of it: so it names an owner to define, refine, and publish the correct figure, wraps governance around it, and stands behind its quality. A whole discipline grew up to deliver this — the data marketplace, the certified product, the lineage and access controls behind each one. There is a lower-order reason too: producing the figure once and reusing it spares teams from continuously rebuilding the same insight by hand. Reliability is the headline; saved effort is the dividend.

Take the lower-order reason first. When an agent can reconstruct a figure on demand from a well-formed source, the effort of building and rebuilding it by hand largely disappears — the dividend of saved labour can be had without maintaining a published copy at all. The headline reason is more interesting. Reliability, accuracy, and governance do not evaporate when the artefact does; the demand for a trustworthy number is, if anything, sharper when an agent is acting on it unsupervised. What changes is not whether we govern, but what we govern.

The object of governance moves from the published number to the rule that produces it — from the data product to the semantic layer beneath it.

Govern the definition and the source it draws on, and the number becomes a derivation you can trust without pinning a copy of it in place. Yet we keep building the products themselves with the same energy we always did, as though the artefact and the trust were the same thing. They are not.

I am not arguing that the published estate should vanish overnight. I am arguing that we should stop building by default, and start choosing. A small set of figures genuinely needs to be pinned and published; a much larger set is built out of momentum, because rebuilding was once laborious and producing the artefact is simply what we have always done.

The convenience layer, examined

Why does this hold? A properly designed source — an append-only ledger, an event-dated CRM record — already carries its own history. Point-in-time reconstruction is intrinsic to it: nothing was overwritten, so the past can be rebuilt by reading events up to a date. In my work I have repeatedly seen teams maintain elaborate published snapshots to answer “what did this look like in March?” when the source itself could already answer that question faithfully, had anyone asked it to.

Where the source is well-formed, then, the published product is doing less than we assume. It is not the keeper of truth. It is a convenience layer sitting on top of truth — and a convenience layer is exactly what on-demand computation dissolves.

The data-quality debt underneath

If that were the whole story, stopping would be a clean efficiency saving and the decision would be easy. It is not the whole story.

The published data product has, in many organisations, quietly subsidised poor source data quality by concealing it. A curated report smooths, reconciles, and patches on the way to a clean figure. Stop producing the report and let an agent read the source directly, and whatever was wrong underneath is now exposed — at the speed and scale of every question anyone cares to ask. Stopping does not deliver a free saving so much as call in a deferred debt. The cost was always there; the published layer was paying the interest on our behalf.

This is why the shift is not a simple subtraction. Once the rule and its source are what we govern — not the published number, but the semantic layer beneath it — two things have to be trustworthy that the published report used to let us take on faith: the definition itself, agreed and held, and the quality of the raw source it computes over. The question “is our revenue figure correct?” stops being answered by certifying a report and starts being answered by trusting the source and the definition behind it. The governance does not lighten; it points at harder things.

Hallucination and the same concealment

The same mechanism shows up in a problem we tend to treat as separate: hallucination. The current research locates one of its causes in the data — when a model draws on context that is thin, conflicting, or ungoverned, it fills the gap with something plausible rather than something true. This is not the only cause, and it is not solved by feeding the model more; an existing, accurate, well-bounded semantic layer matters far more than a larger one. But it points somewhere useful. The confidence we can place in an agent’s answer is, in good part, the confidence we have built into the semantic layer beneath it — the definitions it reasons over and the source it draws from. Govern that well, looking actively for risk and error rather than waiting for it to surface, and the agent’s confidence starts to be earned rather than merely fluent.

Fluent-but-wrong is the whole of the hallucination problem.

Seen this way, the published data product does not remove hallucination so much as hide it. When a figure is published, it quietly becomes the truth itself, rather than one reading of the context and raw data it was drawn from — and once it is the truth, no one goes back to interrogate what produced it. A human error baked into the calculation is smoothed into a clean, signed number and travels downstream uncontested. Let an agent read the source directly and the error is exposed instead of absorbed; uncomfortable, but honest, and now open to being fixed at the place it occurred. I should be careful not to claim too much here. The model itself matters as well as the semantics, and a governed definition that is confidently wrong is worse than an ungoverned one, because it manufactures conviction at scale. Correctness and confidence are not the same thing, and a system tuned for the second can drift from the first.

A decision rather than a default

So the discipline is selection. Some figures genuinely must be pinned and formally asserted — regulatory submissions, audited statements, anything carrying an external obligation. Others can sensibly be computed live. The sensible landing point is almost always a blend; what matters is that an organisation chooses the blend on purpose, rather than inheriting it by habit and building on momentum ever since.

Questions worth taking into the room

I offer this as a perspective, not a prescription — and I am genuinely interested in where it breaks against other people’s experience. A few questions seem worth putting on the table before AI forces the issue:

What figures must we still formally assert, and why — obligation, or habit? Where are our sources genuinely well-formed enough to be trusted without a report standing in front of them, and where are they not? And if the published layer has been concealing data-quality debt, how large is that debt, and who has been paying its interest? One question I am holding more loosely, as a hypothesis rather than a claim: if we govern the semantic layer with correctness as the explicit goal — actively hunting risk and error rather than certifying outputs after the fact — does an enterprise build, over time, a data capability that grows more correct the longer it runs? Or does strong governance simply entrench whatever definitions it started with, propagating early errors with greater conviction? I do not think there is a single right answer to any of these. I suspect the value is in asking the questions together, early — and I would be glad to hear where this reasoning meets, or fails to meet, what others are seeing.

A structured perspective on a fast-moving problem — offered to start a conversation, not settle one.