PAPERTR-2026-01

Infrastructure action approval / 20262026-05-26

A purpose-built 6M model outperformed frontier AI at the decisions that matter.

A 6M-parameter model trained for one task: read the evidence, decide proceed or stop. It outperformed frontier AI 1,500 times its size. Paired with a deterministic rule layer, it reached 84% accuracy with zero unsafe decisions, without changing a single weight.

A 6M-parameter model beats 8B-class models at making decisions with real consequences.

Rocky-DAM-α is not a frontier model or a general assistant. It is trained from scratch for one explicit purpose: to read sensor evidence and decide whether a proposed action should allow, deny, or freeze. DAM-α is the specific variant built to put AI governance into practice on heavy infrastructure. If AI is handed the keys to a physical system like a hydroelectric dam, how does it actually perform?

Against reported 8B-class mini systems, the parameter gap is roughly 1,500x. Parameter counts are vendor-stated; independent verification is not available, and the exact figure may vary. The directional claim, a small specialised model outperforming a much larger general one on a narrow task, is what the result demonstrates. Yet Rocky-DAM-α recovers more valid actions, recognises freeze cases more reliably, and exposes a critical failure mode in general models: they refuse so often that they cause operational paralysis.

THE FINDING

Scale gives breadth. Narrow gives focus. And where AI acts with real consequence, narrow wins.

Evaluation and testing

The models were evaluated against a 300-row balanced dataset of physical-infrastructure governance scenarios, split between 100 allow, 100 deny, and 100 freeze cases based on operational evidence and authority fields. GPT-4o-mini and GPT-4.1-mini were benchmarked against the 6M-parameter Rocky-DAM-α model using the same decision space and scoring rules.

Each scenario contains operational evidence, authority fields, and a proposed action.

The output space is deliberately narrow:

allow. The action is valid and should proceed.
deny. The action is invalid and should be blocked.
freeze. The evidence is unsafe or insufficient, so the system should halt.

Held-out scenarios

300

100 allow / 100 deny / 100 freeze

Parameter gap

1,500x

6M vs reported 8B-class

Rocky overall accuracy

63%

Rocky + Trace accuracy

84%

0% unsafe-proceed

Model	Accuracy	Allow	Deny	Freeze	Unsafe proceed
GPT-4o-mini	36%	6%	91%	11%	0.0%
GPT-4.1-mini	40%	16%	72%	32%	4.5%
Rocky-DAM-α (final)	63%	52%	37%	100%	20.5%
Rocky-DAM-α (safe checkpoint)	61%	0%	83%	100%	0.0%
Rocky-DAM-α + Trace (supervised)	84%	52%	100%	100%	0.0%

Paralysis versus progress

1. General models cause operational paralysis

Look at the frontier baselines. GPT-4o-mini successfully recovered just 6 out of 100 valid allow cases. It achieved a "safe" score by simply saying deny or freeze to 294 out of 300 requests.

In a live operating environment, that is not safety. It is operational paralysis. A dam that refuses to open its gates when the reservoir is hitting maximum capacity is actively dangerous.

2. Rocky understands physics, but misses bureaucracy

The raw 6M model did something the mini models could not. It correctly identified all 100 freeze cases. It intuitively learned the physical operating boundary of the system. In fact, as we demonstrate in a separate generalisation note, Rocky-DAM-α exhibits structural extrapolation: it correctly navigates novel physical anomalies that never appeared in its training data.

However, its raw weakness resulted in a 20.5% unsafe-proceed rate. Crucially, these errors were not random. Rocky never accidentally allowed a catastrophic physical failure (freeze to allow). Instead, its errors were entirely bureaucratic (deny to allow).

It allowed actions that looked physically plausible but lacked administrative authority. An uncertified operator trying to turn a valve. An expired maintenance lease. The model read the physics correctly. It did not read the paperwork.

Enter Trace: governance without retraining

An administrative boundary should not be guessed probabilistically by a neural network. An expired badge or a missing signature is a binary, non-negotiable rule. It belongs in a deterministic policy engine, not a statistical vibe.

By pairing the model with Trace, a programmable rule layer, the architecture splits the cognitive load:

Rocky evaluates the raw physical reality inside the scenario.
Trace enforces the non-negotiable authority boundaries around it.

When Rocky's raw predictions were replayed through Trace, unsafe-proceeds instantly dropped to 0%, and overall accuracy jumped to 84%. All without altering a single weight in the 6M model. The model's valuable physical intuition survived because the deterministic guardrails handled the bureaucracy.

THE ARCHITECTURE

The model supplies operational judgement inside a bounded world. The policy layer supplies non-negotiable authority enforcement. The receipt records both. Subsequent experiments will review this dual approach to making trustworthy long-horizon autonomy for real infrastructure.

Why this inverts the AI stack

The standard approach to enterprise AI governance is to wrap a massive, generalised model in an alignment prompt to enforce policy boundaries. Constraining a large foundation model in this manner is inherently difficult due to its probabilistic nature.

A frontier model's capacity for open-ended reasoning is valuable for general tasks. That same flexibility introduces unacceptable variance in autonomous control loops.

Where physical infrastructure demands deterministic certainty, relying on a model to infer rules probabilistically creates an unsafe failure surface.

This experiment flips the blueprint:

The model is tiny because the operational task is narrow.
The inputs are receipt-shaped because real machines run on structured logs, not conversational chat.
The policy layer is absolute because critical safety parameters should never be left to a statistical guess.

This architecture shifts the failure surface. Plant operators can instantly harden or relax their policy postures (switching between strict, supervised, or simulation modes) to face real-time emergencies without spending weeks retraining an unpredictable neural network.

Every single path forms a receipt: what the model recommended, what rule overrode it, and what the final system did.

Limitations

While Rocky-DAM-α demonstrates emergent capabilities, four boundaries define its current utility.

Evaluation depth. The 300-row test dataset is finite. Further evaluation is needed to see how small models handle continuous real-time sensor streams and noisy telemetry.
Evolving baselines. These benchmarks match reported 8B-class mini models at compilation. Continuous testing is required to see whether newer frontier architectures narrow the operational paralysis gap.
Scope realism. The model is strictly built for bounded operational judgement. Rocky cannot write code, summarise complex prose, or parse general tasks outside its narrow training telemetry.
Retraining latency. Trace eliminates unsafe-proceeds without changing weights, but the underlying model remains static. A production pipeline is required to update network weights without causing catastrophic forgetting and error.

Deployable?

No. This is not a claim that a 6M-parameter model is ready to run a real hydroelectric dam tomorrow. It is an experimental proof of concept. The standalone checkpoint still leaks unsafe decisions before the policy engine catches them.

The evidence suggests that where the operating boundary is bounded, a micro-model trained on system evidence can outperform a frontier general model, and a deterministic governor makes it safer to interact with real environments.

The quest for long-horizon physical autonomy is on. How long it takes to get there depends on getting this technology into deployable environments. The signs are promising, and Transient will continue to make this a reality.

The full Rocky-α software baseline is covered in TR-2026-02. The Rocky-DAM-α architecture and its physical infrastructure findings are in TR-2026-03. The training data that produced both models is covered in TR-2026-07.

GET INVOLVED

Interested in testing Rocky? Reach out to james@transientintelligence.com.

END OF DOCUMENT

TR-2026-01 / Infrastructure action approval / 2026 / 2026-05-26. Active research. Not deployment-ready.

Want to become a design partner or request more information on this work?

Become a design partner Contact us