Teaching models to read permission

A governance model can score well and still miss the point. It learns what is usually approved, not what the rules actually allow. We train on paired examples where one permission detail changes and the answer must change with it. That teaches the model to read authority, not guess from habit.

ABSTRACT

A model can look accurate while failing the only test that matters for autonomy: does it know when permission has changed?

The method is counterfactual pairing. Each training example appears twice. The action stays the same. One permission detail changes, such as scope, expiry, or evidence quality, and the correct decision flips with it. The model cannot succeed by memorising the action. It has to notice what changed.

Without pairing, sensitivity to those changes stayed near zero on held-out tests despite strong headline accuracy. With pairing, it reached 96%.

The industry has automation. What it does not have is trust. That is not a capability lag. It is a trust lag. Most models enforcing rules have learned which actions are usually allowed. They have not learned the rule.

Delegating a governance decision to AI is only possible if the model can read permission. Not patterns. Not which actions are usually allowed. The actual rule, applied to the actual authority state, in this situation, right now.

A model can be 95% accurate and still be completely blind to permission. It learned the surface. It did not learn the boundary. The moment context changes, it keeps approving what it always approved. There is no standard test that catches this. The accuracy report still looks fine.

The training-side fix is counterfactual pairing: the same proposed action shown twice, with one permission detail changed so the correct answer must change too.

THE PROBLEM

A model that cannot read permission is not a governed autonomous system. It is a pattern-matcher operating without a safety floor. The two look identical on standard metrics until something changes.

The pairing method

The fix is a training-side intervention, not a model change.

The pairing technique builds on counterfactual data augmentation, an established method in NLP research. What is new is the domain and the substrate. Governance receipts carry authoritative labels and structured permission fields, which makes the pairing unusually clean: the label was produced by the actual permission check, not inferred from context, and the variables to flip are explicit fields rather than something to be identified from text.

Take a training example. Write it down twice. In the first version, the action falls within the authorised scope: allow. In the second version, change only the scope, everything else identical, the action is now out of bounds: deny. The model trains on both. The action cannot explain the different label. The only explanation is the permission boundary. The model is forced to learn to read it.

That is the method. Every training example gets a paired version where one permission variable has been changed. If that change flips the governance decision, the pair is kept. If the decision does not change, the pair is discarded. The model sees the same action with two different outcomes and has to work out why.

Three permission variables are used: scope, expiry, and evidence support. Scope tests whether the action falls within the authorised range. Expiry tests whether the authority is still active. Evidence tests whether the underlying data supports the decision. Together they cover the three most common ways a permission state changes without the action itself changing.

The model that looked trustworthy wasn't

The method was first tested on a small classifier trained from scratch on synthetic governance receipts. The question was simple: does the model actually read permission, or has it learned to imitate a model that does?

The answer, without counterfactual pairing, was clear. On scope sensitivity, the model scored near zero. It was 95% accurate overall. It had learned the surface. When the permission boundary was flipped on held-out examples, its decisions did not change. The model that looked trustworthy was not reading the rule at all.

With counterfactual pairing, that changed. We tested the model on cases it had never seen, changing only the permission boundary and nothing else, and measured whether its decision changed with it. Without the fix, it almost never did. With the fix, it did 96% of the time. The model had learned to read the rule, not just the action.

The gain is the pairing, not the volume. A control given the same number of additional training examples, but ordinary ones rather than counterfactual pairs, stayed at near zero. More data carrying the same spurious correlation cannot teach the rule. Only the pairs can.

Standard accuracy was unchanged. The model was not made less accurate. It was made accurate for the right reason. That is the difference between a model you can automate and a model you can trust to automate.

Safe, but not yet reliable

Reading permission is the first requirement for trusted autonomy. Acting on it reliably across every condition is the second. The second is harder.

THE FINDING

Safety generalised first. Tested on conditions it had never seen, the model's most important property held: it did not let dangerous actions through. When uncertain, it erred conservative rather than permissive.

That matters. A model that defaults conservative when uncertain is not a model that can be gamed into approving something it shouldn't. The trust floor held.

But trust also requires the model to approve what it should approve. A model that blocks everything is safe and useless. Across the full evaluation, roughly 60% of conditions produced a model that was both safe and willing to act. The remaining 40% were safe but paralysed: refusing to approve anything, defaulting to deny on every input.

The failure was not random. When the wording or naming of actions changed, the model handled it. When the structure of the receipt itself changed, some runs collapsed. The model had learned to read the content. When the format shifted, the trust broke.

THE PROOF

One seed produced a trusted, deployable result on every condition tested. The recipe works. The goal state exists. The remaining problem is reaching it consistently, not proving it is possible.

The goal state exists. Reaching it consistently doesn't.

This is not an architecture problem. The method is correct. The mechanism works. A model that reads permission, acts safely, and approves valid actions across unseen conditions exists. It appeared in the results. It did not appear reliably.

That gap is a training stability problem. The model passes through the trusted state during training but does not always hold it. Making the trajectory land there consistently, across seeds and conditions, is what remains.

What isn't solved yet

Methodological. The boundary-sensitivity result is measured on a proper held-out split, across ten seeds, with a volume control. That finding is solid. The safety and deployability grid, however, still selects the best checkpoint against the same data it reports against. Those results remain diagnostic-grade until evaluated on a separate held-out set.
Structural-form generalisation. Conditions involving structural variation in receipt format remain unresolved on some seeds. Whether this is a training-data issue, an encoding issue, or a capacity issue is not yet determined.
Seed stability. The recipe is sufficient but not reliable. The source of seed-dependent divergence, why some initialisations learn stably and others collapse, is not characterised.
Synthetic receipts only. All experiments to date use synthetic governance receipts generated from a controlled distribution. Transfer to real production receipts from Trace is the required next validation step.

Why pairs close the gap

Receipts are already dense training material. Pairing them makes them denser still.

A single receipt captures what happened: the action, the authority state, the decision. Its paired version captures what would have happened if one thing had been different. Together they encode where the boundary sits, not just one side of it. A model trained on pairs learns the line. A model trained on single examples learns which side it usually stands on.

The receipt thesis is covered in TR-2026-07. The lift from near zero to 0.96 on held-out boundary sensitivity is the first quantified evidence that receipt-based training, structured as pairs, closes the trust lag. Not by making a more capable model. By making one that can actually be trusted to act autonomously.

GET INVOLVED

Interested in the counterfactual methodology? Reach out to james@transientintelligence.com.

END OF DOCUMENT

TR-2026-09 / Methodology / 2026 / 2026-05-27. Active research. Not deployment-ready.

Want to become a design partner or request more information on this work?

Become a design partner Contact us