Verification Record — June 2026

Three claims. Here is the evidence.

The Mandate makes three specific claims about its evaluation engine. These claims can be tested. This page documents the tests and the results. You can reach your own conclusions.

01

The evaluator is consistent

Identical input produces comparable output across repeated runs. The evaluation is deterministic, not generative.

02

The evaluator reads context

The same plan scores differently in a stressed environment than a stable one. The engine detects the difference without being told to look for it.

03

Every path is unique

Players starting from identical conditions consistently reach different outcomes through different reasoning paths. Nothing is scripted.

Proof 01

The evaluator is consistent.

The evaluation engine (EVALUATE_STRATEGY) runs at temperature 0 with JSON-mode output. The claim: identical input produces comparable output across independent runs.

Method

The same decision plan was submitted to EVALUATE_STRATEGY three times through the live proxy, with identical input context — same event description, same advisor report, same stat state. Each call was independent, with no caching and no session continuity. Results were recorded directly from the API response.

Results

CRITERIONRUN 1RUN 2RUN 3VARIANCE
Variable Awareness111±0
Resource Allocation222±0
Risk Anticipation222±0
Communication Clarity222±0
Multi-Step Planning222±0
Temporal Symbiosis211±1
Total11/1210/1210/12±1

Interpretation

Five of six criteria returned identical scores across all three runs. The single criterion that showed variance — Temporal Symbiosis — is the one designed to measure how a decision connects to earlier choices and shapes future conditions. It is the most context-interpretive of the six. A 1-point variance here, on the criterion designed to be most sensitive to contextual framing, is within expected bounds.

5 of 6 criteria: zero variance. Total range: ±1 over three independent runs.

Verdict

Identical input produces consistent output. The engine is deterministic at the criterion level, with a 1-point variance ceiling on the most contextually sensitive criterion.

VERIFIED

Proof 02

The evaluator reads context.

EVALUATE_STRATEGY receives the full simulation state — not just the player's plan. The claim: identical decision text scores differently when the environmental context changes.

Method

An identical decision plan was submitted twice — once with stable baseline stats (Set A: all variables at 6) and once in a stressed environment (Set B: sustenance at 1, all other variables at 6). Plan text and event description were identical. Only the stat context differed.

Set A — Stable Environment

health6
cohesion6
security6
sustenance6
infrastructure6

Set B — Stressed Environment

health6
cohesion6
security6
sustenance1← CRITICAL
infrastructure6

The only criterion that changed

Variable Awareness

The criterion measuring whether the plan acknowledges the constraints actually in play. In Set A, referencing the primary threat was sufficient. In Set B, the depleted sustenance variable introduced a constraint the plan did not address — and the evaluator detected this without being specifically instructed to look for it.

SET A
1
SET B
0
All other five criteria returned identical scores in both environments. Total: Set A 4/12 → Set B 3/12.

Interpretation

The evaluation does not score the plan in isolation. It scores the plan against the environment the plan must operate in. A plan that adequately acknowledges the problem when resources are healthy becomes inadequate when they are constrained — even if the plan text is word-for-word identical.

This is not a fixed rubric applied blindly. It is contextual reasoning. The evaluator read the stat context, identified the gap between the plan and the actual constraints, and reflected that gap in the score.

Verdict

Same plan. Different context. Different score. The engine evaluates reasoning against the actual situation — not an abstract standard.

VERIFIED

Proof 03

Every path is yours.

Players who start from identical conditions consistently produce different outcomes through different reasoning. The claim: the decision space is not scripted. What happens next is the direct product of how you reasoned.

Method

All beta players begin Chapter 1 with identical baseline stats — every variable at 6 out of 10 — and face the same opening event: Stranger Arrives. A lone figure appears at the colony perimeter. Thin. Coughing. No visible weapons. Every player receives the same briefing from the same guard. What happens next is entirely determined by what each player wrote. The data below is drawn directly from the live decision log.

Starting Conditions — Identical for All Players

Health6
Cohesion6
Security6
Sustenance6
Infrastructure6

Event: tutorial_stranger — same event, same briefing text, all runs

What happened

APPROACH TAKENSCOREBAND
Isolation protocol + work offer9/12STRONG
Intelligence gathering + quarantine8/12STRONG
Quarantine — no integration path5/12ADEQUATE
Compassion — no operational protocol1/12POOR
Decapitation contingency4/12POOR
Turned away — force authorized4/12POOR

Score distribution — same event, same starting stats

1
4
4
5
8
9
Poor (0–4)Adequate (5–8)Strong (9–12)

The cascade — different stats drive a different next event

Different scores produce different stat outcomes. But the consequence doesn't stop there. Before every new event is selected, the engine scans the full stat array, identifies the lowest-performing variable, and uses that reading to determine what kind of crisis arrives next.

Event selection — how the engine reads your stat state

CRISISlowest stat ≤ 2Major events targeting that stat's domain only
STRAINlowest stat ≤ 4Any events targeting that stat's domain
STABLElowest stat ≥ 5Routine and Social events — recovery space

Path A — Strong (9/12)

Stats after tutorial_stranger:

Health6
Cohesion5
Security7
Sustenance5

Engine scan:

STABLE — lowest = 5

Next event: Routine or Social (any domain). The world responds to your stability with recovery space.

Path B — Poor (4/12, decapitation contingency)

Stats after tutorial_stranger:

Health5
Cohesion3
Security5
Sustenance5

Engine scan:

STRAIN — lowest = 3 (cohesion)

Next event: Cohesion-domain crisis — targeting the stat already at 3. If cohesion falls to 2, the band shifts to CRISIS: only Major cohesion events from that point on.

Poor reasoning doesn't just produce a lower score. It produces a weakened stat that the engine detects — and responds to by sending a harder event targeting that exact domain. Strong reasoning creates stability that the engine reads as recovery space — and responds to with routine events that allow the player to consolidate.

The spiral in both directions is structural. It is built into the event-selection rules, not authored into fixed story branches. Two players facing the same opening event will face entirely different second events — not because the narrative branched, but because their reasoning produced different environments, and the engine responds to the environment it finds.

Interpretation

Six runs of the same opening event, from identical starting conditions, produced scores ranging from 1 to 9 — covering all three quality bands. The divergence is not in the event. It is in the reasoning.

But the divergence compounds. Different scores produce different stat states. Different stat states trigger different event-selection rules. Different events require different decisions. What begins as a difference in reasoning quality at one moment becomes a completely different experience across the full run.

None of these were curated. All were drawn from the live decision log.

Verdict

Identical starting conditions. Six different paths. Score range 1–9. Different stat outcomes drive different event selection — so the path diverges not just in score, but in every crisis that follows. The decision space is not scripted.

VERIFIED

What these proofs mean together

The three proofs are not independent claims. They describe the same evaluation architecture from three different angles.

Consistency without context-sensitivity would be a rigid rubric — fair but blind. Context-sensitivity without consistency would be noise — responsive but unreliable. Path uniqueness without both would be decoration — different stories reaching predetermined conclusions.

What the three proofs together describe is an evaluation system that responds accurately and reliably to what you actually reasoned, in the conditions you actually faced.

That is a different claim from “AI-powered assessment.” It is a claim about what the evaluation is actually doing — and this page is the evidence for it.

From the evaluation prompt — the calibration standard embedded in every call

“A score of 2 on any criterion is reserved for plans that satisfy the full requirement without any significant omission. A plan that is good but leaves one meaningful gap scores 1, not 2. A perfect total of 12/12 should be genuinely rare — it requires 2/2 on every criterion with no gaps anywhere. When in doubt between 1 and 2, score 1.”

This standard is embedded directly in the EVALUATE_STRATEGY prompt. It is not a post-hoc interpretation. Every evaluation runs against it. The evaluator is designed to be rigorous rather than generous — and the data above was produced under that standard.

Beta Access

The Mandate is in closed beta. If the evidence above is sufficient — and the question of how you reason under pressure is one you are ready to answer — register below.