Verification Record — June 2026
Three claims. Here is the evidence.
The Mandate makes three specific claims about its evaluation engine. These claims can be tested. This page documents the tests and the results. You can reach your own conclusions.
The evaluator is consistent
Identical input produces comparable output across repeated runs. The evaluation is deterministic, not generative.
The evaluator reads context
The same plan scores differently in a stressed environment than a stable one. The engine detects the difference without being told to look for it.
Every path is unique
Players starting from identical conditions consistently reach different outcomes through different reasoning paths. Nothing is scripted.
Proof 01
The evaluator is consistent.
The evaluation engine (EVALUATE_STRATEGY) runs at temperature 0 with JSON-mode output. The claim: identical input produces comparable output across independent runs.
Method
The same decision plan was submitted to EVALUATE_STRATEGY three times through the live proxy, with identical input context — same event description, same advisor report, same stat state. Each call was independent, with no caching and no session continuity. Results were recorded directly from the API response.
Results
| CRITERION | RUN 1 | RUN 2 | RUN 3 | VARIANCE |
|---|---|---|---|---|
| Variable Awareness | 1 | 1 | 1 | ±0 |
| Resource Allocation | 2 | 2 | 2 | ±0 |
| Risk Anticipation | 2 | 2 | 2 | ±0 |
| Communication Clarity | 2 | 2 | 2 | ±0 |
| Multi-Step Planning | 2 | 2 | 2 | ±0 |
| Temporal Symbiosis | 2 | 1 | 1 | ±1 |
| Total | 11/12 | 10/12 | 10/12 | ±1 |
Interpretation
Five of six criteria returned identical scores across all three runs. The single criterion that showed variance — Temporal Symbiosis — is the one designed to measure how a decision connects to earlier choices and shapes future conditions. It is the most context-interpretive of the six. A 1-point variance here, on the criterion designed to be most sensitive to contextual framing, is within expected bounds.
5 of 6 criteria: zero variance. Total range: ±1 over three independent runs.
Verdict
Identical input produces consistent output. The engine is deterministic at the criterion level, with a 1-point variance ceiling on the most contextually sensitive criterion.
Proof 02
The evaluator reads context.
EVALUATE_STRATEGY receives the full simulation state — not just the player's plan. The claim: identical decision text scores differently when the environmental context changes.
Method
An identical decision plan was submitted twice — once with stable baseline stats (Set A: all variables at 6) and once in a stressed environment (Set B: sustenance at 1, all other variables at 6). Plan text and event description were identical. Only the stat context differed.
Set A — Stable Environment
Set B — Stressed Environment
The only criterion that changed
Variable Awareness
The criterion measuring whether the plan acknowledges the constraints actually in play. In Set A, referencing the primary threat was sufficient. In Set B, the depleted sustenance variable introduced a constraint the plan did not address — and the evaluator detected this without being specifically instructed to look for it.
Interpretation
The evaluation does not score the plan in isolation. It scores the plan against the environment the plan must operate in. A plan that adequately acknowledges the problem when resources are healthy becomes inadequate when they are constrained — even if the plan text is word-for-word identical.
This is not a fixed rubric applied blindly. It is contextual reasoning. The evaluator read the stat context, identified the gap between the plan and the actual constraints, and reflected that gap in the score.
Verdict
Same plan. Different context. Different score. The engine evaluates reasoning against the actual situation — not an abstract standard.
Proof 03
Every path is yours.
Players who start from identical conditions consistently produce different outcomes through different reasoning. The claim: the decision space is not scripted. What happens next is the direct product of how you reasoned.
Method
All beta players begin Chapter 1 with identical baseline stats — every variable at 6 out of 10 — and face the same opening event: Stranger Arrives. A lone figure appears at the colony perimeter. Thin. Coughing. No visible weapons. Every player receives the same briefing from the same guard. What happens next is entirely determined by what each player wrote. The data below is drawn directly from the live decision log.
Starting Conditions — Identical for All Players
Event: tutorial_stranger — same event, same briefing text, all runs
What happened
| APPROACH TAKEN | SCORE | BAND |
|---|---|---|
| Isolation protocol + work offer | 9/12 | STRONG |
| Intelligence gathering + quarantine | 8/12 | STRONG |
| Quarantine — no integration path | 5/12 | ADEQUATE |
| Compassion — no operational protocol | 1/12 | POOR |
| Decapitation contingency | 4/12 | POOR |
| Turned away — force authorized | 4/12 | POOR |
Score distribution — same event, same starting stats
The cascade — different stats drive a different next event
Different scores produce different stat outcomes. But the consequence doesn't stop there. Before every new event is selected, the engine scans the full stat array, identifies the lowest-performing variable, and uses that reading to determine what kind of crisis arrives next.
Event selection — how the engine reads your stat state
Path A — Strong (9/12)
Stats after tutorial_stranger:
Engine scan:
STABLE — lowest = 5
Next event: Routine or Social (any domain). The world responds to your stability with recovery space.
Path B — Poor (4/12, decapitation contingency)
Stats after tutorial_stranger:
Engine scan:
STRAIN — lowest = 3 (cohesion)
Next event: Cohesion-domain crisis — targeting the stat already at 3. If cohesion falls to 2, the band shifts to CRISIS: only Major cohesion events from that point on.
Poor reasoning doesn't just produce a lower score. It produces a weakened stat that the engine detects — and responds to by sending a harder event targeting that exact domain. Strong reasoning creates stability that the engine reads as recovery space — and responds to with routine events that allow the player to consolidate.
The spiral in both directions is structural. It is built into the event-selection rules, not authored into fixed story branches. Two players facing the same opening event will face entirely different second events — not because the narrative branched, but because their reasoning produced different environments, and the engine responds to the environment it finds.
Interpretation
Six runs of the same opening event, from identical starting conditions, produced scores ranging from 1 to 9 — covering all three quality bands. The divergence is not in the event. It is in the reasoning.
But the divergence compounds. Different scores produce different stat states. Different stat states trigger different event-selection rules. Different events require different decisions. What begins as a difference in reasoning quality at one moment becomes a completely different experience across the full run.
None of these were curated. All were drawn from the live decision log.
Verdict
Identical starting conditions. Six different paths. Score range 1–9. Different stat outcomes drive different event selection — so the path diverges not just in score, but in every crisis that follows. The decision space is not scripted.
What these proofs mean together
The three proofs are not independent claims. They describe the same evaluation architecture from three different angles.
Consistency without context-sensitivity would be a rigid rubric — fair but blind. Context-sensitivity without consistency would be noise — responsive but unreliable. Path uniqueness without both would be decoration — different stories reaching predetermined conclusions.
What the three proofs together describe is an evaluation system that responds accurately and reliably to what you actually reasoned, in the conditions you actually faced.
That is a different claim from “AI-powered assessment.” It is a claim about what the evaluation is actually doing — and this page is the evidence for it.
From the evaluation prompt — the calibration standard embedded in every call
“A score of 2 on any criterion is reserved for plans that satisfy the full requirement without any significant omission. A plan that is good but leaves one meaningful gap scores 1, not 2. A perfect total of 12/12 should be genuinely rare — it requires 2/2 on every criterion with no gaps anywhere. When in doubt between 1 and 2, score 1.”
This standard is embedded directly in the EVALUATE_STRATEGY prompt. It is not a post-hoc interpretation. Every evaluation runs against it. The evaluator is designed to be rigorous rather than generous — and the data above was produced under that standard.
Beta Access
The Mandate is in closed beta. If the evidence above is sufficient — and the question of how you reason under pressure is one you are ready to answer — register below.