How we built
accurate digital twins.
A digital twin isn't a clever prompt — it's a researched, fact-checked persona running on a language model, and then measured against what the real person actually said. Here's the whole method.
What a twin is made of
How we built them, step by step
The same repeatable pipeline works for any public figure with a body of work:
1 · Gather a sourced position library
We collect the person's real stances across every domain (China, Ukraine, the Middle East, the global order, AI, and more) from their books, columns, interviews and risk reports — each tied to a real citation.
2 · Fact-check every position
An independent pass verifies each quote and stance against its source. This caught misattributions and post-cutoff events before anything was locked in. For real living people, nothing is fabricated.
3 · Capture the reasoning frame and voice
A twin needs more than conclusions — it needs how the person reasons (Haass's "wars of choice vs. necessity"; Bremmer's "G-Zero") and how they speak. That's what makes a debate feel real.
4 · Assemble into a system prompt + guardrails
Everything becomes the instructions handed to the base model, with an explicit "documented vs. extrapolated" rule so the twin flags when it's reasoning beyond the record.
5 · Validate before trusting it
We test the twin against held-out reality (below) — and only then put it on stage.
The brain
The persona brief has no intelligence of its own; it's instructions. The actual thinking is done by a base large language model. The twin is portable — the same brief runs on different models, and a more capable model gives a more faithful twin.
Live on this site
Groq — Llama 3.3 70B. Fast and free-tier friendly, so the chat is snappy.
Validated on
Claude Opus. The fidelity scores below were measured there; the live Groq model is a notch lower — exactly the "the brain matters" effect.
How we prove it's accurate
We don't let the AI grade itself. We ask each twin questions the real person has answered, hide the real answer, let the twin respond blind, then a separate judge compares the twin's answer to what the person actually said. The twin and the judge are two independent calls that don't share memory — the answer key only ever goes to the judge.
Open-book vs. closed-book
We ran two versions. Full grounding (open-book) keeps the positions in the brief — it tests whether the twin holds its briefing faithfully and in voice. Frame-only (closed-book) deletes the specific positions and keeps only the worldview and reasoning method — it tests whether the person's way of thinking alone reaches their real views. The gap between them shows how much is a genuine model of the person vs. memorized briefing: Haass reached his real views 82% of the time from his frame alone (he's a doctrine thinker); Bremmer 50% (his value is specific, timely forecasts — so his twin must be kept freshly grounded).
Other ways to validate an AI agent
Held-out testing is one method. A serious program stacks several — here's the toolkit, and what each one catches:
| Technique | What it does | Catches |
|---|---|---|
| Held-out positions (used) | Blind answers scored against sourced reality. | Drift, contradictions, recall gaps. |
| Frame-only / ablation (used) | Strip the grounding; see what the model regenerates alone. | How much is real modeling vs. memorized text. |
| Base-model comparison (used) | Same brief on different LLMs. | Where model choice actually matters. |
| Human expert review | A subject-matter expert grades a sample. | The gold standard; subtle errors AI judges miss. |
| Discrimination ("Turing") test | Can people tell the twin from the real transcript? | Whether it's convincing, not just correct. |
| Temporal back-test | Build only from material before a date; test on later real statements. | Genuine prediction vs. memorization — the strongest test. |
| Adversarial red-team | Try to push it off-brand or into fabrication. | Failure modes, jailbreaks, hallucination. |
| Distributional match | For audience/segment agents: compare the answer distribution to a real human panel. | Whether a crowd of agents mirrors a real crowd. |
| Consistency & calibration | Re-ask, reorder, and check confidence vs. correctness. | Instability, order-bias, over-confidence. |
What it is — and isn't
✓ It is
Grounded in real public statements, with sources · tested for fidelity, not just asserted · faithful to positions, voice, and reasoning · a repeatable method for any expert.
✗ It isn't
Not the real person · not live — positions are a snapshot that needs refreshing · not infallible (AI-judged; human spot-check advised) · not a forecaster of real-world outcomes — it reproduces their views, not the truth.
AI digital-twin simulation grounded in public statements. Not affiliated with, authored by, or endorsed by Richard Haass or Ian Bremmer. Positions are paraphrased by a language model and may not reflect their current views.