Under the hood

How we built
accurate digital twins.

A digital twin isn't a clever prompt — it's a researched, fact-checked persona running on a language model, and then measured against what the real person actually said. Here's the whole method.

What a twin is made of

1 · Base LLM

The brain. A frontier AI model supplies the raw intelligence and language.

2 · Grounded persona

A researched, sourced brief of the person's real positions, reasoning frame and voice.

3 · Guardrails

Rules separating what they've actually said from anything the AI extrapolates.

A faithful twin

An AI that thinks and talks like the specific person — and can be tested.

How we built them, step by step

The same repeatable pipeline works for any public figure with a body of work:

1 · Gather a sourced position library

We collect the person's real stances across every domain (China, Ukraine, the Middle East, the global order, AI, and more) from their books, columns, interviews and risk reports — each tied to a real citation.

2 · Fact-check every position

An independent pass verifies each quote and stance against its source. This caught misattributions and post-cutoff events before anything was locked in. For real living people, nothing is fabricated.

3 · Capture the reasoning frame and voice

A twin needs more than conclusions — it needs how the person reasons (Haass's "wars of choice vs. necessity"; Bremmer's "G-Zero") and how they speak. That's what makes a debate feel real.

4 · Assemble into a system prompt + guardrails

Everything becomes the instructions handed to the base model, with an explicit "documented vs. extrapolated" rule so the twin flags when it's reasoning beyond the record.

5 · Validate before trusting it

We test the twin against held-out reality (below) — and only then put it on stage.

The brain

The persona brief has no intelligence of its own; it's instructions. The actual thinking is done by a base large language model. The twin is portable — the same brief runs on different models, and a more capable model gives a more faithful twin.

Live on this site

Groq — Llama 3.3 70B. Fast and free-tier friendly, so the chat is snappy.

Validated on

Claude Opus. The fidelity scores below were measured there; the live Groq model is a notch lower — exactly the "the brain matters" effect.

How we prove it's accurate

We don't let the AI grade itself. We ask each twin questions the real person has answered, hide the real answer, let the twin respond blind, then a separate judge compares the twin's answer to what the person actually said. The twin and the judge are two independent calls that don't share memory — the answer key only ever goes to the judge.

95%

Haass — positions reproduced · 0 contradictions

100%

Bremmer — positions reproduced · 0 contradictions

Open-book vs. closed-book

We ran two versions. Full grounding (open-book) keeps the positions in the brief — it tests whether the twin holds its briefing faithfully and in voice. Frame-only (closed-book) deletes the specific positions and keeps only the worldview and reasoning method — it tests whether the person's way of thinking alone reaches their real views. The gap between them shows how much is a genuine model of the person vs. memorized briefing: Haass reached his real views 82% of the time from his frame alone (he's a doctrine thinker); Bremmer 50% (his value is specific, timely forecasts — so his twin must be kept freshly grounded).

Other ways to validate an AI agent

Held-out testing is one method. A serious program stacks several — here's the toolkit, and what each one catches:

Technique	What it does	Catches
Held-out positions (used)	Blind answers scored against sourced reality.	Drift, contradictions, recall gaps.
Frame-only / ablation (used)	Strip the grounding; see what the model regenerates alone.	How much is real modeling vs. memorized text.
Base-model comparison (used)	Same brief on different LLMs.	Where model choice actually matters.
Human expert review	A subject-matter expert grades a sample.	The gold standard; subtle errors AI judges miss.
Discrimination ("Turing") test	Can people tell the twin from the real transcript?	Whether it's convincing, not just correct.
Temporal back-test	Build only from material before a date; test on later real statements.	Genuine prediction vs. memorization — the strongest test.
Adversarial red-team	Try to push it off-brand or into fabrication.	Failure modes, jailbreaks, hallucination.
Distributional match	For audience/segment agents: compare the answer distribution to a real human panel.	Whether a crowd of agents mirrors a real crowd.
Consistency & calibration	Re-ask, reorder, and check confidence vs. correctness.	Instability, order-bias, over-confidence.

What it is — and isn't

✓ It is

Grounded in real public statements, with sources · tested for fidelity, not just asserted · faithful to positions, voice, and reasoning · a repeatable method for any expert.

✗ It isn't

Not the real person · not live — positions are a snapshot that needs refreshing · not infallible (AI-judged; human spot-check advised) · not a forecaster of real-world outcomes — it reproduces their views, not the truth.

AI digital-twin simulation grounded in public statements. Not affiliated with, authored by, or endorsed by Richard Haass or Ian Bremmer. Positions are paraphrased by a language model and may not reflect their current views.

How we builtaccurate digital twins.