In plain English

How we build a synthetic person.

Teaching a computer to react the way a real group of customers would — so a brand can test a campaign before spending money sending it to millions.

What is a "synthetic person"?

Imagine a focus group you can assemble in seconds, ask anything, and never have to pay or tire out. That's the idea.

We build a small set of computer "people" that react to a message — an email, an SMS, a WhatsApp — the way a real group of your customers would. A brand can then see which version of a message lands best before sending the real thing to millions.

The golden rule: it's a crowd, not one person

A group of customers is never one type of person. Some are bargain-hunters, some are loyal fans, some are about to leave. So we never build a single "average" customer — we build a small crowd of the real types, in the same mix as your real audience.

Why we do it this way

If you build one "average" customer, you lose the disagreement that actually decides whether a campaign works — the bargain-hunter and the loyal fan want opposite things. A crowd keeps that tension.

What the research found

When you ask an AI to be "the average person," it tends to make everyone sound the same and hides the real spread of opinions. Researchers measured this directly and warned against it — so we deliberately keep the crowd diverse.

Bisbee and colleagues, Political Analysis, 2024.

What we feed it (the data)

A synthetic person is only as real as what we build it from. Three kinds of information, from least to most useful:

What the research found

Why "what they do" matters most: when this was tested on 2,000 real online stores, synthetic customers built from generic descriptions barely worked — but synthetic customers built from each store's real shopping behaviour predicted the winning options well.

SimGym, 2026 (arxiv.org/abs/2602.01443).

How we turn that data into a personality

How we make them talk naturally

We don't ask a synthetic person to "rate this 1 to 5" — that turns out to be unreliable. Instead we ask, "What do you think of this?" and read the feeling in their answer.

What the research found

A study run with Colgate showed that asking an AI for a number gives poor results — but reading the words of its answer matches real people about 90% as well as real people match themselves. So we read words, not numbers.

"Semantic Similarity Rating," PyMC Labs & Colgate, 2025 (arxiv.org/abs/2510.08338).

How we know it's real — not made up

This is the part most people skip, and it's the most important one.

Before we trust a synthetic person, we test it: we ask it about things the real group has already done, and check whether it gets them right. If it doesn't match reality, we don't use it. A confident-sounding answer that's wrong is worse than no answer.

What the research found

Why we always work with groups and test them: copying one specific individual is still weak — a 2025 Columbia study found a computer "twin" of a single person matched that person only loosely. But predicting a whole group is far more reliable. So we stay at the group level and prove it against real outcomes.

"Digital Twins as Funhouse Mirrors," Columbia, 2025 (arxiv.org/abs/2509.19088).

What it's good for — and what it isn't

✓ Good at

"Which of these 5 subject lines will my customers like most?"

Ranking options, spotting likely winners and losers, and explaining why.

✗ Not good at

"This will get exactly 4.2% clicks."

Exact numbers. We never promise those.

What the research found

Why: AIs are reliably good at saying which option is better, but unreliable at the exact number. So we use them to rank and screen — keep the likely winners, drop the likely losers — and let the real campaign confirm the precise numbers.

Consistent across studies, e.g. Li & Ji, 2026, and SimGym, 2026.

The numbers, in plain English

If someone walks you through this work, a few statistics show up. Here's what each one actually means — no maths needed.

Correlation — the "r" number how much two things move together
A score from 0 to 1. 0 = no connection at all (random). 1 = they move in perfect lockstep. Rule of thumb: ~0.2 is weak (barely related), ~0.6 is genuinely useful, ~0.85 is strong. So "the agents matched reality at 0.64" means a real, useful signal — not perfect, but well above guessing.
"r ≈ 0.2" the weak score
This is what you get trying to copy one specific individual. It's weak — which is exactly why we never predict single people. We always work with groups, where the numbers get much better.
Ranking agreement (Spearman) did we get the order right?
Same 0-to-1 idea, but it only asks: did we put the options in the right order? If we rank 5 subject lines and reality ranks them in nearly the same order, this is high — even if our exact scores differ. This is the one that matters for "which one will win."
Directional accuracy did we call the winner?
The simplest one: did we get the direction right — "A will beat B"? 69% means right about 7 times out of 10. A coin toss is 50%, so this is clearly better than chance — but not magic.
Top pick precision the one marketers care about
Did our top pick actually land among the real best few? In plain terms: "did the tool point me at a winner?"
Spread match not just the average
Did we capture the whole mix of reactions — the lovers and the haters — not just the middle? Getting the average right but missing the spread can hide the people who'll unsubscribe.
"Too bunched together" (under-dispersion) a known AI flaw
AIs tend to make everyone sound too similar, so the answers cluster too tightly. Real people disagree more than that. We deliberately widen it back out so the crowd feels real.
The "ceiling" (test-retest) why 100% is impossible
Ask a real person the same question twice, a week apart, and they don't give the identical answer. So "perfect" doesn't exist. We measure how close the AI gets to being as consistent as a real human is with themselves — "90% of that ceiling" is about as good as it can get.
Calibration tuning it to reality
Adjusting the AI's raw output using a little real data so its numbers line up with what actually happens — like sighting-in a scope before you rely on it.
Confidence band a range, not a fake-exact number
Instead of pretending to know "4.2%", we give an honest range — "somewhere around 3–5%." A single precise-looking number usually hides how unsure the tool really is.
"Significant" real, or just luck?
When a result is "significant," it means it's very unlikely to be a fluke. "Not significant" means it could just be random noise — don't bet on it.
Lift how much it moved the needle
The size of the change a new version caused — e.g. "the new subject line lifted clicks by 23%." It's the payoff you're hunting for.

The whole idea in one line: build a small crowd from real data, let them react in their own words, and never trust them until they've been tested against reality.

This describes AI simulations built from data — useful stand-ins for testing, not real people, and not a crystal ball. Predictions can be wrong and should be confirmed with a real test before any big decision.