Can an LLM Be a Method Actor?
- Empatix Consulting
- 3 days ago
- 7 min read
We gave 600 AI personas a standardized personality test. The results say more about the model than the personas.
Three things are happening in market research right now that anyone in the insights business should be paying attention to.

First, synthetic respondents are now a real product category. In March 2025, Qualtrics launched Edge Audiences — synthetic panels built on a fine-tuned model trained on millions of actual survey responses — with Booking.com, Google Labs, and Loop Earplugs among the early adopters. Evidenza, Fairgen, Toluna, PersonaPanels, and over a dozen other platforms are selling AI-generated respondents as a commercial service. This isn’t a concept deck anymore. It’s a line item.
Second, the methodology gap is enormous. The best platforms are training custom models on proprietary survey data and publishing validation studies — Evidenza’s partnership with EY reported 95% correlation with traditional brand survey results; Qualtrics has published head-to-head comparisons showing their fine-tuned model outperforms off-the-shelf LLMs on survey tasks. But the distance between what these purpose-built systems can do and what you’d get from prompting ChatGPT with a persona description is vast, and not widely understood.
Third, the industry is converging on “hybrid.” Nobody credible is claiming synthetic will replace human panels. The emerging consensus is synthetic for speed and breadth on early-stage exploration, human for validation and depth on high-stakes decisions. Qualtrics frames it as “test five concepts synthetically, validate the top two with humans.” Fairgen positions their product as a statistical booster that doubles your effective sample size for niche subgroups. The question isn’t synthetic or human — it’s where in your workflow each one fits.
We wanted to understand the fundamentals. Not by evaluating any vendor’s platform, but by asking a simpler question: what actually happens when you give an off-the-shelf LLM a persona and a validated psychometric instrument? Not because that’s how the best platforms work — it isn’t — but because it’s the starting point for understanding what these tools can and can’t do. Think of it as a naive investigation: we used the cheapest, most accessible approach possible to surface the kinds of questions you should be asking before you buy anything.
We generated synthetic personas using gpt-4o-mini and administered the Mini-IPIP Big Five personality inventory — a validated 20-item survey used across thousands of psychology studies. Then we ran the numbers the same way we would for real panel data: descriptive stats, treatment effects, analysis of variance, and discriminant validity checks across three experiments for a total of 600 surveys.
Here’s what we found.
The Model Has a Personality (And It’s Not Yours)
Before testing whether the model could simulate different people, we needed to know what its default personality looked like. So we ran 100 personas through the Big Five with no personality instructions — just demographics and a backstory — and compared the results to published US adult norms.
Two things jump out immediately. The model is extraordinarily agreeable — nearly maxing out a 5-point scale — and extraordinarily not anxious. These are exactly the traits you’d expect from a system trained to be helpful, harmless, and pleasant to talk to. The model doesn’t have a neutral personality. It has a specific one: an agreeable, emotionally stable, conscientious optimist.
The variance problem is arguably worse than the mean problem. In the general population, agreeableness has a standard deviation of 0.63. In our synthetic sample, it’s 0.17 — four times more homogeneous than reality. Neuroticism is six times too narrow. Every persona the model generates lives in the calm, agreeable corner of human personality space, regardless of what backstory you give it.
The model’s own personality bleeds through every costume it puts on.
It Follows the Script — But Some Knobs Are Stickier Than Others
Next we tested the basics: if you explicitly tell the model “this persona is highly neurotic” or “this persona has low agreeableness,” do the survey scores follow? We ran five treatment groups, 50 personas each.
Every treatment shifted scores in the expected direction. All five were statistically significant. But the knobs don’t all turn equally:
The pattern is telling. Conscientiousness is the easiest dial to turn — tell the model a persona is disorganized and the scores drop dramatically. But neuroticism barely moves. Even when explicitly told to be neurotic, the treatment group averaged just 2.08 — a hair above the baseline of 2.02. The model knows it’s supposed to be anxious. It just... isn’t.
This is the alignment fingerprint in action. Agreeableness and emotional stability are exactly the traits that the model’s safety training reinforces. It will play a slob before it’ll play a worrier.
The Traits Bleed Into Each Other
Something we didn’t expect showed up in the traits we didn’t manipulate.
When we told personas to be highly open, their conscientiousness dropped and their extraversion rose — even though we never mentioned those traits. When we made them highly extraverted, conscientiousness dropped. Low conscientiousness pushed extraversion up.
The model appears to carry an implicit personality theory: open people are less organized, extraverts are less disciplined, disagreeable people are less social. Some of these correlations match published research on how real personality traits covary. Others look more like stereotypes.
For anyone trying to use LLM personas as a stand-in for real survey respondents, this matters. You’re not just getting the trait you asked for — you’re getting a whole constellation of assumptions bundled with it.
Artists and Accountants: The Method Acting Test
The first two experiments tested whether the model could follow instructions. This one tests whether it can inhabit a role.
We generated 250 personas across five occupations — Artist, Accountant, ER Nurse, Sales Director, Startup Founder — with no personality hints. Just a job title, some demographics, and a backstory. Then we gave them the same personality inventory.
The question: does the model’s implicit understanding of “what kind of person becomes a nurse” align with decades of occupational psychology research?
Short answer: directionally, yes. Artists scored highest in openness. Accountants scored highest in conscientiousness. ER Nurses led in agreeableness. Sales Directors and Startup Founders scored higher in extraversion. All five traits showed significant differences across occupations.³
But the effect sizes tell a different story. The differences are real but small — the model is whispering when it should be speaking. An Artist’s openness score of 4.50 versus an Accountant’s 4.11 is statistically significant, but in human data you’d expect a much wider gap. The model’s compressed variance means all five occupations are variations on the same agreeable, emotionally stable template. The occupational “costumes” are recognizable, but thin.
We came away with a simple three-part framework for how well the model does persona work:
It knows the script. Give it explicit personality instructions and the scores move in the right direction — for most traits.
It’s learning to method-act. Occupational stereotypes come through without being told, which suggests the model has internalized real patterns about how personality and profession relate.
It can’t improvise. The model’s own personality — agreeable, stable, conscientious — is always the loudest voice in the room. Variance is compressed, some traits resist manipulation, and the “character” never fully takes over from the “actor.”
So What?
In the general population, about a third of people score above 3.6 on neuroticism. In our synthetic sample, the maximum was 2.6. The entire synthetic population lives in the calm half of the human distribution.
Neuroticism variance is six times too narrow. If you used this data to model a real market, you’d systematically underestimate the diversity of your customers.
And the trait resistance is a design problem masquerading as a data problem. The model can play an introvert, a slob, or even a jerk — but it can’t play someone who’s anxious, insecure, or emotionally volatile. The training that makes the model helpful also makes it constitutionally incapable of inhabiting certain personality types. It’s a method actor who refuses to take dark roles.
This is, remember, the naive version — a general-purpose LLM with persona prompting. The commercial platforms are building on top of much more sophisticated foundations: fine-tuned models trained on millions of real survey responses, proprietary validation frameworks, and hybrid workflows that combine synthetic speed with human depth. Our experiments don’t tell you whether Qualtrics Edge or Fairgen or Evidenza have solved these problems. What they tell you is that the problems are real, and worth asking about.
Questions Worth Asking
If you’re evaluating synthetic data for your research program — or just trying to figure out where it fits — here are the questions we’d start with:
On the methodology: When a vendor reports high correlation between synthetic and human results, ask what types of questions drove that correlation. Our experiments suggest attitudinal and psychographic items (Likert scales, agreement statements) are the sweet spot. Behavioral recall, open-ended responses, and anything requiring genuine lived experience are where models struggle. A 95% correlation on brand perception questions is a very different claim than a 95% correlation on behavioral frequency questions.
On the economics: How much of your current research budget goes to early-stage screening — concept tests, message checks, rough segmentation — versus high-stakes decisions that inform major investments? That ratio tells you how much synthetic upside you actually have. The use case isn’t replacing your annual brand tracker. It’s running twenty concept screens in the time and budget it used to take to run two.
On the variance: Ask to see distributional comparisons, not just mean comparisons. Two datasets can have nearly identical means but wildly different variance structures — and it’s the variance that matters for segmentation, targeting, and understanding the full range of your market. If the synthetic data compresses your population into a narrow band around the average, you’ll miss the edges where the most interesting customers live.
On the training data: Is the model fine-tuned on actual survey response data, or is it a general-purpose LLM with demographic conditioning? The performance gap between these two approaches is significant. Our experiment used the latter. The commercial platforms claiming high correlation are using the former. These are not the same thing, and the distinction matters.
Notes
¹ Effect sizes conventionally considered “large” in behavioral research (Cohen’s d > 0.8).
² Effect sizes in the “small to medium” range (Cohen’s d 0.4–0.5) — the instruction moved scores, but not by much.
³ All five traits showed significant between-group differences (one-way ANOVA, all p < 0.05). Extraversion, conscientiousness, and openness were highly significant (p < 0.0001).




Comments