Synthetic survey respondents

Silicon sampling is the use of LLMs to generate synthetic responses that simulate how subpopulations would answer survey questions, a term coined by Argyle et al. (2023).

The technique is used in two distinct ways. The first is group-level simulation: conditioning an LLM on demographic profiles to reproduce the distribution of responses a given subpopulation would give — for example, how Republicans aged 30–45 would answer a question about immigration policy. The goal is not to simulate any individual but to approximate the aggregate opinion of a group. The second is individual prediction (sometimes called a “digital twin”): constructing a model of a specific person from their survey history or demographic attributes and predicting how they would answer new questions.

Proposed applications include replacing human survey respondents entirely, piloting survey instruments, pretesting question framings, stress-testing measurement instruments, and generating hypotheses about group differences before collecting primary data.

Study overview

Study	Goal	Key finding	Verdict
Argyle et al. (2023)	Replace human polling samples	Conditioning GPT-3 on ANES demographic profiles reproduced vote choice distributions with r = 0.90–0.94 across three election years; fails for independents	Mixed
Sun et al. (2024)	Replace human polling samples	Group-level demographic sampling reproduced ANES 2020 vote choice (KL-divergence = 0.0004); fails for sensitive topics and sorted partisan subgroups	Mixed
Lyman et al. (2025)	Assess effect of RLHF on fidelity	Follow-up to Argyle et al. (2023) examining the trade-off between RLHF alignment and algorithmic fidelity; alignment suppresses minority opinion expression	Negative
Bisbee et al. (2024)	Replace human survey respondents	Aggregate feeling thermometer scores fall within one SD of ANES, but 48% of regression coefficients differ significantly and 32% flip sign; results not reproducible across model versions	Negative
Cerina & Duch (2023)	Compare annotation vs simulation	LLMs better suited to annotating existing survey data than simulating responses; lack of explainability makes them unsuitable for silicon sampling	Negative
Neumann et al. (2025)	Establish reliability thresholds	No model tested passed all quality control checks for a minimum reliability threshold for silicon sampling	Negative
Karanjai et al. (2025)	Improve simulation accuracy with RAG	Prompting alone achieves reasonable adherence; RAG improves further, but results may reflect training data memorisation rather than generalisation	Mixed
K. Lee et al. (2025)	Generalise silicon sampling cross-culturally	LLMs can replicate key ideological and demographic patterns in Korean survey data but overemphasise ideological differences on contentious issues	Mixed
Xu et al. (2025)	Simulate opinion on legal questions	Models are better aligned with their training data than with actual survey respondents on US Supreme Court questions	Negative
S. Lee et al. (2024)	Simulate climate opinion by demographic group	Including demographic and issue-related covariates improves alignment; models perform poorly for non-Hispanic Black Americans	Mixed
Cao et al. (2025)	Cross-national opinion simulation via fine-tuning	Fine-tuning on World Values Survey data outperforms zero-shot by 34% (1-JSD); all models produce less diverse predictions than real human data	Mixed
Durmus et al. (2023)	Simulate global public opinion	Models default to US/European perspectives; responses shift when prompted with country information but often reflect cultural stereotypes	Negative
Boelaert et al. (2025)	Replace survey respondents for opinion research	Models cannot replace survey respondents for opinion research; response distributions show a systematic machine bias that varies randomly across topics	Negative
Maier et al. (2025)	Replace human panels for consumer testing	Synthetic consumers achieve 90% of human test-retest reliability for ranking personal care products; response distributions are too narrow for absolute measurement	Mixed
Kaiser et al. (2025)	Simulate consumer preference rankings	Models approximate aggregate rankings but produce overly positive results and reduced variance compared to real consumers	Mixed
Aher et al. (2022)	Replicate classic human subject experiments	LLMs successfully replicated findings from economic, psycholinguistic, and social psychology experiments but showed a hyper-accuracy distortion absent in real human data	Mixed
Peng et al. (2025)	Build individual digital twins	Across 19 pre-registered studies and 164 outcomes, mean r = 0.20 between digital twin and human responses; twins significantly under-dispersed in 94% of outcomes	Negative
Kim & Lee (2024)	Predict individual responses via fine-tuning	Fine-tuning on GSS data achieves AUC = 0.87 for questions in training data, falling to AUC = 0.73 for truly unasked questions	Mixed
Amini (2025)	Predict individual electoral choices	Survey Transfer Learning (non-LLM) achieves 93% accuracy for US electoral outcomes and substantially outperforms LLM-based methods on the same task	Negative
Dillion et al. (2023)	Assess when LLMs can substitute for humans	Reviews conditions under which LLMs can substitute for human participants; concludes substitution is appropriate only for tasks with objectively correct answers, not for subjective opinion measurement	Negative
Santurkar et al. (2023)	Assess whose opinions LLMs represent	Every human demographic subgroup is more representative of US public opinion than any LM tested; RLHF worsens group-level alignment	Negative
Li et al. (2025)	Assess opinion distribution accuracy	LLMs assign extreme probability to the modal answer in 80–100% of subgroups on abortion and immigration questions (vs. 30–40% in ANES); structural inconsistency across levels of aggregation	Negative
Park et al. (2024)	Measure opinion diversity in LLM outputs	LLMs express a single dominant opinion with very high probability on most questions, suppressing the diversity of thought present in human populations	Negative
Tjuatja et al. (2024)	Test whether LLMs replicate response biases	LLMs do not exhibit human-like response biases across a range of survey design effects; models give uniform answers where humans show high variation	Negative
Beck et al. (2024)	Assess robustness of sociodemographic prompting	Sociodemographic prompting is not robust; prompt formulation and model choice produce large variance in answers, with more than half of labels incorrectly classified	Negative
Cheng et al. (2023)	Assess quality of social group representations	Models fail to capture the multidimensionality of social groups and instead produce caricatures that perpetuate stereotypes	Negative
Morocho et al. (2026)	Test whether persona conditioning improves alignment	Persona conditioning produces no consistent improvement in aggregate alignment and frequently worsens subgroup fidelity for underrepresented strata	Negative
Gao et al. (2025)	Simulate game-theoretic behavior	Nearly all advanced approaches fail to replicate human behavior distributions in a game-theoretic task; failure causes are diverse and unpredictable across models and prompt variations	Negative
Suh et al. (2025)	Fine-tune for public opinion prediction	Fine-tuning on scaled public opinion survey data improves the ability to predict response distributions and generalises to unseen populations and question topics	Positive
Lu et al. (2025)	Simulate human online navigation behavior	LLM agents show similar outcomes to humans in website navigation tasks but use more goal-directed strategies; hyper-accuracy distortion observed in behavioral simulation	Mixed

What the reviews say

Three review papers have synthesised the empirical literature on silicon sampling. Their overall conclusions converge but differ in emphasis.

Silicon samples are not reliable substitutes for human respondents, especially in policy settings (Wihbey & D’Alonzo, 2025). LLMs are useful complements for early-stage tasks — refining survey questions, pretesting framings, and exploratory concept testing — but the most defensible approach is a hybrid pipeline that keeps human samples as the gold standard for final data collection (Wihbey & D’Alonzo, 2025).

Results vary considerably across domains, and silicon samples hold the most promise in upstream phases of the research process, such as qualitative pretesting and pilot studies, rather than in main studies (Sarstedt et al., 2024). Fine-tuning models on group-specific data can backfire, pushing them to produce caricatures rather than authentic group representations (Sarstedt et al., 2024).

LLMs may be appropriate participants in a narrow set of circumstances: topics where explicit situational features drive human judgment (such as moral scenarios with a clear intentional agent), tedious or high-volume tasks, early research stages such as hypothesis generation and item piloting, and Western English-speaking samples where training data coverage is strongest (Dillion et al., 2023). Even so, LLMs collapse diversity into a single modal opinion and are better at approximating group averages than capturing within-group variation (Dillion et al., 2023).

The point of agreement across all three reviews: silicon sampling is more defensible as a design tool than as a data collection method. The point of divergence is how much weight to give the positive cases — whether promising results in narrow domains (US electoral politics, moral judgment, consumer concept ranking) are evidence of a useful technique or anomalies that should not be generalised.

When silicon sampling holds promise

The more consistent positive results cluster around a few conditions.

Politically structured outcomes with large group margins. When the question being simulated has a strong and stable relationship with demographic predictors — as US vote choice does — LLMs can reproduce aggregate distributions with high accuracy (Argyle et al., 2023; Sun et al., 2024). The model is effectively recovering well-documented group-level correlations that are heavily represented in its training data. Results degrade for groups with weaker predictable structure (independents, mixed partisans) and for sensitive topics where training data is sparse or suppressed (Sun et al., 2024).

Tasks with structurally determined answers. Moral scenarios, economic games, and psycholinguistic tasks — where human responses are driven by explicit situational features — are the conditions where LLM output correlates most strongly with human judgment (Aher et al., 2022; Dillion et al., 2023). The model is detecting the structural signal in the prompt. Divergence appears when competing intuitions are in play and the correct response is not deducible from surface features (Dillion et al., 2023).

Ranking rather than absolute measurement. Synthetic respondents produce rankings of consumer product concepts that correlate with human rankings, but absolute scores are systematically inflated and distributions are too narrow for measurement use (Kaiser et al., 2025; Maier et al., 2025). Comparisons within a set of stimuli are more reliable than any single rating.

Fine-tuned models on domain-specific data. Fine-tuning on survey data improves performance and generalises to unseen questions and populations (Suh et al., 2025). However, fine-tuning on group-specific profiles can backfire, pushing models to produce caricatures rather than authentic variation within groups (Sarstedt et al., 2024).

What limits silicon sampling

The same studies that document positive results in narrow conditions also reveal a consistent set of failure modes that are structural, not fixable by scaling or prompting.

Under-dispersion. Every model tested produces less diverse predictions than real human data (Cao et al., 2025; Li et al., 2025; Park et al., 2024; Peng et al., 2025). LLMs assign extremely high probability to a single modal answer — often above 0.99 for a given option — whereas humans show genuine within-group variation even on contentious topics (Santurkar et al., 2023). This is not incidental: alignment training pushes models toward consensus, socially desirable responses regardless of demographic conditioning (Lyman et al., 2025).

Structural inconsistency. The opinion distribution an LLM predicts for a demographic group changes depending on how finely the persona is specified (Li et al., 2025). Real survey data trivially satisfies the constraint that group estimates should be consistent across levels of aggregation; LLMs systematically violate it. This means silicon sampling can produce coherent-looking aggregate estimates that have no stable relationship to underlying subgroup opinion.

Training data skew. Models default to US and European perspectives on cross-national topics and perform poorly for groups underrepresented in training data — non-Hispanic Black Americans, older adults, non-Western populations (Durmus et al., 2023; S. Lee et al., 2024; Santurkar et al., 2023). Prompting with country or demographic information shifts responses but often toward stereotypes rather than genuine group distributions (Cheng et al., 2023; Durmus et al., 2023).

Prompt sensitivity. Results vary substantially with prompt formulation and model version, making findings difficult to replicate or reproduce (Beck et al., 2024; Bisbee et al., 2024). More than half of sociodemographic labels are incorrectly classified in some prompt configurations (Beck et al., 2024).

Individual prediction fails. Digital twins that simulate individual survey respondents achieve a mean correlation of r = 0.20 with actual human responses across a large pre-registered study set, and are significantly under-dispersed in 94% of outcomes (Peng et al., 2025). Non-LLM survey imputation methods substantially outperform LLMs on the same individual-prediction task (Amini, 2025).

Recommendations

Use silicon sampling for design, not data collection. The clearest applications are upstream: generating survey items, pretesting question framing, exploring concept space before fielding human samples (Dillion et al., 2023; Sarstedt et al., 2024; Wihbey & D’Alonzo, 2025). Treat silicon samples as a way to stress-test a questionnaire, not as a substitute for it.

Validate against human data before drawing conclusions. Even in domains where group-level alignment looks acceptable, regression coefficients and subgroup estimates are unreliable (Bisbee et al., 2024). If silicon sampling is used in a study, at minimum report the model and version, the full prompting approach, and a sensitivity check across prompt variants.

Do not use silicon sampling to measure minority opinion or within-group variation. The systematic under-dispersion and training data skew make LLMs poorly suited for studying groups that are underrepresented in training corpora or whose opinions diverge from the modal position (S. Lee et al., 2024; Lyman et al., 2025; Santurkar et al., 2023). This is precisely the research context where human samples are most needed and hardest to replace.

References

Aher, G., Arriaga, R. I., & Kalai, A. T. (2022). Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies. arXiv. https://doi.org/10.48550/arXiv.2208.10264

Amini, A. (2025). Survey transfer learning: Recycling data with silicon responses. arXiv. https://doi.org/10.48550/arXiv.2501.06577

Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3), 337–355. https://doi.org/10.1017/pan.2023.2

Beck, T., Schuff, H., Lauscher, A., & Gurevych, I. (2024). Sensitivity, Performance, Robustness: Deconstructing the Effect of Sociodemographic Prompting. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 2589–2615. https://doi.org/10.18653/v1/2024.eacl-long.159

Bisbee, J., Clinton, J., Dorff, C., Kenkel, B., & Larson, J. (2024). Synthetic replacements for human survey data? The perils of large language models. Political Analysis, 32(3), 401–416. https://doi.org/10.1017/pan.2023.27

Boelaert, J., Coavoux, S., Ollion, É., Petev, I., & Präg, P. (2025). Machine Bias. How Do Generative Language Models Answer Opinion Polls? Sociological Methods & Research, 54(3), 1156–1196. https://doi.org/10.1177/00491241251330582

Cao, Y., Liu, H., Arora, A., Augenstein, I., Röttger, P., & Hershcovich, D. (2025). Specializing large language models to simulate survey response distributions for global populations. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 3141–3154. https://doi.org/10.18653/v1/2025.naacl-long.162

Cerina, R., & Duch, R. (2023). Artificially Intelligent Opinion Polling. arXiv. https://doi.org/10.48550/arXiv.2309.06029

Cheng, M., Piccardi, T., & Yang, D. (2023). CoMPosT: Characterizing and Evaluating Caricature in LLM Simulations. arXiv. https://doi.org/10.48550/arXiv.2310.11501

Dillion, D., Tandon, N., Gu, Y., & Gray, K. (2023). Can AI language models replace human participants? Trends in Cognitive Sciences, 27(7), 597–600. https://doi.org/10.1016/j.tics.2023.04.008

Durmus, E., Nguyen, K., Liao, T. I., Schiefer, N., Askell, A., Bakhtin, A., Chen, C., Hatfield-Dodds, Z., Hernandez, D., Joseph, N., Lovitt, L., McCandlish, S., Sikder, O., Tamkin, A., Thamkul, J., Kaplan, J., Clark, J., & Ganguli, D. (2023). Towards Measuring the Representation of Subjective Global Opinions in Language Models. arXiv. https://doi.org/10.48550/arXiv.2306.16388

Gao, Y., Lee, D., Burtch, G., & Fazelpour, S. (2025). Take caution in using LLMs as human surrogates. Proceedings of the National Academy of Sciences, 122(24), e2501660122. https://doi.org/10.1073/pnas.2501660122

Kaiser, C., Kaiser, J., Manewitsch, V., Rau, L., & Schallner, R. (2025). Simulating Human Opinions with Large Language Models: Opportunities and Challenges for Personalized Survey Data Modeling. Adjunct Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization, 82–86. https://doi.org/10.1145/3708319.3733685

Karanjai, R., Shor, B., Austin, A., Kennedy, R., Lu, Y., Xu, L., & Shi, W. (2025). Synthesizing Public Opinions with LLMs: Role Creation, Impacts, and the Future to eDemocracy. arXiv. https://doi.org/10.48550/arXiv.2504.00241

Kim, J., & Lee, B. (2024). AI-augmented surveys: Leveraging large language models and surveys for opinion prediction. arXiv. https://doi.org/10.48550/arXiv.2305.09620

Lee, K., Park, J., Choi, S., & Lee, C. (2025). Ideology and Policy Preferences in Synthetic Data: The Potential of LLMs for Public Opinion Analysis. Media and Communication, 13, 9677. https://doi.org/10.17645/mac.9677

Lee, S., Peng, T.-Q., Goldberg, M. H., Rosenthal, S. A., Kotcher, J. E., Maibach, E. W., & Leiserowitz, A. (2024). Can large language models estimate public opinion about global warming? An empirical assessment of algorithmic fidelity and bias. PLOS Climate, 3(8), e0000429. https://doi.org/10.1371/journal.pclm.0000429

Li, D., Li, L., & Qiu, H. S. (2025). ChatGPT is not A Man but Das Man: Representativeness and Structural Consistency of Silicon Samples Generated by Large Language Models. arXiv. https://doi.org/10.48550/arXiv.2507.02919

Lu, Y., Huang, J., Han, Y., Yao, B., Bei, S., Gesi, J., Xie, Y., Wang, Z., He, Q., & Wang, D. (2025). Can LLM Agents Simulate Multi-Turn Human Behavior? Evidence from Real Online Customer Behavior Data. arXiv. https://doi.org/10.48550/arXiv.2503.20749

Lyman, A., Hepner, B., Argyle, L. P., Busby, E. C., Gubler, J. R., & Wingate, D. (2025). Balancing Large Language Model Alignment and Algorithmic Fidelity in Social Science Research. Sociological Methods & Research, 54(3), 1110–1155. https://doi.org/10.1177/00491241251342008

Maier, B. F., Aslak, U., Fiaschi, L., Rismal, N., Fletcher, K., Luhmann, C. C., Dow, R., Pappas, K., & Wiecki, T. V. (2025). LLMs reproduce human purchase intent via semantic similarity elicitation of likert ratings. arXiv. https://doi.org/10.48550/arXiv.2510.08338

Morocho, E. E. T., Cima, L., Fagni, T., Avvenuti, M., & Cresci, S. (2026). Assessing the reliability of persona-conditioned LLMs as synthetic survey respondents. arXiv. https://doi.org/10.48550/arXiv.2602.18462

Neumann, T., De-Arteaga, M., & Fazelpour, S. (2025). Should you use LLMs to simulate opinions? Quality checks for early-stage deliberation. arXiv. https://doi.org/10.48550/arXiv.2504.08954

Park, P. S., Schoenegger, P., & Zhu, C. (2024). Diminished diversity-of-thought in a standard large language model. Behavior Research Methods, 56(6), 5754–5770. https://doi.org/10.3758/s13428-023-02307-x

Peng, T., Gui, G., Merlau, D. J., Fan, G. J., Sliman, M. B., Brucks, M., Johnson, E. J., Morwitz, V., Althenayyan, A., Bellezza, S., Donati, D., Fong, H., Friedman, E., Guevara, A., Hussein, M., Jerath, K., Kogut, B., Kumar, A., Lane, K., … Toubia, O. (2025). A mega-study of digital twins reveals strengths, weaknesses and opportunities for further improvement. arXiv. https://doi.org/10.48550/arXiv.2509.19088

Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., & Hashimoto, T. (2023). Whose opinions do language models reflect? Proceedings of the 40th International Conference on Machine Learning.

Sarstedt, M., Adler, S. J., Rau, L., & Schmitt, B. (2024). Using large language models to generate silicon samples in consumer and marketing research: Challenges, opportunities, and guidelines. Psychology & Marketing, 41(6), 1254–1270. https://doi.org/10.1002/mar.21982

Suh, J., Jahanparast, E., Moon, S., Kang, M., & Chang, S. (2025). Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions. arXiv. https://doi.org/10.48550/arXiv.2502.16761

Sun, S., Lee, E., Nan, D., Zhao, X., Lee, W., Jansen, B. J., & Kim, J. H. (2024). Random silicon sampling: Simulating human sub-population opinion using a large language model based on group-level demographic information. Findings of the Association for Computational Linguistics: ACL 2024.

Tjuatja, L., Chen, V., Wu, T., Talwalkwar, A., & Neubig, G. (2024). Do LLMs Exhibit Human-like Response Biases? A Case Study in Survey Design. Transactions of the Association for Computational Linguistics, 12, 1011–1026. https://doi.org/10.1162/tacl_a_00685

Wihbey, J., & D’Alonzo, S. (2025). AI simulations of audience attitudes and policy preferences: Silicon Sampling guidance for communications practitioners. SSRN. https://doi.org/10.2139/ssrn.5533958

Xu, S., Santosh, T. Y. S. S., Elazar, Y., Vogel, Q., Plank, B., & Grabmair, M. (2025). Better Aligned with Survey Respondents or Training Data? Unveiling Political Leanings of LLMs on U.S. Supreme Court Cases. arXiv. https://doi.org/10.48550/arXiv.2502.18282