Research - Synthetic healthcare data

Synthetic healthcare data, finally native to Python.

PySynthea is a Python-native reimplementation of Synthea, designed to make synthetic longitudinal EHR generation installable, extensible, reproducible, and usable directly inside the data science stack where modern healthcare AI work already happens.

Published May 21, 2026Authors Roberto Cruz, David Rey-BlancoReading time ~9 min

Download the paper (PDF)GitHub repo What's inside

Roberto Cruz

TietAI

ORCID Email

David Rey-Blanco

TietAI

ORCID Email

High-level PySynthea architecture showing the generator, person model, module engine, and export layer.

System architecture. PySynthea organizes synthetic patient generation around a Python Generator, deterministic Person instances, a GMF-compatible module engine, and export layers for FHIR, JSON, CSV, and analytics-friendly formats.

01 - What is PySynthea

A Synthea-style simulator built for Python-first healthcare AI.

Synthetic healthcare data lets teams prototype, teach, test, and benchmark without exposing protected health information. Synthea made this practical for longitudinal EHR data, but many researchers now work almost entirely in Python, pandas, Jupyter, PyTorch, Dask, and PySpark.

PySynthea keeps the core Synthea model: modular disease state machines, demographic population sampling, longitudinal patient histories, and standards-conformant export. The difference is operational. It is designed to install with standard Python tooling, run without a JVM, expose a clean Python API, and put generated records directly into the workflows used for healthcare data science and applied AI.

The central idea is simple: synthetic healthcare data generation should be a Python library, not a separate runtime that data scientists have to bridge through files and shell commands.

02 - Why it matters

Less setup friction, faster reproducible research.

Real EHR data is constrained by HIPAA, GDPR, institutional review, contractual restrictions, and re-identification risk. Synthetic data does not replace real clinical validation, but it gives teams a shareable, reproducible substrate for early experimentation and engineering.

pip

Install through normal Python workflows, including pip and uv.

GMF

Runs original Synthea Generic Module Framework JSON modules.

FHIR R4

Exports standards-conformant bundles for interoperability testing.

Feeds pandas, tensors, notebooks, and distributed analytics pipelines.

Design goals

Goal	What it means
Accessibility	Python-native install and use, with no external JVM runtime.
Interoperability	Natural integration with pandas, NumPy, PyTorch, Dask, PySpark, Airflow, and Jupyter.
Reproducibility	Deterministic generation from explicit seeds and configuration.
Extensibility	Documented extension points for state types, exporters, modules, and simulation behavior.
Scalability	Interactive notebook cohorts and larger parallel batch jobs share the same API.
Standards fidelity	FHIR R4, CSV, JSON, and Synthea-compatible identifiers remain first-class outputs.

03 - How it works

Generator, world model, module engine, exporters.

PySynthea is organized as a single Python distribution with four primary packages: the simulation engine, the healthcare world model, export adapters, and helper utilities for configuration and shared behavior.

The Generator coordinates demographic sampling, simulation, module execution, and export. Each Person carries deterministic randomness, attributes, active module state, and a HealthRecord. Modules are loaded from Synthea GMF JSON and executed as typed state machines, including encounters, condition onsets, medication orders, observations, procedures, care plans, devices, diagnostic reports, immunizations, and terminal states.

The generation pipeline

Step 1

Initialize

Load configuration, seed randomness, providers, payers, resources, and modules.

Step 2

Sample

Draw demographics, location, birth date, and social context from configured distributions.

Step 3

Simulate

Advance disease modules through a longitudinal timestep-based engine.

Step 4

Emit

Write encounters, conditions, medications, procedures, observations, and costs.

Step 5

Order

Preserve temporal coherence across decades of simulated patient history.

Step 6

Export

Produce FHIR, CSV, JSON, Parquet, or database-ready outputs.

04 - Outputs and workflows

From notebook cohorts to FHIR test data and data lakes.

Because generation runs in-process, records can be inspected immediately as Python objects, materialized as DataFrames, converted to model-ready sequences, written to a FHIR server, or partitioned for distributed jobs.

Format	Primary use
FHIR R4	FHIR server loading, validation, interoperability testing, and SMART-on-FHIR development.
CSV	Flat resource tables for pandas, spreadsheets, and database bulk loading.
JSON	Nested patient documents for debugging, document databases, and complete history inspection.
Parquet	Columnar analytics for data lakes, PySpark, Dask, and lakehouse workflows.
Relational DB	Research warehouses, test databases, and downstream schema mappings through SQLAlchemy or pandas.

Representative workflows

Notebook-based cohort generation. Generate a small cohort, inspect records, and iterate on parameters in a few cells.
Custom patient construction. Build a deterministic patient for teaching, debugging, or unit-testing modules.
Parallel batch generation. Run independent patients across threads, processes, Dask workers, or PySpark partitions.
ML benchmarking. Regenerate fixed-seed cohorts to compare models under controlled synthetic distributions.
Interoperability testing. Produce controlled FHIR bundles for end-to-end software and integration tests.

05 - Tradeoffs and future work

A Python foundation for the next generation of synthetic clinical simulation.

PySynthea is an independent engine, so exact semantic parity with upstream Java Synthea remains a continuing validation target. The goal is to keep Synthea GMF modules as the shared clinical content source while making the engine easier for Python users to extend, profile, test, and embed.

The paper also makes a clear distinction between Synthea-style synthetic data and learned generative models. PySynthea is valuable for structural, longitudinal, standards-conformant records, education, reproducibility, and pipeline testing. It is not a substitute for clinical validation on real cohorts or a learned model that reproduces the distribution of a specific institution.

PySynthea lowers the cost of using synthetic EHR data in healthcare AI, but responsible deployment still requires validation, governance, and a clear understanding of what synthetic data can and cannot prove.

Research directions

GPU acceleration. Explore vectorized and batched simulation through CUDA, JAX, or PyTorch.
LLM-assisted modules. Draft disease modules from clinical guidelines with human-in-the-loop review.
Multi-agent patient simulation. Couple patient trajectories through outbreaks, households, and system congestion.
Reinforcement learning environments. Expose synthetic patient state and treatment actions for clinical RL benchmarks.
Multimodal synthetic data. Link structured EHR histories with synthetic imaging and clinical text.

Read the full paper View the repository Contact the team