TietAI a levé 2,5 M€ en financement seed pour placer l’IA au cœur opérationnel des hôpitaux
Research - Synthetic healthcare data

Synthetic healthcare data, finally native to Python.

PySynthea is a Python-native reimplementation of Synthea, designed to make synthetic longitudinal EHR generation installable, extensible, reproducible, and usable directly inside the data science stack where modern healthcare AI work already happens.

Published May 21, 2026 Authors Roberto Cruz, David Rey-Blanco Reading time ~9 min
Roberto Cruz
TietAI
David Rey-Blanco
TietAI
High-level PySynthea architecture showing the generator, person model, module engine, and export layer.

System architecture. PySynthea organizes synthetic patient generation around a Python Generator, deterministic Person instances, a GMF-compatible module engine, and export layers for FHIR, JSON, CSV, and analytics-friendly formats.

A Synthea-style simulator built for Python-first healthcare AI.

Synthetic healthcare data lets teams prototype, teach, test, and benchmark without exposing protected health information. Synthea made this practical for longitudinal EHR data, but many researchers now work almost entirely in Python, pandas, Jupyter, PyTorch, Dask, and PySpark.

PySynthea keeps the core Synthea model: modular disease state machines, demographic population sampling, longitudinal patient histories, and standards-conformant export. The difference is operational. It is designed to install with standard Python tooling, run without a JVM, expose a clean Python API, and put generated records directly into the workflows used for healthcare data science and applied AI.

The central idea is simple: synthetic healthcare data generation should be a Python library, not a separate runtime that data scientists have to bridge through files and shell commands.

Less setup friction, faster reproducible research.

Real EHR data is constrained by HIPAA, GDPR, institutional review, contractual restrictions, and re-identification risk. Synthetic data does not replace real clinical validation, but it gives teams a shareable, reproducible substrate for early experimentation and engineering.

pip
Install through normal Python workflows, including pip and uv.
GMF
Runs original Synthea Generic Module Framework JSON modules.
FHIR R4
Exports standards-conformant bundles for interoperability testing.
ML
Feeds pandas, tensors, notebooks, and distributed analytics pipelines.

Design goals

GoalWhat it means
AccessibilityPython-native install and use, with no external JVM runtime.
InteroperabilityNatural integration with pandas, NumPy, PyTorch, Dask, PySpark, Airflow, and Jupyter.
ReproducibilityDeterministic generation from explicit seeds and configuration.
ExtensibilityDocumented extension points for state types, exporters, modules, and simulation behavior.
ScalabilityInteractive notebook cohorts and larger parallel batch jobs share the same API.
Standards fidelityFHIR R4, CSV, JSON, and Synthea-compatible identifiers remain first-class outputs.

Generator, world model, module engine, exporters.

PySynthea is organized as a single Python distribution with four primary packages: the simulation engine, the healthcare world model, export adapters, and helper utilities for configuration and shared behavior.

The Generator coordinates demographic sampling, simulation, module execution, and export. Each Person carries deterministic randomness, attributes, active module state, and a HealthRecord. Modules are loaded from Synthea GMF JSON and executed as typed state machines, including encounters, condition onsets, medication orders, observations, procedures, care plans, devices, diagnostic reports, immunizations, and terminal states.

The generation pipeline

Step 1
Initialize
Load configuration, seed randomness, providers, payers, resources, and modules.
Step 2
Sample
Draw demographics, location, birth date, and social context from configured distributions.
Step 3
Simulate
Advance disease modules through a longitudinal timestep-based engine.
Step 4
Emit
Write encounters, conditions, medications, procedures, observations, and costs.
Step 5
Order
Preserve temporal coherence across decades of simulated patient history.
Step 6
Export
Produce FHIR, CSV, JSON, Parquet, or database-ready outputs.
End-to-end PySynthea generation pipeline from initialization through demographic sampling, longitudinal simulation, and export.
End-to-end pipeline. Each synthetic patient is initialized, demographically sampled, simulated across a longitudinal timeline, and exported in one or more interoperable formats.

From notebook cohorts to FHIR test data and data lakes.

Because generation runs in-process, records can be inspected immediately as Python objects, materialized as DataFrames, converted to model-ready sequences, written to a FHIR server, or partitioned for distributed jobs.

FormatPrimary use
FHIR R4FHIR server loading, validation, interoperability testing, and SMART-on-FHIR development.
CSVFlat resource tables for pandas, spreadsheets, and database bulk loading.
JSONNested patient documents for debugging, document databases, and complete history inspection.
ParquetColumnar analytics for data lakes, PySpark, Dask, and lakehouse workflows.
Relational DBResearch warehouses, test databases, and downstream schema mappings through SQLAlchemy or pandas.

Representative workflows

  • Notebook-based cohort generation. Generate a small cohort, inspect records, and iterate on parameters in a few cells.
  • Custom patient construction. Build a deterministic patient for teaching, debugging, or unit-testing modules.
  • Parallel batch generation. Run independent patients across threads, processes, Dask workers, or PySpark partitions.
  • ML benchmarking. Regenerate fixed-seed cohorts to compare models under controlled synthetic distributions.
  • Interoperability testing. Produce controlled FHIR bundles for end-to-end software and integration tests.

A Python foundation for the next generation of synthetic clinical simulation.

PySynthea is an independent engine, so exact semantic parity with upstream Java Synthea remains a continuing validation target. The goal is to keep Synthea GMF modules as the shared clinical content source while making the engine easier for Python users to extend, profile, test, and embed.

The paper also makes a clear distinction between Synthea-style synthetic data and learned generative models. PySynthea is valuable for structural, longitudinal, standards-conformant records, education, reproducibility, and pipeline testing. It is not a substitute for clinical validation on real cohorts or a learned model that reproduces the distribution of a specific institution.

PySynthea lowers the cost of using synthetic EHR data in healthcare AI, but responsible deployment still requires validation, governance, and a clear understanding of what synthetic data can and cannot prove.

Research directions

  • GPU acceleration. Explore vectorized and batched simulation through CUDA, JAX, or PyTorch.
  • LLM-assisted modules. Draft disease modules from clinical guidelines with human-in-the-loop review.
  • Multi-agent patient simulation. Couple patient trajectories through outbreaks, households, and system congestion.
  • Reinforcement learning environments. Expose synthetic patient state and treatment actions for clinical RL benchmarks.
  • Multimodal synthetic data. Link structured EHR histories with synthetic imaging and clinical text.