Research - Medical AI

A multi-agent medical AI that beats the flagship.

MDIA is a clinical reasoning system built by TietAI. On HealthBench Professional, OpenAI's most rigorous medical benchmark drawn from real clinician conversations, MDIA scores 0.627, surpassing OpenAI's own ChatGPT for Clinicians (0.590) and the physician-written baseline (0.437), using OpenAI's own grader.

Published May 15, 2026 Authors Roberto Cruz, David Rey-Blanco Reading time ~8 min

Download the paper (PDF) What's inside Github repo

Roberto Cruz

TietAI

ORCID Email

David Rey-Blanco

TietAI

ORCID Email

MDIA v1.0.53 outperforms ChatGPT for Clinicians on HealthBench Professional under GPT-5.4 grading.

Overall comparison. Each bar shows the rubric score on HealthBench Professional (n = 525) under OpenAI's own GPT-5.4 grader. MDIA v1.0.53 reaches 0.627, 3.7 points above ChatGPT for Clinicians and 19 points above the physician-written baseline.

01 - What is MDIA

A team of specialised AI clinicians, not a single chatbot.

Most medical chatbots are a single large language model with a fancy prompt. MDIA is different. It is a coordinated graph of seven agents, each with its own role and tools, that work together to read a patient's question, look things up in trusted medical sources, decide which specialist should answer, write a response, and double-check it for safety.

We built MDIA on TietAI's Hydra Platform, an internal agent-orchestration engine, using off-the-shelf Google Gemini models as the underlying reasoners. No fine-tuning, no proprietary medical data, no special model access. All of the performance gains come from the way the agents are wired together and the engineering work behind them.

The headline finding: under OpenAI's own grader, on OpenAI's own benchmark, a well-orchestrated multi-agent system using a general-purpose LLM can beat a domain-specific model fine-tuned by the company that wrote the benchmark.

02 - The numbers

What we measured.

HealthBench Professional contains 525 real clinical conversations spanning 21 specialties, from gastroenterology to neurology to oncology, graded by a detailed rubric written by physicians. Every system in the chart below was run on the same 525 cases and graded by the same OpenAI grader.

0.627

MDIA v1.0.53 score on HealthBench Professional (GPT-5.4 grader)

+3.7pp

vs ChatGPT for Clinicians (0.590), OpenAI's best published system

+14.6pp

vs GPT-5.4 used directly without an agent architecture (0.481)

+19.0pp

vs the physician-written baseline (0.437) on the same 525 cases

The full ranking

System	Score
MDIA v1.0.53 (TietAI - this work)	0.627
ChatGPT for Clinicians (OpenAI)	0.590
GPT-5.4 single-agent baseline	0.481
Claude Opus 4.7	0.470
GPT-5	0.462
GPT-5.2	0.459
Gemini 3.1 Pro	0.438
Physician-written baseline	0.437
Grok 4.20	0.361

All scores on the full HealthBench Professional benchmark (n = 525), graded by OpenAI's GPT-5.4-2026-03-05. Reference scores from OpenAI, 2026.

How performance evolved

Score progression across MDIA versions. The largest jumps came from a multi-turn fix in v1.0.40 and a length-control update in v1.0.53. — **Twenty-six versions of trial-and-error.** MDIA's score climbed from roughly 0.52 to 0.63 over six months, mostly through engineering work, not bigger models. The two largest jumps came from fixing how follow-up questions were passed to the agent and teaching the synthesiser to write shorter, denser answers.

03 - How it works

Seven agents, one shared memory.

When a question arrives, say a clinician asking about a 38-week pregnant patient on ACE inhibitors, MDIA does not answer right away. It walks the question through a pipeline of small, specialised agents, each focused on one job.

MDIA architecture: intake, router, three specialty reasoners, generalist reasoner, output synthesiser, verifier. — **The MDIA graph.** A question flows from intake, which gathers evidence using 14 medical tools, through a router that picks the right specialist, to a synthesiser and a final safety verifier.

The seven roles

Step 1

Intake

Reads the case and pulls evidence from 14 medical tools: PubMed, DailyMed, UMLS, ICD-10, drug-safety checks and more.

Step 2

Router

Decides which specialist should handle this case: GI, ophthalmology, neurology, or the generalist.

Step 3

Specialist reasoner

A specialty-tuned agent that knows the right scores, time windows and red flags for that domain.

Step 4

Output

Turns the specialist's brief into a clear, length-controlled answer for the clinician.

Step 5

Verifier

A final safety and formatting pass catches contradictions, missing warnings and over-long answers.

What actually moved the score

Listening to the whole conversation. 22% of cases in the benchmark have follow-up questions. The standard evaluation harness silently dropped that context. Fixing it gave us +6 points overnight without changing the agent at all.
A specialised drug-safety check. Before writing anything prescriptive, MDIA verifies the recommendation against patient context, such as no loperamide when there is fever, and no ACE inhibitors in pregnancy.
Specialty routing. Instead of one prompt to rule them all, three dedicated reasoners handle the specialties where a generalist was systematically weakest.
Reliability engineering. Five low-level fixes in our agent runtime dropped the rate of blank or failed responses from 3.8% to 0.2%.
Tighter answers. The grader penalises long responses. Teaching the synthesiser to write 2,000 to 3,000 character answers added another point.

04 - What this means

Architecture matters more than model size, but graders are fragile.

The popular narrative on AI progress is "bigger model, better answers." Our results tell a different story for the clinical setting: most of the lift came from how the system was assembled, not from a larger or fine-tuned base model. A small team with a well-engineered agent graph can outperform a domain-specific model released by a major lab.

But there is an important caveat: the score depends a lot on who is doing the grading. When we re-graded the same MDIA responses with Google's Gemini 2.5 Pro instead of OpenAI's GPT-5.4, the score jumped to 0.658. That is a 3-point swing on the same answers. Robust evaluation needs multiple independent graders, not a single judge, especially when the company writing the benchmark is also the company building one of the models being compared.

Headline benchmark scores are useful, but they are not clinical validation. A 3-point edge under one grader can disappear under another. We publish our full per-sample outputs so anyone can independently regrade them.

05 - Questions

Frequently asked.

Does this mean MDIA is ready for clinical use?

No. Benchmark performance is a technical indicator, not regulatory approval. MDIA has not been validated in prospective clinical trials and should not be used for direct patient care without further evaluation, clinician oversight, and the relevant regulatory clearances.

Did you fine-tune a model on medical data?

No. MDIA uses off-the-shelf Google Gemini models. The performance comes from the agent architecture, the medical tools the agents can call, and engineering work on the underlying platform, not from training data.

How can the result be reproduced?

We have published the per-sample grader transcripts and evaluation framework in the TietAI Evals Public repository. The full graph definitions, prompts and engine fixes are available to other research teams on request.

Why a 3.7-point margin if it is within bootstrap noise?

Bootstrap resampling gives sigma approximately 0.023, so a 3.7-point gap is directionally consistent but not statistically decisive at p < 0.05. OpenAI does not publish confidence intervals for ChatGPT for Clinicians, which makes a formal significance test impossible. We treat the lead as meaningful but not as a knockout result; the more interesting story is the gap to the GPT-5.4 single-agent baseline (+14.6 pp).

What is next?

Four things: connecting MDIA to a curated guideline retrieval system (RAG) to plug the remaining knowledge gaps; testing Claude as the specialist reasoner; building an automated harness that prevents regressions during prompt iteration; and a cross-system regrade if OpenAI publishes per-sample outputs for ChatGPT for Clinicians.

Read the full paper Contact the team