Research - Clinical coding search

Small curated data make effective clinical coders.

Clinical coding is a fine-grained retrieval problem: short clinical mentions must map to the right ICD-10/CIE-10 code despite subtle differences in severity, anatomy, temporality, and etiology. We show that compact task-specific retrievers, trained with LLM-generated supervision and clinically meaningful hard negatives, can outperform public embedding models and BM25 on CodiESP and DISTEMIST.

Published April 10, 2026Authors David Rey-Blanco, Roberto CruzReading time ~8 min

Download the paper (PDF)See the results

David Rey-Blanco

TietAI

ORCID Email

Roberto Cruz

TietAI

ORCID Email

Two-stage ICD retrieval workflow with a bi-encoder producing top-k candidates and a cross-encoder reranking them.

Two-stage ICD retrieval. A multilingual bi-encoder retrieves candidate CIE-10 codes, then a Spanish-tuned cross-encoder reranks the shortlist to recover clinically relevant distinctions such as laterality, severity, etiology, and episode type.

01 - The question

Can compact task-specific retrievers outperform broad public embeddings?

Sentence embeddings work well for broad semantic search, but ICD coding is not broad semantic search. The model must map short clinical mentions to a controlled vocabulary where neighboring codes often differ by a qualifier: anatomical site, laterality, severity, encounter type, or etiology.

The paper asks whether high-quality synthetic supervision can close that gap for Spanish clinical coding. We use a frontier LLM to generate multilingual examples grounded in the ICD-10 hierarchy, fine-tune a Spanish biomedical bi-encoder, then train a cross-encoder reranker on listwise groups with clinically plausible hard negatives.

The headline finding: task-aligned supervision and clinically meaningful negatives matter more than embedding size or generic English biomedical pretraining for Spanish ICD-10/CIE-10-ES retrieval.

02 - Method

A bi-encoder for recall, a cross-encoder for exact-code precision.

The retriever uses a standard but carefully aligned two-stage architecture. Stage 1 encodes each query and CIE-10 description independently and ranks by cosine similarity. Stage 2 jointly reads each query-code pair and reorders the top candidates with a listwise objective.

The training data was not treated as raw synthetic output. We curated the generated examples through rule-based validation, semantic deduplication, language consistency checks, and manual spot inspection before using them to build the bi-encoder and reranking classifiers. That curation step is central to the paper: the goal is not simply to make more data, but to create examples and hard negatives that reflect the clinical distinctions encoded in ICD-10/CIE-10.

19,502

Synthetic bi-encoder examples across 17 ICD-10 chapters.

Languages used for synthetic supervision: EN, ES, CA, IT, PT, and FR.

10,628

Spanish listwise cross-encoder groups with one positive and hard negatives.

Evaluation resolutions: full code, 3-character category, and chapter.

Retrieval strategies compared

Strategy	Role in the study
TietAI Cross-Encoder	Reranks the bi-encoder shortlist and optimizes exact-code ordering.
TietAI Bi-Encoder	Spanish biomedical dense retrieval backbone fine-tuned on synthetic supervision.
BM25 (Postgres FTS)	Lexical full-text baseline over CIE-10-ES descriptions.
ST MiniLM-L6-v2	General-purpose public sentence-transformer baseline.
ST BioBERT	English biomedical sentence-transformer baseline.
ST MPNet-v2	Larger 768-dimensional general sentence-transformer baseline.

03 - Results

F1 and MAP@10 put the ranking quality in focus.

On CodiESP v4, the TietAI cross-encoder reaches F1 = 0.709 and MAP@10 = 0.747 at exact-code resolution. On DISTEMIST, it reaches F1 = 0.776 and MAP@10 = 0.812. The updated paper treats MAP@10 as the ranking metric and F1 as the top-1 decision metric.

0.709

CodiESP exact-code F1 for the TietAI Cross-Encoder.

0.747

CodiESP exact-code MAP@10 for the TietAI Cross-Encoder.

0.776

DISTEMIST exact-code F1 for the TietAI Cross-Encoder.

0.812

DISTEMIST exact-code MAP@10 for the TietAI Cross-Encoder.

CodiESP v4 F1 and MAP@10

Model	F1 exact	F1 category	MAP@10 exact	MAP@10 category
TietAI Cross-Encoder	0.709	0.823	0.747	0.851
TietAI Bi-Encoder	0.359	0.617	0.461	0.694
BM25 (Postgres FTS)	0.239	0.376	0.322	0.471
ST MiniLM-L6-v2	0.225	0.371	0.287	0.426
ST BioBERT	0.193	0.314	0.252	0.373
ST MPNet-v2	0.161	0.317	0.226	0.376

CodiESP F1 and MAP@10 at exact-code and category levels. — **CodiESP F1 and MAP@10.** The TietAI cross-encoder leads every metric. MAP@10 is higher than F1, showing that many top-1 misses still contain the correct code in the top-10 shortlist.

CodiESP exact-code top-1 precision, recall and F1

Model	Precision	Recall	F1
TietAI Cross-Encoder	0.709	0.709	0.709
TietAI Bi-Encoder	0.359	0.359	0.359
BM25 (Postgres FTS)	0.254	0.227	0.239
ST MiniLM-L6-v2	0.225	0.225	0.225
ST BioBERT	0.193	0.193	0.193
ST MPNet-v2	0.161	0.161	0.161

CodiESP top-1 precision, recall, and F1 at exact-code resolution. — **CodiESP top-1 exact-code precision, recall and F1.**Dense retrievers emit one prediction per query, so the three metrics coincide. BM25 diverges because it returns no candidate for about 11% of CodiESP queries.

CodiESP precision at k by retrieval model for exact-code and category matches. — **CodiESP Precision@k.** Precision decays with k because each query usually has one gold code, but the model ordering is stable at both exact-code and category resolutions.

CodiESP recall at k by retrieval model for exact-code and category matches. — **CodiESP Recall@k.** The cross-encoder reaches R@10 = 0.813 exact and 0.903 category, while the bi-encoder closes much of the gap by deeper ranks.

CodiESP top-1 accuracy by hierarchy

CodiESP top-1 accuracy at exact-code, category, and chapter levels. — **CodiESP hierarchical accuracy.** The same top-1 decision improves from exact code to category and chapter, reflecting how often models find the right disease family even when the full code is wrong.

DISTEMIST F1 and MAP@10

Model	F1 exact	F1 category	MAP@10 exact	MAP@10 category
TietAI Cross-Encoder	0.776	0.818	0.812	0.846
TietAI Bi-Encoder	0.603	0.690	0.682	0.747
BM25 (Postgres FTS)	0.431	0.470	0.504	0.537

DISTEMIST precision at k by retrieval model for exact-code and category matches. — **DISTEMIST Precision@k.** Precision drops as k grows, but the task-specific dense models remain ahead of the BM25 lexical baseline at both exact-code and category levels.

DISTEMIST recall at k by retrieval model for exact-code and category matches. — **DISTEMIST Recall@k.** The bi-encoder closes much of the gap at deeper ranks, but the cross-encoder remains best at every k.

DISTEMIST top-1 accuracy at exact-code, category, and chapter levels. — **DISTEMIST hierarchical accuracy.** Accuracy rises from exact code to category and chapter, while the model ordering stays consistent: cross-encoder, bi-encoder, then BM25.

04 - Interpretation

The bottleneck is task alignment, not embedding scale.

The public-baseline ordering is the important negative result. BioBERT-ST has biomedical pretraining, but it is English-centric. MPNet-v2 has larger embeddings, but it still trails MiniLM-L6-v2. Neither scale nor generic domain pretraining solves Spanish clinical-code retrieval by itself.

ICD retrieval requires models to separate sibling concepts that are semantically close but operationally different. The bi-encoder captures chapter-level topic information well, but exact-code ranking needs a reranker trained on hard negatives drawn from neighboring CIE-10 codes.

Task alignment matters. Synthetic examples must reflect the ICD hierarchy, not just broad biomedical relatedness.
Hard negatives matter. Neighboring codes teach the model to distinguish etiology, anatomy, severity, and episode type.
Reranking matters. The cross-encoder adds the qualifier-level discrimination required by CIE-10-ES guidelines.
Small data can be enough. A compact synthetic corpus can train a competitive retriever when examples are aligned with the target vocabulary structure.
Biomedical pretraining is not automatic transfer. English biomedical encoders still struggle when both query and target vocabulary are Spanish.

For production clinical coding search, the practical lesson is to train for the target vocabulary, target language, and target ranking problem. The useful unit is not a bigger embedding model; it is a retriever trained on the distinctions coders actually need to make.

Read the full paper Contact the team