TietAI cierra una ronda de 2,5M€ para llevar la IA al corazón operativo de los hospitales
Research - Clinical coding search

Small curated data make effective clinical coders.

Clinical coding is a fine-grained retrieval problem: short clinical mentions must map to the right ICD-10/CIE-10 code despite subtle differences in severity, anatomy, temporality, and etiology. We show that compact task-specific retrievers, trained with LLM-generated supervision and clinically meaningful hard negatives, can outperform public embedding models and BM25 on CodiESP and DISTEMIST.

Published April 10, 2026 Authors David Rey-Blanco, Roberto Cruz Reading time ~8 min
David Rey-Blanco
TietAI
Roberto Cruz
TietAI
Two-stage ICD retrieval workflow with a bi-encoder producing top-k candidates and a cross-encoder reranking them.

Two-stage ICD retrieval. A multilingual bi-encoder retrieves candidate CIE-10 codes, then a Spanish-tuned cross-encoder reranks the shortlist to recover clinically relevant distinctions such as laterality, severity, etiology, and episode type.

Can compact task-specific retrievers outperform broad public embeddings?

Sentence embeddings work well for broad semantic search, but ICD coding is not broad semantic search. The model must map short clinical mentions to a controlled vocabulary where neighboring codes often differ by a qualifier: anatomical site, laterality, severity, encounter type, or etiology.

The paper asks whether high-quality synthetic supervision can close that gap for Spanish clinical coding. We use a frontier LLM to generate multilingual examples grounded in the ICD-10 hierarchy, fine-tune a Spanish biomedical bi-encoder, then train a cross-encoder reranker on listwise groups with clinically plausible hard negatives.

The headline finding: task-aligned supervision and clinically meaningful negatives matter more than embedding size or generic English biomedical pretraining for Spanish ICD-10/CIE-10-ES retrieval.

A bi-encoder for recall, a cross-encoder for exact-code precision.

The retriever uses a standard but carefully aligned two-stage architecture. Stage 1 encodes each query and CIE-10 description independently and ranks by cosine similarity. Stage 2 jointly reads each query-code pair and reorders the top candidates with a listwise objective.

The training data was not treated as raw synthetic output. We curated the generated examples through rule-based validation, semantic deduplication, language consistency checks, and manual spot inspection before using them to build the bi-encoder and reranking classifiers. That curation step is central to the paper: the goal is not simply to make more data, but to create examples and hard negatives that reflect the clinical distinctions encoded in ICD-10/CIE-10.

19,502
Synthetic bi-encoder examples across 17 ICD-10 chapters.
6
Languages used for synthetic supervision: EN, ES, CA, IT, PT, and FR.
10,628
Spanish listwise cross-encoder groups with one positive and hard negatives.
3
Evaluation resolutions: full code, 3-character category, and chapter.

Retrieval strategies compared

StrategyRole in the study
TietAI Cross-EncoderReranks the bi-encoder shortlist and optimizes exact-code ordering.
TietAI Bi-EncoderSpanish biomedical dense retrieval backbone fine-tuned on synthetic supervision.
BM25 (Postgres FTS)Lexical full-text baseline over CIE-10-ES descriptions.
ST MiniLM-L6-v2General-purpose public sentence-transformer baseline.
ST BioBERTEnglish biomedical sentence-transformer baseline.
ST MPNet-v2Larger 768-dimensional general sentence-transformer baseline.

F1 and MAP@10 put the ranking quality in focus.

On CodiESP v4, the TietAI cross-encoder reaches F1 = 0.709 and MAP@10 = 0.747 at exact-code resolution. On DISTEMIST, it reaches F1 = 0.776 and MAP@10 = 0.812. The updated paper treats MAP@10 as the ranking metric and F1 as the top-1 decision metric.

0.709
CodiESP exact-code F1 for the TietAI Cross-Encoder.
0.747
CodiESP exact-code MAP@10 for the TietAI Cross-Encoder.
0.776
DISTEMIST exact-code F1 for the TietAI Cross-Encoder.
0.812
DISTEMIST exact-code MAP@10 for the TietAI Cross-Encoder.

CodiESP v4 F1 and MAP@10

ModelF1 exactF1 categoryMAP@10 exactMAP@10 category
TietAI Cross-Encoder0.7090.8230.7470.851
TietAI Bi-Encoder0.3590.6170.4610.694
BM25 (Postgres FTS)0.2390.3760.3220.471
ST MiniLM-L6-v20.2250.3710.2870.426
ST BioBERT0.1930.3140.2520.373
ST MPNet-v20.1610.3170.2260.376
CodiESP F1 and MAP@10 at exact-code and category levels.
CodiESP F1 and MAP@10. The TietAI cross-encoder leads every metric. MAP@10 is higher than F1, showing that many top-1 misses still contain the correct code in the top-10 shortlist.

CodiESP exact-code top-1 precision, recall and F1

ModelPrecisionRecallF1
TietAI Cross-Encoder0.7090.7090.709
TietAI Bi-Encoder0.3590.3590.359
BM25 (Postgres FTS)0.2540.2270.239
ST MiniLM-L6-v20.2250.2250.225
ST BioBERT0.1930.1930.193
ST MPNet-v20.1610.1610.161
CodiESP top-1 precision, recall, and F1 at exact-code resolution.
CodiESP top-1 exact-code precision, recall and F1. Dense retrievers emit one prediction per query, so the three metrics coincide. BM25 diverges because it returns no candidate for about 11% of CodiESP queries.
CodiESP precision at k by retrieval model for exact-code and category matches.
CodiESP Precision@k. Precision decays with k because each query usually has one gold code, but the model ordering is stable at both exact-code and category resolutions.
CodiESP recall at k by retrieval model for exact-code and category matches.
CodiESP Recall@k. The cross-encoder reaches R@10 = 0.813 exact and 0.903 category, while the bi-encoder closes much of the gap by deeper ranks.

CodiESP top-1 accuracy by hierarchy

CodiESP top-1 accuracy at exact-code, category, and chapter levels.
CodiESP hierarchical accuracy. The same top-1 decision improves from exact code to category and chapter, reflecting how often models find the right disease family even when the full code is wrong.

DISTEMIST F1 and MAP@10

ModelF1 exactF1 categoryMAP@10 exactMAP@10 category
TietAI Cross-Encoder0.7760.8180.8120.846
TietAI Bi-Encoder0.6030.6900.6820.747
BM25 (Postgres FTS)0.4310.4700.5040.537
DISTEMIST F1 and MAP@10 at exact-code and category levels.
DISTEMIST F1 and MAP@10. The relative ordering of cross-encoder, bi-encoder, and BM25 is unchanged on an independent Spanish clinical corpus.
DISTEMIST precision at k by retrieval model for exact-code and category matches.
DISTEMIST Precision@k. Precision drops as k grows, but the task-specific dense models remain ahead of the BM25 lexical baseline at both exact-code and category levels.
DISTEMIST recall at k by retrieval model for exact-code and category matches.
DISTEMIST Recall@k. The bi-encoder closes much of the gap at deeper ranks, but the cross-encoder remains best at every k.
DISTEMIST top-1 accuracy at exact-code, category, and chapter levels.
DISTEMIST hierarchical accuracy. Accuracy rises from exact code to category and chapter, while the model ordering stays consistent: cross-encoder, bi-encoder, then BM25.

The bottleneck is task alignment, not embedding scale.

The public-baseline ordering is the important negative result. BioBERT-ST has biomedical pretraining, but it is English-centric. MPNet-v2 has larger embeddings, but it still trails MiniLM-L6-v2. Neither scale nor generic domain pretraining solves Spanish clinical-code retrieval by itself.

ICD retrieval requires models to separate sibling concepts that are semantically close but operationally different. The bi-encoder captures chapter-level topic information well, but exact-code ranking needs a reranker trained on hard negatives drawn from neighboring CIE-10 codes.

  • Task alignment matters. Synthetic examples must reflect the ICD hierarchy, not just broad biomedical relatedness.
  • Hard negatives matter. Neighboring codes teach the model to distinguish etiology, anatomy, severity, and episode type.
  • Reranking matters. The cross-encoder adds the qualifier-level discrimination required by CIE-10-ES guidelines.
  • Small data can be enough. A compact synthetic corpus can train a competitive retriever when examples are aligned with the target vocabulary structure.
  • Biomedical pretraining is not automatic transfer. English biomedical encoders still struggle when both query and target vocabulary are Spanish.
For production clinical coding search, the practical lesson is to train for the target vocabulary, target language, and target ranking problem. The useful unit is not a bigger embedding model; it is a retriever trained on the distinctions coders actually need to make.