AI & ML interests

Clinical Reasoning, Medical Diagnosis, Bayesian Networks, Healthcare AI, LLM Fine-tuning

Recent Activity

Organization Card

šŸ„ Clinical Reasoning Hub

Clinical Reasoning Labs for Medical Diagnostic Accuracy

Models Parameter Class Top Score

Advancing clinical reasoning in compact language models through structured diagnostic methodology and evidence-based training


Mission

Clinical Reasoning Hub develops specialized medical AI models that demonstrate how structured training methodology can dramatically improve diagnostic reasoning in parameter-efficient architectures. Our research focuses on a fundamental question:

Can an 8B-parameter model, trained with the right clinical reasoning framework, approach the diagnostic accuracy of models 10–80Ɨ its size?

Our results suggest yes — with the right approach, compact models can achieve clinically meaningful performance.


Research Approach

Our training methodology is built on three core pillars:

Structured Clinical Reasoning Chains — Models are trained on multi-phase diagnostic reasoning that mirrors real clinical decision-making: gathering evidence, generating differentials, weighing likelihood ratios, arriving at diagnoses, and self-correcting through verification. This is not simple question-answer memorization — it is structured thinking.

Evidence-Grounded Training Curricula — Training data is curated from diverse medical knowledge sources including USMLE-style reasoning, knowledge-graph-grounded clinical pathways, peer-reviewed evidence synthesis, clinical case discussions, and examination content spanning multiple international medical education systems.

Base Model Quality as a Multiplier — We validate our methodology across different base architectures to confirm that improvements are driven by training methodology, not base model artifacts. The same pipeline applied to stronger base models produces compounding gains, confirming genuine capability transfer.


Models

Model Base Architecture Params Medical Avg Key Strength
Diagnostic-Reasoning-Q3X1 Qwen3-8B 8B 76.4% Strongest overall — 89.7% Professional Medicine
Diagnostic-Medicine-R1 DeepSeek-R1-Distill-Llama-8B 8B 64.5% Validated methodology on reasoning-distilled architecture

Both models are fine-tuned using QLoRA (rank 128, alpha 256) on approximately 92K curated medical training examples, with a stabilization phase to prevent catastrophic forgetting.


Benchmark Results

All evaluations performed using lm-evaluation-harness v0.4.11 with zero-shot log-likelihood scoring on official test splits. No benchmark contamination — training data contains no benchmark test questions.

Diagnostic-Reasoning-Q3X1 (Flagship)

Benchmark Base Model Q3X1 (Ours) Ī” Improvement
Professional Medicine 58.9% 89.7% +30.8%
Medical Genetics 58.6% 88.0% +29.4%
Clinical Knowledge 61.0% 86.4% +25.4%
Anatomy 57.6% 79.3% +21.7%
MedQA (USMLE-style) 43.9% 66.3% +22.4%
PubMedQA 48.3% 66.6% +18.3%
MedMCQA 37.3% 58.6% +21.3%
Overall Average 52.2% 76.4% +24.2%

Diagnostic-Medicine-R1

Benchmark Base Model R1 (Ours) Ī” Improvement
Professional Medicine 58.9% 73.5% +14.6%
Anatomy 57.6% 73.3% +15.7%
Clinical Knowledge 61.0% 71.3% +10.3%
PubMedQA 48.3% 68.0% +19.7%
Medical Genetics 58.6% 62.0% +3.4%
MedQA (USMLE-style) 43.9% 56.8% +12.9%
MedMCQA 37.3% 46.4% +9.1%
Overall Average 52.2% 64.5% +12.3%

Cross-Version Progression

Our iterative development shows consistent, compounding improvements:

Version Base Architecture Avg Accuracy Ī” vs Base
Base (untuned) DeepSeek-R1-Distill-Llama-8B 52.2% —
V7 DeepSeek-R1-Distill-Llama-8B 57.6% +5.4%
V8 (Diagnostic-Medicine-R1) DeepSeek-R1-Distill-Llama-8B 64.5% +12.3%
V9 (Diagnostic-Reasoning-Q3X1) Qwen3-8B 76.4% +24.2%

Context: 8B-Class Comparison

Model MedQA Prof. Medicine Clinical Knowledge
Diagnostic-Reasoning-Q3X1 (Ours) 66.3% 89.7% 86.4%
MedReason-8B (published) 61.7% — —
Qwen3-8B (base, untuned) 43.9% 58.9% 61.0%
DeepSeek-R1-Distill-Llama-8B (base) ~43.9% ~58.9% ~61.0%

Our Q3X1 model's MedQA score of 66.3% exceeds published 8B medical models and approaches the performance of several 14B-class models.


Key Findings

Methodology transfers across architectures. The same training pipeline applied to DeepSeek-R1-Distill-Llama-8B (V8) and Qwen3-8B (V9) both produce substantial gains, confirming our approach is not architecture-dependent.

Base model quality acts as a multiplier. Qwen3-8B's stronger pretraining foundation (36T tokens vs 15T) amplifies the effect of our clinical reasoning training, producing nearly double the improvement (+24.2% vs +12.3%).

Structured reasoning training disproportionately helps clinical domains. The largest gains are in Professional Medicine (+30.8%) and Medical Genetics (+29.4%) — domains where step-by-step clinical reasoning is most valuable.

Compact models can punch far above their weight. An 8B model scoring 89.7% on Professional Medicine demonstrates that parameter count is not the limiting factor for clinical reasoning — training methodology is.


Training Overview

Detail Value
Method QLoRA (rank 128, alpha 256) with cosine decay
Hardware Single NVIDIA H100 80GB
Training Data ~92K curated medical reasoning examples
Training Phases 3 epochs main + 1 epoch stabilization
Format Multi-phase structured clinical reasoning chains
Q3X1 Training Time ~16.3 hours total
R1 Training Time ~10.2 hours total
Precision BF16

The training data spans USMLE-style reasoning, knowledge-graph-grounded clinical pathways, evidence synthesis from biomedical literature, clinical textbook cases, and international medical examination content with real explanations. The stabilization phase consolidates gains and prevents catastrophic forgetting of base model capabilities.


Intended Use

These models are released for research purposes in medical AI, clinical NLP, and healthcare reasoning evaluation. They are intended to advance the study of clinical reasoning in language models.

āš ļø These models are NOT intended for clinical decision-making, medical diagnosis, or patient care. They are research artifacts that demonstrate training methodology effectiveness. Any clinical application would require extensive additional validation, regulatory review, and institutional oversight.


Limitations

These models have been evaluated primarily on English-language multiple-choice medical benchmarks and are not validated on open-ended clinical scenarios, multi-turn dialogue, or real patient data. Performance on rare diseases and edge cases has not been characterized. The models may exhibit hallucination or confident incorrect reasoning typical of language models. They are not a substitute for qualified medical professionals.


License

The models in this organization are released under the Apache 2.0 License for the Qwen3-8B-based model (Diagnostic-Reasoning-Q3X1), subject to the base model license terms. The DeepSeek-R1-Distill-Llama-8B-based model (Diagnostic-Medicine-R1) is subject to the Meta Llama 3.1 Community License.

Both models permit commercial use under their respective base model license terms. Users are responsible for ensuring compliance with all applicable license conditions.


Citation

@misc{clinical-reasoning-hub-2026,
  title={Structured Clinical Reasoning Training for Compact Medical Language Models},
  author={Clinical Reasoning Hub},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/Clinical-Reasoning-Hub}
}

datasets 0

None public yet