From Prompts to Probes.Original researchMonash MAI
Foundation models like FLAIR can already diagnose diabetic retinopathy and glaucoma from a single fundus photo, often without ever being trained on the specific dataset. This thesis tests how far that goes. Four models, three public benchmarks, two protocols (zero-shot and few-shot linear probing), all on a laptop. The answer to “which one is best?” is more interesting than I expected.
1. Why this matters
More than 43 million people are blind, and the majority live in places with too few ophthalmologists. Training a board-certified eye specialist takes well over a decade. Training an AI to flag the most common screen-able diseases — diabetic retinopathy and glaucoma — takes a lot less.
The hard part isn’t the model architecture. It’s the labels. Annotating a fundus image requires a specialist whose time is the scarce resource. Inter-grader agreement is moderate even among specialists. So the practical question is not “can a deep network learn this?” (it can). The practical question is: can a foundation model pre-trained on enough medical data already do this, with little or no task-specific supervision?
If the answer is yes, a lot of screening becomes possible in places where it currently isn’t.
2. What I tested
I evaluated four foundation models under two protocols.
Zero-shot. The model has never seen the test dataset. I describe each disease in words — using prompts validated by a clinical supervisor — and the model decides whether each image matches each description. The whole pipeline is a vision encoder, a text encoder, and cosine similarity.
Few-shot linear probing. The vision encoder is frozen. I add a single linear classifier on top and train it with 5 %, 10 %, or 20 % of the available labels. This is the cheapest possible adaptation: no fine-tuning, no gradient flowing through the backbone.
Both protocols use stratified 5-fold cross-validation with identical fold assignments across models, so every comparison is paired. RETFound, which is vision-only (no text encoder), participates only in the few-shot protocol.
3. The four models
The four models span three pre-training philosophies: domain-specific contrastive (FLAIR), general biomedical contrastive (BiomedCLIP), web-scale contrastive (OpenCLIP), and self-supervised masked autoencoding (RETFound).
FLAIR
Vision-language contrastive pre-training on roughly 288,000 fundus images and clinical text pairs from 38 curated ophthalmic datasets. Built specifically for retinal imaging.
OpenCLIP
CLIP-style contrastive pre-training on about 2 billion image-text pairs from the open web. The largest encoder in the comparison and the only one that has seen almost everything except medical images specifically.
BiomedCLIP
Contrastive pre-training on PMC-15M — 15 million figure-caption pairs from PubMed Central biomedical papers. Broad medical knowledge but no retinal specialism.
RETFound
Self-supervised masked-autoencoder pre-training on 1.6 million retinal images. Rich visual representations but no text encoder, which is why it’s excluded from the zero-shot protocol.
All four were evaluated on an Apple M-series laptop using PyTorch 2.3.1 with MPS acceleration. There is no external GPU in this work, and there doesn’t need to be.
4. The benchmarks
Three public ophthalmic datasets, chosen to span binary, multi-class, and multi-disease classification with varying prevalence and imaging conditions.
| Dataset | Task | Images | Metric |
|---|---|---|---|
| MESSIDOR | Diabetic retinopathy, 4 grades (R0–R3) | 1,200 | Macro-F1 |
| REFUGE | Glaucoma, binary (~10 % positive) | 1,200 | AUROC |
| ODIR-200×3 | Normal / DR / Other, balanced 3-class | 600 | Macro-F1 |
Public benchmarks. MESSIDOR has moderate inter-grader agreement; REFUGE’s low prevalence makes precision the harder constraint; ODIR’s “Other” class tests breadth.
5. The prompts
Zero-shot classification rises and falls on prompt design. I used three templates of increasing clinical specificity, validated by my supervisor:
T1 — bare label. “A fundus photograph of diabetic retinopathy.”
T2 — clinical description. Adds the pathological signs visible in fundus photography — microaneurysms, hemorrhages, hard exudates — consistent with the diagnosis.
T3 — structured clinical template. Full template that follows ophthalmic reporting conventions, including lesion types, locations, and severity indicators.
Zero-shot performance is reported as the mean and standard deviation across T1–T3, which captures both average performance and the model’s sensitivity to prompt engineering. A clinically useful model should perform consistently across reasonable prompt formulations, not just on a single carefully tuned template.
6. Zero-shot results
Performance averaged across T1–T3 prompts. RETFound is excluded because it has no text encoder. Statistical significance was assessed via paired image-bootstrap (α = 0.05, Holm-adjusted).
| Model | MESSIDOR (Macro-F1) | REFUGE (AUROC) | ODIR-200×3 (Macro-F1) |
|---|---|---|---|
| FLAIR | 0.735 ± 0.049 | 0.921 ± 0.049 | 0.366 ± 0.016 |
| BiomedCLIP | 0.471 ± 0.042 | 0.649 ± 0.037 | 0.709 ± 0.032 |
| OpenCLIP | 0.353 ± 0.001 | 0.530 ± 0.046 | 0.399 ± 0.131 |
Orange = best in column. Higher is better.
FLAIR’s 0.921 AUROC on REFUGE is the headline result. The model has never been trained on the REFUGE dataset and is asked, in plain English, whether each image shows glaucoma. It is correct, on average, more than nine times in ten. That number isn’t a fluke of one prompt — it’s the average across all three templates, with comparable performance on each.
On MESSIDOR, FLAIR similarly leads. On ODIR-200×3 the ranking flips: BiomedCLIP wins, because the “Other” class includes pathology that FLAIR’s retinal-specific pre-training doesn’t cover well, while BiomedCLIP’s broader PubMed-trained features generalise better. OpenCLIP’s ±0.001 standard deviation on MESSIDOR is telling: its web-scale features are nearly invariant to the prompt, which means it has learned no meaningful retinal alignment.
Pre-training domain alignment is the primary driver of zero-shot transfer quality.
7. Few-shot results
Linear probing on frozen encoders, all four models, mean across 5-fold stratified cross-validation, three label budgets (5 %, 10 %, 20 %).
MESSIDOR — diabetic retinopathy (Macro-F1)
| Model | 5 % labels | 10 % labels | 20 % labels |
|---|---|---|---|
| FLAIR | 0.647 | 0.675 | 0.700 |
| OpenCLIP | 0.611 | 0.624 | 0.648 |
| BiomedCLIP | 0.580 | 0.596 | 0.613 |
| RETFound | 0.543 | 0.579 | 0.619 |
REFUGE — glaucoma (AUROC)
| Model | 5 % labels | 10 % labels | 20 % labels |
|---|---|---|---|
| FLAIR | 0.718 | 0.843 | 0.870 |
| OpenCLIP | 0.807 | 0.853 | 0.891 |
| BiomedCLIP | 0.756 | 0.808 | 0.874 |
| RETFound | 0.664 | 0.811 | 0.836 |
ODIR-200×3 — multi-disease (Macro-F1)
| Model | 5 % labels | 10 % labels | 20 % labels |
|---|---|---|---|
| FLAIR | 0.843 | 0.873 | 0.900 |
| OpenCLIP | 0.793 | 0.871 | 0.878 |
| BiomedCLIP | 0.857 | 0.892 | 0.870 |
| RETFound | 0.650 | 0.744 | 0.820 |
Orange = best in column. FLAIR has the highest overall average rank across all nine cells, with OpenCLIP second, BiomedCLIP third, RETFound trailing.
The most interesting result is the REFUGE flip. OpenCLIP, which has the worst zero-shot performance on glaucoma, is the best linear-probe performer at every label budget. The interpretation is that OpenCLIP’s ViT-H/14 has very high-dimensional, generic-but-rich visual features — useful raw material for a small classifier to learn from, but without the medical alignment that lets a text prompt land cleanly.
FLAIR stays dominant on diabetic retinopathy at every budget. BiomedCLIP is the best low-budget choice on ODIR but FLAIR overtakes it at 20 %. RETFound underperforms across the board despite being trained on 1.6 million retinal images. The most likely explanation is the pre-training objective: masked-autoencoder reconstruction produces spatially rich features, but contrastive training is better aligned with discriminative downstream tasks like classification.
Linear probing reshuffles the zero-shot ranking. The best model depends on the task and the label budget.
8. The clinical operating point
AUROC summarises discrimination at every threshold but a deployed screening tool runs at one. The clinically meaningful operating point for population-level glaucoma screening is sensitivity at a fixed specificity — how many true cases the model catches while keeping false positives below a tolerable level.
I fixed specificity at 90 %, which means accepting that roughly 10 % of healthy eyes get flagged for review. The question is then: of the actual glaucoma cases, how many does the model catch?
| Model | Sens @ Spec=90 % (5 % labels) | Sens @ Spec=90 % (20 % labels) |
|---|---|---|
| OpenCLIP | 0.436 | 0.611 |
| FLAIR | 0.289 | 0.526 |
| BiomedCLIP | 0.231 | 0.471 |
| RETFound | 0.200 | 0.414 |
Sensitivity at 90 % specificity on REFUGE. Higher is better.
OpenCLIP catches 61 % of glaucoma cases with 20 % labels at this operating point. That is below the WHO-recommended ≥ 80 % sensitivity threshold for population screening, so this is not a standalone diagnostic tool. But it is a meaningful triage capability that could reduce specialist workload by pre-filtering the most likely negative cases.
Temperature scaling improves calibration (lowering Expected Calibration Error and Area Under Calibration Error) without changing AUROC. Calibrated probability outputs are achievable post-hoc — useful for downstream clinical decision-support that depends on probability thresholds rather than raw scores.
9. What I learned
There is no universal best model. FLAIR wins on diabetic retinopathy. OpenCLIP wins on glaucoma when given even a few labels. BiomedCLIP wins on multi-disease classification at low budgets. The choice depends on what disease you’re screening for and how much labeled data you can afford to collect.
Domain alignment dominates zero-shot. Encoder size dominates linear probing. FLAIR’s retinal-specific pre-training is decisive in the zero-shot regime where everything depends on whether text and image embeddings already agree. Once a small classifier sits on top of the frozen encoder, the dominant factor becomes how rich the encoder’s features are — and OpenCLIP’s ViT-H/14, despite never being trained on a fundus image, has the richest features.
Bigger isn’t always better. RETFound has the largest architecture (ViT-Large), the largest retinal-specific corpus (1.6 million images), and the longest pre-training compute budget. It underperforms because its self-supervised objective — reconstructing masked patches — produces features that are good at the reconstruction task but less aligned with discrimination.
5 % of labels already changes the game. The biggest jump in every dataset is from zero-shot to a 5 % linear probe. After 10 %, returns diminish. The implication for screening programs is concrete: even a small annotation budget — a few dozen labeled images — produces large gains over zero-shot.
Calibration is fixable. Temperature scaling improves probability calibration without changing the model’s ranking ability. Clinical decision support that depends on probability thresholds doesn’t need re-training; it needs a single scalar parameter fit on a held-out validation split.
This all ran on a laptop. The whole experiment — four models, three benchmarks, five label budgets, five folds, three prompts, calibration, statistical testing — runs on a single Apple M-series laptop using MPS acceleration. There is no GPU cluster anywhere in this pipeline. Meaningful foundation-model evaluation is now accessible to individual researchers.
10. Methodology in brief
Stratified 5-fold cross-validation with identical fold assignments across all models and budgets. Within each training fold, stratified subsampling at 5 %, 10 %, and 20 % label budgets. The linear head is a logistic regression trained on frozen-encoder features. Paired Wilcoxon signed-rank tests across folds for model comparisons. BCa bootstrap confidence intervals (B = 2000) for point estimates. Temperature scaling optimised on the validation split with negative log-likelihood; calibration reported as 15-bin ECE and AUCE. Three classification metrics: Macro-F1 on the 4-class MESSIDOR and 3-class ODIR tasks, AUROC on the binary REFUGE task, and Sensitivity @ 90 % Specificity for the clinical operating point.
All datasets are publicly available de-identified benchmarks. No patient-identifiable data was used. The research followed Monash University ethics guidelines. Results are not intended for direct clinical use without further validation on the populations they would be deployed in.
11. If you’re building a screening tool
For diabetic retinopathy: use FLAIR. It wins zero-shot (0.735 Macro-F1, no labels at all) and stays on top through every few-shot budget. With no labeled data it is already useful as a triage layer.
For glaucoma: use OpenCLIP with a linear probe. The 0.921 AUROC zero-shot FLAIR number is the most dramatic single result in this work, but if you have even 5 % labeled data (about 60 images), OpenCLIP’s ViT-H/14 features beat FLAIR by a meaningful margin at every operating point.
For multi-disease screening: start with BiomedCLIP at very small label budgets (under 10 %). Switch to FLAIR once you can afford ~20 % of the data labelled.
Always: prefer few-shot probing over zero-shot if any labels are available; apply temperature scaling for calibrated probabilities; do not assume RETFound’s domain-specific pre-training will beat contrastive models; and validate on the local population before deployment.
12. What comes next
Three directions stand out. First, parameter-efficient fine-tuning — LoRA or adapter-based — should sit between linear probing and full fine-tuning on the cost-benefit curve. Second, ensembling FLAIR and OpenCLIP, which have complementary strengths, would be straightforward and is likely to lift both DR and glaucoma performance simultaneously. Third, the prompt-sensitivity work is just a starting point: a clinician-validated library of prompts, rather than three templates, would let downstream users tune the model for their setting without changing weights.
The thesis goes deeper on all of this — literature review, full methodology, calibration plots, statistical tests, and the appendix on threats to validity and reproducibility.
“Foundation models can already screen retinal disease at clinically meaningful operating points. The hard work isn’t building a bigger model; it’s choosing the right one for the task and the budget.”
This page summarises a 78-page Master of AI thesis at Monash University, supervised by Dr Yasmeen George. The full thesis — literature review, methodology, results, discussion, references, and appendices — is below.
Read the full thesis
From Prompts to Probes: Zero-Shot and Few-Shot Transfer of Foundation Models for Retinal Fundus Classification. FIT5128 final thesis, Monash University Faculty of Information Technology · ~7,900 words · literature review, methodology, results, discussion, references, and appendices.