← Back to projects
Master’s thesis  ·  Monash University  ·  2025
FLAIR OpenCLIP BiomedCLIP RETFound Vision-Language Foundation Models Computer Vision Ophthalmology Zero-Shot Linear Probing PyTorch

From Prompts to Probes.Original researchMonash MAI

Zero-shot and few-shot transfer of foundation models for retinal fundus disease classification — four models, three benchmarks, no GPU cluster.
PM
Pugalenthi Magendran
Master of AI, Monash  ·  Supervised by Dr Yasmeen George  ·  2025
12 min summary · 78-page thesis
4 models compared 3 public benchmarks 0.921 best AUROC (zero-shot) 5-fold cross-validation

Foundation models like FLAIR can already diagnose diabetic retinopathy and glaucoma from a single fundus photo, often without ever being trained on the specific dataset. This thesis tests how far that goes. Four models, three public benchmarks, two protocols (zero-shot and few-shot linear probing), all on a laptop. The answer to “which one is best?” is more interesting than I expected.

1. Why this matters

More than 43 million people are blind, and the majority live in places with too few ophthalmologists. Training a board-certified eye specialist takes well over a decade. Training an AI to flag the most common screen-able diseases — diabetic retinopathy and glaucoma — takes a lot less.

The hard part isn’t the model architecture. It’s the labels. Annotating a fundus image requires a specialist whose time is the scarce resource. Inter-grader agreement is moderate even among specialists. So the practical question is not “can a deep network learn this?” (it can). The practical question is: can a foundation model pre-trained on enough medical data already do this, with little or no task-specific supervision?

If the answer is yes, a lot of screening becomes possible in places where it currently isn’t.

2. What I tested

I evaluated four foundation models under two protocols.

Zero-shot. The model has never seen the test dataset. I describe each disease in words — using prompts validated by a clinical supervisor — and the model decides whether each image matches each description. The whole pipeline is a vision encoder, a text encoder, and cosine similarity.

Few-shot linear probing. The vision encoder is frozen. I add a single linear classifier on top and train it with 5 %, 10 %, or 20 % of the available labels. This is the cheapest possible adaptation: no fine-tuning, no gradient flowing through the backbone.

Both protocols use stratified 5-fold cross-validation with identical fold assignments across models, so every comparison is paired. RETFound, which is vision-only (no text encoder), participates only in the few-shot protocol.

3. The four models

The four models span three pre-training philosophies: domain-specific contrastive (FLAIR), general biomedical contrastive (BiomedCLIP), web-scale contrastive (OpenCLIP), and self-supervised masked autoencoding (RETFound).

FLAIR

Retinal specialist
ResNet-50 + BioClinicalBERT

Vision-language contrastive pre-training on roughly 288,000 fundus images and clinical text pairs from 38 curated ophthalmic datasets. Built specifically for retinal imaging.

OpenCLIP

Web-scale generalist
ViT-H/14, LAION-2B

CLIP-style contrastive pre-training on about 2 billion image-text pairs from the open web. The largest encoder in the comparison and the only one that has seen almost everything except medical images specifically.

BiomedCLIP

Medical generalist
ViT-B/16 + PubMedBERT

Contrastive pre-training on PMC-15M — 15 million figure-caption pairs from PubMed Central biomedical papers. Broad medical knowledge but no retinal specialism.

RETFound

Self-supervised eye model
ViT-Large, MAE

Self-supervised masked-autoencoder pre-training on 1.6 million retinal images. Rich visual representations but no text encoder, which is why it’s excluded from the zero-shot protocol.

All four were evaluated on an Apple M-series laptop using PyTorch 2.3.1 with MPS acceleration. There is no external GPU in this work, and there doesn’t need to be.

4. The benchmarks

Three public ophthalmic datasets, chosen to span binary, multi-class, and multi-disease classification with varying prevalence and imaging conditions.

Dataset Task Images Metric
MESSIDOR Diabetic retinopathy, 4 grades (R0–R3) 1,200 Macro-F1
REFUGE Glaucoma, binary (~10 % positive) 1,200 AUROC
ODIR-200×3 Normal / DR / Other, balanced 3-class 600 Macro-F1

Public benchmarks. MESSIDOR has moderate inter-grader agreement; REFUGE’s low prevalence makes precision the harder constraint; ODIR’s “Other” class tests breadth.

5. The prompts

Zero-shot classification rises and falls on prompt design. I used three templates of increasing clinical specificity, validated by my supervisor:

T1 — bare label. “A fundus photograph of diabetic retinopathy.”

T2 — clinical description. Adds the pathological signs visible in fundus photography — microaneurysms, hemorrhages, hard exudates — consistent with the diagnosis.

T3 — structured clinical template. Full template that follows ophthalmic reporting conventions, including lesion types, locations, and severity indicators.

Zero-shot performance is reported as the mean and standard deviation across T1–T3, which captures both average performance and the model’s sensitivity to prompt engineering. A clinically useful model should perform consistently across reasonable prompt formulations, not just on a single carefully tuned template.

6. Zero-shot results

Performance averaged across T1–T3 prompts. RETFound is excluded because it has no text encoder. Statistical significance was assessed via paired image-bootstrap (α = 0.05, Holm-adjusted).

Model MESSIDOR (Macro-F1) REFUGE (AUROC) ODIR-200×3 (Macro-F1)
FLAIR 0.735 ± 0.049 0.921 ± 0.049 0.366 ± 0.016
BiomedCLIP 0.471 ± 0.042 0.649 ± 0.037 0.709 ± 0.032
OpenCLIP 0.353 ± 0.001 0.530 ± 0.046 0.399 ± 0.131

Orange = best in column. Higher is better.

FLAIR’s 0.921 AUROC on REFUGE is the headline result. The model has never been trained on the REFUGE dataset and is asked, in plain English, whether each image shows glaucoma. It is correct, on average, more than nine times in ten. That number isn’t a fluke of one prompt — it’s the average across all three templates, with comparable performance on each.

On MESSIDOR, FLAIR similarly leads. On ODIR-200×3 the ranking flips: BiomedCLIP wins, because the “Other” class includes pathology that FLAIR’s retinal-specific pre-training doesn’t cover well, while BiomedCLIP’s broader PubMed-trained features generalise better. OpenCLIP’s ±0.001 standard deviation on MESSIDOR is telling: its web-scale features are nearly invariant to the prompt, which means it has learned no meaningful retinal alignment.

Pre-training domain alignment is the primary driver of zero-shot transfer quality.

7. Few-shot results

Linear probing on frozen encoders, all four models, mean across 5-fold stratified cross-validation, three label budgets (5 %, 10 %, 20 %).

MESSIDOR — diabetic retinopathy (Macro-F1)

Model5 % labels10 % labels20 % labels
FLAIR0.6470.6750.700
OpenCLIP0.6110.6240.648
BiomedCLIP0.5800.5960.613
RETFound0.5430.5790.619

REFUGE — glaucoma (AUROC)

Model5 % labels10 % labels20 % labels
FLAIR0.7180.8430.870
OpenCLIP0.8070.8530.891
BiomedCLIP0.7560.8080.874
RETFound0.6640.8110.836

ODIR-200×3 — multi-disease (Macro-F1)

Model5 % labels10 % labels20 % labels
FLAIR0.8430.8730.900
OpenCLIP0.7930.8710.878
BiomedCLIP0.8570.8920.870
RETFound0.6500.7440.820

Orange = best in column. FLAIR has the highest overall average rank across all nine cells, with OpenCLIP second, BiomedCLIP third, RETFound trailing.

The most interesting result is the REFUGE flip. OpenCLIP, which has the worst zero-shot performance on glaucoma, is the best linear-probe performer at every label budget. The interpretation is that OpenCLIP’s ViT-H/14 has very high-dimensional, generic-but-rich visual features — useful raw material for a small classifier to learn from, but without the medical alignment that lets a text prompt land cleanly.

FLAIR stays dominant on diabetic retinopathy at every budget. BiomedCLIP is the best low-budget choice on ODIR but FLAIR overtakes it at 20 %. RETFound underperforms across the board despite being trained on 1.6 million retinal images. The most likely explanation is the pre-training objective: masked-autoencoder reconstruction produces spatially rich features, but contrastive training is better aligned with discriminative downstream tasks like classification.

Linear probing reshuffles the zero-shot ranking. The best model depends on the task and the label budget.

8. The clinical operating point

AUROC summarises discrimination at every threshold but a deployed screening tool runs at one. The clinically meaningful operating point for population-level glaucoma screening is sensitivity at a fixed specificity — how many true cases the model catches while keeping false positives below a tolerable level.

I fixed specificity at 90 %, which means accepting that roughly 10 % of healthy eyes get flagged for review. The question is then: of the actual glaucoma cases, how many does the model catch?

ModelSens @ Spec=90 % (5 % labels)Sens @ Spec=90 % (20 % labels)
OpenCLIP0.4360.611
FLAIR0.2890.526
BiomedCLIP0.2310.471
RETFound0.2000.414

Sensitivity at 90 % specificity on REFUGE. Higher is better.

OpenCLIP catches 61 % of glaucoma cases with 20 % labels at this operating point. That is below the WHO-recommended ≥ 80 % sensitivity threshold for population screening, so this is not a standalone diagnostic tool. But it is a meaningful triage capability that could reduce specialist workload by pre-filtering the most likely negative cases.

Temperature scaling improves calibration (lowering Expected Calibration Error and Area Under Calibration Error) without changing AUROC. Calibrated probability outputs are achievable post-hoc — useful for downstream clinical decision-support that depends on probability thresholds rather than raw scores.

9. What I learned

There is no universal best model. FLAIR wins on diabetic retinopathy. OpenCLIP wins on glaucoma when given even a few labels. BiomedCLIP wins on multi-disease classification at low budgets. The choice depends on what disease you’re screening for and how much labeled data you can afford to collect.

Domain alignment dominates zero-shot. Encoder size dominates linear probing. FLAIR’s retinal-specific pre-training is decisive in the zero-shot regime where everything depends on whether text and image embeddings already agree. Once a small classifier sits on top of the frozen encoder, the dominant factor becomes how rich the encoder’s features are — and OpenCLIP’s ViT-H/14, despite never being trained on a fundus image, has the richest features.

Bigger isn’t always better. RETFound has the largest architecture (ViT-Large), the largest retinal-specific corpus (1.6 million images), and the longest pre-training compute budget. It underperforms because its self-supervised objective — reconstructing masked patches — produces features that are good at the reconstruction task but less aligned with discrimination.

5 % of labels already changes the game. The biggest jump in every dataset is from zero-shot to a 5 % linear probe. After 10 %, returns diminish. The implication for screening programs is concrete: even a small annotation budget — a few dozen labeled images — produces large gains over zero-shot.

Calibration is fixable. Temperature scaling improves probability calibration without changing the model’s ranking ability. Clinical decision support that depends on probability thresholds doesn’t need re-training; it needs a single scalar parameter fit on a held-out validation split.

This all ran on a laptop. The whole experiment — four models, three benchmarks, five label budgets, five folds, three prompts, calibration, statistical testing — runs on a single Apple M-series laptop using MPS acceleration. There is no GPU cluster anywhere in this pipeline. Meaningful foundation-model evaluation is now accessible to individual researchers.

10. Methodology in brief

Stratified 5-fold cross-validation with identical fold assignments across all models and budgets. Within each training fold, stratified subsampling at 5 %, 10 %, and 20 % label budgets. The linear head is a logistic regression trained on frozen-encoder features. Paired Wilcoxon signed-rank tests across folds for model comparisons. BCa bootstrap confidence intervals (B = 2000) for point estimates. Temperature scaling optimised on the validation split with negative log-likelihood; calibration reported as 15-bin ECE and AUCE. Three classification metrics: Macro-F1 on the 4-class MESSIDOR and 3-class ODIR tasks, AUROC on the binary REFUGE task, and Sensitivity @ 90 % Specificity for the clinical operating point.

All datasets are publicly available de-identified benchmarks. No patient-identifiable data was used. The research followed Monash University ethics guidelines. Results are not intended for direct clinical use without further validation on the populations they would be deployed in.

11. If you’re building a screening tool

For diabetic retinopathy: use FLAIR. It wins zero-shot (0.735 Macro-F1, no labels at all) and stays on top through every few-shot budget. With no labeled data it is already useful as a triage layer.

For glaucoma: use OpenCLIP with a linear probe. The 0.921 AUROC zero-shot FLAIR number is the most dramatic single result in this work, but if you have even 5 % labeled data (about 60 images), OpenCLIP’s ViT-H/14 features beat FLAIR by a meaningful margin at every operating point.

For multi-disease screening: start with BiomedCLIP at very small label budgets (under 10 %). Switch to FLAIR once you can afford ~20 % of the data labelled.

Always: prefer few-shot probing over zero-shot if any labels are available; apply temperature scaling for calibrated probabilities; do not assume RETFound’s domain-specific pre-training will beat contrastive models; and validate on the local population before deployment.

12. What comes next

Three directions stand out. First, parameter-efficient fine-tuning — LoRA or adapter-based — should sit between linear probing and full fine-tuning on the cost-benefit curve. Second, ensembling FLAIR and OpenCLIP, which have complementary strengths, would be straightforward and is likely to lift both DR and glaucoma performance simultaneously. Third, the prompt-sensitivity work is just a starting point: a clinician-validated library of prompts, rather than three templates, would let downstream users tune the model for their setting without changing weights.

The thesis goes deeper on all of this — literature review, full methodology, calibration plots, statistical tests, and the appendix on threats to validity and reproducibility.

“Foundation models can already screen retinal disease at clinically meaningful operating points. The hard work isn’t building a bigger model; it’s choosing the right one for the task and the budget.”


This page summarises a 78-page Master of AI thesis at Monash University, supervised by Dr Yasmeen George. The full thesis — literature review, methodology, results, discussion, references, and appendices — is below.

Read the full thesis

From Prompts to Probes: Zero-Shot and Few-Shot Transfer of Foundation Models for Retinal Fundus Classification. FIT5128 final thesis, Monash University Faculty of Information Technology · ~7,900 words · literature review, methodology, results, discussion, references, and appendices.