Forge
Generates training data from documents, scores examples with LLM-as-judge evaluation, detects benchmark contamination, fine-tunes locally with LoRA on Apple Silicon, and runs paired bootstrap significance tests. 184 tests, +16.8% ROUGE-1 gain.
View Project