Agent Workflow Lab
Experiments with LLM agents, tool use, memory, planning, and evaluation loops. Measuring what makes an agent reliably finish a task versus impressively start one.
Experiments, prototypes, and technical build logs across AI systems, agents, software trust, research tools, and applied machine learning.
A living workspace for things I am testing, building, breaking, and improving. Some experiments become products, some become essays, and some become lessons.
Six workstreams I am running right now. Each lab is a focused investigation with its own success metric, failure mode, and lesson log.
Experiments with LLM agents, tool use, memory, planning, and evaluation loops. Measuring what makes an agent reliably finish a task versus impressively start one.
Exploring evidence layers, audit trails, signed build records, and verifiable software systems. The thesis: trust is becoming a first-class layer of the AI stack.
Foundation model experiments for retinal disease classification: zero-shot inference, linear probing, and low-data evaluation. 0.92 AUROC with 5 to 20 percent of labels.
Testing retrieval quality, hallucination control, chunking strategies, and answer evaluation methods. Building a small bench you can actually trust.
Experiments using AI coding agents to build, refactor, debug, and ship production interfaces. This very portfolio is one of the artefacts. Log of what works, what does not.
Small tools for evaluating startup ideas, AI use cases, cold email offers, and workflow automation opportunities. A founder utility belt, not a startup.
Every lab moves through the same six steps. The point is not to be heroic. The point is to learn something specific and write it down.
Find a sharp question worth answering with code.
Sketch the smallest system that could test it.
Make the thing real enough to expose flaws.
Push it until it breaks. Note what broke first.
Write the result so it survives the week.
Promote it to product, essay, or close the file.
Three problems I keep returning to. Most active labs are slices of these.
Reliability beats raw capability. Most demos work once. The interesting work is making them finish a task ten times in a row.
Software is increasingly generated, copied, and signed. Trust may become the next layer between code and the people who use it.
Atlases, field manuals, structured Q&A, learning paths. Tools that compress the time from confused to capable.
If any of this overlaps with what you are working on, I am open to focused collaboration and contract work.
Contact me