Writings
Picking the smallest model that does the job
Draft writing — full body publishes via the editorial workflow.
Most teams pay for one model size up. This post walks through the eval workflow we use to pick the smallest serving stack that still meets the task’s quality bar, under a fixed VRAM budget — vLLM, SGLang, and llama.cpp head-to-head on the same prompts and the same hardware.