v0.12.4 — Ollama - Product release notes & changelog tool

What’s Changed

Flash attention is now enabled by default for Qwen 3 and Qwen 3 Coder
Fixed minor memory estimation issues when scheduling models on NVIDIA GPUs
Fixed an issue where keep_alive in the API would accept different values for the /api/chat and /api/generate endpoints
Fixed tool calling rendering with qwen3-coder
More reliable and accurate VRAM detection
OLLAMA_FLASH_ATTENTION can now be overridden to 0 for models that have flash attention enabled by default
macOS 12 Monterey and macOS 13 Ventura are no longer supported
Fixed crash where templates were not correctly defined
Fix memory calculations on NVIDIA iGPUs
AMD gfx900 and gfx906 (MI50, MI60, etc) GPUs are no longer supported via ROCm. We’re working to support these GPUs via Vulkan in a future release.

Full Changelog: https://github.com/ollama/ollama/compare/v0.12.3…v0.12.4-rc3