Update Oct 3, 2025 tracked by Updatify
v0.12.4
What’s Changed
- Flash attention is now enabled by default for Qwen 3 and Qwen 3 Coder
- Fixed minor memory estimation issues when scheduling models on NVIDIA GPUs
-
Fixed an issue where
keep_alivein the API would accept different values for the/api/chatand/api/generateendpoints -
Fixed tool calling rendering with
qwen3-coder - More reliable and accurate VRAM detection
-
OLLAMA_FLASH_ATTENTIONcan now be overridden to0for models that have flash attention enabled by default - macOS 12 Monterey and macOS 13 Ventura are no longer supported
- Fixed crash where templates were not correctly defined
- Fix memory calculations on NVIDIA iGPUs
- AMD gfx900 and gfx906 (MI50, MI60, etc) GPUs are no longer supported via ROCm. We’re working to support these GPUs via Vulkan in a future release.
New Contributors
- @Fachep made their first contribution in https://github.com/ollama/ollama/pull/12412
Full Changelog: https://github.com/ollama/ollama/compare/v0.12.3…v0.12.4-rc3