What speed do you get at Q8 on AMD Ryzen™ AI Max+ 395

#14

by akierum - opened Oct 6, 2025

Discussion

akierum

Oct 6, 2025

•

edited Oct 6, 2025

Hello, since running this Q8 on RTX 3090 needs 5 GPUS, I want to ask what speeds do you get with AMD AI Max+ 395 ?

At context start, and at say context filled up to 100k tokens?

akierum changed discussion title from What speed do you get at Q6 on AMD Ryzen™ AI Max+ 395 to What speed do you get at Q8 on AMD Ryzen™ AI Max+ 395 Oct 6, 2025

liquidsnakeblue

Oct 6, 2025

I am getting around 200pp and 30tg at 0 context, it slows down to a crawl at around 20-30K context, less than 50pp and 7.5tg.

Very usable until 20-30k tokens.

akierum

Oct 6, 2025

Thank you. Coding requires at least 128k context as prompt in cline, roocode, kilocode is 30K minimum. Reading 4 files (js, html, css, json), and 4 documentation (md files like API_DOCUMENTATION.md README.md technical_specification.md) files. Then this AMD Ryzen™ AI Max+ 395 not usable for coding :(

CHNtentes

Oct 10, 2025

Sadly gpu is still the only way for long context. If you only have 100pp, it will take 5 min to process 30k prompt...

akierum

Oct 10, 2025

Well this new Kwaipilot KAT-Dev is superior to GLM4.5-Air or Qwen3-coder 30B seems like a toy. But it was hard to make it work.

V7 - is Kwaipilot / KAT-Dev , V2 - is GLM4.5-Air

bhamm-lab

Oct 30, 2025

I found this one is a bit more performant - https://huggingface.co/bartowski/cerebras_GLM-4.5-Air-REAP-82B-A12B-GGUF on my strix halo machine. I'm using Q6_K and it's my go-to for coding tasks at the moment.

jo987654321

Nov 1, 2025

how did you get Kwaipilot KAT-Dev up and running?

bhamm-lab

Nov 2, 2025

I was able to get it working, but had to use the prompt template here - https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp. It's decent, but is pretty bad at tool calling. It constantly fails with roo code and I found that glm air reap is much more consistent. Here are the llama server args I gave it: https://github.com/blake-hamm/bhamm-lab/blob/main/kubernetes/manifests/apps/ai/models/helm-green.yaml#L142 (probably still needs some tweaking; would love any feedback as I'm still new to llama.cpp in general).

11ds11

Nov 28, 2025

If you find this helpful for comparison, not Q8, but MXFP4 quant: I got roughly 13 tok/sec running glm-4.5-air (specifically, "noctrex/GLM-4.5-Air-Derestricted-MXFP4_MOE-GGUF") with context 131K on Gmktec Evo X2 (AMD AI Max+ 395, Windows, ROCm) using LM Studio. Not fast, but quite usable.

akierum

Dec 1, 2025

Q6 for coding is minimum requirement I think, after testing Q8 and Q8-XL of other LLM's

Sdevd

24 days ago

For local coding dreamers M4 max 128gb is your best bet.

akierum

24 days ago

I think M3 Ultra 256GB is minimum requirement as context window needs memory too. But price is very bad for what it gives you in terms of productivity.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment