What speed do you get at Q8 on AMD Ryzen™ AI Max+ 395

#14
by akierum - opened

Hello, since running this Q8 on RTX 3090 needs 5 GPUS, I want to ask what speeds do you get with AMD AI Max+ 395 ?

At context start, and at say context filled up to 100k tokens?

akierum changed discussion title from What speed do you get at Q6 on AMD Ryzen™ AI Max+ 395 to What speed do you get at Q8 on AMD Ryzen™ AI Max+ 395

I am getting around 200pp and 30tg at 0 context, it slows down to a crawl at around 20-30K context, less than 50pp and 7.5tg.

Very usable until 20-30k tokens.

Thank you. Coding requires at least 128k context as prompt in cline, roocode, kilocode is 30K minimum. Reading 4 files (js, html, css, json), and 4 documentation (md files like API_DOCUMENTATION.md README.md technical_specification.md) files. Then this AMD Ryzen™ AI Max+ 395 not usable for coding :(

Sadly gpu is still the only way for long context. If you only have 100pp, it will take 5 min to process 30k prompt...

Well this new Kwaipilot KAT-Dev is superior to GLM4.5-Air or Qwen3-coder 30B seems like a toy. But it was hard to make it work.

V7 - is Kwaipilot / KAT-Dev , V2 - is GLM4.5-Air
menu

I found this one is a bit more performant - https://huggingface.co/bartowski/cerebras_GLM-4.5-Air-REAP-82B-A12B-GGUF on my strix halo machine. I'm using Q6_K and it's my go-to for coding tasks at the moment.

how did you get Kwaipilot KAT-Dev up and running?

I was able to get it working, but had to use the prompt template here - https://huggingface.co/Kwaipilot/KAT-Dev-72B-Exp. It's decent, but is pretty bad at tool calling. It constantly fails with roo code and I found that glm air reap is much more consistent. Here are the llama server args I gave it: https://github.com/blake-hamm/bhamm-lab/blob/main/kubernetes/manifests/apps/ai/models/helm-green.yaml#L142 (probably still needs some tweaking; would love any feedback as I'm still new to llama.cpp in general).

If you find this helpful for comparison, not Q8, but MXFP4 quant: I got roughly 13 tok/sec running glm-4.5-air (specifically, "noctrex/GLM-4.5-Air-Derestricted-MXFP4_MOE-GGUF") with context 131K on Gmktec Evo X2 (AMD AI Max+ 395, Windows, ROCm) using LM Studio. Not fast, but quite usable.

Q6 for coding is minimum requirement I think, after testing Q8 and Q8-XL of other LLM's

For local coding dreamers M4 max 128gb is your best bet.

I think M3 Ultra 256GB is minimum requirement as context window needs memory too. But price is very bad for what it gives you in terms of productivity.

Sign up or log in to comment