Qwen3-ASR 0.6B β GGUF (ggml-quantised)
GGUF / ggml conversions of Qwen/Qwen3-ASR-0.6B for use with the qwen3-asr-main CLI from CrispStrobe/CrispASR.
Qwen3-ASR 0.6B is Alibaba's speech-LLM ASR model:
- 30 languages + 22 Chinese dialects with automatic language detection
- 6.42 % avg WER on the HuggingFace Open ASR Leaderboard
- Apache-2.0 licence
- Speech-LLM architecture: Whisper-style audio encoder (2D-conv subsampler + 18-layer Transformer + projector head, 896 β 1024) feeds frames into a stock Qwen3 0.6B LLM (28 layers, GQA 16/8, head_dim=128, Q-norm/K-norm, SwiGLU, RoPE ΞΈ=1e6) via embedding splice at
<|audio_pad|>placeholder positions in a ChatML prompt. The LLM autoregressively generates the transcript.
This is the first speech-LLM in the CrispASR family β every other model in the set uses a dedicated CTC / transducer / encoder-decoder. The Qwen3-ASR runtime ships with a persistent KV cache so per-token decode is O(1) in cache size, not O(N) full re-forwards.
Files
| File | Size | Notes |
|---|---|---|
qwen3-asr-0.6b.gguf |
1.88 GB | F16 |
qwen3-asr-0.6b-q8_0.gguf |
961 MB | Q8_0, near-lossless |
qwen3-asr-0.6b-q4_k.gguf |
676 MB | Q4_K β recommended default, faster than realtime on a 4-core CPU |
All quantisations produce the correct transcript on samples/jfk.wav:
And so, my fellow Americans, ask not what your country can do for you; ask what you can do for your country.
The mel filterbank from WhisperFeatureExtractor is baked into the GGUF as audio.mel_filters (along with audio.mel_window), so the C++ runtime computes the log-mel spectrogram natively without needing torch / librosa / scipy at inference time.
Quick Start
# 1. Build the runtime
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target qwen3-asr-main
# 2. Download a quantisation
huggingface-cli download cstr/qwen3-asr-0.6b-GGUF \
qwen3-asr-0.6b-q4_k.gguf --local-dir .
# 3. Transcribe
./build/bin/qwen3-asr-main \
-m qwen3-asr-0.6b-q4_k.gguf \
-f your-audio.wav -t 8
Audio must be 16 kHz mono 16-bit PCM WAV. Pre-convert with:
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le output.wav
Performance
Measured on samples/jfk.wav (11 seconds), Apple-class 4-core CPU:
| Variant | Mel | Encoder | Prefill | Decode/tok | Total |
|---|---|---|---|---|---|
| F16 | 250 ms | 2660 ms | 3032 ms | 151 ms | 10.3 s |
| Q8_0 | 236 ms | 2459 ms | 2840 ms | 137 ms | 9.5 s |
| Q4_K | 246 ms | 2851 ms | 2721 ms | 118 ms | 9.3 s |
Q4_K runs slightly faster than realtime with no quality loss on this clip.
Architecture
| Component | Details |
|---|---|
| Audio encoder | 18-layer Whisper-style pre-LN Transformer, d=896, heads=14, head_dim=64, FFN=3584 |
| Conv subsampler | 3 Γ Conv2D stride-2 (1β480β480β480), then linear (480Β·16=7680 β 896). Output frame rate |
| Projector | ln_post β proj1 (896β896) β GELU β proj2 (896β1024) |
| LLM | Qwen3 0.6B: 28 layers, hidden=1024, 16 Q heads / 8 KV heads (GQA), head_dim=128, FFN=3072, SwiGLU, RMSNorm, per-head Q-norm / K-norm, NEOX-style RoPE ΞΈ=1e6 |
| Vocab | 151 936 tokens (Qwen2 BPE, GPT-2 byte encoding) |
| Audio | 16 kHz mono, 128 mel bins, n_fft=400, hop=160, win=400 (matches WhisperFeatureExtractor) |
| Audio injection | `< |
| Parameters | ~900 M |
Implementation notes (correctness)
The C++ runtime is verified to F16 numerical precision against the PyTorch reference at every architectural boundary on samples/jfk.wav:
| Stage | Diff metric | Result |
|---|---|---|
| Conv front-end (per-chunk Conv2D + flatten + linear) | max abs vs conv_out.npy |
1.43e-4 |
| Full audio encoder (18 layers + projector) | per-row cosine sim vs proj2_out.npy |
mean 1.000000, min 0.999999 |
| Qwen3 LLM forward (28 layers, no audio) | per-position cosine sim vs llm_logits.npy |
mean 0.999999, top-1 9/9 |
| End-to-end (audio β spliced embeds β LLM β greedy decode) | reproduced reference token sequence | 26 / 26 |
Mel filterbank (C++ STFT vs WhisperFeatureExtractor) |
max abs vs mel_input.npy |
2.2e-2 |
Bugs that would have been hours of debugging
A few non-obvious gotchas the port had to handle:
ggml_permutesemantics are inverted from the obvious reading:permute(t, p0, p1, p2, p3)means "source axis i goes to NEW positionp_i", not "new axis i comes from source axisp_i".- PyTorch hooks fire pre-GELU when registered on an
nn.Conv2dmodule β theF.geluis applied externally in the forward function. cu_seqlensis GPU-only:eager_attention_forward(used on CPU) ignorescu_seqlensand does standard full self-attention. The "windowed attention" path only kicks in for FlashAttention2 on GPU. Don't apply the windowed mask on CPU β the reference produces full-attention output.WhisperFeatureExtractor.mel_filtersshape is(n_freqs=201, n_mels=128), not(n_mels, n_freqs)as the parameter ordering might suggest.- Qwen3 attention output width is
hd Γ n_q_heads = 2048, notd_model = 1024. The o_proj is(2048 β 1024), so the attention output is reshaped to(2048, T)before o_proj. - mrope sidestep: Qwen3-ASR uses interleaved multi-modal RoPE with
mrope_section=[24,20,20]. For text-only or 1D-position input (which includes our spliced audio frames), the three mrope sections all receive identical position_ids and collapse to standard 1D RoPE. The simpler RoPE matches the reference perfectly for our use case.
See qwen3-asr-todo.md in the runtime repo for the complete work log.
How this was made
- The HF safetensors model was converted to GGUF F16 by
models/convert-qwen3-asr-to-gguf.py. All 612 tensors map cleanly. The mel filterbank (fromWhisperFeatureExtractor.mel_filters) and Hann window are baked into the GGUF asaudio.mel_filters/audio.mel_window. - Quantised variants are produced by
cohere-quantize(the same llama.cpp-style quantiser used for the other GGUF releases in this family). - Inference is implemented in
src/qwen3_asr.{h,cpp}: the encoder and the LLM each run as one ggml graph, with a persistent F32 KV cache(head_dim, max_ctx, n_kv_heads, n_layers)shared between prefill and per-token decode steps.
Reference implementation
predict-woo/qwen3-asr.cpp (MIT) was read for architecture discovery and tensor name mapping. No source code was vendored β the CrispASR runtime is a re-implementation in this repo's existing FastConformer / cohere-style ggml infrastructure, sharing structures with the four other ASR runtimes in the family.
Supported languages
ar cs da de el en es fa fi fil fr hi hu id it ja ko mk ms nl pl pt ro ru sv th tr vi yue zh plus 22 Chinese dialects (auto-detected at inference time).
Attribution
- Original model:
Qwen/Qwen3-ASR-0.6B(Apache-2.0). Alibaba Cloud Qwen team. - GGUF conversion + ggml runtime: CrispStrobe/CrispASR β community contribution.
- Reference implementation: predict-woo/qwen3-asr.cpp (MIT) β used for architecture discovery only, no code vendored.
Related
- C++ runtime: CrispStrobe/CrispASR
- Sister releases in the same family:
cstr/cohere-transcribe-03-2026-GGUFβ Cohere Transcribe 2B (Open ASR Leaderboard #1)cstr/parakeet-tdt-0.6b-v3-GGUFβ Parakeet TDT 600M (free word timestamps)cstr/canary-1b-v2-GGUFβ Canary 978M (speech translation)cstr/canary-ctc-aligner-GGUFβ universal multilingual forced aligner
License
Apache-2.0, inherited from the base model.
- Downloads last month
- 1,007
8-bit
Model tree for cstr/qwen3-asr-0.6b-GGUF
Base model
Qwen/Qwen3-ASR-0.6B