Qwen3-4B-Instruct-2507 (NVFP4 Quantized)
A High-Performance 4B Instruct Model with 256K Context & NVFP4 Quantization
🔍 Overview Qwen/Qwen3-4B-Instruct-2507 is an optimized, instruction-tuned version of the Qwen3-4B language model, quantized to NVFP4 precision using NVIDIA Model Optimizer. This version delivers state-of-the-art instruction-following, and long-context performance — all in a highly efficient, NVIDIA-optimized NVFP4 quantized format.
docker run \
--gpus all \
--rm \
--ipc=host \
--ulimit memlock=-1:-1 \
--ulimit stack=67108864 \
--shm-size=64G \
-p 8050:8000 \
-v $HOME/.cache/huggingface:/root/.cache/huggingface \
-w /app/tensorrt_llm \
nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5 \
trtllm-serve \
OPENZEKA/Qwen3-4B-Instruct-2507-NVFP4 \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 128 \
--tp_size 1 \
--kv_cache_free_gpu_memory_fraction 0.2
- Quantized by: OPENZEKA
Qwen3‑4B‑Instruct‑2507: Full‑Precision vs. NVFP4 Quantized Performance Comparison
The Qwen3‑4B‑Instruct‑2507 model is a compact yet powerful language model with 4 billion parameters. In this benchmark, the model’s full‑precision version is compared with NVIDIA’s NVFP4‑quantized version. Tests were performed on an NVIDIA DGX Spark system using the vLLM inference engine.
Test Configuration
| Parameter | Value |
|---|---|
| Prompt length | ~128 tokens (64 different prompts) |
| Maximum response length | 128 tokens |
| Concurrency levels | 2, 4, 8, 16, 32 |
| Measured metrics | TTFT (Time to First Token, ms) ITL (Inter‑Token Latency, ms) TPS (Tokens Per Second) Latency (total response time, s) Throughput (RPS) (requests per second) |
General Observations
NVFP4 quantization dramatically reduces model size and memory footprint while delivering striking performance gains. The quantized version:
- Reduces TTFT by roughly 50 %–60 % on average.
- Almost doubles TPS.
- Cuts total latency by nearly ½.
- Increases throughput (RPS) by about 2×.
These gains are especially noticeable at low‑ and medium‑level concurrency. Even at the highest concurrency (32), the quantized model still maintains a clear advantage.
Detailed Comparison Tables
Full
| Concurrency | TTFT Mean (ms) | ITL Mean (ms) | TPS Mean | Latency Mean (s) | RPS |
|---|---|---|---|---|---|
| 2 | 90.73 | 41.61 | 23.81 | 5.37 | 0.19 |
| 4 | 92.13 | 42.19 | 23.48 | 5.45 | 0.18 |
| 8 | 138.83 | 43.13 | 22.79 | 5.62 | 0.18 |
| 16 | 169.09 | 44.34 | 22.06 | 5.80 | 0.17 |
| 32 | 186.25 | 47.94 | 20.39 | 6.28 | 0.16 |
NVFP4 Quantized
| Concurrency | TTFT Mean (ms) | ITL Mean (ms) | TPS Mean | Latency Mean (s) | RPS |
|---|---|---|---|---|---|
| 2 | 43.89 | 19.64 | 50.42 | 2.54 | 0.39 |
| 4 | 59.53 | 20.22 | 48.71 | 2.63 | 0.38 |
| 8 | 77.06 | 20.99 | 46.66 | 2.74 | 0.36 |
| 16 | 90.63 | 22.15 | 44.07 | 2.91 | 0.34 |
| 32 | 114.21 | 24.72 | 39.30 | 3.26 | 0.31 |
Key Findings
- First‑Token Speed (TTFT) The quantized model produces the first token in ~44 ms at the lowest concurrency, whereas the full‑precision model needs ~91 ms. At concurrency = 32 the gap widens from 186 ms → 114 ms, showing that the quantized model degrades much less under load.
- Token Generation Speed (TPS & ITL) ITL stays around 20–25 ms for the quantized model, while it rises to 42–48 ms for the full‑precision model. Consequently, TPS enjoys an almost constant ~2× boost (e.g., 23.8 → 50.4 at low concurrency, 20.4 → 39.3 at high concurrency).
- Total Latency Generating a 128‑token response takes 2.5–3.3 s with the quantized model versus 5.4–6.3 s with full precision—nearly a two‑fold reduction, which is highly significant for user experience.
- Throughput (RPS) The quantized version processes roughly twice as many concurrent requests on the same hardware. This translates into major cost‑efficiency and scalability benefits for production deployments.
Conclusion
NVFP4 quantization delivers substantial speed and efficiency improvements for the Qwen3‑4B‑Instruct‑2507 model without any measurable loss in quality (the benchmark focuses solely on performance; quality should be evaluated separately). The quantized model is clearly the preferred choice for real‑time applications, chatbots, or high‑traffic inference services.
These results further demonstrate how NVIDIA’s low‑precision quantization technologies (FP4/NVFP4) can be highly effective for small‑to‑medium sized models, providing a compelling option for deploying performant, cost‑effective AI services.
- Downloads last month
- 76
Model tree for OPENZEKA/Qwen3-4B-Instruct-2507-NVFP4
Base model
Qwen/Qwen3-4B-Instruct-2507