Qwen3-4B-Instruct-2507 (NVFP4 Quantized)

A High-Performance 4B Instruct Model with 256K Context & NVFP4 Quantization

🔍 Overview Qwen/Qwen3-4B-Instruct-2507 is an optimized, instruction-tuned version of the Qwen3-4B language model, quantized to NVFP4 precision using NVIDIA Model Optimizer. This version delivers state-of-the-art instruction-following, and long-context performance — all in a highly efficient, NVIDIA-optimized NVFP4 quantized format.

docker run \
  --gpus all \
  --rm \
  --ipc=host \
  --ulimit memlock=-1:-1 \
  --ulimit stack=67108864 \
  --shm-size=64G \
  -p 8050:8000 \
  -v $HOME/.cache/huggingface:/root/.cache/huggingface \
  -w /app/tensorrt_llm \
  nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5 \
  trtllm-serve \
   OPENZEKA/Qwen3-4B-Instruct-2507-NVFP4 \
  --host 0.0.0.0 \
  --port 8000 \
  --backend pytorch \
  --max_batch_size 128 \
  --tp_size 1 \
  --kv_cache_free_gpu_memory_fraction 0.2

Quantized by: OPENZEKA

Qwen3‑4B‑Instruct‑2507: Full‑Precision vs. NVFP4 Quantized Performance Comparison

The Qwen3‑4B‑Instruct‑2507 model is a compact yet powerful language model with 4 billion parameters. In this benchmark, the model’s full‑precision version is compared with NVIDIA’s NVFP4‑quantized version. Tests were performed on an NVIDIA DGX Spark system using the vLLM inference engine.

Test Configuration

Parameter	Value
Prompt length	~128 tokens (64 different prompts)
Maximum response length	128 tokens
Concurrency levels	2, 4, 8, 16, 32
Measured metrics	TTFT (Time to First Token, ms) ITL (Inter‑Token Latency, ms) TPS (Tokens Per Second) Latency (total response time, s) Throughput (RPS) (requests per second)

General Observations

NVFP4 quantization dramatically reduces model size and memory footprint while delivering striking performance gains. The quantized version:

Reduces TTFT by roughly 50 %–60 % on average.
Almost doubles TPS.
Cuts total latency by nearly ½.
Increases throughput (RPS) by about 2×.

These gains are especially noticeable at low‑ and medium‑level concurrency. Even at the highest concurrency (32), the quantized model still maintains a clear advantage.

Detailed Comparison Tables

Full

Concurrency	TTFT Mean (ms)	ITL Mean (ms)	TPS Mean	Latency Mean (s)	RPS
2	90.73	41.61	23.81	5.37	0.19
4	92.13	42.19	23.48	5.45	0.18
8	138.83	43.13	22.79	5.62	0.18
16	169.09	44.34	22.06	5.80	0.17
32	186.25	47.94	20.39	6.28	0.16

NVFP4 Quantized

Concurrency	TTFT Mean (ms)	ITL Mean (ms)	TPS Mean	Latency Mean (s)	RPS
2	43.89	19.64	50.42	2.54	0.39
4	59.53	20.22	48.71	2.63	0.38
8	77.06	20.99	46.66	2.74	0.36
16	90.63	22.15	44.07	2.91	0.34
32	114.21	24.72	39.30	3.26	0.31

Key Findings

First‑Token Speed (TTFT) The quantized model produces the first token in ~44 ms at the lowest concurrency, whereas the full‑precision model needs ~91 ms. At concurrency = 32 the gap widens from 186 ms → 114 ms, showing that the quantized model degrades much less under load.
Token Generation Speed (TPS & ITL) ITL stays around 20–25 ms for the quantized model, while it rises to 42–48 ms for the full‑precision model. Consequently, TPS enjoys an almost constant ~2× boost (e.g., 23.8 → 50.4 at low concurrency, 20.4 → 39.3 at high concurrency).
Total Latency Generating a 128‑token response takes 2.5–3.3 s with the quantized model versus 5.4–6.3 s with full precision—nearly a two‑fold reduction, which is highly significant for user experience.
Throughput (RPS) The quantized version processes roughly twice as many concurrent requests on the same hardware. This translates into major cost‑efficiency and scalability benefits for production deployments.

Conclusion

NVFP4 quantization delivers substantial speed and efficiency improvements for the Qwen3‑4B‑Instruct‑2507 model without any measurable loss in quality (the benchmark focuses solely on performance; quality should be evaluated separately). The quantized model is clearly the preferred choice for real‑time applications, chatbots, or high‑traffic inference services.

These results further demonstrate how NVIDIA’s low‑precision quantization technologies (FP4/NVFP4) can be highly effective for small‑to‑medium sized models, providing a compelling option for deploying performant, cost‑effective AI services.

Downloads last month: 76

Safetensors

Model size

2B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OPENZEKA/Qwen3-4B-Instruct-2507-NVFP4

Base model

Qwen/Qwen3-4B-Instruct-2507

Quantized

(175)

this model