Qolda-AVL
Qolda-AVL is a 5B audio-vision-language model designed to operate in Kazakh, Russian, and English. The model extends Qwen3-VL with an audio branch built on a fine-tuned Whisper encoder and a dedicated audio projection module. All three modalities are adapted to Kazakh through a staged training pipeline, with the audio branch covering speech recognition, speech translation, audio classification, and environmental sound captioning.
To improve audio feature injection into the language backbone, we apply the DeepStack mechanism to the audio branch, mirroring the vision processing pipeline of Qwen3-VL 💜
The model is our step towards omni-modal systems for the Kazakh language.
The name "Qolda" reflects both its design and purpose in Kazakh: "in hand" (қолда) for its compact accessibility, and "to support" (қолдау) for its assistive nature.
Evaluation Results
The model is evaluated in three languages on benchmarks covering language, vision, and audio.
The evaluation results will be added soon...
Model Usage
1. Transformers inference
To run the inference with transformers, complete the preliminary setup:
uv venv venv
source venv/bin/activate
uv pip install torch accelerate transformers
Then initialize the model and processor:
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
model = AutoModelForCausalLM.from_pretrained(
"issai/Qolda-AVL-5B",
trust_remote_code=True,
dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained("issai/Qolda-AVL-5B", trust_remote_code=True)
Depending on the required modalities, define the messages list:
Language:
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "y = (lnx)^2 функциясының туындысын тап. JSON форматында жауап бер: {'answer': '...'}"},
],
}
]
Vision-Language:
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "assets/sample_image.jpg"}, # or provide link to the image
{"type": "text", "text": "Суретті егжей-тегжейлі сипаттап бер. Неше жылқы көріп тұрсың және олардың түстері қандай?"},
],
}
]
Audio-Language:
Note: The model was not trained to answer questions posed directly in the audio. Provide a detailed text instruction alongside the audio describing the task you want performed on it.
prompt = """Математикалық есепті шеш.
Respond ONLY with this JSON format: {"explanation": "<your step-by-step reasoning>", "answer": <integer or float number>}
The answer must be a number (integer or float). No text, no units, just the number.
"""
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio": "assets/sample_audio.wav"}, # or provide link to the audio
{"type": "text", "text": prompt}
],
}
]
Audio-Vision-Language:
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio": "assets/question_audio.wav"},
{"type": "image", "image": "assets/sample_image.jpg"},
{"type": "text", "text": "Answer the question"},
],
}
]
Finally, pass the messages to the model for inference:
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt",
).to(model.device)
generated_ids = model.generate(
**inputs,
max_new_tokens=4096,
temperature=0.7,
top_p=0.95,
top_k=20,
do_sample=True,
repetition_penalty=1.0,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)
print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])
2. vLLM inference
Alternatively, you can run the model via a vLLM server. Note that we use a custom vLLM package. First, complete the preliminary setup:
uv venv venv
source venv/bin/activate
# Install this fork (precompiled binaries)
git clone https://github.com/IS2AI/vLLM-Qolda-AVL.git
cd vLLM-Qolda-AVL
VLLM_USE_PRECOMPILED=1 uv pip install -e .
Then start the OpenAI-compatible server (adjust parameters to your settings):
vllm serve issai/Qolda-AVL-5B \
--served-model-name qolda-avl \
--trust-remote-code \
--tensor-parallel-size 4 \
--dtype bfloat16 \
--max-model-len 16384 \
--limit-mm-per-prompt '{"audio": 1, "image": 1}'
To run inference, you can use the following code:
import base64
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="EMPTY"
)
def encode_audio_base64(path: str | Path) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
def encode_image_base64(path: str | Path) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
audio_path = "assets/sample_audio.wav"
audio_b64 = encode_audio_base64(audio_path)
stream = client.chat.completions.create(
model=client.models.list().data[0].id,
messages=[
{
"role": "user",
"content": [
{
"type": "input_audio",
"input_audio": {
"data": audio_b64,
"format": "wav",
},
},
{
"type": "text",
"text": (
"Analyze the voice in the audio and identify the speaker's "
"gender (male or female). Also transcribe what is said. "
"Return your answer as JSON in the following format: "
'{"answer": "<male or female>",'
'"transcription": "<transcription>"}'
),
},
],
}
],
max_tokens=4096,
temperature=0.7,
top_p=0.8,
stream=True,
stream_options={"include_usage": True},
)
text = ""
usage = None
for chunk in stream:
if chunk.usage:
usage = chunk.usage
if chunk.choices and chunk.choices[0].delta.content:
token = chunk.choices[0].delta.content
print(token, end="", flush=True)
text += token
License
Apache License 2.0
Citation
Paper coming soon...
- Downloads last month
- 49
Model tree for issai/Qolda-AVL-5B
Base model
Qwen/Qwen3-VL-4B-Thinking