Qolda-AVL

Qolda-AVL is a 5B audio-vision-language model designed to operate in Kazakh, Russian, and English. The model extends Qwen3-VL with an audio branch built on a fine-tuned Whisper encoder and a dedicated audio projection module. All three modalities are adapted to Kazakh through a staged training pipeline, with the audio branch covering speech recognition, speech translation, audio classification, and environmental sound captioning.

To improve audio feature injection into the language backbone, we apply the DeepStack mechanism to the audio branch, mirroring the vision processing pipeline of Qwen3-VL 💜

The model is our step towards omni-modal systems for the Kazakh language.

The name "Qolda" reflects both its design and purpose in Kazakh: "in hand" (қолда) for its compact accessibility, and "to support" (қолдау) for its assistive nature.

Evaluation Results

The model is evaluated in three languages on benchmarks covering language, vision, and audio.

The evaluation results will be added soon...

Model Usage

1. `Transformers` inference

To run the inference with transformers, complete the preliminary setup:

uv venv venv
source venv/bin/activate
uv pip install torch accelerate transformers

Then initialize the model and processor:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model = AutoModelForCausalLM.from_pretrained(
    "issai/Qolda-AVL-5B",
    trust_remote_code=True,
    dtype=torch.bfloat16,
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("issai/Qolda-AVL-5B", trust_remote_code=True)

Depending on the required modalities, define the messages list:

Language:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "y = (lnx)^2 функциясының туындысын тап. JSON форматында жауап бер: {'answer': '...'}"},
        ],
    }
]

Vision-Language:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "assets/sample_image.jpg"}, # or provide link to the image
            {"type": "text", "text": "Суретті егжей-тегжейлі сипаттап бер. Неше жылқы көріп тұрсың және олардың түстері қандай?"},
        ],
    }
]

Audio-Language:

Note: The model was not trained to answer questions posed directly in the audio. Provide a detailed text instruction alongside the audio describing the task you want performed on it.

prompt = """Математикалық есепті шеш.
Respond ONLY with this JSON format: {"explanation": "<your step-by-step reasoning>", "answer": <integer or float number>}
The answer must be a number (integer or float). No text, no units, just the number.
"""

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "assets/sample_audio.wav"}, # or provide link to the audio
            {"type": "text", "text": prompt}
        ],
    }
]

Audio-Vision-Language:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "assets/question_audio.wav"},
            {"type": "image", "image": "assets/sample_image.jpg"},
            {"type": "text", "text": "Answer the question"},
        ],
    }
]

Finally, pass the messages to the model for inference:

inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt",
).to(model.device)

generated_ids = model.generate(
    **inputs,
    max_new_tokens=4096,
    temperature=0.7,
    top_p=0.95,
    top_k=20,
    do_sample=True,
    repetition_penalty=1.0,
)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)

print(processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True)[0])

2. `vLLM` inference

Alternatively, you can run the model via a vLLM server. Note that we use a custom vLLM package. First, complete the preliminary setup:

uv venv venv
source venv/bin/activate

# Install this fork (precompiled binaries)
git clone https://github.com/IS2AI/vLLM-Qolda-AVL.git
cd vLLM-Qolda-AVL
VLLM_USE_PRECOMPILED=1 uv pip install -e .

Then start the OpenAI-compatible server (adjust parameters to your settings):

vllm serve issai/Qolda-AVL-5B \
    --served-model-name qolda-avl \
    --trust-remote-code \
    --tensor-parallel-size 4 \
    --dtype bfloat16 \
    --max-model-len 16384 \
    --limit-mm-per-prompt '{"audio": 1, "image": 1}'

To run inference, you can use the following code:

import base64
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1", 
    api_key="EMPTY"
)

def encode_audio_base64(path: str | Path) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def encode_image_base64(path: str | Path) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

audio_path = "assets/sample_audio.wav"
audio_b64 = encode_audio_base64(audio_path)

stream = client.chat.completions.create(
    model=client.models.list().data[0].id,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_b64,
                        "format": "wav",
                    },
                },
                {
                    "type": "text",
                    "text": (
                        "Analyze the voice in the audio and identify the speaker's "
                        "gender (male or female). Also transcribe what is said. "
                        "Return your answer as JSON in the following format: "
                        '{"answer": "<male or female>",'
                        '"transcription": "<transcription>"}'
                    ),
                },
            ],
        }
    ],
    max_tokens=4096,
    temperature=0.7,
    top_p=0.8,
    stream=True,
    stream_options={"include_usage": True},
)

text = ""
usage = None
for chunk in stream:
    if chunk.usage:
        usage = chunk.usage
    if chunk.choices and chunk.choices[0].delta.content:
        token = chunk.choices[0].delta.content
        print(token, end="", flush=True)
        text += token

License

Apache License 2.0

Citation

Paper coming soon...

Downloads last month: 49

Safetensors

Model size

5B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for issai/Qolda-AVL-5B

Base model

Qwen/Qwen3-VL-4B-Thinking

Finetuned

(21)

this model