Supported models

Table of contents

  1. Gemma 4 (Google)
    1. Server command
    2. shunt config
    3. Sampling profile
  2. Qwen (Alibaba)
    1. Server command
    2. shunt config
    3. Sampling profile
  3. Key server flags
  4. Adding a custom model

shunt auto-detects the loaded model from the server’s /v1/models response and applies the appropriate decoding strategy, sampling parameters, and thinking configuration for that model family. No manual tuning is required.


Gemma 4 (Google)

Recommended default. The 12B variant runs on ~8 GB VRAM.

Model Quant VRAM
gemma-4-12b-it UD-Q4_K_XL ~8 GB
gemma-4-26B-A4B-it UD-Q4_K_M ~17 GB

The 26B-A4B is a mixture-of-experts model — 26B total parameters, 4B active per forward pass.

Server command

# 12B
llama-server -hf unsloth/gemma-4-12b-it-GGUF:UD-Q4_K_XL \
  --jinja -ngl 999 -fa on -c 8192

# 26B-A4B
llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M \
  --jinja -ngl 999 -fa on -c 8192

shunt config

endpoint = "http://localhost:8080"
model    = "gemma-4-12b"

Sampling profile

Parameter Value
max_tokens 8192
thinking_budget_tokens 2048
temperature 1.0
top_p 0.95
top_k 64

Qwen (Alibaba)

Three practical options covering 6.5 GB to 22 GB VRAM. The 35B-A3B is a mixture-of-experts model with only 3B active parameters per forward pass.

Model Quant VRAM
Qwen3.5-9B UD-Q4_K_XL ~6.5 GB
Qwen3.6-27B Q4_K_S ~16 GB
Qwen3.6-35B-A3B UD-Q4_K_M ~22 GB

Server command

# Qwen3.5-9B
llama-server -hf unsloth/Qwen3.5-9B-GGUF:UD-Q4_K_XL \
  --jinja -ngl 999 -fa on -c 8192

# Qwen3.6-27B
llama-server -hf unsloth/Qwen3.6-27B-GGUF:Q4_K_S \
  --jinja -ngl 999 -fa on -c 8192

# Qwen3.6-35B-A3B (MoE)
llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_M \
  --jinja -ngl 999 -fa on -c 8192

shunt config

endpoint = "http://localhost:8080"
model    = "qwen3.6-27b"   # or qwen3.5-9b

Sampling profile

Parameter Value
max_tokens 32768
temperature 0.7
content_temperature 0.6
top_p 0.8
top_k 20
presence_penalty 1.5

Key server flags

Flag Purpose
-hf <repo:quant> Download and run directly from Hugging Face
--jinja Jinja2 chat template — required for tool-call formatting
-ngl 999 Offload all layers to GPU
-fa Flash attention — significant speedup on CUDA
-c 8192 Context window size
-sm layer Tensor parallel across multiple GPUs

Adding a custom model

Implement ModelMatcher in crates/shunt-infer/src/registry.rs:

pub struct MyModelMatcher;
impl ModelMatcher for MyModelMatcher {
    fn match_model(&self, id: &str) -> Option<ModelProfile> {
        id.contains("my-model").then_some(ModelProfile {
            max_tokens: 8192,
            temperature: 0.7,
            ..Default::default()
        })
    }
}

shunt-infer now uses one OpenAI-compatible tool-calling path for chat servers: strict native function tools over /v1/chat/completions. Model matchers only provide model traits such as token budget and sampling defaults.

Register it in ModelRegistry::with_defaults():

registry.register(MyModelMatcher)