Available Providers

QueryMT ships with 15 providers split across two types: WASM (API-based, cloud services) and Native (local inference, runs models on your hardware). This page covers how to configure each type, how to pick the right build variant for your hardware, and provides copy-pasteable configuration recipes.

Provider Repository (`repo.query.mt`)

QueryMT maintains an always-up-to-date provider repository at https://repo.query.mt.

latest.json — updated on every push to main (default)
stable.json — updated on every tagged release

If no providers config exists at ~/.qmt/providers.toml (or .json / .yaml), QueryMT automatically fetches latest.json and caches it to ~/.qmt/providers.json on first run.

# Refresh to the latest provider list
qmt update

Override the source URL:

QMT_PROVIDERS_URL=https://repo.query.mt/stable.json qmt update

Note

The provider repository currently covers WASM providers only. Native providers must be configured manually in your providers config.

WASM Providers (API-based)

WASM providers call remote APIs. They are sandboxed WebAssembly modules that run on any platform and require no local hardware beyond an internet connection. All 12 WASM providers are listed in the provider repository, so you don't need to configure them manually unless you want to pin a version or customise settings.

Name	Description	API Key Env Var
`anthropic`	Anthropic Claude models	`ANTHROPIC_API_KEY`
`openai`	OpenAI GPT / o-series models	`OPENAI_API_KEY`
`codex`	OpenAI Codex / coding-optimised models	`OPENAI_API_KEY`
`google`	Google Gemini models	`GEMINI_API_KEY`
`mistral`	Mistral AI models	`MISTRAL_API_KEY`
`groq`	Groq-hosted models (OpenAI-compatible)	`GROQ_API_KEY`
`ollama`	Ollama local server (OpenAI-compatible)	—
`openrouter`	OpenRouter multi-provider gateway	`OPENROUTER_API_KEY`
`alibaba`	Alibaba Cloud Qwen models (OpenAI-compatible)	`DASHSCOPE_API_KEY`
`moonshot`	Moonshot AI Kimi models (OpenAI-compatible)	`MOONSHOT_API_KEY`
`kimi-code`	Kimi Code specialised coding model	`MOONSHOT_API_KEY`
`xai`	xAI Grok models (OpenAI-compatible)	`XAI_API_KEY`

Minimal WASM configuration

You only need to specify a provider if you want to override defaults or pin a version. Most users can skip this entirely.

anthropicopenaigooglemistralgroqollamaopenrouteralibabamoonshotkimi-codexaicodex

[[providers]]
name = "anthropic"
path = "oci://ghcr.io/querymt/anthropic:latest"

[[providers]]
name = "openai"
path = "oci://ghcr.io/querymt/openai:latest"

[[providers]]
name = "google"
path = "oci://ghcr.io/querymt/google:latest"

[[providers]]
name = "mistral"
path = "oci://ghcr.io/querymt/mistral:latest"

[[providers]]
name = "groq"
path = "oci://ghcr.io/querymt/groq:latest"

[[providers]]
name = "ollama"
path = "oci://ghcr.io/querymt/ollama:latest"

[[providers]]
name = "openrouter"
path = "oci://ghcr.io/querymt/openrouter:latest"

[[providers]]
name = "alibaba"
path = "oci://ghcr.io/querymt/alibaba:latest"

[[providers]]
name = "moonshot"
path = "oci://ghcr.io/querymt/moonshot:latest"

[[providers]]
name = "kimi-code"
path = "oci://ghcr.io/querymt/kimi-code:latest"

[[providers]]
name = "xai"
path = "oci://ghcr.io/querymt/xai:latest"

[[providers]]
name = "codex"
path = "oci://ghcr.io/querymt/codex:latest"

Native Providers (local inference)

Native providers run LLMs locally on your machine. They are distributed as platform-specific shared libraries (.so / .dylib / .dll) packaged into OCI images. Unlike WASM providers, QueryMT resolves native providers at download time by inspecting the OCI image index for a manifest that matches your OS and architecture. If a match is found, the native library is downloaded; if not, QueryMT falls back to the WASM variant if one exists.

There are three native providers:

Name	Underlying engine	Description
`llama-cpp`	llama.cpp	GGUF models, vision support, broad hardware compatibility
`izwi`	izwi-core	Efficient local inference with Flash Attention
`mrs`	mistral.rs	High-performance inference for Mistral and compatible architectures

Native providers must be configured manually

Native providers are not included in the repo.query.mt provider repository. You must add them to your ~/.qmt/providers.toml explicitly.

Feature tags

Each native provider is published with multiple OCI tags corresponding to different hardware backends. The tag you specify in path controls which binary is downloaded.

Tag suffix	Hardware	Availability
`latest` / `latest-default`	CPU only	Linux (x86_64, arm64), macOS (x86_64, arm64), Windows (x86_64)
`latest-metal`	macOS Metal GPU	macOS (x86_64, arm64)
`latest-accelerate`	macOS Accelerate framework	macOS (x86_64, arm64) — izwi, mrs only
`latest-vulkan`	Vulkan GPU	Linux (x86_64, arm64), Windows (x86_64) — llama-cpp only
`latest-cuda12.8`	NVIDIA CUDA 12.8	Linux (x86_64, arm64), Windows (x86_64) — llama-cpp only
`latest-cuda12.8-sm80`	NVIDIA CUDA SM 80	Linux x86_64 — izwi, mrs only
`latest-cuda12.8-sm86`	NVIDIA CUDA SM 86	Linux x86_64 — izwi, mrs only
`latest-cuda12.8-sm87`	NVIDIA CUDA SM 87 (Jetson Orin)	Linux arm64 — izwi, mrs only
`latest-cuda12.8-sm89`	NVIDIA CUDA SM 89	Linux x86_64 — izwi, mrs only
`latest-cuda12.8-sm120`	NVIDIA CUDA SM 120	Linux x86_64 — izwi, mrs only

Version-pinned variants follow the same pattern with the crate version prepended: 0.1.0-metal, 0.1.0-cuda12.8-sm89, etc.

llama-cpp CUDA vs. izwi/mrs CUDA

llama-cpp uses a single combined cuda12.8 build that covers all NVIDIA GPUs via JIT compilation at runtime — no SM selection needed. izwi and mrs compile Flash Attention kernels ahead of time, which requires a specific SM architecture to be selected at build time.

NVIDIA GPU to SM architecture

Use this table to find the correct cuda12.8-smXX suffix for your GPU.

GPU	Architecture	SM	Tag suffix
RTX 3050 / 3060 / 3070 / 3080 / 3090	Ampere	86	`cuda12.8-sm86`
RTX 3050 Ti (laptop)	Ampere	86	`cuda12.8-sm86`
A10 / A40	Ampere	86	`cuda12.8-sm86`
A30 / A100	Ampere	80	`cuda12.8-sm80`
RTX 4060 / 4070 / 4070 Ti / 4080 / 4090	Ada Lovelace	89	`cuda12.8-sm89`
RTX 4060 Ti / 4070 Super / 4080 Super	Ada Lovelace	89	`cuda12.8-sm89`
L4 / L40 / L40S	Ada Lovelace	89	`cuda12.8-sm89`
RTX 5070 / 5070 Ti / 5080 / 5090	Blackwell	120	`cuda12.8-sm120`
B100 / B200 / GB200	Blackwell	120	`cuda12.8-sm120`
Jetson AGX Orin / Orin NX / Orin Nano	Ampere (aarch64)	87	`cuda12.8-sm87`

Not sure which SM your GPU is?

Run nvidia-smi --query-gpu=compute_cap --format=csv,noheader to print your GPU's compute capability (e.g. 8.9 = SM 89).

Choosing a feature tag

What OS are you on?
├── macOS Apple Silicon (M1/M2/M3/M4)  → latest-metal
├── macOS Intel
│   ├── izwi / mrs                     → latest-accelerate
│   └── llama-cpp                      → latest-default
├── Linux
│   ├── NVIDIA GPU
│   │   ├── llama-cpp                  → latest-cuda12.8
│   │   └── izwi / mrs                 → latest-cuda12.8-sm{XX}  (see table above)
│   ├── AMD / Intel GPU                → latest-vulkan  (llama-cpp only)
│   └── CPU only                       → latest  (or latest-default)
└── Windows
    ├── NVIDIA GPU                     → latest-cuda12.8  (llama-cpp only)
    └── CPU only                       → latest

Provider reference

llama-cpp

Wraps llama.cpp. Supports GGUF models, vision/multimodal models, and streaming. Broadest hardware support of the three native providers.

Supported platforms:

OS	Architecture	CPU	Metal	Accelerate	Vulkan	CUDA 12.8
Linux	x86_64	✓	—	—	✓	✓
Linux	arm64	✓	—	—	✓	✓
macOS	x86_64	—	✓	—	—	—
macOS	arm64	—	✓	—	—	—
Windows	x86_64	✓	—	—	—	✓

Configuration:

macOS (Metal)Linux / Windows (CPU)Linux (CUDA)Linux (Vulkan)Windows (CUDA)

[[providers]]
name = "llama-cpp"
path = "oci://ghcr.io/querymt/llama-cpp:latest-metal"

[providers.config]
model = "/path/to/model.gguf"
n_ctx = 4096
n_gpu_layers = 99

[[providers]]
name = "llama-cpp"
path = "oci://ghcr.io/querymt/llama-cpp:latest"

[providers.config]
model = "/path/to/model.gguf"
n_ctx = 4096
n_gpu_layers = 0

[[providers]]
name = "llama-cpp"
path = "oci://ghcr.io/querymt/llama-cpp:latest-cuda12.8"

[providers.config]
model = "/path/to/model.gguf"
n_ctx = 4096
n_gpu_layers = 99
flash_attention = "enabled"

[[providers]]
name = "llama-cpp"
path = "oci://ghcr.io/querymt/llama-cpp:latest-vulkan"

[providers.config]
model = "/path/to/model.gguf"
n_ctx = 4096
n_gpu_layers = 99

[[providers]]
name = "llama-cpp"
path = "oci://ghcr.io/querymt/llama-cpp:latest-cuda12.8"

[providers.config]
model = "/path/to/model.gguf"
n_ctx = 4096
n_gpu_layers = 99
flash_attention = "enabled"

Key config options:

Option	Type	Description
`model`	string	Path to GGUF model file, or `owner/repo:filename` HuggingFace reference
`n_ctx`	integer	Context window size (default: model native size)
`n_gpu_layers`	integer	Layers to offload to GPU. `0` = CPU only, `99` = all layers
`flash_attention`	string	`"auto"` \| `"enabled"` \| `"disabled"`
`kv_cache_type_k`	string	KV cache key quantization: `"f16"`, `"q8_0"`, `"q4_0"`
`kv_cache_type_v`	string	KV cache value quantization: `"f16"`, `"q8_0"`, `"q4_0"`
`max_tokens`	integer	Maximum tokens to generate (default: 256)
`temperature`	float	Sampling temperature. `0` = greedy
`mmproj_path`	string	Path to multimodal projection file (vision models only)

For vision model configuration and the full option reference see the llama-cpp provider README.

izwi

Wraps izwi-core. Efficient local inference with Flash Attention support. CUDA builds are compiled per SM architecture for maximum performance.

Supported platforms:

OS	Architecture	CPU	Metal	Accelerate	CUDA 12.8
Linux	x86_64	✓	—	—	SM 80, 86, 89, 120 (+ Flash Attn)
Linux	arm64	✓	—	—	SM 87 (+ Flash Attn)
macOS	x86_64	—	✓	✓	—
macOS	arm64	—	✓	✓	—
Windows	x86_64	✓	—	—	—

Configuration:

macOS (Metal)macOS (Accelerate)Linux / Windows (CPU)Linux (CUDA — RTX 30xx / A10 / A40)Linux (CUDA — RTX 40xx / L4 / L40)Linux (CUDA — A100)Linux (CUDA — RTX 50xx / Blackwell)Linux arm64 (CUDA — Jetson Orin)

[[providers]]
name = "izwi"
path = "oci://ghcr.io/querymt/izwi:latest-metal"

[providers.config]
model = "/path/to/model"

[[providers]]
name = "izwi"
path = "oci://ghcr.io/querymt/izwi:latest-accelerate"

[providers.config]
model = "/path/to/model"

[[providers]]
name = "izwi"
path = "oci://ghcr.io/querymt/izwi:latest"

[providers.config]
model = "/path/to/model"

# SM 86: RTX 3060, 3070, 3080, 3090, A10, A40
[[providers]]
name = "izwi"
path = "oci://ghcr.io/querymt/izwi:latest-cuda12.8-sm86"

[providers.config]
model = "/path/to/model"

# SM 89: RTX 4060, 4070, 4080, 4090, L4, L40, L40S
[[providers]]
name = "izwi"
path = "oci://ghcr.io/querymt/izwi:latest-cuda12.8-sm89"

[providers.config]
model = "/path/to/model"

# SM 80: A30, A100
[[providers]]
name = "izwi"
path = "oci://ghcr.io/querymt/izwi:latest-cuda12.8-sm80"

[providers.config]
model = "/path/to/model"

# SM 120: RTX 5070, 5080, 5090, B100, B200
[[providers]]
name = "izwi"
path = "oci://ghcr.io/querymt/izwi:latest-cuda12.8-sm120"

[providers.config]
model = "/path/to/model"

# SM 87: Jetson AGX Orin, Orin NX, Orin Nano
[[providers]]
name = "izwi"
path = "oci://ghcr.io/querymt/izwi:latest-cuda12.8-sm87"

[providers.config]
model = "/path/to/model"

mrs

Wraps mistral.rs. High-performance inference for Mistral and compatible architectures. CUDA x86_64 builds include both Flash Attention and cuDNN.

Supported platforms:

OS	Architecture	CPU	Metal	Accelerate	CUDA 12.8
Linux	x86_64	✓	—	—	SM 80, 86, 89, 120 (+ Flash Attn + cuDNN)
Linux	arm64	✓	—	—	SM 87 (+ cuDNN, no Flash Attn)
macOS	x86_64	—	✓	✓	—
macOS	arm64	—	✓	✓	—
Windows	x86_64	✓	—	—	—
Windows	arm64	✓	—	—	—

Configuration:

macOS (Metal)macOS (Accelerate)Linux / Windows (CPU)Linux (CUDA — RTX 30xx / A10 / A40)Linux (CUDA — RTX 40xx / L4 / L40)Linux (CUDA — A100)Linux (CUDA — RTX 50xx / Blackwell)Linux arm64 (CUDA — Jetson Orin)

[[providers]]
name = "mrs"
path = "oci://ghcr.io/querymt/mrs:latest-metal"

[providers.config]
model = "/path/to/model"

[[providers]]
name = "mrs"
path = "oci://ghcr.io/querymt/mrs:latest-accelerate"

[providers.config]
model = "/path/to/model"

[[providers]]
name = "mrs"
path = "oci://ghcr.io/querymt/mrs:latest"

[providers.config]
model = "/path/to/model"

# SM 86: RTX 3060, 3070, 3080, 3090, A10, A40
[[providers]]
name = "mrs"
path = "oci://ghcr.io/querymt/mrs:latest-cuda12.8-sm86"

[providers.config]
model = "/path/to/model"

# SM 89: RTX 4060, 4070, 4080, 4090, L4, L40, L40S
[[providers]]
name = "mrs"
path = "oci://ghcr.io/querymt/mrs:latest-cuda12.8-sm89"

[providers.config]
model = "/path/to/model"

# SM 80: A30, A100
[[providers]]
name = "mrs"
path = "oci://ghcr.io/querymt/mrs:latest-cuda12.8-sm80"

[providers.config]
model = "/path/to/model"

# SM 120: RTX 5070, 5080, 5090, B100, B200
[[providers]]
name = "mrs"
path = "oci://ghcr.io/querymt/mrs:latest-cuda12.8-sm120"

[providers.config]
model = "/path/to/model"

# SM 87: Jetson AGX Orin, Orin NX, Orin Nano
# Note: cuDNN included; Flash Attention not available on aarch64
[[providers]]
name = "mrs"
path = "oci://ghcr.io/querymt/mrs:latest-cuda12.8-sm87"

[providers.config]
model = "/path/to/model"

Complete configuration recipes

Drop one of these into ~/.qmt/providers.toml and adjust paths as needed.

Cloud APIs only

No config needed — repo.query.mt handles this automatically on first run. If you want explicit control:

[[providers]]
name = "openai"
path = "oci://ghcr.io/querymt/openai:latest"

[[providers]]
name = "anthropic"
path = "oci://ghcr.io/querymt/anthropic:latest"

[[providers]]
name = "google"
path = "oci://ghcr.io/querymt/google:latest"

Cloud + local llama-cpp on macOS Apple Silicon

[[providers]]
name = "openai"
path = "oci://ghcr.io/querymt/openai:latest"

[[providers]]
name = "anthropic"
path = "oci://ghcr.io/querymt/anthropic:latest"

[[providers]]
name = "llama-cpp"
path = "oci://ghcr.io/querymt/llama-cpp:latest-metal"

[providers.config]
model = "/path/to/model.gguf"
n_ctx = 8192
n_gpu_layers = 99
flash_attention = "auto"

Fully local on Linux with NVIDIA RTX 40-series

# llama-cpp: single CUDA build, all SM supported at runtime
[[providers]]
name = "llama-cpp"
path = "oci://ghcr.io/querymt/llama-cpp:latest-cuda12.8"

[providers.config]
model = "/path/to/model.gguf"
n_ctx = 8192
n_gpu_layers = 99
flash_attention = "enabled"

# izwi: SM 89 for RTX 4060/4070/4080/4090
[[providers]]
name = "izwi"
path = "oci://ghcr.io/querymt/izwi:latest-cuda12.8-sm89"

[providers.config]
model = "/path/to/model"

# mrs: SM 89 for RTX 4060/4070/4080/4090
[[providers]]
name = "mrs"
path = "oci://ghcr.io/querymt/mrs:latest-cuda12.8-sm89"

[providers.config]
model = "/path/to/model"

Fully local on macOS Apple Silicon

[[providers]]
name = "llama-cpp"
path = "oci://ghcr.io/querymt/llama-cpp:latest-metal"

[providers.config]
model = "/path/to/model.gguf"
n_ctx = 8192
n_gpu_layers = 99

[[providers]]
name = "izwi"
path = "oci://ghcr.io/querymt/izwi:latest-metal"

[providers.config]
model = "/path/to/model"

[[providers]]
name = "mrs"
path = "oci://ghcr.io/querymt/mrs:latest-metal"

[providers.config]
model = "/path/to/model"

Fully local on Linux (CPU only)

[[providers]]
name = "llama-cpp"
path = "oci://ghcr.io/querymt/llama-cpp:latest"

[providers.config]
model = "/path/to/model.gguf"
n_ctx = 4096
n_gpu_layers = 0

[[providers]]
name = "mrs"
path = "oci://ghcr.io/querymt/mrs:latest"

[providers.config]
model = "/path/to/model"

Available Providers

Provider Repository (repo.query.mt)

WASM Providers (API-based)

Minimal WASM configuration

Native Providers (local inference)

Feature tags

NVIDIA GPU to SM architecture

Choosing a feature tag

Provider reference

llama-cpp

izwi

mrs

Complete configuration recipes

Cloud APIs only

Cloud + local llama-cpp on macOS Apple Silicon

Fully local on Linux with NVIDIA RTX 40-series

Fully local on macOS Apple Silicon

Fully local on Linux (CPU only)

Further reading

Provider Repository (`repo.query.mt`)