Local AI text-to-speech with named speakers, voice cloning and voice design, powered by GGML. C++17 port of Qwen3-TTS 12 Hz (Qwen team, Alibaba). 11 languages with Mandarin dialects, 24 kHz mono output, runs on CPU, CUDA, Metal, Vulkan.
- Named speakers from the CustomVoice checkpoints, with per-speaker Mandarin dialect overrides (eric -> sichuan, dylan -> beijing)
- Zero shot voice cloning from a reference clip, x-vector only or in-context with a matching transcript
- Voice design from a free text attribute instruction (gender, age, pitch, style)
- Streaming synthesis : autoregressive frame loop with chunked codec decode over a rolling left context, low latency chunk callback API
- Two stage generation : the Talker LM emits the semantic codebook, a code predictor MTP head emits the 15 acoustic codes per frame, both KV cached
- Seedable Philox PRNG and an HF aligned sampling chain (repetition penalty -> temperature -> top-k -> top-p -> multinomial)
- Q8_0 and Q4_K_M quantisation of the Qwen3 talker backbone (0.6B and 1.7B), the RVQ codec paths kept at F32
- Two CLI tools :
qwen-tts(text -> WAV) andqwen-codec(WAV <-> RVQ codes)
git clone --recurse-submodules https://github.com/ServeurpersoCom/qwentts.cpp.git
cd qwentts.cpp
./buildcuda.sh # NVIDIA GPU
./buildvulkan.sh # AMD/Intel GPU (Vulkan)
./buildcpu.sh # CPU only
./buildall.sh # all backends, runtime DL loading
NVCC_CCBIN=g++-13 ./buildcuda.sh # rolling release distros (Arch w/ GCC 16, etc.)
Pre-converted GGUFs are available on Hugging Face :
https://huggingface.co/Serveurperso/Qwen3-TTS-GGUF
Drop them in models/ and skip to the quick start. To convert from
the original checkpoints :
./checkpoints.sh # hf download Qwen/Qwen3-TTS-12Hz-* -> checkpoints/
./convert.py # F32 GGUFs (one talker per mode/size + tokenizer) -> models/
./quantize.sh # BF16 / Q8_0 / Q4_K_M ; RVQ codebooks and projections stay F32
Two GGUFs load together : a talker
(qwen-talker-{size}-{mode}-{variant}.gguf, LM plus code predictor MTP
head plus optional speaker encoder) and a shared tokenizer
(qwen-tokenizer-12hz-{variant}.gguf, SEANet + ConvNeXt + DAC v2 +
RVQ). Modes are base, customvoice and voicedesign ; sizes are
0.6B and 1.7B (voicedesign is 1.7B only).
Each block is the command run by the matching script in examples/.
Default voice (base.sh) :
./build/qwen-tts \
--model models/qwen-talker-1.7b-base-Q8_0.gguf \
--codec models/qwen-tokenizer-12hz-Q8_0.gguf \
--lang English -o out.wav < prompt.txt
Voice cloning (clone.sh, Base, reference WAV plus its transcript) :
./build/qwen-tts \
--model models/qwen-talker-1.7b-base-Q8_0.gguf \
--codec models/qwen-tokenizer-12hz-Q8_0.gguf \
--ref-wav ref.wav --ref-text ref.txt \
--lang English -o out.wav < prompt.txt
Pre-encoded reference (clone.sh): qwen-codec --talker encodes a reference
WAV into two compact latents in one pass, the .spk speaker embedding and
the .rvq ICL codes, bit-identical to what the --ref-wav path computes
internally. Passing them via --ref-spk / --ref-rvq skips the speaker
encoder and the codec encode on every synthesis:
build/qwen-codec --model models/qwen-tokenizer-12hz-Q8_0.gguf \
--talker models/qwen-talker-1.7b-base-Q8_0.gguf -i ref.wav
build/qwen-tts \
--model models/qwen-talker-1.7b-base-Q8_0.gguf \
--codec models/qwen-tokenizer-12hz-Q8_0.gguf \
--ref-spk ref.spk --ref-rvq ref.rvq --ref-text ref.txt \
--lang English -o out.wav < prompt.txt
Named speaker (customvoice.sh, CustomVoice) :
./build/qwen-tts \
--model models/qwen-talker-1.7b-customvoice-Q8_0.gguf \
--codec models/qwen-tokenizer-12hz-Q8_0.gguf \
--speaker vivian \
--lang English -o out.wav < prompt.txt
Speakers : serena, vivian, uncle_fu, ryan, aiden, ono_anna, sohee, eric (sichuan dialect), dylan (beijing dialect).
Voice design (tts.sh, VoiceDesign, attribute instruction) :
./build/qwen-tts \
--model models/qwen-talker-1.7b-voicedesign-Q8_0.gguf \
--codec models/qwen-tokenizer-12hz-Q8_0.gguf \
--instruct "male, young adult, moderate pitch" \
--lang English -o out.wav < prompt.txt
The CLI tools are thin wrappers over a public ABI. Single-header, single-name-prefix, plain C linkage so that C, C++, Python ctypes, Rust bindgen and Go cgo all consume it the same way.
#include "qwen.h"
struct qt_init_params iparams;
qt_init_default_params(&iparams);
iparams.talker_path = "models/qwen-talker-1.7b-base-Q8_0.gguf";
iparams.codec_path = "models/qwen-tokenizer-12hz-Q8_0.gguf";
struct qt_context * q = qt_init(&iparams);
struct qt_tts_params params;
qt_tts_default_params(¶ms);
params.text = "Hello world.";
params.lang = "English";
struct qt_audio audio = { 0 };
qt_synthesize(q, ¶ms, &audio);
/* audio.samples, audio.n_samples, audio.sample_rate, audio.channels */
qt_audio_free(&audio);
qt_free(q);tests/abi-c.c is built with -std=c99 -Wall -Werror -pedantic on
every build (the test-abi-c target), so any regression that breaks
plain C consumability fails the build, not just an opt-in target.
For a binding-friendly shared library (libqwen.so / .dll / .dylib),
configure with cmake -DQWEN_SHARED=ON .... The shared target exports
only the qt_* symbols ; every internal pipeline_* and backend_*
stays hidden inside the .so. The static libqwen-core.a is the default
build artefact and the one the bundled CLI tools link against.
See docs/ARCHITECTURE.md for the model, the GGUF layout, the inference pipeline, every CLI flag, the public API reference and the validation results.
MIT. See LICENSE.
Upstream model : Qwen3-TTS by Alibaba / Qwen team, Apache 2.0. Audio codec : Qwen3-TTS-Tokenizer-12Hz (Qwen team), Apache 2.0.