llama + spec: MTP Support #22673
Conversation
|
Nice, I think this is a fresh start better than my WIP #18886 (that I still never find the time to continue) There were some other attempts to add MTP support but they all heavily rely on host <--> device data copy. I assume you tried addressed this, right? (Maybe there was a discussion somewhere but I wasn't aware of) |
ngxson
left a comment
There was a problem hiding this comment.
(not a review, but opening some discussions)
|
@ngxson yes the h2d was discussed with GG, he's working on a refactor which will allow us to share tensors between two llama context |
|
Great work, this should massively bridge the TG gap with vLLM, or maybe even surpass it together with tensor-parallel. |
|
in my opinion Qwen 3.6 is the most important thing that happened in open source models in a long time, this is going to be so valuable. ngram could be set to match only very strong and long candidates - for large repetitive paraphrasing |
|
" idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc." -> Does this mean MTP needs additional resources (RAM/VRAM?) If so, there should always be an option to remain to disable it. Right now on my system (6 GB VRAM, 32 GB RAM), speculative decoding just makes things much slower even on very small draft models because of that exact reason, they need own context and kv-cache. Such low to midrange systems already operate on the edge in terms of memory. |
|
I'm getting garbage responses running this PR on the Vulkan backend with an R9700 using llama-server. I'm using the GGUF you linked above. Interestingly, draft acceptance is only 0.01282. Prompt: "Hello!" |
|
@cmp-nct I'm not sure, but could be possible @Dampfinchen as of right now it is opt-in via @mbednarek360 |
|
Might it be possible/useful to run the draft model on a second GPU? Given that MTP weights model are relatively small this might provide a useful speedup on systems with a dedicated high-VRAM "AI" GPU with a cheaper low-VRAM "normal" GPU used for display output, etc... possibly prevent some degree of resource contention. |
|
Thank you, we are eagerly awaiting this to become stable, here automated test results for my machine; __
Result:
|
|
@cturan Thanks for testing, I'm aware of the issue for the prefill and will work on a fix. |
|
Might be a long shot, but any chance of supporting MTP with a reduced vocabulary? MTP layers are rather chonky and reducing token embeddings might help users with less VRAM by filtering out certain languages. Obviously the full model will still be able to produce those tokens if need be so it won't be gimped. |
|
Working on taking this for a spin with the Q4_K_M quant of Qwen3.6-35BA3B. I was gonna try to start from unsloth's quant since they already perform really well, but of course they don't have any mtp layers. @am17an Think it would work if I just "steal" the layers from your q8 quant and merge them into the unsloth quant? (add blk.40 and bump some top-level config like block_count and kv_count) |
|
only a quick test run, 1x 5090 qwen3.6-27b mtp 3, q4_0 quantized, kv also q4_0 same model, same config (except mtp) prompt „create a flappy bird clone“ (I‘m not creative, sorry) Great Speedup! |
|
this is a game changer, on Strix Halo with the q8 Qwen 3.6 35B3A jumping from 40 to 70 tg at low context and for the 27B from 12 to 25 tg(with layer split 7900 xtx and strix halo 50,50) for coding. We need this one to master asap together with turbo4, it performs very well and without any issues. Good job |
|
On a 3060 Laptop 6GB vram + 64GB ram running your provided Qwen 3.6 35A3B gguf there is a reasonable speed up.
raw resultsspec-draft-n-max 4
spec-draft-n-max 3
spec-draft-n-max 2
spec-draft-n-max 1
no mtp
|
|
Crashes when using srv params_from_: Chat format: peg-native
slot get_availabl: id 0 | task -1 | selected slot by LRU, t_last = -1
srv get_availabl: updating prompt cache
srv load: - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv update: - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 262144 tokens, 8589934592 est)
srv get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id 0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id 0 | task 0 | processing task, is_child = 0
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 262144, n_keep = 0, task.n_tokens = 356
slot update_slots: id 0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_tokens = 352, batch.n_tokens = 352, progress = 0.988764
/root/llama.cpp/ggml/src/ggml-backend-meta.cpp:1013: GGML_ASSERT(split_state.ne[j] * tensor->src[i]->ne[src_ss[i].axis] == sum * tensor->ne[split_state.axis]) failed
/root/llama.cpp/build/bin/libggml-base.so.0(+0x1b25b)[0x74b4b4ca925b]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_print_backtrace+0x21f)[0x74b4b4ca96df]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_abort+0x152)[0x74b4b4ca98b2]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x41506)[0x74b4b4ccf506]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x3d579)[0x74b4b4ccb579]
/root/llama.cpp/build/bin/libggml-base.so.0(+0x41adb)[0x74b4b4ccfadb]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_gallocr_alloc_graph+0x474)[0x74b4b4cbff54]
/root/llama.cpp/build/bin/libggml-base.so.0(ggml_backend_sched_alloc_graph+0x111)[0x74b4b4cc6351]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0xe8)[0x74b4b44dac08]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x37b)[0x74b4b44d912b]
/root/llama.cpp/build/bin/libllama.so.0(llama_decode+0x10)[0x74b4b44da780]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context21handle_mtp_for_ubatchEiPKiS1_P11ggml_tensor+0x20d)[0x74b4b44da9bd]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context14process_ubatchERK12llama_ubatch14llm_graph_typeP22llama_memory_context_iR11ggml_status+0x142)[0x74b4b44dac62]
/root/llama.cpp/build/bin/libllama.so.0(_ZN13llama_context6decodeERK11llama_batch+0x37b)[0x74b4b44d912b]
/root/llama.cpp/build/bin/libllama.so.0(llama_decode+0x10)[0x74b4b44da780]
llama-server(+0xf846e)[0x63c5e42c046e]
llama-server(+0x172971)[0x63c5e433a971]
llama-server(+0x5842c)[0x63c5e422042c]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x74b4b3c29d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x74b4b3c29e40]
llama-server(+0x58cd5)[0x63c5e4220cd5]
Aborted``` |
|
Tested on 3x RTX3060 12Gb. Sorry I don't have the VRAM for your Q8, I used RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF which was quantized with ik_llama's MTP. Prompt: "Write a simple minimal hash table implementation in C99." Three runs with no MTP, avg generation 18.51 tok/sec: Three runs with MTP, avg generation 32.24 tok/sec: Result 74% speedup. Wow! Thank you for your work. You will make many users happy with this. What an exciting PR! One small hiccup. On my initial attempt I got the error message: Adding |
This comment has been minimized.
This comment has been minimized.
|
|
|
@ggerganov @am17an somewhere between merged PRs #23234 and #23333 something made models need more ram to fit into the same setup 15 May 19:46 build_4_BFMTP it seems to be MTP related, i tested the none MTP model from unsloth and i didn't see the problem. And the model has been doing this ever since that update happened build_5
|
|
Adding the first gfx1150 (AMD Radeon 890M / Strix Point APU, Ryzen AI 9 HX 470) data point. Tested on Setup
./llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
-ngl 999 -c 8192 -fa on -np 1 \
-ctk q8_0 -ctv q8_0 --jinja --no-mmap \
--spec-type draft-mtp --spec-draft-n-max 2n-max sweep
Best config (mtp-d2-r1) — per-prompt detail {
"n_requests": 9,
"total_predicted": 1728,
"total_draft": 1316,
"total_draft_accepted": 1057,
"aggregate_accept_rate": 0.8032,
"wall_s_total": 73.0
}Observations
Two minor things worth filing separately if reproducible
|
|
@AbdulrahmanHashem Do you have a consistent repro of the garbage generation? If yes, please open an issue with detailed information about your hardware, model, logs, etc. |
sadly i can't get enough time to make the issue but i just made a build and it's alittle better with when it comes to using more VRAM but it still uses at least about 250 mb more ram at the moment i tested it on my own projects and i haven't seen garbage again. Update : it just gave me garbage again. |
|
Great! The current branch has been merged into the main branch. But how do I use it after merging into the main branch? After compiling with the latest code of master branch, llama-server --spec-type does not support draft-mtp. |
|
For somebody interested in a full comparison on a coding benchmark to see how much benefit MTP gives you, I run SWE Verified mini on Strix Halo and AMD R9700: https://pi-local-coding-bench.dev/ In a nutshell, there is a measurable improvement in average time to complete tasks, more than I expected given the negative performance hit with prompt processing: Strix Halo / Qwen 3.6 35B-A3B UD_Q8-K-XL:
R9700 / Qwen 3.6 27B UD_Q4-K-XL:
I also observed an improvement in task completion - is it just random or does MTP change some of the sampling strategy? |
It should be random. You need to run the eval multiple times to reduce the variance of the result. |
Yup, that was what I thought. I need to find the time to do it, even single runs of the full benchmark can take half a day! |
|
Yes, you need to distribute it on many machines. Btw, without MTP, the Qwen3.x models should support parallel processing efficiently. So depending on the max context needed for these tasks, you can run requests in parallel on a single server. The more requests you can batch, the better. However, batching with MTP enabled using a recurrent model (i.e. Qwen3.x) is currently not optimized, so you won't benefit from parallel processing on a single machine in that case. The only way atm is to scale the machines. |
|
Thanks @ggerganov , I'm definitely doing that... Interestingly a comment on my video on this seems to think instead MTP might actually improve performance:
But, this will be clear after I've re-run all benchmarks. |
What I would have liked to see is to restrict n_rs_seq to one slot only. |
|
@kyuz0 did you limit the parallel agents for the non-mtp use case? Because MTP currently limits you to 1 concurrent agent, while I noticed that qwen3.6-35B likes to use several parallel agents in opencode (as opposed to qwen3.5-122B) which makes things faster on its own. |
@darkbasic the benchmark was done with pi, no sub agents, so both MTP and non-MTP had 1 agent thread. |
|
@kyuz0 that explains it. Also keep in mind that opencode tends to bloat the context much more if it doesn't know the model. You can fake it to a known one by running llama-server with |




Overview
This PR adds support for MTP (Multi Token Prediction) heads. I tested this on Qwen3.6 27B and Qwen3.6 35BA3B but in principle it should work for any MTP model. I've posted the detailed results below, but typically I see a steady-state acceptance of around 75% with 3 draft tokens, which is more than >2x speed-up over baseline. The design decisions I took to get to this stage are as follows:
ubatchTip
MTP is compatible with Vision input and Tensor/Pipeline Parallelism
Note
Prompt processing (PP) speed typically takes a negative hit when MTP is enabled mainly due to Device-To-Host (D2H) embedding transfers. It's something to be optimized in the future.
Note
Parallel decoding with MTP is supported, but not fully optimized yet.
Performance
A simple bench for testing various prompts is here: https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090. Posting the results below:
Performance on DGX Spark 🧵
No MTP (baseline)
./llama-server -m ../qwen3.6-q8_0.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"MTP --spec-draft-max-n 3
./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type draft-mtp --spec-draft-n-max 3MTP --spec-draft-max-n 2
./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type draft-mtp --spec-draft-n-max 2Draft model (Qwen3.5 0.8B) with spec-draft-n-max 16 with partial rollback
llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 16 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"Master with draft model with spec-draft-n-max 64 with no partial rollback
llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 64 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"How to use
I've uploaded the GGUF which I made by using the
convert_hf_to_gguf.pychanges in this PR. Here is another GGUF for the MoE (35BA3B) modelThese are some sample commands to get started with MTP:
Models
Quality check
The results from 4 runs of the AIME2026 eval (4x30 questions in total) with MTP enabled, using llama-eval, are within expectation and match the reported value by Qwen team.
Full data: aime2026-qwen3.6-27b-mtp-q4_k-x4.json.html
Next Steps until merge
mtpdraft-mtpTODOs after merge
ngramcompatibility withmtp--spec-draft-p-minsupport formtpbatch size > 1 + n_rs_seq(sample patch)n_rs_seq > 0(currently the multi-seq states are not contiguous in memory so cannot be batched together)Requirements