feat: generic chat_template_kwargs (model config + per-request metadata)#10359
Merged
Conversation
Adds the ChatTemplateKwargs model-config map and RequestMetadata carrier, plus ResolveChatTemplateKwargs which layers the config map under coerced request metadata. Foundation for generic jinja chat-template kwargs (issue #10329). Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
gRPCPredictOpts now merges per-request client metadata over the server-derived enable_thinking/reasoning_effort (reaching all backends via the standalone keys) and serialises the resolved chat_template_kwargs map into a JSON blob for llama.cpp, written last so a client cannot clobber it. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The OpenAI request metadata field was parsed but unused; stamp it onto the per-request ModelConfig so gRPCPredictOpts forwards it as chat_template_kwargs overrides. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…cks) Replace the per-key enable_thinking/reasoning_effort handling in both the streaming and non-streaming chat paths with a single block that parses the chat_template_kwargs JSON blob resolved by the Go layer and merges every key into body_json. New jinja template levers (e.g. preserve_thinking) now need no C++ change. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
…kwargs blob Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Adds an ECHO_PREDICT_METADATA marker to the mock-backend that echoes the received PredictOptions.Metadata, and an app_test.go spec that drives a real /v1/chat/completions request (model chat_template_kwargs + per-request metadata override) and asserts the exact metadata + chat_template_kwargs blob the REST layer forwards to gRPC. Locks the REST->gRPC contract against regressions. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
chat_template_kwargs is a free-form map[string]any (like engine_args, already on the list), not a scalar the config UI registry can surface, so it is exempt from the registry-entry requirement. Fixes the TestAllFieldsHaveRegistryEntries failure introduced by the new field. Issue #10329. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Closes #10329.
Adds a generic way to pass arbitrary jinja chat-template variables (e.g. Qwen3's
preserve_thinking) to backends, so a new template lever no longer needs a hardcoded block ingrpc-server.cpp(asenable_thinkingin #8973 andreasoning_effortin #10184 each did).Two sources, no new API surface:
chat_template_kwargs:map (typed):metadatafield. String values;"true"/"false"are coerced to booleans, anything else stays a string:{ "model": "qwen3", "messages": [{"role": "user", "content": "hi"}], "metadata": { "preserve_thinking": "true", "enable_thinking": "false" } }How
Precedence (low -> high): config map < server reasoning levers (
enable_thinking/reasoning_effort) < per-requestmetadata.core/config-ModelConfig.ChatTemplateKwargs(YAML) +RequestMetadata(request-scoped carrier) +ResolveChatTemplateKwargs(meta)which layers the config map under the coerced metadata and skips the reservedchat_template_kwargskey.core/backend/options.go(gRPCPredictOpts) - merges clientRequestMetadataover the server-derived levers (so a per-requestenable_thinking/reasoning_effortoverride reaches every backend via the standalone metadata keys), then serialises the resolved map into a singlemetadata["chat_template_kwargs"]JSON blob, written last so a client cannot clobber it.core/http/middleware/request.go- stamps the requestmetadataonto the per-request config.backend/cpp/llama-cpp/grpc-server.cpp- replaces the two per-keyenable_thinking/reasoning_effortblocks (streaming + non-streaming) with one generic block that parses the blob and merges every key intobody_json["chat_template_kwargs"]. New template levers now need no C++ change.The standalone
metadata["enable_thinking"]/["reasoning_effort"]keys are still emitted (sglang, mlx-vlm, mlx-distributed, vllm-omni read them); other backends receive the newchat_template_kwargsmetadata key and harmlessly ignore it.enable_thinkingreaches llama.cpp as a real JSON bool (preserving the old== "true"behaviour);reasoning_effortstays a string.Notes / by design
chat_template_kwargsvalue is folded only into the llama.cpp blob, while per-requestmetadatakeys also become standalone gRPC metadata keys (so they reach the Python backends). This asymmetry is intentional: the feature is llama.cpp/jinja-centric, and typed (non-boolean) values are YAML-only.enable_thinking/reasoning_effortchat_template_kwargs construction from C++ into Go; llama.cpp output is unchanged, but the gRPC metadata now carries achat_template_kwargsblob whenever a reasoning lever or kwarg is active.Test plan
go test ./core/config/ ./core/backend/ ./core/http/middleware/- green (resolver precedence/coercion/reserved-key,gRPCPredictOptsblob + client-override + anti-clobber + omit, middleware metadata wiring).golangci-lint(new-from-merge-base) - 0 issues; gofmt clean.Assisted-by: Claude:claude-opus-4-8