Skip to content

refactor(tinygrad): reuse tinygrad.apps.llm instead of vendored Transformer#9380

Merged
mudler merged 1 commit into
masterfrom
chore/tinygrad-upstream
Apr 16, 2026
Merged

refactor(tinygrad): reuse tinygrad.apps.llm instead of vendored Transformer#9380
mudler merged 1 commit into
masterfrom
chore/tinygrad-upstream

Conversation

@mudler

@mudler mudler commented Apr 16, 2026

Copy link
Copy Markdown
Owner

Drop the 295-line vendor/llama.py fork in favor of tinygrad.apps.llm, which now provides the Transformer blocks, GGUF loader (incl. Q4/Q6/Q8 quantization), KV-cache and generate loop we were maintaining ourselves.

What changed:

  • New vendor/appsllm_adapter.py (~90 LOC) — HF -> GGUF-native state-dict keymap, Transformer kwargs builder, _embed_hidden helper, and a hard rejection of qkv_bias models (Qwen2 / 2.5 are no longer supported; the apps.llm Transformer ties bias=False on Q/K/V projections).
  • backend.py routes both safetensors and GGUF paths through apps.llm.Transformer. Generation now delegates to its (greedy-only) generate(); Temperature / TopK / TopP / RepetitionPenalty are still accepted on the wire but ignored — documented in the module docstring.
  • Jinja chat render now passes enable_thinking=False so Qwen3's reasoning preamble doesn't eat the tool-call token budget on small models.
  • Embedding path uses _embed_hidden (block stack + output_norm) rather than the custom embed() method we were carrying on the vendored Transformer.
  • test.py gains TestAppsLLMAdapter covering the keymap rename, tied embedding fallback, unknown-key skipping, and qkv_bias rejection.
  • Makefile fixtures move from Qwen/Qwen2.5-0.5B-Instruct to Qwen/Qwen3-0.6B (apps.llm-compatible) and tool_parser from qwen3_xml to hermes (the HF chat template emits hermes-style JSON tool calls).

Verified with the docker-backed targets:
test-extra-backend-tinygrad 5/5 PASS
test-extra-backend-tinygrad-embeddings 3/3 PASS
test-extra-backend-tinygrad-whisper 4/4 PASS
test-extra-backend-tinygrad-sd 3/3 PASS

Description

This PR fixes #

Notes for Reviewers

Signed commits

  • Yes, I signed my commits.
…former

Drop the 295-line vendor/llama.py fork in favor of `tinygrad.apps.llm`,
which now provides the Transformer blocks, GGUF loader (incl. Q4/Q6/Q8
quantization), KV-cache and generate loop we were maintaining ourselves.

What changed:
- New vendor/appsllm_adapter.py (~90 LOC) — HF -> GGUF-native state-dict
  keymap, Transformer kwargs builder, `_embed_hidden` helper, and a hard
  rejection of qkv_bias models (Qwen2 / 2.5 are no longer supported; the
  apps.llm Transformer ties `bias=False` on Q/K/V projections).
- backend.py routes both safetensors and GGUF paths through
  apps.llm.Transformer. Generation now delegates to its (greedy-only)
  `generate()`; Temperature / TopK / TopP / RepetitionPenalty are still
  accepted on the wire but ignored — documented in the module docstring.
- Jinja chat render now passes `enable_thinking=False` so Qwen3's
  reasoning preamble doesn't eat the tool-call token budget on small
  models.
- Embedding path uses `_embed_hidden` (block stack + output_norm) rather
  than the custom `embed()` method we were carrying on the vendored
  Transformer.
- test.py gains TestAppsLLMAdapter covering the keymap rename, tied
  embedding fallback, unknown-key skipping, and qkv_bias rejection.
- Makefile fixtures move from Qwen/Qwen2.5-0.5B-Instruct to Qwen/Qwen3-0.6B
  (apps.llm-compatible) and tool_parser from qwen3_xml to hermes (the
  HF chat template emits hermes-style JSON tool calls).

Verified with the docker-backed targets:
  test-extra-backend-tinygrad             5/5 PASS
  test-extra-backend-tinygrad-embeddings  3/3 PASS
  test-extra-backend-tinygrad-whisper     4/4 PASS
  test-extra-backend-tinygrad-sd          3/3 PASS
@mudler mudler merged commit a0cbc46 into master Apr 16, 2026
38 of 40 checks passed
@mudler mudler deleted the chore/tinygrad-upstream branch April 16, 2026 20:41
@localai-bot localai-bot added the enhancement New feature or request label May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

2 participants