Gemma 4 Fine-tuning Guide
Train Gemma 4 by Google with Unsloth.
You can now train Google's Gemma 4 12B, E2B, E4B, 26B-A4B and 31B with Unsloth. Unsloth supports all vision, text, audio and RL fine-tuning for Gemma 4.
Unsloth trains Gemma 4 ~1.5x faster with ~60% less VRAM than FA2 setups (no accuracy loss)
We fixed many universal bugs for Gemma 4 training (not derived from Unsloth).
Gemma 4 E2B trains on 8GB VRAM. E4B requires 10GB VRAM.
Fine-tune Gemma 4 via our free Google Colab notebooks:
You can run and train Gemma 4 for free with a UI in our Unsloth Studio✨ notebook:
You can view more notebooks here.
You can also train Gemma 4 with reinforcement learning (RL) on 9GB VRAM.
Gemma 4 E2B LoRA works on 8-10GB VRAM. E4B LoRA requires 17GB VRAM.
31B QLoRA works with 22GB and 26B-A4B LoRA needs >40GB
Exporting/saving models to GGUF etc. and full fine-tuning (FFT) works as well.
🐛 Bug fixes + Tips
If you see Gemma-4 E2B and E4B having a loss of 13-15, this is perfectly normal - this is a common quirk of multimodal models. This also happened on Gemma-3N, Llama Vision, Mistral vision models and more.
Gemma 26B and 31B have lower loss at 1-3 or lower. Vision will be 2x higher so 3-5
🍇Gradient accumulation might inflate your losses


If you see losses higher than 13-15 (like 100 or 300) most likely gradient accumulation is not being accounted properly - we have fixed this as part of Unsloth and Unsloth Studio.
To read more about gradient accumulation see our gradient accumulation bug fix blog: https://unsloth.ai/blog/gradient
⁉️IndexError on Gemma-4 31B and 26B-A4B inference
You might see this error when doing inference with 31B and 26B:
The culprit is below:
Where Gemma-4 31B and 26B-A4B ship with num_kv_shared_layers = 0. In Python, -0 == 0, so layer_types[:-0] collapses to layer_types[:0] == []. The cache is built with zero layer slots and the very first attention forward crashes inside Cache.update.
⛔ use_cache = True generation was gibberish for E2B, E4B
use_cache = True generation was gibberish for E2B, E4BSee issue "[Gemma 4] use_cache=False corrupts attention computation, producing garbage logits #45242"
Gemma-4 E2B and E4B share KV state across layers (num_kv_shared_layers = 20 and 18). The cache is the only place where early layers stash KV for later layers to reuse. When use_cache=False (as every QLoRA tutorial sets, and as gradient_checkpointing=True forces), Gemma4TextModel.forward skips cache construction, so the KV-shared layers fall through to recomputing K and V locally from the current hidden states. The logits become garbage and training loss diverges.
Before (unsloth/gemma-4-E2B-it, prompt "What is 1+1?"):
After our fix:
📻Audio float16 overflow
Gemma4AudioAttention uses config.attention_invalid_logits_value = -1e9 in a masked_fill call. On fp16 (Tesla T4), -1e9 overflows the fp16 max of 65504, causing:
This was due to self.config.attention_invalid_logits_value :
💡Tips for Gemma-4
If you want to preserve reasoning ability, you can mix reasoning-style examples with direct answers (keep a minimum of 75% reasoning). Otherwise you can emit it fully. Use
gemma-4for the non thinking chat-template andgemma-4-thinkingfor the thinking variant. Use the thinking one for the larger 26B and 31B ones, and the non thinking one for the small ones.To enable thinking mode, use
enable_thinking = True / Falseintokenizer.apply_chat_templateThinking enabled:
Will print
<bos><|turn>system\n<|think|><turn|>\n<|turn>user\nWhat is 2+2?<turn|>\n<|turn>model\nThinking disabled:
Will print
<bos><|turn>user\nWhat is 2+2?<turn|>\n<|turn>model\n<|channel>thought\n<channel|>Gemma 4 is powerful for multilingual fine-tuning as it supports 140 languages.
It is recommended to train E4B QLoRA rather than E2B LoRA as the E4B is bigger and the quantization accuracy difference is miniscule. Gemma 4 E4B LoRA is even better.
After fine-tuning, you can export to GGUF (for llama.cpp/Unsloth/Ollama/etc.)
⚡Quickstart
🦥 Unsloth Studio Guide
Gemma 4 can be run and fine-tuned in Unsloth Studio, our new open-source web UI for local AI.
With Unsloth Studio, you can run models locally on MacOS, Windows, Linux and train NVIDIA GPUs. Intel, MLX and AMD training support coming this month.

Train Gemma 4
On first launch you will need to create a password to secure your account and sign in again later. You’ll then see a brief onboarding wizard to choose a model, dataset, and basic settings. You can skip it at any time.
Search for Gemma 4 in the search bar and select your desired model and dataset. Next, adjust your hyperparameters, context length as desired.

🦥 Unsloth Core (code-based) Guide
We made free notebooks for Gemma 4:
And for reinforcement learning (RL): E2B (RL GRPO)
We also made notebooks for the larger Gemma 4 models but they need A100:
Gemma-4-26B-A4B - A100 GPU
Gemma-4-31B - A100 GPU
Below is a standalone Gemma-4-26B-A4B-it text SFT recipe. This is text only - see also our vision fine-tuning section for more details.
If you OOM:
Drop
per_device_train_batch_sizeto 1 and/or reducemax_seq_length.Keep
use_gradient_checkpointing="unsloth"on (it’s designed to reduce VRAM use and extend context length).
Loader example for MoE (bf16 LoRA):
Once loaded, you’ll attach LoRA adapters and train similarly to the SFT example above.
Reinforcement Learning (RL)
You can now train Gemma 4 with RL, GSPO, GRPO etc with our free notebook.
Gemma 4 E2B RL works on 9GB.
The notebook's goal is to make Gemma 4 learn to solve Sudoku puzzles using GRPO.
The model will devise a strategy to fill in empty cells, and we'll reward it for correct placements and completing valid puzzles.
You can run Gemma 4 RL with Unsloth even though it is not supported by vLLM, by setting fast_inference=False when loading the model:

MoE fine-tuning (26B-A4B)
The 26B-A4B model is the speed / quality middle ground in the Gemma 4 lineup. Since it is an MoE model with only a subset of parameters active per token, a conservative fine-tuning approach is:
use LoRA rather than full fine-tuning
prefer 16-bit / bf16 LoRA if memory allows
start with shorter contexts and smaller ranks first
scale up only after the pipeline is stable
If your goal is the highest quality and you have more memory, use 31B instead.
Multimodal fine-tuning (E2B / E4B)
Because E2B and E4B support image and audio, they are the main Gemma 4 variants for multimodal fine-tuning.
load the multimodal model with
FastVisionModelkeep
finetune_vision_layers = Falsefirstfine-tune only the language, attention, and MLP layers
enable vision or audio layers later if your task needs it
Gemma 4 Multimodal LoRA example:
Image example format
Remember: for Gemma 4 multimodal prompts, put the image before the text instruction.
Audio example format
Audio is for E2B / E4B only. Keep clips short and task-specific.
Saving / export fine-tuned model
You can view our specific inference / deployment guides for Unsloth Studio, llama.cpp, vLLM, llama-server, Ollama or SGLang.
Save to GGUF
Unsloth supports saving directly to GGUF:
Or push GGUFs to Hugging Face:
If the exported model behaves worse in another runtime, Unsloth flags the most common cause: wrong chat template / EOS token at inference time (you must use the same chat template you trained with).
For more details read our inference guides:
Gemma 4 data best practices
Gemma 4 has a few formatting details you need to keep in mind.
1. Use standard chat roles
Gemma 4 uses the standard:
systemuserassistant
This means your SFT dataset should be written in regular chat format rather than older Gemma-specific role formats.
2. Thinking mode is explicit
If you want to preserve thinking-style behavior during SFT:
keep the format consistent
decide whether you want to train on visible thought blocks or on final answers only
do not mix multiple incompatible thought formats in the same dataset
For most production assistants, the simplest setup is to fine-tune on the final visible answer only.
3. Multi-turn rule
For multi-turn conversations, only keep the final visible answer in the conversation history. Do not feed earlier thought blocks back into later turns.
Last updated
Was this helpful?




