For the complete documentation index, see llms.txt. This page is also available as Markdown.

Gemma 4 Fine-tuning Guide

Train Gemma 4 by Google with Unsloth.

You can now train Google's Gemma 4 12B, E2B, E4B, 26B-A4B and 31B with Unsloth. Unsloth supports all vision, text, audio and RL fine-tuning for Gemma 4.

  • Unsloth trains Gemma 4 ~1.5x faster with ~60% less VRAM than FA2 setups (no accuracy loss)

  • We fixed many universal bugs for Gemma 4 training (not derived from Unsloth).

  • Gemma 4 E2B trains on 8GB VRAM. E4B requires 10GB VRAM.

QuickstartBug Fixes + Tips

Fine-tune Gemma 4 via our free Google Colab notebooks:

You can run and train Gemma 4 for free with a UI in our Unsloth Studio✨ notebook:

You can view more notebooks here.

  • You can also train Gemma 4 with reinforcement learning (RL) on 9GB VRAM.

  • Gemma 4 E2B LoRA works on 8-10GB VRAM. E4B LoRA requires 17GB VRAM.

  • 31B QLoRA works with 22GB and 26B-A4B LoRA needs >40GB

  • Exporting/saving models to GGUF etc. and full fine-tuning (FFT) works as well.

🐛 Bug fixes + Tips

🍇Gradient accumulation might inflate your losses

If you see losses higher than 13-15 (like 100 or 300) most likely gradient accumulation is not being accounted properly - we have fixed this as part of Unsloth and Unsloth Studio.

To read more about gradient accumulation see our gradient accumulation bug fix blog: https://unsloth.ai/blog/gradient

⁉️IndexError on Gemma-4 31B and 26B-A4B inference

You might see this error when doing inference with 31B and 26B:

The culprit is below:

Where Gemma-4 31B and 26B-A4B ship with num_kv_shared_layers = 0. In Python, -0 == 0, so layer_types[:-0] collapses to layer_types[:0] == []. The cache is built with zero layer slots and the very first attention forward crashes inside Cache.update.

use_cache = True generation was gibberish for E2B, E4B

See issue "[Gemma 4] use_cache=False corrupts attention computation, producing garbage logits #45242"

Gemma-4 E2B and E4B share KV state across layers (num_kv_shared_layers = 20 and 18). The cache is the only place where early layers stash KV for later layers to reuse. When use_cache=False (as every QLoRA tutorial sets, and as gradient_checkpointing=True forces), Gemma4TextModel.forward skips cache construction, so the KV-shared layers fall through to recomputing K and V locally from the current hidden states. The logits become garbage and training loss diverges.

Before (unsloth/gemma-4-E2B-it, prompt "What is 1+1?"):

After our fix:

📻Audio float16 overflow

Gemma4AudioAttention uses config.attention_invalid_logits_value = -1e9 in a masked_fill call. On fp16 (Tesla T4), -1e9 overflows the fp16 max of 65504, causing:

This was due to self.config.attention_invalid_logits_value :

💡Tips for Gemma-4

  1. If you want to preserve reasoning ability, you can mix reasoning-style examples with direct answers (keep a minimum of 75% reasoning). Otherwise you can emit it fully. Use gemma-4 for the non thinking chat-template and gemma-4-thinking for the thinking variant. Use the thinking one for the larger 26B and 31B ones, and the non thinking one for the small ones.

  2. To enable thinking mode, use enable_thinking = True / False in tokenizer.apply_chat_template

    Thinking enabled:

    Will print <bos><|turn>system\n<|think|><turn|>\n<|turn>user\nWhat is 2+2?<turn|>\n<|turn>model\n

    Thinking disabled:

    Will print <bos><|turn>user\nWhat is 2+2?<turn|>\n<|turn>model\n<|channel>thought\n<channel|>

  3. Gemma 4 is powerful for multilingual fine-tuning as it supports 140 languages.

  4. It is recommended to train E4B QLoRA rather than E2B LoRA as the E4B is bigger and the quantization accuracy difference is miniscule. Gemma 4 E4B LoRA is even better.

  5. After fine-tuning, you can export to GGUF (for llama.cpp/Unsloth/Ollama/etc.)

⚡Quickstart

🦥 Unsloth Studio Guide

Gemma 4 can be run and fine-tuned in Unsloth Studio, our new open-source web UI for local AI.

With Unsloth Studio, you can run models locally on MacOS, Windows, Linux and train NVIDIA GPUs. Intel, MLX and AMD training support coming this month.

1

Install Unsloth

Run in your terminal:

MacOS, Linux, WSL:

Windows PowerShell:

2

Launch Unsloth

MacOS, Linux, WSL and Windows:

Then open http://localhost:8888 in your browser.

3

Train Gemma 4

On first launch you will need to create a password to secure your account and sign in again later. You’ll then see a brief onboarding wizard to choose a model, dataset, and basic settings. You can skip it at any time.

Search for Gemma 4 in the search bar and select your desired model and dataset. Next, adjust your hyperparameters, context length as desired.

4

Monitor training progress

After you click start training, you will be able to monitor and observe the training progress of the model. The training loss should be steadily decreasing. Once done, the model will be automatically saved.

5

Export your fine-tuned model

Once done, Unsloth Studio allows you to export the model to GGUF, safetensor etc formats.

6

Compare fine-tuned model vs original model

Click on Compare Mode to compare the LoRA adapter and the original model.

🦥 Unsloth Core (code-based) Guide

We made free notebooks for Gemma 4:

And for reinforcement learning (RL): E2B (RL GRPO)

We also made notebooks for the larger Gemma 4 models but they need A100:

Gemma-4-26B-A4B - A100 GPU

Gemma-4-31B - A100 GPU

If you'd like to do GRPO, it works in Unsloth if you disable fast vLLM inference and use Unsloth inference instead. Follow our Vision RL notebook examples.

Below is a standalone Gemma-4-26B-A4B-it text SFT recipe. This is text only - see also our vision fine-tuning section for more details.

If you OOM:

  • Drop per_device_train_batch_size to 1 and/or reduce max_seq_length.

  • Keep use_gradient_checkpointing="unsloth" on (it’s designed to reduce VRAM use and extend context length).

Loader example for MoE (bf16 LoRA):

Once loaded, you’ll attach LoRA adapters and train similarly to the SFT example above.

Reinforcement Learning (RL)

You can now train Gemma 4 with RL, GSPO, GRPO etc with our free notebook.

Gemma 4 E2B RL works on 9GB.

The notebook's goal is to make Gemma 4 learn to solve Sudoku puzzles using GRPO.

The model will devise a strategy to fill in empty cells, and we'll reward it for correct placements and completing valid puzzles.

You can run Gemma 4 RL with Unsloth even though it is not supported by vLLM, by setting fast_inference=False when loading the model:

MoE fine-tuning (26B-A4B)

The 26B-A4B model is the speed / quality middle ground in the Gemma 4 lineup. Since it is an MoE model with only a subset of parameters active per token, a conservative fine-tuning approach is:

  • use LoRA rather than full fine-tuning

  • prefer 16-bit / bf16 LoRA if memory allows

  • start with shorter contexts and smaller ranks first

  • scale up only after the pipeline is stable

If your goal is the highest quality and you have more memory, use 31B instead.

Multimodal fine-tuning (E2B / E4B)

Because E2B and E4B support image and audio, they are the main Gemma 4 variants for multimodal fine-tuning.

  • load the multimodal model with FastVisionModel

  • keep finetune_vision_layers = False first

  • fine-tune only the language, attention, and MLP layers

  • enable vision or audio layers later if your task needs it

Gemma 4 Multimodal LoRA example:

Image example format

Remember: for Gemma 4 multimodal prompts, put the image before the text instruction.

Audio example format

Audio is for E2B / E4B only. Keep clips short and task-specific.

Saving / export fine-tuned model

You can view our specific inference / deployment guides for Unsloth Studio, llama.cpp, vLLM, llama-server, Ollama or SGLang.

Save to GGUF

Unsloth supports saving directly to GGUF:

Or push GGUFs to Hugging Face:

If the exported model behaves worse in another runtime, Unsloth flags the most common cause: wrong chat template / EOS token at inference time (you must use the same chat template you trained with).

For more details read our inference guides:

Gemma 4 data best practices

Gemma 4 has a few formatting details you need to keep in mind.

1. Use standard chat roles

Gemma 4 uses the standard:

  • system

  • user

  • assistant

This means your SFT dataset should be written in regular chat format rather than older Gemma-specific role formats.

2. Thinking mode is explicit

If you want to preserve thinking-style behavior during SFT:

  • keep the format consistent

  • decide whether you want to train on visible thought blocks or on final answers only

  • do not mix multiple incompatible thought formats in the same dataset

For most production assistants, the simplest setup is to fine-tune on the final visible answer only.

3. Multi-turn rule

For multi-turn conversations, only keep the final visible answer in the conversation history. Do not feed earlier thought blocks back into later turns.

Last updated

Was this helpful?