Gemma 4 QAT
Run Google Gemma 4 QAT models locally, including E2B, E4B, 12B, 26B-A4B, and 31B.
Gemma 4 QAT (Quantization-Aware Training) is Google DeepMind’s new Gemma 4 variants designed to reduce memory requirements while preserving model quality. This makes it possible to run larger models, such as Gemma 4 26B-A4B, locally on consumer GPUs with as little as 16GB of RAM.
Gemma 4 QAT is trained with quantization in mind, allowing 4-bit format to have ~72% lower memory usage with near original performance. 2 special mobile quants of E2B and E4B are also provided which uses a mixture of quant widths.
Converting to Q4_0 from QAT naively gets only 70.2% top-1 % accuracy for 26B-A4B. We applied our Unsloth Dynamic method to push it up to 85.6% (+15.6%) whilst also being 200MB smaller!
Gemma 4 QAT includes: E2B, E4B, 12B, 26B-A4B, and 31B. They are multimodal, hybrid-thinking models that support 140+ languages and up to 256K context.
Gemma-4-E2B QAT runs on 3GB RAM, E4B on 5GB, 12B on 7GB, 26-A4B on 15GB and 31B on 18GB.
We name our Gemma 4 QAT GGUFs as UD-Q4_K_XL as we found q4_0 to degrade accuracy despite being bigger. See our Gemma 4 QAT GGUFs.
To compare int4 quantization, see the original vs. QAT size differences below. QAT uses ~72% less memory whilst retaining nearly all its original accuracy:

E2B
2.62 GB
9.31 GB
71.86%
E4B
4.22 GB
15.1 GB
72.05%
12B
6.72 GB
23.8 GB
71.76%
26B A4B
14.2 GB
50.5 GB
71.88%
31B
17.3 GB
61.4 GB
71.82%
Usage Guide
Gemma 4 QAT variants for E2B and E4B are designed for phones and laptops, while the larger 26B-A4B and 31B QAT models now work on laptops rather than just strong home GPUs.
There is only one GGUF file for each Gemma 4 model because we found that precisions higher than the uploaded UD-Q4_K_XL version degrade accuracy rather than improve it. Use the original non QAT Q4_0 quants here.

Hardware requirements
Table: Gemma 4 QAT Inference GGUF recommended hardware requirements (units = total memory: RAM + VRAM, or unified memory).
E2B QAT
3 GB
E4B QAT
5 GB
12B QAT
7 GB
26B A4B QAT
15 GB
31B QAT
18 GB
Recommended Settings
The QAT checkpoints use the same recommended Gemma 4 settings:
temperature = 1.0top_p = 0.95top_k = 64
Gemma 4's max context is 128K for E2B, E4B and 256K for 12B, 26B A4B, 31B.
QAT Analysis
We found that naively converting the QAT Q4_0 checkpoint to Q4_0 in llama.cpp land actually degraded accuracy and was not actually aligned with the BF16 QAT lattice for Q4_0. We applied our Unsloth dynamic method to force a better agreement between the llama.cpp compatible Q4_0 format and the true BF16 QAT Q4_0 format, and managed to both make the quants smaller (Q6_K wasn't needed for embeddings), and also more accurate!

Below is a table of KLD and Top 1% accuracy and Disk space. You can see our versions dramatically improve on 99.9% KLD and mean KLD. E2B for example has a mean KLD of 0.00173 vs 0.05109 (29x better relatively) for a naive Q4_0 quantization, and ours is even 22% smaller!
The main issue is converting from QAT BF16 to llama.cpp's Q4_0 format is not lossless. llama.cpp uses F16 scales, whilst QAT BF16 uses BF16 scales, and the scales are not determined optimally in llama.cpp land.
Naive conversion gets 24.77% byte exactness to BF16 QAT, whilst we found we can push it to 99.96% using some hacks!
E2B
Unsloth
2.62
0.0557
0.00173
98.16
E2B
Q4_0
3.35
1.0513
0.05109
89.29
E4B
Unsloth
4.22
0.0536
0.00121
98.54
E4B
Q4_0
5.15
0.6722
0.03778
90.94
26B
Unsloth
14.25
2.7087
0.09788
85.63
26B
Q4_0
14.44
4.5420
0.36094
70.20
31B
Unsloth
17.29
1.3659
0.01403
96.67
31B
Q4_0
17.65
3.0030
0.09349
87.91
12B
Unsloth
6.72
9.2740
0.13288
88.76
12B
Q4_0
6.98
14.7323
0.50702
74.08
Mobile Mixture QAT
The Gemma-4 team also released special mobile mixture QAT versions of Gemma-4-E2B-it and Gemma-4-E4B-it. We also faithfully converted them to llama.cpp compatible format, and also recovered nearly all accuracy as well. We used TQ2_0 for the 2-bit layers and did a negative scaler.
We made UD-Q2_K_XL quants for both E2B and E4B.
Size
2.19 GB
3.22 GB
2-bit (TQ2_0) tensors
61 (incl. deep MLP)
2 (embeddings only)
Mean KLD vs BF16
0.00409
0.00102
Top-1 %
97.82%
98.76%
Base PPL
~103
42.4
See gemma-4-E2B-it-qat-GGUF and gemma-4-E4B-it-qat-GGUF for UD-Q2_K_XL.
Run Gemma 4 QAT Tutorials
Because Gemma 4 GGUFs comes in several sizes, the recommended starting point for the small models is 8-bit and the larger models is Dynamic 4-bit. Gemma 4 GGUFs:
🦥 Unsloth Studio Guide🦙 Llama.cpp Guide
You can run and train Gemma 4 QAT for free with a UI in our Unsloth Studio✨ notebook:
🦥 Unsloth Studio Guide
Gemma 4 QAT can now be run and trained in Unsloth Studio, our new open-source web UI for local AI. Unsloth Studio lets you run models locally on MacOS, Windows, Linux and:
Search, download, run GGUFs and safetensor models
Self-healing tool calling + web search
Code execution (Python, Bash)
Automatic inference parameter tuning (temp, top-p, etc.)
Fast CPU + GPU inference via llama.cpp
Train LLMs 2x faster with 70% less VRAM

Search and download Gemma 4 QAT
On first launch you will need to create a password to secure your account and sign in again.
Then go to the Studio Chat tab and search for Gemma 4 in the search bar and download your desired model and quant.
Run Gemma 4 QAT
Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.
For more information, you can view our Unsloth Studio inference guide.

🦙 Llama.cpp Guide
For this guide there is no need to select quantization type since there is only one: UD-Q4_K_XL. See: Gemma 4 QAT collection. For these tutorials, we will using llama.cpp for fast local inference, especially if you have a CPU.
Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.
If you want to use llama.cpp directly to load models, you can follow commands below, according to each model. UD-Q4_K_XL is the ONLY quantization type. You can also download via Hugging Face (step 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location.
26B-A4B:
31B:
E4B:
E2B:
Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL or other quantized versions like Q8_0 . If downloads get stuck, see: Hugging Face Hub, XET debugging
Then run the model in conversation mode (with vision mmproj-F16):
Llama-server deployment
To deploy Gemma-4 on llama-server, use:
To disable thinking / reasoning, use --chat-template-kwargs '{"enable_thinking":false}'
If you're on Windows Powershell, use: --chat-template-kwargs "{\"enable_thinking\":false}"
Use 'true' and 'false' interchangeably.
Last updated
Was this helpful?

