Gemma 4 QAT

Run Google Gemma 4 QAT models locally, including E2B, E4B, 12B, 26B-A4B, and 31B.

Gemma 4 QAT (Quantization-Aware Training) is Google DeepMind’s new Gemma 4 variants designed to reduce memory requirements while preserving model quality. This makes it possible to run larger models, such as Gemma 4 26B-A4B, locally on consumer GPUs with as little as 16GB of RAM.

Gemma 4 QAT is trained with quantization in mind, allowing 4-bit format to have ~72% lower memory usage with near original performance. 2 special mobile quants of E2B and E4B are also provided which uses a mixture of quant widths.

Converting to Q4_0 from QAT naively gets only 70.2% top-1 % accuracy for 26B-A4B. We applied our Unsloth Dynamic method to push it up to 85.6% (+15.6%) whilst also being 200MB smaller!

Gemma 4 QAT includes: E2B, E4B, 12B, 26B-A4B, and 31B. They are multimodal, hybrid-thinking models that support 140+ languages and up to 256K context.

Run Gemma 4 QAT QAT Analysis

Gemma-4-E2B QAT runs on 3GB RAM, E4B on 5GB, 12B on 7GB, 26-A4B on 15GB and 31B on 18GB.

We name our Gemma 4 QAT GGUFs as UD-Q4_K_XL as we found q4_0 to degrade accuracy despite being bigger. See our Gemma 4 QAT GGUFs.

To compare int4 quantization, see the original vs. QAT size differences below. QAT uses ~72% less memory whilst retaining nearly all its original accuracy:

Gemma 4

QAT (int4) GGUF

Original BF16

Percentage change

E2B

2.62 GB

9.31 GB

71.86%

E4B

4.22 GB

15.1 GB

72.05%

12B

6.72 GB

23.8 GB

71.76%

26B A4B

14.2 GB

50.5 GB

71.88%

31B

17.3 GB

61.4 GB

71.82%

Usage Guide

Gemma 4 QAT variants for E2B and E4B are designed for phones and laptops, while the larger 26B-A4B and 31B QAT models now work on laptops rather than just strong home GPUs.

There is only one GGUF file for each Gemma 4 model because we found that precisions higher than the uploaded UD-Q4_K_XL version degrade accuracy rather than improve it. Use the original non QAT Q4_0 quants here.

Hardware requirements

Table: Gemma 4 QAT Inference GGUF recommended hardware requirements (units = total memory: RAM + VRAM, or unified memory).

Gemma 4 QAT

Requirements

E2B QAT

3 GB

E4B QAT

5 GB

12B QAT

7 GB

26B A4B QAT

15 GB

31B QAT

18 GB

Recommended Settings

The QAT checkpoints use the same recommended Gemma 4 settings:

temperature = 1.0
top_p = 0.95
top_k = 64

Gemma 4's max context is 128K for E2B, E4B and 256K for 12B, 26B A4B, 31B.

QAT Analysis

We found that naively converting the QAT Q4_0 checkpoint to Q4_0 in llama.cpp land actually degraded accuracy and was not actually aligned with the BF16 QAT lattice for Q4_0. We applied our Unsloth dynamic method to force a better agreement between the llama.cpp compatible Q4_0 format and the true BF16 QAT Q4_0 format, and managed to both make the quants smaller (Q6_K wasn't needed for embeddings), and also more accurate!

Below is a table of KLD and Top 1% accuracy and Disk space. You can see our versions dramatically improve on 99.9% KLD and mean KLD. E2B for example has a mean KLD of 0.00173 vs 0.05109 (29x better relatively) for a naive Q4_0 quantization, and ours is even 22% smaller!

The main issue is converting from QAT BF16 to llama.cpp's Q4_0 format is not lossless. llama.cpp uses F16 scales, whilst QAT BF16 uses BF16 scales, and the scales are not determined optimally in llama.cpp land.

Naive conversion gets 24.77% byte exactness to BF16 QAT, whilst we found we can push it to 99.96% using some hacks!

Model

Method

Disk (GB)

99.9% KLD

Mean KLD

Top-1 %

E2B

Unsloth

2.62

0.0557

0.00173

98.16

E2B

Q4_0

3.35

1.0513

0.05109

89.29

E4B

Unsloth

4.22

0.0536

0.00121

98.54

E4B

Q4_0

5.15

0.6722

0.03778

90.94

26B

Unsloth

14.25

2.7087

0.09788

85.63

26B

Q4_0

14.44

4.5420

0.36094

70.20

31B

Unsloth

17.29

1.3659

0.01403

96.67

31B

Q4_0

17.65

3.0030

0.09349

87.91

12B

Unsloth

6.72

9.2740

0.13288

88.76

12B

Q4_0

6.98

14.7323

0.50702

74.08

Mobile Mixture QAT

The Gemma-4 team also released special mobile mixture QAT versions of Gemma-4-E2B-it and Gemma-4-E4B-it. We also faithfully converted them to llama.cpp compatible format, and also recovered nearly all accuracy as well. We used TQ2_0 for the 2-bit layers and did a negative scaler.

We made UD-Q2_K_XL quants for both E2B and E4B.

E2B mobile

E4B mobile

Size

2.19 GB

3.22 GB

2-bit (TQ2_0) tensors

61 (incl. deep MLP)

2 (embeddings only)

Mean KLD vs BF16

0.00409

0.00102

Top-1 %

97.82%

98.76%

Base PPL

~103

42.4

See gemma-4-E2B-it-qat-GGUF and gemma-4-E4B-it-qat-GGUF for UD-Q2_K_XL.

Run Gemma 4 QAT Tutorials

Because Gemma 4 GGUFs comes in several sizes, the recommended starting point for the small models is 8-bit and the larger models is Dynamic 4-bit. Gemma 4 GGUFs:

🦥 Unsloth Studio Guide 🦙 Llama.cpp Guide

You can run and train Gemma 4 QAT for free with a UI in our Unsloth Studio✨ notebook:

Google Colabcolab.research.google.com

🦥 Unsloth Studio Guide

Gemma 4 QAT can now be run and trained in Unsloth Studio, our new open-source web UI for local AI. Unsloth Studio lets you run models locally on MacOS, Windows, Linux and:

Search, download, run GGUFs and safetensor models
Self-healing tool calling + web search
Code execution (Python, Bash)
Automatic inference parameter tuning (temp, top-p, etc.)
Fast CPU + GPU inference via llama.cpp
Train LLMs 2x faster with 70% less VRAM

Install Unsloth

Run in your terminal:

MacOS, Linux, WSL:

Windows PowerShell:

Launch Unsloth

MacOS, Linux, WSL and Windows:

Then open http://127.0.0.1:8888 (or your specific URL) in your browser.

Search and download Gemma 4 QAT

On first launch you will need to create a password to secure your account and sign in again.

Then go to the Studio Chat tab and search for Gemma 4 in the search bar and download your desired model and quant.

Run Gemma 4 QAT

Inference parameters should be auto-set when using Unsloth Studio, however you can still change it manually. You can also edit the context length, chat template and other settings.

For more information, you can view our Unsloth Studio inference guide.

🦙 Llama.cpp Guide

For this guide there is no need to select quantization type since there is only one: UD-Q4_K_XL. See: Gemma 4 QAT collection. For these tutorials, we will using llama.cpp for fast local inference, especially if you have a CPU.

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference. For Apple Mac / Metal devices, set -DGGML_CUDA=OFF then continue as usual - Metal support is on by default.

If you want to use llama.cpp directly to load models, you can follow commands below, according to each model. UD-Q4_K_XL is the ONLY quantization type. You can also download via Hugging Face (step 3). This is similar to ollama run . Use export LLAMA_CACHE="folder" to force llama.cpp to save to a specific location.

26B-A4B:

31B:

E4B:

E2B:

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL or other quantized versions like Q8_0 . If downloads get stuck, see: Hugging Face Hub, XET debugging

Then run the model in conversation mode (with vision mmproj-F16):

Llama-server deployment

To deploy Gemma-4 on llama-server, use:

To disable thinking / reasoning, use --chat-template-kwargs '{"enable_thinking":false}'

If you're on Windows Powershell, use: --chat-template-kwargs "{\"enable_thinking\":false}"

Use 'true' and 'false' interchangeably.

PreviousGemma 4 NextFine-tune Gemma 4

Last updated 23 days ago

Was this helpful?