This model was contributed to Hugging Face Transformers on 2026-04-02.

Gemma4

Overview

Gemma 4 is a multimodal model with pretrained and instruction-tuned variants, available in E2B, E4B, 31B and 26B-A4B (MoE) parameter sizes. Gemma 4 models provide the following capabilities:

Reasoning: All models in the family are designed as highly capable reasoners, with configurable thinking modes.
Extended Multimodalities: Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B and E4B models).
Increased Context Window: Small models feature a 128K context window, while the other models support 256K.
Enhanced Coding & Agentic Capabilities: Achieves notable improvements in coding benchmarks alongside built-in function-calling support, powering highly capable autonomous agents.
Native System Prompt Support: Gemma 4 introduces built-in support for the system role, enabling more structured and controllable conversations.

You can find all the original Gemma 4 checkpoints under the Gemma 4 release.

Gemma4 Vision Model

The key difference from previous Gemma releases for vision is the new design to process images of different sizes using a fixed-budget number of tokens. Unlike many models that squash every image into a fixed square (like 224×224), Gemma 4 keeps the image's natural aspect ratio while making it the right size. There are a couple constraints to follow:

The total number of pixels must fit within a patch budget
Both height and width must be divisible by 48 (= patch size 16 × pooling kernel 3)

Important

Gemma 4 does not apply the standard ImageNet mean/std normalization that many other vision models use. The model's own patch embedding layer handles the final scaling internally (shifting values to the [-1, 1] range).

The number of "soft tokens" (aka vision tokens) an image processor can produce is configurable. The supported options are outlined below and the default is 280 soft tokens per image.

Soft Tokens	Patches (before pooling)	Approx. Image Area
70	630	~161K pixels
140	1,260	~323K pixels
280	2,520	~645K pixels
560	5,040	~1.3M pixels
1,120	10,080	~2.6M pixels

To encode positional information for each patch in the image, Gemma 4 uses a learned 2D position embedding table. The position table stores up to 10,240 positions per axis, which allows the model to handle very large images. Each position is a learned vector of the same dimensions as the patch embedding. The 2D RoPE which Gemma 4 uses independently rotate half the attention head dimensions for the x-axis and the other half for the y-axis. This allows the model to understand spatial relationships like "above," "below," "left of," and "right of."

Per-Layer Embeddings (PLE)

Gemma 4 introduces a Per-Layer Embeddings (PLE) system that feeds an auxiliary residual signal into each decoder layer, rather than relying solely on a single shared embedding at the input.

PLE combines two components that are summed and scaled by 1/√2 before being fed to each decoder layer:

Token-identity (get_per_layer_inputs): looks up input_ids in embed_tokens_per_layer, a Gemma4TextScaledWordEmbedding that multiplies by √(hidden_size_per_layer_input). The packed output is reshaped from [batch, seq, num_hidden_layers * hidden_size_per_layer_input] to [batch, seq, num_hidden_layers, hidden_size_per_layer_input].
Context-aware (project_per_layer_inputs): projects inputs_embeds through per_layer_model_projection (a Linear layer), scales by 1/√(hidden_size), reshapes to [batch, seq, num_layers, ple_dim], and normalizes with per_layer_projection_norm (RMSNorm).

When both components are available, the final per-layer input is (token_identity + context_aware) * (1/√2). For multimodal inputs where input_ids are not available, only the context-aware projection is used.

Usage examples

The example below demonstrates how to generate text based on an image with [Pipeline] or the [AutoModel] class.

from transformers import pipeline


pipeline = pipeline(
    task="image-text-to-text",
    model="google/gemma-4-E2B-it",
)
pipeline(
    images="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg",
    text="<|image|>\n\nWhat is shown in this image?"
)

from transformers import AutoModelForImageTextToText, AutoProcessor


model = AutoModelForImageTextToText.from_pretrained(
    "google/gemma-4-E2B-it",
    device_map="auto",
    attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained(
    "google/gemma-4-E2B-it",
    padding_side="left"
)

messages = [
    {
        "role": "user", "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"},
            {"type": "text", "text": "What is shown in this image?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

output = model.generate(**inputs, max_new_tokens=50, cache_implementation="static")
print(processor.decode(output[0][input_len:], skip_special_tokens=True))

Function calling

from transformers import AutoModelForCausalLM, AutoProcessor


WEATHER_TOOL = {
    "type": "function",
    "function": {
        "name": "get_n_day_weather_forecast",
        "description": "Get an N-day weather forecast",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA",
                },
                "format": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"],
                    "description": "The temperature unit to use",
                },
                "num_days": {
                    "type": "integer",
                    "description": "The number of days to forecast",
                },
            },
            "required": ["location", "format", "num_days"],
        },
    },
}

messages = [
    {
        "role": "user",
        "content": "What's the weather like the next 3 days in San Francisco, CA (using F)?",
    },
]

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-E2B-it",
    device_map="auto",
    attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained(
    "google/gemma-4-E2B-it",
    padding_side="left"
)

text = processor.apply_chat_template(
    messages,
    tools=[WEATHER_TOOL],
    tokenize=False,
    add_generation_prompt=True,
)

inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0][input_len:], skip_special_tokens=False))

Audio (E2B and E4B Only)

from transformers import AutoModelForMultimodalLM, AutoProcessor


messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "Please transcribe the following audio:"},
            {
                "type": "audio",
                "url": "https://huggingface.co/datasets/eustlb/audio-samples/resolve/main/dude_where_is_my_car.wav",
            },
        ],
    }
]

model = AutoModelForMultimodalLM.from_pretrained(
    "google/gemma-4-E2B-it",
    device_map="auto",
    attn_implementation="sdpa"
)
processor = AutoProcessor.from_pretrained(
    "google/gemma-4-E2B-it",
    padding_side="left"
)

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, dtype=model.dtype)

input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0][input_len:], skip_special_tokens=False))

Gemma4AudioConfig

[[autodoc]] Gemma4AudioConfig

Gemma4VisionConfig

[[autodoc]] Gemma4VisionConfig

Gemma4TextConfig

[[autodoc]] Gemma4TextConfig

Gemma4Config

[[autodoc]] Gemma4Config

Gemma4AudioFeatureExtractor

[[autodoc]] Gemma4AudioFeatureExtractor - call

Gemma4ImageProcessorPil

[[autodoc]] Gemma4ImageProcessorPil - preprocess

Gemma4ImageProcessor

[[autodoc]] Gemma4ImageProcessor - preprocess

Gemma4VideoProcessor

[[autodoc]] Gemma4VideoProcessor - preprocess

Gemma4Processor

[[autodoc]] Gemma4Processor - call

Gemma4PreTrainedModel

[[autodoc]] Gemma4PreTrainedModel - forward

Gemma4AudioModel

[[autodoc]] Gemma4AudioModel - forward

Gemma4VisionModel

[[autodoc]] Gemma4VisionModel - forward

Gemma4TextModel

[[autodoc]] Gemma4TextModel - forward

Gemma4ForCausalLM

[[autodoc]] Gemma4ForCausalLM

Gemma4Model

[[autodoc]] Gemma4Model - forward

Gemma4ForConditionalGeneration

[[autodoc]] Gemma4ForConditionalGeneration - forward

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gemma4

Overview

Gemma4 Vision Model

Per-Layer Embeddings (PLE)

Usage examples

Function calling

Audio (E2B and E4B Only)

Gemma4AudioConfig

Gemma4VisionConfig

Gemma4TextConfig

Gemma4Config

Gemma4AudioFeatureExtractor

Gemma4ImageProcessorPil

Gemma4ImageProcessor

Gemma4VideoProcessor

Gemma4Processor

Gemma4PreTrainedModel

Gemma4AudioModel

Gemma4VisionModel

Gemma4TextModel

Gemma4ForCausalLM

Gemma4Model

Gemma4ForConditionalGeneration

Uh oh!

FilesExpand file tree

gemma4.md

Latest commit

History

gemma4.md

File metadata and controls

Gemma4

Overview

Gemma4 Vision Model

Per-Layer Embeddings (PLE)

Usage examples

Function calling

Audio (E2B and E4B Only)

Gemma4AudioConfig

Gemma4VisionConfig

Gemma4TextConfig

Gemma4Config

Gemma4AudioFeatureExtractor

Gemma4ImageProcessorPil

Gemma4ImageProcessor

Gemma4VideoProcessor

Gemma4Processor

Gemma4PreTrainedModel

Gemma4AudioModel

Gemma4VisionModel

Gemma4TextModel

Gemma4ForCausalLM

Gemma4Model

Gemma4ForConditionalGeneration