Gemma 4: Apache 2.0 Multimodal Models for Any Use

Video description

In this video, we look at the launch of the Gemma 4 family of models. These are 4 models 2 small and 2 larger models which combine not only multilingual but multimodality features. Blog: https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/ Colab: https://dripl.ink/EYT7h HF Collection: https://huggingface.co/collections/google/gemma-4 Twitter: https://x.com/Sam_Witteveen 🕵️ Interested in building LLM Agents? Fill out the form below Building LLM Agents Form: https://drp.li/dIMes 👨‍💻Github: https://github.com/samwit/llm-tutorials ⏱️Time Stamps: 00:00 Intro 00:15 Gemma 4 License 00:55 Quick Orientation 01:01 Gemma 4: 2 Model Tiers 03:05 Gemma 4: Thinking, Audio, Image & Video, Function Calling 03:27 Reasoning 04:06 Function Calling 04:56 Audio Support 05:38 Image and Video 06:23 Model Comparison 06:47 Gemma 4 Model Sizes 07:59 Workstation models 09:27 Edge models 11:30 Demo 16:40 Gemma 4 Availability

Apache 2.0 License Enables Unrestricted Commercial Deployment

Gemma 4's standout feature is its pure Apache 2.0 license, allowing full modification, fine-tuning, and commercial deployment without custom restrictions like non-compete clauses. This addresses past frustrations with Gemma 3's limited license, positioning it competitively against Llama or Qwen. Built from Gemini 3 research, these models trickle flagship innovations into open weights, enabling builders to create production AI features like local coding assistants or on-device agents without legal hurdles.

Native Multimodality and Reasoning Boost Agentic Workflows

All four models integrate vision, audio, long chain-of-thought reasoning, and function calling at the architecture level—not bolted-on prompts. Reasoning spans text, images, and audio (on edge models), improving benchmarks like MMU Pro and Sweetbench Pro. Function calling supports multi-turn agentic flows with multiple tools, outperforming instruction-following hacks. Edge models (E2B, E4B) handle ASR, speech-to-text translation (e.g., English to Japanese), and interleaved multi-image inputs for video or OCR. Workstation models excel in code generation, completion, correction across 140 pre-trained and 35 fine-tuned languages.

Optimized Architectures for Edge Efficiency and Workstation Power

Workstation tier: 31B dense model (fewer layers, value normalization, optimized attention for 256K context) and 26B MoE (128 tiny experts, 3.8B-4B active + shared expert, mimicking 27B intelligence at 4B compute cost). Both support 256K context, native aspect-ratio vision encoders for document understanding. Edge tier: E2B/E4B with 128K context, compressed audio encoder (305M params, 87MB vs. prior 681M/390MB, 40ms frames for responsive transcription) and 150M vision encoder (vs. 300-350M). QAT checkpoints preserve quality at low precision; run edge on T4 GPUs or phones, workstations on H100/RTX 6000 or serverless Cloud Run with G4 GPUs (96GB VRAM).

Hands-On Usage Yields Immediate Results

Enable thinking via chat template (enable_thinking=true) for better outputs on tasks like deep learning use cases in finance. Process images/videos with autoprocessor for detailed scene breakdowns (e.g., "girl on beach with dog"). Audio demos transcribe dual voices accurately or translate speech end-to-end. Base and instruction-tuned versions on Hugging Face suit fine-tuning; expect strong results from solid base models, outperforming Gemma 3's 32K context, outdated encoders, and text/vision-only limits.

Video description

Apache 2.0 License Enables Unrestricted Commercial Deployment

Native Multimodality and Reasoning Boost Agentic Workflows

Optimized Architectures for Edge Efficiency and Workstation Power

Hands-On Usage Yields Immediate Results

More on Edge

TokenSpeed Beats TensorRT-LLM 9-11% on Agentic Coding Inference

Open Source AI: Innovation Engine or Security Risk?

Hermes Agent Persists Learning Across Sessions

Kimi K2.6: Open MoE Model Tops Agentic Coding Benchmarks