Zamba2-VL: Hybrid Mamba2-Transformer Vision-Language Models

Hybrid Architecture for Efficiency

Zyphra’s Zamba2-VL family (1.2B, 2.7B, and 7B parameters) addresses the latency bottlenecks inherent in traditional dense Transformer-based Vision-Language Models (VLMs). By replacing the standard dense Transformer backbone with a hybrid design, the models combine Mamba2 state-space layers—which offer linear-time computation—with interleaved shared Transformer attention blocks. This hybrid approach retains the in-context retrieval capabilities of attention mechanisms while leveraging the computational efficiency of state-space models (SSMs).

Performance and Latency Gains

The primary advantage of Zamba2-VL is its performance on long-context multimodal inputs. Traditional Transformers suffer from quadratic scaling of the KV cache as sequence lengths grow, particularly when processing high-resolution images or video. Zamba2-VL avoids this by utilizing a fixed-size recurrent state, resulting in a time-to-first-token (TTFT) that is approximately an order of magnitude faster than comparable Transformer-based models on 32k-token prefill tasks. While the models lag behind larger baselines on complex reasoning benchmarks like MMMU and MathVista, they demonstrate competitive accuracy in perception-heavy tasks such as document understanding (DocVQA) and visual counting (PixMoCount).

Deployment and Implementation

Designed for on-device and edge use cases, the 1.2B and 2.7B variants are optimized for scenarios like invoice parsing, receipt digitization, and inventory management. The models are available on Hugging Face and require a custom fork of the transformers library (v4.57.1) to support the optimized Mamba2 CUDA kernels. The architecture integrates the Vision Transformer from Qwen2.5-VL, utilizing 2D rotary position embeddings and dynamic-resolution processing to handle diverse visual inputs effectively.

Hybrid Architecture for Efficiency

Performance and Latency Gains

Deployment and Implementation

More from AI & LLMs

VBFDD-Agent: Translating Battery Signals into Descriptive Text

Sovereign AI Grounds Robotics in Physics for 1.1M States/Sec

Gemma 4 MTP Drafters: 3x Faster Inference, No Quality Loss

H2E: Deterministic Safety via Riemannian Multimodal Fusion