Hybrid Architecture for Efficiency
Zyphra’s Zamba2-VL family (1.2B, 2.7B, and 7B parameters) addresses the latency bottlenecks inherent in traditional dense Transformer-based Vision-Language Models (VLMs). By replacing the standard dense Transformer backbone with a hybrid design, the models combine Mamba2 state-space layers—which offer linear-time computation—with interleaved shared Transformer attention blocks. This hybrid approach retains the in-context retrieval capabilities of attention mechanisms while leveraging the computational efficiency of state-space models (SSMs).
Performance and Latency Gains
The primary advantage of Zamba2-VL is its performance on long-context multimodal inputs. Traditional Transformers suffer from quadratic scaling of the KV cache as sequence lengths grow, particularly when processing high-resolution images or video. Zamba2-VL avoids this by utilizing a fixed-size recurrent state, resulting in a time-to-first-token (TTFT) that is approximately an order of magnitude faster than comparable Transformer-based models on 32k-token prefill tasks. While the models lag behind larger baselines on complex reasoning benchmarks like MMMU and MathVista, they demonstrate competitive accuracy in perception-heavy tasks such as document understanding (DocVQA) and visual counting (PixMoCount).
Deployment and Implementation
Designed for on-device and edge use cases, the 1.2B and 2.7B variants are optimized for scenarios like invoice parsing, receipt digitization, and inventory management. The models are available on Hugging Face and require a custom fork of the transformers library (v4.57.1) to support the optimized Mamba2 CUDA kernels. The architecture integrates the Vision Transformer from Qwen2.5-VL, utilizing 2D rotary position embeddings and dynamic-resolution processing to handle diverse visual inputs effectively.