#computer-vision
Every summary, chronological. Filter by category, tag, or source from the rail.
COMPASS: Improving Compositional Control in Multimodal Models
COMPASS introduces a unified framework that uses a shared 'expert token' to bridge composition perception and generation, enabling precise layout control in multimodal models.
OmniPath: Automating Wheelchair Accessibility Audits with AI
OmniPath improves accessibility mapping by fusing OpenStreetMap data with high-density LiDAR to identify physical barriers like slope and surface discontinuities that standard maps ignore.
SpatialClaw: Using Code as an Action Interface for Spatial Reasoning
SpatialClaw is a training-free agent framework that improves spatial reasoning in VLMs by treating Python code—rather than structured tool calls—as the primary interface for perception and geometric tasks.
Qwen-RobotSuite: Three Foundation Models for Embodied AI
The Qwen team has released a suite of three specialized foundation models—RobotManip, RobotWorld, and RobotNav—designed to address data fragmentation in robotics through unified action representations, language-conditioned world modeling, and scalable navigation interfaces.
Edge-Based Computer Vision for Industrial Food Waste Reduction
Mill uses custom-tuned Gemma models on Nvidia Jetson hardware to process high-frame-rate video at the edge, turning food waste data into actionable procurement insights for commercial kitchens.
Google Cloud TechByteDance's Lance: A Unified 3B Model for Vision and Video
Lance is an open-source, 3B parameter unified model that natively integrates image and video understanding, generation, and editing within a single jointly trained framework.
Showing 6 of 6