Open Models Surpass Closed with Full Customization
Open-weight models like GLM-4.1 now lead the Artificial Analysis Intelligence Index over closed models, with the gap closing per release. Full weight access enables quantization (e.g., GGUF 4-bit fits Gemma-2 large in 24GB L4 GPU VRAM), fine-tuning, and edge/browser deployment without data exfiltration—critical amid cloud performance drops and breaches. Hugging Face Hub (nearing 3M models) centralizes this: filter agentic models (VLMs for screenshot-based computer use or pure LLMs), compare via benchmark datasets (SWE-bench Pro for coding, AIME, humanities exams—GLM-4.1 tops SWE-bench), and route inference to fastest/cheapest providers (Groq, Cerebras) with tool-use visibility.
Traces repos store/explore agent sessions (Cody, Cline, Pi)—upload paths, view parsed interactions in dataset viewer, retrain later. This cuts model selection friction from 3M options.
Local Coding Agents Run Seamlessly on Open Models
Serve agents locally via llama.cpp server, LM Studio, or Ollama—Hub's 'Apps' tab lists compatible local tools; model cards show GGUF quants, hardware fits, and 2-3 line commands (e.g., ollama run <hub-id> launches llama-agent binary). Favorites: Pi (simple setup, remote/local), Hermes agents (superior memory, Slack/WhatsApp integration—fixed speaker's own Slack bug autonomously). Pair with GLM-4.1; upcoming Gemma-2, Minimax rumored. Vibe-check via inference providers before local deploy.
Skills Turn Agents into AI Engineers
Hugging Face Skills plug Hub into agents (Claude Code, Gemini): CLI skill manages repos/jobs/demos; Dataset skill explores via viewer API; Gradio skill builds demos. Star: LLM Trainer skill fine-tunes LLMs/VLMs/object detectors (handles bounding boxes)—prompt "train Qwen2-VL on LLaVA Instruct Mix," agent queries batch size/validation split, computes VRAM/job cost, launches on HF infra (or local), outputs to Hub.
MCP server exposes models/datasets/spaces/jobs/semantic search—toggle 'dynamic spaces' for full app store querying (e.g., "generate baklava of yarn" calls Qwen2 image gen). Colleague's pipeline: agent picks top OCR-bench model (Chandram), writes/scripts job via skills, OCRs 30K papers on HF Jobs/Buckets (cheaper S3), hosts without manual infra math. Outcome: agents handle end-to-end ML workflows—what took days of calculation is now a prompt.