#quantization
Every summary, chronological. Filter by category, tag, or source from the rail.
Tag · #quantization
LLM Inference: mmap Loading & Quantization Deep Dive
Efficient LLM inference hinges on mmap for lazy memory loading (e.g., <10s startup on llama.cpp) and quantization like GGUF K-Quants or AWQ/EXL2 to shrink 15GB models while preserving quality via salient weights and mixed precision.
Caleb Writes CodeShowing 3 of 3