#prompt-engineering
Every summary, chronological. Filter by category, tag, or source from the rail.
Making LLM Self-Evolution Safe with Held-Out Selection
RSEA improves LLM agent performance by recursively evolving natural-language artifacts while using a strict held-out validation gate to prevent performance regression.
Building Great Agent Skills: The Missing Manual
To escape 'skill hell,' developers must treat agent skills as structured, maintainable code by optimizing triggers, minimizing context bloat, using 'leading words' for steering, and aggressively pruning irrelevant instructions.
AI EngineerImproving LLM Planning with Symbolic Feedback Loops
To solve LLM planning errors in long-horizon tasks, this framework uses symbolic verification to provide corrective, interpretable feedback, forcing the model to iteratively refine its plans.
Personality Prompting in Multi-Agent Teams: Impact vs. Task Structure
Personality manipulation in LLM agents significantly alters communication style but only degrades performance in open-ended or competitive tasks, while having negligible impact on structured coding tasks.
The Promptware Kill Chain: Securing AI Agents
Promptware is a new class of malware that exploits the lack of separation between instructions and data in LLMs. To defend against it, builders must adopt a zero-trust architecture, treating AI agents as untrusted, hostile runtimes rather than benign assistants.
IBM TechnologyControlling LLM Output: Deterministic vs. Stochastic Generation
LLM outputs are probability distributions over tokens. You can force deterministic results by setting temperature to 0 or using top-p/top-k sampling to constrain the randomness of the next-token selection.
The Mechanics and Risks of AI Prompt Injection
AI agents cannot distinguish between developer instructions and untrusted data, making them vulnerable to prompt injection attacks where hidden text in web pages overrides system commands.
Stop Writing Tone Instructions: Use a 4-Layer AI Architecture
Stop relying on a single system prompt for brand voice. Instead, use a four-layer architecture—Immutable Identity, Situational Mode, Example-Anchored Voice, and a Deterministic Veto—to separate instructions from verification.
Improving LLM Ethical Reasoning with Narration-of-Thought
Narration-of-Thought (NoT) is an inference-time prompting scaffold that forces LLMs to explicitly identify stakeholders and uncertainties before committing to a decision, significantly reducing common ethical reasoning failures.
Instruction Bleed: The Hidden Risk of Prompt Composition
Compositional Behavioral Leakage (CBL) occurs when prompt modules interfere with each other within a shared context window, causing silent, sub-threshold shifts in agent behavior that standard QA often misses.
Building AI-Powered Apps: A Low-Code Guide for Small Teams
Small teams can modernize legacy applications by leveraging 'vibe coding' and managed database AI features like hybrid search and vector embeddings, allowing them to implement semantic capabilities without needing a team of AI experts.
Google Cloud TechThe Miranda Hypothesis: Why Persona Evals Fail
Current persona-based AI benchmarks measure 'convincingness' rather than historical fidelity, leading to 'Miranda distortion' where models prioritize culturally dominant narratives (like the Hamilton musical) over primary documentary records.
Verifying LLM Reasoning Traces with VeryTrace
VeryTrace improves LLM reliability by formalizing natural language reasoning into a structured, compilable DSL, enabling automated verification and error repair without domain-specific training.
AI Agents vs. Social Engineering: The Future of Trust
AI-native operating systems may finally solve social engineering by removing humans from routine trust decisions, though this shifts the battlefield to AI-agent manipulation and prompt injection.
Building Functional Personas with AI for User-Centric Decisions
Move beyond static, demographic-heavy personas by using AI to synthesize research into 'functional' personas focused on user goals, tasks, and objections, then making them interactive via custom chatbots.
Smashing MagazineOptimizing LLM Skills with Microsoft SkillOpt
Microsoft SkillOpt provides an automated pipeline to iteratively improve LLM prompt-based skills through a cycle of rollout, reflection, and validation, allowing developers to quantitatively measure performance gains against a baseline.
How AI Memory Tools Introduce Bias and Degrade Accuracy
Research shows that AI memory systems often fail to distinguish between relevant context and irrelevant user preferences, causing models to become sycophantic and prioritize user-fed misconceptions over objective accuracy.
Optimizing Long-Horizon AI Agents via Context Engineering
The paper demonstrates that reducing context noise in long-horizon LLM agents significantly improves performance and reliability, challenging the 'more context is better' paradigm.
Anthropic's Mythos-Class Models: Fable 5 and Mythos 5 Explained
Anthropic has introduced the 'Mythos-class' model tier, featuring Claude Fable 5 (general release with safety classifiers) and Claude Mythos 5 (limited, unrestricted release). Both models offer 1M token context windows and advanced reasoning capabilities.
Diagnosing Instruction Hierarchy Failures in Reasoning LLMs
Reasoning models often fail when instructions conflict or are poorly prioritized; this research identifies the structural causes of these hierarchy breakdowns and proposes methods to repair them.
Automating Prompt Optimization with GEPA Reflective Evolution
GEPA automates prompt engineering by using a reflection model to iteratively refine prompts based on structured feedback from a deterministic evaluation pipeline.
AI Comprehension Over Generation: The 'Catch Me Up' Workflow
In complex, legacy codebases, the primary value of AI is not code generation but comprehension. By using structured prompts to build mental models before planning or implementation, developers can avoid 'slop' and maintain high code quality.
AI EngineerQwen3.7-Max: Reasoning-First Agent Model with 1M Context
Alibaba's Qwen3.7-Max is a text-only reasoning model featuring a 1M-token context window and an 'extended-thinking' mode designed for complex, multi-step agentic workflows and code refactoring.
Long Context vs. Cache Augmented Generation (CAG)
Long context is best for one-off document analysis, while Cache Augmented Generation (CAG) and prompt caching optimize performance and cost for repeated queries against stable knowledge bases by reusing pre-computed KV caches.
Scaling Coding Agents: Lessons from Building Langfuse Skills
To make coding agents reliable, move away from static pre-training context toward dynamic, search-based documentation retrieval and rigorous evaluation, while carefully defining target functions to avoid optimizing away reliability.
AI EngineerOptimizing System Prompts via Embedding by Elicitation
The paper introduces 'Embedding by Elicitation,' a method that uses Bayesian Optimization to dynamically refine system prompts by learning latent representations, overcoming the limitations of static prompt engineering.
Building Long-Running AI Agents: Harnesses and Adversarial Loops
To build agents that run for hours without losing coherence, move beyond single-session loops. Use adversarial 'generator-critic' architectures, structured handoffs, and persistent state files to maintain focus and quality over long horizons.
AI EngineerWider Harness: 6D Framework for Digital Workers
Evolve task agents into digital workers handling recurring functions using a 6D harness: Identity, Context, Capability, Conduct, Cognition, Governance—onboard like hires, not deploy like tasks.
Poetiq Meta-System Auto-Builds Harnesses Boosting All LLMs on LCB Pro
Poetiq’s Meta-System uses recursive self-improvement to automatically generate model-agnostic inference harnesses, lifting every tested LLM's LiveCodeBench Pro score without fine-tuning—e.g., Gemini 3.1 Pro from 78.6% to 90.9%, GPT 5.5 High to 93.9%.
Chess Coach Pipeline: Engines + Detectors + LLM Translator
LLMs fail at chess due to hallucinations; fix by using Stockfish for evaluation, tactical/positional detectors for concepts, and LLM only to translate into natural language—achieving sub-3s latency without errors.
AI EngineerShowing 30 of 260