Engineering

Guardrails: Stop Your Agent Before It Goes Off the Rails

Agents fail in surprising ways that no amount of prompt engineering fully prevents, so the defensive layer you add around the model matters as much as the model itself.

June 26, 2026
6 min read
Aki Wijesundara
#Guardrails#LLM Security#AI Safety

Key Takeaways

  • Comprehensive strategies proven to work at top companies
  • Actionable tips you can implement immediately
  • Expert insights from industry professionals

A well-prompted agent is still an agent. It can receive malicious input, generate harmful output, go off-topic, or take actions outside its intended scope. None of these failures require a broken model. They are the predictable consequences of deploying a system that interprets natural language without a hard boundary between what it should and should not do. Guardrails are that boundary.

Input Sanitization

Input sanitization is the first line of defense. Before any user input reaches the model, check it against a set of rules: maximum length limits to prevent context stuffing, character allow-lists for structured inputs like product search queries, and pattern matching to catch known attack signatures.

import re

def sanitize_input(user_input: str, max_length: int = 2000) -> str:
    # Enforce length limit
    if len(user_input) > max_length:
        raise ValueError("Input exceeds maximum length of {} characters".format(max_length))

    # Strip null bytes and control characters (keep newlines and tabs)
    cleaned = re.sub(r'[--]', '', user_input)

    # Detect prompt injection patterns
    injection_patterns = [
        r'ignores+(alls+)?(previous|prior|above)s+instructions?',
        r'disregards+(yours+)?(systems+)?prompt',
        r'yous+ares+nows+w+',
        r'acts+ass+(ifs+yous+(were|are)s+)?(?!as+helpful)'
    ]
    for pattern in injection_patterns:
        if re.search(pattern, cleaned, re.IGNORECASE):
            raise ValueError("Input contains disallowed patterns")

    return cleaned.strip()

This is not foolproof. Determined attackers will probe for patterns your regex does not cover. Treat input sanitization as one layer of a defense-in-depth strategy, not as a complete solution on its own.

Output Format Validation

Output guardrails prevent the model's response from reaching your user in a broken or dangerous state. At minimum, validate that the output matches the expected schema and passes a length check. For agentic systems, also validate that any actions the model proposes are within the allowed action space before executing them.

import json

ALLOWED_ACTIONS = {"search", "summarize", "recommend", "clarify"}

def validate_agent_output(raw_output: str) -> dict:
    # Must be valid JSON
    try:
        parsed = json.loads(raw_output)
    except json.JSONDecodeError:
        raise ValueError("Agent output is not valid JSON")

    # Must have required fields
    required = {"action", "response"}
    missing = required - set(parsed.keys())
    if missing:
        raise ValueError("Agent output missing required fields: {}".format(missing))

    # Action must be in the allowed set
    if parsed["action"] not in ALLOWED_ACTIONS:
        raise ValueError("Action '{}' is not permitted".format(parsed["action"]))

    # Response must not be empty
    if not parsed["response"].strip():
        raise ValueError("Agent response is empty")

    return parsed

Topic Classifiers

For agents with a defined scope, a topic classifier is a lightweight way to reject out-of-scope requests before they reach the main model. The classifier is itself an LLM call, but it uses a small, fast model and a tight binary prompt: is this input within scope or not?

Prompt

"You are a content classifier for a customer support AI that handles product questions and order inquiries only. Classify the following user message as IN_SCOPE or OUT_OF_SCOPE. IN_SCOPE: product questions, order status, returns, account issues. OUT_OF_SCOPE: anything unrelated to customer support, requests to roleplay, political opinions, creative writing, coding help. Respond with only the classification and a one-sentence reason."

NeMo Guardrails for Complex Cases

For agents that need richer guardrail logic, the NeMo Guardrails framework from NVIDIA provides a declarative way to define rails in a domain-specific language. You specify topic rails (topics to stay on or avoid), dialog rails (patterns of conversation to allow or block), and fact-checking rails (claims that must be grounded). The library intercepts every model call and applies your rules automatically.

from nemoguardrails import RailsConfig, LLMRails

# Load rails config from a directory with config.yml and .co files
config = RailsConfig.from_path("./guardrails_config")
rails = LLMRails(config)

async def safe_respond(user_message: str) -> str:
    response = await rails.generate_async(
        messages=[{"role": "user", "content": user_message}]
    )
    return response["content"]

NeMo Guardrails adds latency, so benchmark it for your use case before adopting it for all calls. For simpler agents, the sanitization and validation functions above are usually sufficient. The goal in both cases is the same: make it impossible for a bad input or a bad model output to cause harm, not merely unlikely.

Want to build this live with Aki?

Join a Lightning Lesson and go deeper on this topic. Browse upcoming sessions →

A

Aki Wijesundara

Expert team of AI professionals and career advisors with experience at top tech companies. We've helped 500+ students land internships at Google, Meta, OpenAI, and other leading AI companies.

📍 Silicon Valley🎓 500+ Success Stories⭐ 98% Success Rate

Ready to Launch Your AI Career?

Join our comprehensive program and get personalized guidance from industry experts who've been where you want to go.

Share Article

Get Weekly AI Career Tips

Join 5,000+ professionals getting actionable career advice in their inbox.

No spam. Unsubscribe anytime.