
Building Secure AI Agents: A Default-Deny Architecture
"I'm going to let the agent manage the temp directory," our devops engineer said. "It'll be fine."
The prompt was simple: "Check the /tmp/logs directory and delete files older than 7 days."
The agent (GPT-4) looked at the directory. It saw a lot of logs. It decided to be thorough. It interpreted "delete files older than 7 days" as "delete files and directories that look old."
It found a directory called /tmp/logs/archive. It deleted it. That directory was a symlink to our S3 backup mount.
We lost 3 months of logs in 4 seconds.
The Fundamental Problem With AI Agents
AI agents are not intelligent actors making reasoned decisions. They are language models executing tool calls based on pattern matching. When you give an agent access to delete_file, it doesn't understand the consequences of deletion the way a human does. It doesn't check for symlinks. It doesn't consider whether the directory might be important. It just sees a path and a tool, and it calls the tool because that's what the prompt implied it should do.
The mental model that gets people in trouble is treating agents like competent employees. They're not. They're more like shell scripts that can improvise. They follow instructions literally, they don't ask clarifying questions when they should, and they optimize for task completion without considering side effects.
You cannot trust an agent's judgment. You can only trust its capabilities—and you must limit them.
The Default-Deny Architecture
After the log incident, we designed a security architecture for AI agents based on a simple principle borrowed from network security: default deny. An agent starts with zero permissions and must explicitly be granted access to each capability.
Layer 1: Tool Classification
Not all tools are equally dangerous. We classify every tool into four risk tiers:
Blocked (immediate rejection, no exceptions): 18 tools that should never be available to an AI agent in production:
- Shell execution:
execute_shell,run_command,bash,subprocess,os.system - Destructive operations:
delete_file,rm,kill_process,chmod - Outbound communication:
send_email,http_post,upload_file
Review Required (needs human approval or explicit policy): 6 tools that are sometimes legitimate but always risky:
write_file,create_file,modify_file(can overwrite critical data)send_message,post,api_call(can exfiltrate data or trigger external actions)
Allowed (safe for autonomous execution): 15 tools that are read-only or side-effect-free:
read_file,list_directory,search,query,getfetch,format,parse,summarize,translate
Custom (user-defined risk assessment): Any tool not in the above lists gets a risk score based on its arguments, the current session context, and custom validators defined by the developer.
Layer 2: Argument Validation
Even a "safe" tool becomes dangerous with the wrong arguments. Our argument validator inspects every parameter of every tool call for:
Path traversal:
# Agent requests:
tool_call("read_file", {"path": "../../../etc/passwd"})
# Validator: BLOCKED - path traversal detectedShell injection:
# Agent requests:
tool_call("search", {"query": "users; rm -rf /"})
# Validator: BLOCKED - shell metacharacter detectedSensitive file access:
# Agent requests:
tool_call("read_file", {"path": "/home/user/.ssh/id_rsa"})
# Validator: BLOCKED - sensitive path (.ssh directory)SQL injection:
# Agent requests:
tool_call("query", {"sql": "SELECT * FROM users; DROP TABLE users"})
# Validator: BLOCKED - SQL injection pattern detectedWe also apply Unicode normalization (NFKC) before validation. This prevents the fullwidth character bypass, where an attacker uses rm -rf (Unicode fullwidth characters) instead of rm -rf. After normalization, both resolve to the same ASCII representation.
Layer 3: Sequence Analysis
Individual tool calls might be safe, but the sequence of calls can reveal an attack pattern.
We track three dangerous sequences:
Privilege escalation: read → write → execute
If an agent reads a file, writes a modified version, then tries to execute it, that's a classic privilege escalation pattern—even if each individual call looks benign.
Data exfiltration: read → read → read → send
Three consecutive reads followed by an outbound communication (email, HTTP POST, file upload) suggests the agent is gathering data to exfiltrate.
Excessive writes: More than 5 consecutive write operations trigger a review. Legitimate agents rarely need to modify many files in rapid succession.
Layer 4: Velocity Limiting
Even if every individual tool call is valid, we enforce rate limits to prevent runaway agents:
- 30 calls per minute per agent
- 100 calls per session (session TTL: 1 hour)
These limits catch infinite loops (a common failure mode when agents get confused) and sustained automated attacks.
Layer 5: Risk Scoring
Every tool call receives a composite risk score from 0.0 to 1.0:
risk_score = (argument_validation × 0.50)
+ (sequence_validation × 0.30)
+ (custom_validator × 0.20)
+ (schema_violation_penalty × 0.30)The risk score maps to an action:
- SAFE (0.0-0.2): Execute immediately.
- LOW (0.2-0.4): Execute with logging.
- MEDIUM (0.4-0.6): Execute with logging and alert.
- HIGH (0.6-0.8): Queue for human review.
- CRITICAL (0.8-1.0): Block immediately.
The Human Circuit Breaker
For any high-stakes action—refunds over $50, deleting files, sending emails to more than one person—the agent must pause and request human approval.
We integrate with webhook alerting to make this practical:
{
"event": "tool_call_review",
"agent_id": "support-agent-v3",
"tool_name": "issue_refund",
"arguments": {"amount": 5000, "order_id": "ORD-12345"},
"risk_score": 0.85,
"risk_level": "CRITICAL",
"reason": "Refund amount exceeds $50 threshold",
"session_context": {
"total_calls": 12,
"previous_tools": ["get_order", "get_customer", "get_refund_policy"]
}
}This alert goes to Slack (or any webhook endpoint). The human can approve or deny. The agent waits. No autonomous execution of irreversible actions.
Why This Matters More Than Prompt Engineering
We see teams spend weeks crafting the perfect system prompt for their agent:
"You are a helpful assistant. Never delete important files. Always ask before making changes. Be careful with sensitive data."
This is security theater. The system prompt is a suggestion written in natural language, interpreted by a probabilistic model. It will be followed most of the time, but "most of the time" is not a security guarantee.
In contrast, the default-deny architecture is deterministic. The agent physically cannot call delete_file because it's on the blocked list. It physically cannot access /etc/passwd because the path traversal check runs before the tool executes. It physically cannot issue a $5,000 refund without human approval because the risk threshold requires it.
Deterministic controls beat probabilistic instructions. Every time.
Integration With PromptGuard
All of this agent security is built into the PromptGuard platform. You can validate tool calls through the SDK:
from promptguard import PromptGuard
pg = PromptGuard(api_key=os.environ["PROMPTGUARD_API_KEY"])
# Before executing any tool call from your agent
validation = pg.agent.validate_tool(
agent_id="my-agent",
tool_name="delete_file",
arguments={"path": "/tmp/logs/old.log"},
session_id="session-abc"
)
if validation.allowed:
execute_tool("delete_file", {"path": "/tmp/logs/old.log"})
else:
print(f"Blocked: {validation.reason}")
print(f"Risk: {validation.risk_level} ({validation.risk_score})")You can also monitor agent behavior in aggregate:
# Get agent statistics
stats = pg.agent.stats(agent_id="my-agent")
print(f"Total calls: {stats.total_tool_calls}")
print(f"Blocked calls: {stats.total_blocked_calls}")
print(f"Risk score: {stats.risk_score}")Conclusion
The AI agent hype cycle is real, and agents are genuinely useful. But every agent is one creative prompt injection away from executing the worst possible interpretation of its instructions.
The fix isn't better prompts. It's better architecture. Default deny. Argument validation. Sequence analysis. Human circuit breakers. Velocity limits.
Treat your agents like what they are: untrusted code execution environments with natural language interfaces. If your agent runs as root, you're one prompt injection away from disaster.
READ MORE

Inside Our 5-Model ML Ensemble: How We Detect Attacks Without Adding Latency
A technical deep dive into how PromptGuard's ensemble of Llama-Prompt-Guard, DeBERTa, ALBERT, toxic-bert, and RoBERTa classifies threats—covering parallel inference, weighted voting, category-specific thresholds, confidence calibration, and why five small models beat one large one.

Why Support Bots Are Your Biggest Security Hole (And How to Fix It)
We've watched helpfully trained bots email transaction histories to strangers, issue unauthorized refunds, and leak internal system prompts—all without a single 'jailbreak' keyword. Here's the three-layer defense architecture that actually secures customer support AI.

Multi-Provider Failover: How to Keep Your AI App Running When OpenAI Goes Down
When OpenAI has a 30-minute outage, your AI application doesn't have to go down with it. Here's how PromptGuard's SmartRouter automatically fails over across providers—OpenAI, Anthropic, Gemini, Mistral, Groq, and Azure—without your users noticing.