Building Secure AI Agents: A Default-Deny Architecture

"I'm going to let the agent manage the temp directory," our devops engineer said. "It'll be fine."

The prompt was simple: "Check the /tmp/logs directory and delete files older than 7 days."

The agent (GPT-4) looked at the directory. It saw a lot of logs. It decided to be thorough. It interpreted "delete files older than 7 days" as "delete files and directories that look old."

It found a directory called /tmp/logs/archive. It deleted it. That directory was a symlink to our S3 backup mount.

We lost 3 months of logs in 4 seconds.

The Fundamental Problem With AI Agents

AI agents are not intelligent actors making reasoned decisions. They are language models executing tool calls based on pattern matching. When you give an agent access to delete_file, it doesn't understand the consequences of deletion the way a human does. It doesn't check for symlinks. It doesn't consider whether the directory might be important. It just sees a path and a tool, and it calls the tool because that's what the prompt implied it should do.

The mental model that gets people in trouble is treating agents like competent employees. They're not. They're more like shell scripts that can improvise. They follow instructions literally, they don't ask clarifying questions when they should, and they optimize for task completion without considering side effects.

You cannot trust an agent's judgment. You can only trust its capabilities-and you must limit them.

The Default-Deny Architecture

After the log incident, we designed a security architecture for AI agents based on a simple principle borrowed from network security: default deny. An agent starts with zero permissions and must explicitly be granted access to each capability.

Layer 1: Tool Classification

Not all tools are equally dangerous. We classify every tool into four risk tiers:

Blocked (immediate rejection, no exceptions): 18 tools that should never be available to an AI agent in production:

Shell execution: execute_shell, run_command, bash, subprocess, os.system
Destructive operations: delete_file, rm, kill_process, chmod
Outbound communication: send_email, http_post, upload_file

Review Required (needs human approval or explicit policy): 6 tools that are sometimes legitimate but always risky:

write_file, create_file, modify_file (can overwrite critical data)
send_message, post, api_call (can exfiltrate data or trigger external actions)

Allowed (safe for autonomous execution): 15 tools that are read-only or side-effect-free:

read_file, list_directory, search, query, get
fetch, format, parse, summarize, translate

Custom (user-defined risk assessment): Any tool not in the above lists gets a risk score based on its arguments, the current session context, and custom validators defined by the developer.

Layer 2: Argument Validation

Even a "safe" tool becomes dangerous with the wrong arguments. Our argument validator inspects every parameter of every tool call for:

Path traversal:

# Agent requests:
tool_call("read_file", {"path": "../../../etc/passwd"})
# Validator: BLOCKED - path traversal detected

Shell injection:

# Agent requests:
tool_call("search", {"query": "users; rm -rf /"})
# Validator: BLOCKED - shell metacharacter detected

Sensitive file access:

# Agent requests:
tool_call("read_file", {"path": "/home/user/.ssh/id_rsa"})
# Validator: BLOCKED - sensitive path (.ssh directory)

SQL injection:

# Agent requests:
tool_call("query", {"sql": "SELECT * FROM users; DROP TABLE users"})
# Validator: BLOCKED - SQL injection pattern detected

We also apply Unicode normalization (NFKC) before validation. This prevents the fullwidth character bypass, where an attacker uses ｒｍ　－ｒｆ (Unicode fullwidth characters) instead of rm -rf. After normalization, both resolve to the same ASCII representation.

Layer 3: Sequence Analysis

Individual tool calls might be safe, but the sequence of calls can reveal an attack pattern.

We track three dangerous sequences:

Privilege escalation: read → write → execute If an agent reads a file, writes a modified version, then tries to execute it, that's a classic privilege escalation pattern-even if each individual call looks benign.

Data exfiltration: read → read → read → send Three consecutive reads followed by an outbound communication (email, HTTP POST, file upload) suggests the agent is gathering data to exfiltrate.

Excessive writes: More than 5 consecutive write operations trigger a review. Legitimate agents rarely need to modify many files in rapid succession.

Layer 4: Velocity Limiting

Even if every individual tool call is valid, we enforce rate limits to prevent runaway agents:

30 calls per minute per agent
100 calls per session (session TTL: 1 hour)

These limits catch infinite loops (a common failure mode when agents get confused) and sustained automated attacks.

Layer 5: Risk Scoring

Every tool call receives a composite risk score from 0.0 to 1.0:

risk_score = (argument_validation × 0.50)
           + (sequence_validation × 0.30)
           + (custom_validator × 0.20)
           + (schema_violation_penalty × 0.30)

The risk score maps to an action:

SAFE (0.0-0.2): Execute immediately.
LOW (0.2-0.4): Execute with logging.
MEDIUM (0.4-0.6): Execute with logging and alert.
HIGH (0.6-0.8): Queue for human review.
CRITICAL (0.8-1.0): Block immediately.

The Human Circuit Breaker

For any high-stakes action-refunds over $50, deleting files, sending emails to more than one person-the agent must pause and request human approval.

We integrate with webhook alerting to make this practical:

{
  "event": "tool_call_review",
  "agent_id": "support-agent-v3",
  "tool_name": "issue_refund",
  "arguments": {"amount": 5000, "order_id": "ORD-12345"},
  "risk_score": 0.85,
  "risk_level": "CRITICAL",
  "reason": "Refund amount exceeds $50 threshold",
  "session_context": {
    "total_calls": 12,
    "previous_tools": ["get_order", "get_customer", "get_refund_policy"]
  }
}

This alert goes to Slack (or any webhook endpoint). The human can approve or deny. The agent waits. No autonomous execution of irreversible actions.

Why This Matters More Than Prompt Engineering

We see teams spend weeks crafting the perfect system prompt for their agent:

"You are a helpful assistant. Never delete important files. Always ask before making changes. Be careful with sensitive data."

This is security theater. The system prompt is a suggestion written in natural language, interpreted by a probabilistic model. It will be followed most of the time, but "most of the time" is not a security guarantee.

In contrast, the default-deny architecture is deterministic. The agent physically cannot call delete_file because it's on the blocked list. It physically cannot access /etc/passwd because the path traversal check runs before the tool executes. It physically cannot issue a $5,000 refund without human approval because the risk threshold requires it.

Deterministic controls beat probabilistic instructions. Every time.

Integration With PromptGuard

All of this agent security is built into the PromptGuard platform. You can validate tool calls through the SDK:

from promptguard import PromptGuard

pg = PromptGuard(api_key=os.environ["PROMPTGUARD_API_KEY"])

# Before executing any tool call from your agent
validation = pg.agent.validate_tool(
    agent_id="my-agent",
    tool_name="delete_file",
    arguments={"path": "/tmp/logs/old.log"},
    session_id="session-abc"
)

if validation.allowed:
    execute_tool("delete_file", {"path": "/tmp/logs/old.log"})
else:
    print(f"Blocked: {validation.reason}")
    print(f"Risk: {validation.risk_level} ({validation.risk_score})")

You can also monitor agent behavior in aggregate:

# Get agent statistics
stats = pg.agent.stats(agent_id="my-agent")
print(f"Total calls: {stats.total_tool_calls}")
print(f"Blocked calls: {stats.total_blocked_calls}")
print(f"Risk score: {stats.risk_score}")

Conclusion

The AI agent hype cycle is real, and agents are genuinely useful. But every agent is one creative prompt injection away from executing the worst possible interpretation of its instructions.

The fix isn't better prompts. It's better architecture. Default deny. Argument validation. Sequence analysis. Human circuit breakers. Velocity limits.

Treat your agents like what they are: untrusted code execution environments with natural language interfaces. If your agent runs as root, you're one prompt injection away from disaster.

Building Secure AI Agents: A Default-Deny Architecture

Building Secure AI Agents: A Default-Deny Architecture

The Fundamental Problem With AI Agents

The Default-Deny Architecture

Layer 1: Tool Classification

Layer 2: Argument Validation

Layer 3: Sequence Analysis

Layer 4: Velocity Limiting

Layer 5: Risk Scoring

The Human Circuit Breaker

Why This Matters More Than Prompt Engineering

Integration With PromptGuard

Conclusion

Continue Reading

How Our Multi-Model ML Ensemble Detects Attacks Without Adding Latency

Why Support Bots Are Your Biggest Security Hole (And How to Fix It)

One MCP Server to Secure Every AI Tool