Why We Built a Transparent AI Firewall
When you deploy an LLM application, you are effectively giving every user a command line to your backend.
If you're building a chatbot, that command line is restricted-it can only generate text. If you're building an agent with tools, that command line has sudo. It can read files, query databases, send emails, and issue refunds. The only thing standing between a user and those capabilities is a system prompt that says "please don't do bad things."
That's not security. That's a suggestion.
The industry's answer to this problem has been "safety layers"-proprietary APIs that return a binary safe or unsafe verdict. We tried using them. We hated them. Here's why, and what we built instead.
The Black Box Problem
Imagine a Web Application Firewall (WAF) that didn't tell you why it blocked a request. It just returned 403 Forbidden. You check your logs-nothing. You check the vendor dashboard-"Threat Detected." You call support-"Our model flagged it as suspicious."
How do you debug that? You don't. You turn the WAF off, eat the risk, and move on.
That is the state of most AI security tools today. They are proprietary models hidden behind APIs. You send your user's prompt, and you pray the vendor's definition of "unsafe" matches yours.
We saw three fundamental problems with this approach:
1. No explainability. When a legitimate user gets blocked, you can't tell them why. You can't tell your engineering team why. You can't file a meaningful bug report. The vendor's model is a black box, and you're building your product on top of it.
2. No auditability. You're trusting a third party with every prompt your users send. Are they training on your data? Are they logging it? You don't know, because you can't see the code. For regulated industries (healthcare, finance, defense), this is a non-starter.
3. No customization. Every application has different security requirements. A creative writing tool and a banking chatbot have opposite definitions of "safe content." Black-box tools can't be tuned for your specific use case.
What We Built
PromptGuard is an AI firewall designed for transparency. It sits between your application and your LLM provider as a transparent proxy, with open-source SDKs and fully explainable decisions.
Your App → PromptGuard → OpenAI / Anthropic / Gemini / etc.Integration is one line of code-change your base_url. No SDK required, no code refactoring, no middleware to write.
Fourteen Threat Detectors
We don't run one generic "is this safe?" model. We run fourteen specialized detectors, each focused on a specific threat type:
- Prompt Injection: Catches direct overrides ("ignore instructions"), roleplay manipulation, delimiter injection, and semantic evasion techniques — backed by ~1,000+ detection patterns including open-source community rules for agent-layer threats.
- Jailbreak Detection (LLM): LLM-powered analysis across 7 attack categories — character obfuscation, competing objectives, lexical, semantic, context, structure obfuscation, and multi-turn escalation.
- PII Detection: Identifies 39+ types of personally identifiable information using layered regex, checksum validation (Luhn, ABA, Verhoeff, and more), ML-based named entity recognition for unstructured PII like names and addresses, and encoded PII detection — then replaces them with safe tokens.
- Data Exfiltration: Detects attempts to extract system prompts, training data, user data, or knowledge base contents.
- Toxicity: Five-model ML ensemble covering hate speech, violence, self-harm, sexual content, and harassment.
- API Key Detection: Catches leaked credentials (OpenAI, AWS, GitHub, Google, generic API keys and tokens) with Shannon entropy analysis.
- Fraud Detection: Identifies social engineering patterns like wire transfer scams, gift card scams, and credential harvesting.
- Malware Detection: Blocks destructive commands (rm -rf, format c:), reverse shells, and encoded PowerShell payloads.
- Tool Injection Detection: Catches indirect prompt injection in agentic tool calls and outputs for LLM agent architectures.
- MCP Server Security: Server allow/block-listing, argument schema validation, and resource access policies for MCP-based agents.
- URL Filtering: Allow-list/block-list, CIDR matching, scheme restriction, and credential injection blocking.
- Multimodal Content Safety: Image content analysis via Cloud Vision or Azure Content Safety, with OCR-based PII detection.
- Security Groundedness: Detects hallucinated CVEs, fabricated compliance claims, and invented security statistics.
- Content Safety Classification: LLM-based harmful intent detection via an open-weight safety classifier.
The Hybrid Architecture
Most AI security tools are just another LLM call. They send your prompt to GPT-4 with "is this safe?" and add 500ms to your latency.
We take a different approach. Our detection pipeline combines:
- Regex patterns for known attack signatures (fast, deterministic, zero false negatives on known patterns)
- A multi-model ML ensemble of specialized classifiers — covering injection, content moderation, toxicity, and adversarial hate speech — running in parallel for semantic understanding
- Confidence calibration via Platt scaling, so our confidence scores are actually calibrated probabilities
Total overhead: approximately 150ms. Not zero, but an order of magnitude faster than an LLM-based security check.
Three Decisions, Not Two
Most security tools give you ALLOW or BLOCK. We add a third: REDACT.
If a user says "My SSN is 123-45-6789, can you help with my tax return?", the intent is legitimate. Blocking them would be hostile. Instead, we replace the SSN with [SSN_REDACTED], forward the sanitized prompt to the LLM, and the user gets their answer without exposing sensitive data.
Every decision is returned with full metadata:
X-PromptGuard-Event-ID: Unique trace ID for debuggingX-PromptGuard-Decision: ALLOW, BLOCK, or REDACTX-PromptGuard-Confidence: Calibrated confidence score (0.0-1.0)X-PromptGuard-Threat-Type: Which threat category was detected
Six Use-Case Presets
Every application has different security needs. A support bot needs strict PII protection. A code assistant needs to allow technical content that looks like injection. A creative writing tool needs relaxed toxicity thresholds.
We ship six composable presets: support_bot, code_assistant, rag_system, creative_writing, data_analysis, and default. Each preset can be combined with a strictness level (strict, moderate, permissive) to fine-tune thresholds across all detectors simultaneously.
Beyond Detection
PromptGuard isn't just a scanner. It's a complete security platform:
- Multi-provider routing with automatic failover across OpenAI, Anthropic, Gemini, Mistral, Groq, and Azure OpenAI
- Bot detection with behavioral analysis (rate limiting, timing analysis, payload fingerprinting, session analysis, reputation scoring)
- Red team testing with 20 built-in attack vectors you can run from the API or dashboard
- AI agent security with tool call validation, argument inspection, sequence analysis, and velocity limiting
- Webhook and email alerting when threats are detected (Slack-compatible)
- Shadow mode / A/B testing for safely evaluating detection model changes without affecting production traffic
- Feedback-driven model recalibration that automatically adjusts confidence thresholds based on false positive/negative reports
Why Transparency Matters
We designed PromptGuard for one simple reason: security tools must be auditable.
If you're trusting a tool to sit between your users and your AI, you need to understand its decisions. You need to see why a request was blocked, what detector triggered, and what confidence threshold was applied.
When PromptGuard blocks a request, you can trace the decision through the security engine to the specific detector and pattern that triggered. Every decision includes a human-readable reason, threat type, confidence score, and event ID. No black box. No "trust us." No mystery.
Our SDKs are open source (MIT licensed), so you can inspect exactly how the client-side integration works. Enterprise customers who self-host get full access to the platform source code for audit purposes.
The Free Tier
We offer 10,000 requests per month for free. Not a trial. Not "free for 14 days." Free forever.
We do this because we believe AI security shouldn't be gated by budget. A solo developer building a side project deserves the same injection protection as an enterprise team. The free tier includes all detectors with full ML detection, the dashboard, and the full proxy functionality.
If you need custom policies, webhook alerting, or advanced analytics, the paid plans start at $99/month.
Try It
You can use the hosted version at promptguard.co-swap your base_url and you're done.
Enterprise customers can self-host the entire stack in their VPC - contact sales@promptguard.co for access. No data leaves your infrastructure. We never see your prompts.
If your AI application has users, it has an attack surface. The question isn't whether to add security-it's whether your security layer will explain itself when things go wrong.
Ours does. Every blocked request comes with a full explanation.
Continue Reading
We Benchmarked Our Detection Engine Against 2,369 Samples from 7 Peer-Reviewed Datasets. Here Are the Results.
PromptGuard's multi-layered detection achieves F1 = 0.887 with 99.1% precision across TensorTrust (ICLR 2024), In-the-Wild Jailbreaks (ACM CCS 2024), JailbreakBench (NeurIPS 2024), XSTest (NAACL 2024), and more — with 100% evasion robustness where standalone classifiers achieve only 80%.
Read more Security ResearchThe LiteLLM Compromise: What a Three-Hour Window Reveals About AI Infrastructure Security
A technical analysis of the LiteLLM supply chain attack - how a compromised security scanner led to credential theft across the AI ecosystem, what the three-stage payload did, and what it means for anyone building on LLM infrastructure.
Read more EngineeringOne MCP Server to Secure Every AI Tool
PromptGuard's MCP server works with Cursor, Claude, VS Code Copilot, Windsurf, Cline, Roo Code, Continue, Zed, Goose, Lovable, and every other MCP-compatible tool. Here's how one install protects everything.
Read more