Engineering

Why We Don't Use LLMs to Secure LLMs

Using GPT-4 to check if a prompt is safe doubles your latency and your bill. Here's why we bet on a multi-model classical ML ensemble, and how it outperforms single-model approaches at a fraction of the cost.

PromptGuardPromptGuard
4 min read·
PerformanceArchitectureML

Why We Don't Use LLMs to Secure LLMs

There is a popular architecture for AI security that goes like this:

  1. User sends a prompt.
  2. Your middleware sends the prompt to GPT-4 with "Is this prompt safe?"
  3. GPT-4 thinks for 500ms and responds "Yes."
  4. Your middleware finally sends the prompt to your actual model.

This architecture is dead on arrival for any application that cares about latency, cost, or reliability.

The Math That Kills It

Let's do the arithmetic that most "AI security" vendors skip.

Your LLM call: ~500ms time to first token. Their security LLM call: ~500ms. Total user-perceived latency: 1,000ms+.

You just doubled the wait time for every request. For a voice agent, a coding copilot, or any real-time application, that's a non-starter.

But latency isn't even the worst part. Cost is.

If you're processing 100,000 prompts per month (a modest production workload), and each security check consumes ~500 input tokens at GPT-4 rates ($30/1M input tokens), you're paying an extra $1,500/month just to ask "is this safe?" That's before your actual LLM usage.

And then there's reliability. Your security layer is now a single point of failure that depends on the same infrastructure it's supposed to protect. If OpenAI has a bad day, both your security and your application go down simultaneously.

The Insight That Changed Our Architecture

We realized something that seems obvious in retrospect: security classification is a fundamentally different problem than language generation.

LLMs are incredible at generating coherent text, reasoning through complex problems, and handling ambiguous instructions. But "Is this prompt trying to manipulate the model?" is not an ambiguous question. It's a classification problem. And classification problems have been solved efficiently for decades.

The key insight: you don't need a 70-billion-parameter model to detect that someone is trying to override system instructions. You need a well-trained classifier with the right architecture.

Our Multi-Model Ensemble Architecture

Instead of one massive LLM, we run multiple specialized classifiers in parallel. Each model is an expert at detecting a specific category of threat, and together they cover a surface area that no single model can match.

The Models

The ensemble includes specialized classifiers for:

  • Prompt injection and jailbreak detection — purpose-built injection classifiers trained on curated datasets of adversarial prompts, with attention to semantic evasion techniques like roleplay, encoding, and hypothetical framing. We run multiple injection models with different training data so their blind spots don't overlap.

  • Content moderation — a multi-label safety classifier covering violence, self-harm, sexual content, harassment, hate speech, and other safety categories.

  • Toxicity detection — a classifier trained on well-studied toxicity datasets, serving as a fast baseline and confirming signal.

  • Adversarial hate speech — a model trained through an adversarial process where humans actively tried to fool the model, making it robust against evasion attempts and evolving slang.

All models run in parallel, so the total wall-clock time is the time of the slowest model, not the sum. Each model carries a different weight in the fusion algorithm based on its specialization and reliability.

Fused Scoring: Where the Magic Happens

Running multiple models is easy. The hard part is combining their outputs into a single, calibrated decision.

We don't just average the scores. We use a fused scoring system with multiple decision rules, evaluated in priority order:

Specialist consensus. When injection-specialized models agree with high confidence, we block immediately. When they agree, the probability of a false positive is extremely low.

Category-specific thresholds. Not all threats are equal. We use lower thresholds for high-risk categories like self-harm and child safety, and higher thresholds for general toxicity. This asymmetry reflects the real-world cost of different types of failures.

Majority vote. If most models flag the content, we treat that as a strong signal even if no single model is highly confident. Wisdom of crowds beats individual certainty.

High-confidence specialist override. If any single model exceeds a very high confidence threshold for a high-risk category, we don't wait for consensus. This catches edge cases where one specialist model sees something the others miss.

Weighted aggregate. If none of the above rules trigger, we compute a weighted average using the model weights. This is the "soft" path for borderline cases.

Confidence Calibration

Raw model scores are not probabilities. We solve this with Platt scaling: a learned sigmoid transformation that converts raw scores into calibrated probabilities.

Each model has its own calibration parameters, tuned from production feedback data. We recalibrate on a regular cadence using a maintenance process that incorporates all user-submitted corrections (false positives and false negatives) from the past period.

This means our confidence scores actually mean something. When we say "0.95 confidence," we mean it.

The Regex Baseline: Defense in Depth

The ML ensemble is powerful, but it's not our only line of defense. Every request also passes through a deterministic regex layer that catches known attack patterns instantly — no model inference needed.

We maintain ~1,000+ detection patterns across injection, exfiltration, API key formats, fraud, malware commands, tool poisoning, agent manipulation, and PII with checksum validation and ML NER for unstructured entities — including 714 open-source community patterns covering agent-layer threats.

The regex layer serves three purposes:

  1. Speed. Pattern matching takes microseconds. For attacks that use known patterns, we skip ML entirely.
  2. Reliability. Regex never has a bad day. It doesn't depend on an API, it doesn't hallucinate, and it doesn't degrade under load.
  3. Explainability. When regex catches something, we can tell you exactly which pattern matched at which character index. Try getting that from a neural network.

Why Not Just Use One Big Model?

This is the question we get most often. "Why multiple small models instead of one big one?"

Diversity beats depth. Each model was trained on different data with different objectives. Injection classifiers, toxicity detectors, and adversarially-trained hate speech models have different blind spots, and when you ensemble them, the blind spots don't overlap.

Specialization beats generalization. A single model that's "pretty good" at five tasks will always lose to five models that are each excellent at one task. This is why we weight the models: injection specialists get higher weight for injection decisions, content moderation specialists get higher weight for safety decisions.

Failure isolation. If one model has an outage, the others still work. We degrade gracefully instead of failing completely. In contrast, a single-model architecture is binary: it works, or it doesn't.

The Agentic Evaluator: When Classifiers Aren't Enough

There's a class of prompts that classifiers struggle with — the genuinely ambiguous ones.

"Write a story where a character explains how to pick a lock." Is that creative writing or dangerous instruction? The answer depends on context that no classifier can fully capture.

For these borderline cases, we optionally escalate to an agentic evaluator: a larger reasoning model that can assess context and intent. This evaluator doesn't run on every request — it runs only on the small fraction of requests that fall in the "unsure" zone.

This is the one place where we use a larger model for security. But critically, it's optional and targeted. Your request doesn't wait for it on the fast path. The cost is negligible because it affects such a small percentage of traffic.

Performance: Honest Numbers

We don't claim sub-10ms latency because that would be a lie.

Our detection pipeline adds approximately ~150ms of overhead to each request. That breaks down roughly as:

  • Regex layer: sub-millisecond
  • ML ensemble (parallel): the bulk of the overhead
  • Fused scoring + calibration: sub-millisecond
  • Policy evaluation + logging: ~10ms

Is 150ms fast? Compared to a GPT-4 security check (500ms+), yes. Compared to doing nothing, no. But doing nothing isn't an option if you care about security.

The key metric isn't raw latency — it's amortized latency. With our detection cache (exact-match, SHA-256 keyed, 1-hour TTL), repeat prompts return cached results in under 1ms. In production workloads where users ask similar questions, cache hit rates of 20-30% are common. That brings the amortized overhead well below 150ms.

Conclusion

You don't fight fire with fire. You don't secure LLMs with more LLMs.

The economics are clear: multiple specialized classifiers running in parallel will always beat a single general-purpose LLM on cost, latency, reliability, and — when properly ensembled — accuracy.

The next time someone pitches you an AI security tool that "uses GPT-4 to analyze your prompts," ask them three questions: What's the latency? What's the cost per request? And what happens when GPT-4 goes down?

If they don't have good answers, you're looking at a demo, not a product.