How We Detect 39+ Types of PII With Layered Regex, Checksum Validation, and ML NER

If you search for "how to detect credit card numbers in Python," you'll find this regex:

r'\b(?:\d[ -]*?){13,16}\b'

If you deploy this regex to production, you will block:

Git commit hashes
UUIDs
Product SKUs
Timestamps
Phone numbers that happen to be 13+ digits
Any sequence of numbers in a math question

You will annoy every user who triggers any of them. And they will complain, or worse, they'll just leave.

PII detection is a precision problem. It's easy to catch everything that might be PII. It's brutally hard to catch everything that is PII without also catching things that aren't.

Here's how we built a PII detector that handles 39+ data types with precision high enough for production use-using a hybrid of regex, checksum validation, and ML-based named entity recognition.

Why We Use a Hybrid Approach for PII

This might be surprising for a team that uses a 5-model ML ensemble for injection detection. Why not use ML for all PII?

Because PII falls into two categories that need different tools:

Structured PII has patterns. A credit card number has a specific format (13-19 digits, Luhn-valid). A Social Security Number is XXX-XX-XXXX. An email has an @ and a domain. For these, regex is faster, deterministic, and auditable. Running 25+ regex patterns against a prompt takes microseconds. For compliance-sensitive applications (HIPAA, PCI-DSS), the deterministic explanation-"Matched pattern \d{3}-\d{2}-\d{4} at position 47"-is far more useful than "the model's internal weights produced a score of 0.73."

Unstructured PII needs ML. Names, physical addresses, organization names, and medical conditions don't follow predictable formats. "John" is both a common name and a common word. "Springfield" is both a city and a TV reference. Regex can't tell the difference-but a named entity recognition (NER) model trained on contextual cues can. We run a lightweight NER model that detects person names, locations, organizations, and other unstructured entities with context-aware precision.

The result: regex handles the structured types (fast, deterministic, auditable), ML NER handles the unstructured types (contextual, flexible), and both run in a single pass.

The 39+ PII Types

Here's every type our detector handles, with the validation logic that keeps precision high.

Tier 1: Always Detected (All Presets)

These types are detected even on the most permissive preset because the risk of exposure is always high.

1. Social Security Number (SSN)

Pattern: r'\b\d{3}-\d{2}-\d{4}\b'
Label: [SSN_REDACTED]

The strict format (XXX-XX-XXXX) eliminates most false positives. We don't match bare 9-digit numbers because those are too ambiguous.

2. Credit Card Number

Pattern: r'\b(?:4\d{3}|5[1-5]\d{2}|3[47]\d{2}|6(?:011|5\d{2}))'
        r'[-\s]?\d{4,6}[-\s]?\d{4,5}[-\s]?\d{0,4}\b'
Label: [CARD_REDACTED]
Validation: Luhn algorithm

The regex matches Visa (4xxx), Mastercard (51-55xx), Amex (34/37xx), and Discover (6011/65xx) formats. But the real precision comes from the Luhn checksum-a mathematical validation that eliminates random number sequences. If the Luhn check fails, we don't flag it, even if the format matches.

We validate card lengths of 13, 14, 15, 16, and 19 digits to cover all major card networks.

3. Passport Number

Pattern: r'\b[A-Z]{1,2}\d{6,9}\b'
Label: [PASSPORT_REDACTED]

4. Driver's License

Pattern: r'\b[A-Z]\d{4,8}[-\s]?\d{0,5}\b'
Label: [DL_REDACTED]

5. IBAN (International Bank Account Number)

Pattern: r'\b[A-Z]{2}\d{2}\s?\d{4}\s?\d{4}\s?\d{4}(?:\s?\d{4}){0,4}\b'
Label: [IBAN_REDACTED]

Tier 2: Detected on Moderate and Strict Presets

These types have slightly higher false positive risk but are important for most applications.

6. Email Address

Pattern: r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b'
Label: [EMAIL_REDACTED]

7. US Phone Number

Pattern: r'\b(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b'
Label: [PHONE_REDACTED]

8. International Phone Number

Pattern: r'\b\+\d{1,3}[-.\s]?\d{1,4}[-.\s]?\d{2,4}[-.\s]?\d{2,4}(?:[-.\s]?\d{2,4})?\b'
Label: [PHONE_REDACTED]

9. IPv4 Address

Pattern: r'\b(?:\d{1,3}\.){3}\d{1,3}\b'
Label: [IP_REDACTED]

10. IPv6 Address

Pattern: r'\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\b'
Label: [IP_REDACTED]

11. Date of Birth

Pattern: r'\b(?:\d{1,2}[/-]\d{1,2}[/-]\d{2,4}|\d{4}[/-]\d{1,2}[/-]\d{1,2})\b'
Label: [DOB_REDACTED]

12. Medicare ID

Pattern: r'\b\d{10,11}[A-Za-z]?\b'
Label: [MEDICARE_REDACTED]

13. NHS Number (UK)

Pattern: r'\b\d{3}\s?\d{3}\s?\d{4}\b'
Label: [NHS_REDACTED]

Tier 3: Only Detected on Strict Preset

14. US ZIP Code

Pattern: r'\b\d{5}(?:-\d{4})?\b'
Label: [ZIP_REDACTED]

ZIP codes have high false positive risk (any 5-digit number matches) so they're only detected on strict preset where maximum data protection is more important than user friction.

Tier 4: Additional Structured Types (Regex + Checksum)

Beyond the core 14, we detect 10+ additional structured PII types with format-specific validation:

#	Type	Validation	Label
15	Bank Routing Number (ABA)	9-digit format + ABA checksum	`[ROUTING_REDACTED]`
16	Tax ID / EIN	`XX-XXXXXXX` format	`[TAX_ID_REDACTED]`
17	Vehicle Identification Number (VIN)	17-char format + transliteration checksum	`[VIN_REDACTED]`
18	DEA Number	2 letters + 7 digits + DEA checksum	`[DEA_REDACTED]`
19	MAC Address	`XX:XX:XX:XX:XX:XX` format	`[MAC_REDACTED]`
20	Bitcoin Address	Base58Check with version prefix + checksum	`[CRYPTO_REDACTED]`
21	Ethereum Address	`0x` + 40 hex chars	`[CRYPTO_REDACTED]`
22	Canadian SIN	9-digit format + Luhn	`[SIN_REDACTED]`
23	Australian TFN	8-9 digits + TFN checksum	`[TFN_REDACTED]`
24	Indian Aadhaar	12-digit format + Verhoeff checksum	`[AADHAAR_REDACTED]`
25	ITIN	`9XX-XX-XXXX` format	`[ITIN_REDACTED]`

Each of these uses the same principle as our credit card detection: format matching narrows the candidates, and checksum or structural validation eliminates false positives.

Tier 5: Unstructured PII (ML NER)

These types don't follow predictable formats, so regex can't catch them reliably. We use a lightweight named entity recognition model that runs alongside the regex patterns:

#	Type	Detection Method	Label
26	Person Name	NER (contextual)	`[NAME_REDACTED]`
27	Physical Address	NER (contextual)	`[ADDRESS_REDACTED]`
28	Organization Name	NER (contextual)	`[ORG_REDACTED]`
29	Medical Condition	NER (contextual)	`[MEDICAL_REDACTED]`
30	Medication Name	NER (contextual)	`[MEDICATION_REDACTED]`
31+	Additional entity types	NER (contextual)	Various labels

The NER model runs only on presets that enable it (moderate and strict). On the permissive preset, we skip NER entirely to avoid false positives on names in creative writing and code contexts.

Encoded PII Detection

Attackers and users sometimes encode PII to evade detection-intentionally or accidentally. We detect PII embedded in common encodings:

Base64-encoded PII: We decode Base64 segments and re-scan the decoded content for PII patterns.
URL-encoded PII: Percent-encoded strings (%40 for @, etc.) are decoded before scanning.
Hex-encoded PII: Hexadecimal strings are decoded and scanned.
ROT13 and simple substitution ciphers: Common obfuscation techniques are reversed before scanning.

Encoded PII detection runs on moderate and strict presets. It adds minimal overhead because encoding detection is itself a regex pass-we only decode and re-scan when we find a plausible encoded segment.

Checksum Validation: Beyond Luhn

Luhn is just one of several checksum algorithms we use to keep precision high. Each checksum eliminates a class of false positives:

Algorithm	Used For	What It Validates
Luhn	Credit cards, Canadian SIN	Digit check sequence
ABA checksum	Bank routing numbers	Weighted digit sum mod 10
Verhoeff	Indian Aadhaar numbers	Dihedral group-based checksum (catches all single-digit and transposition errors)
VIN transliteration	Vehicle identification numbers	Character-to-digit mapping + position-weighted check
DEA checksum	DEA registration numbers	Alternating sum of digits mod 10
TFN checksum	Australian Tax File Numbers	Weighted digit sum mod 11

The pattern is the same everywhere: match the format first (fast), validate the checksum second (still fast), and only flag it as PII if both pass. This two-step approach is why our false positive rate stays low even with 39+ entity types.

Preset-Based Sensitivity

Not every application needs the same PII protection level. A healthcare chatbot needs to catch everything. A creative writing tool probably doesn't need to redact ZIP codes.

We use three presets that control which PII types are active:

Preset	Types Detected	Use Case
Strict	All 39+ types (regex + checksum + NER + encoded PII)	Healthcare, finance, government
Moderate	~30 types (regex + checksum + NER, no ZIP, no encoded PII)	General applications, support bots
Permissive	5 types (SSN, credit card, passport, driver's license, IBAN)	Creative writing, code assistants

The preset is configured per-project in the dashboard. You can also override individual types via the project's preset_overrides configuration.

Redaction vs. Blocking

Here's a design decision that makes a huge difference in user experience: we redact instead of blocking.

When we detect PII in a prompt, we don't reject the entire request. We replace the PII with descriptive tokens and forward the sanitized version to the LLM.

Original prompt:

My SSN is 123-45-6789 and my email is john@example.com.
Can you help me fill out my tax return?

Redacted prompt (sent to LLM):

My SSN is [SSN_REDACTED] and my email is [EMAIL_REDACTED].
Can you help me fill out my tax return?

The LLM receives the redacted version. It can still understand the intent ("user needs help with tax return") without seeing the sensitive data. The user gets their answer. The PII never reaches the LLM provider.

This is fundamentally different from blocking:

Blocking says: "You did something wrong. Try again." The user is frustrated and confused.
Redaction says: "We protected your data and processed your request." The user gets what they wanted.

Synthetic Data Replacement

For applications where the redaction tokens ([SSN_REDACTED]) would confuse the LLM, we offer synthetic data replacement.

Instead of replacing PII with tokens, we generate realistic-looking fake data that preserves the format:

PII Type	Original	Synthetic Replacement
Email	john@example.com	user_847@placeholder.com
Phone	(555) 123-4567	(555) 000-0001
SSN	123-45-6789	000-00-0000
Credit Card	4111 1111 1111 1111	4000 0000 0000 0002

The synthetic data preserves:

Format: The LLM sees correctly formatted data, so it can still reason about structure.
Consistency: The same PII replaced multiple times in the same prompt gets the same synthetic value, so references remain coherent.
Invalidity: Synthetic values are intentionally invalid (SSNs starting with 000, Luhn-invalid card numbers) so they can't be confused with real data.

This is particularly useful for RAG applications where the LLM needs to reference specific data points in its response. The response will contain the synthetic data, which the application can optionally reverse-map to the real values if needed.

API Key Detection: A Separate Detector

PII isn't the only sensitive data that leaks into prompts. We also detect API keys and credentials with a separate detector that handles 10 patterns:

Pattern	Example
OpenAI API keys	`sk-proj-...`, `sk-...`
AWS Access Keys	`AKIA...`
Google OAuth tokens	`ya29....`
Google API keys	`AIza...`
GitHub PATs	`ghp_...`
GitHub OAuth	`gho_...`
Generic API keys	`api_key = "..."`
Bearer tokens	`Authorization: Bearer ...`

All detected credentials are replaced with [API_KEY_REDACTED]. This detector runs on both inputs and outputs, catching credentials that users accidentally paste into prompts and credentials that the LLM might hallucinate from its training data.

Output Scanning: The Other Side

PII detection runs on both directions:

Input scanning catches PII that users are sending to the LLM. This prevents the data from reaching the LLM provider's servers.

Output scanning catches PII that the LLM generates in its response. This can happen when:

The LLM hallucinates realistic-looking PII
The LLM echoes back PII from its training data
A RAG system injects PII from retrieved documents into the response

For streaming responses, output scanning runs in real-time as chunks arrive. If PII is detected in a stream chunk, we can redact it inline without cutting the entire stream.

Conclusion

PII detection isn't about finding patterns. It's about finding the right patterns with enough precision that you don't destroy the user experience.

Our approach is deliberately layered: regex patterns for structured PII, checksum validation (Luhn, ABA, Verhoeff, and more) for precision, ML NER for unstructured entities like names and addresses, encoded PII detection for obfuscation attempts, and preset-based sensitivity to match each application's needs. Redaction instead of blocking, always.

The structured detections remain fully auditable and deterministic-when we redact a credit card number, you can trace the decision to a specific regex pattern and a passing Luhn check. The ML NER layer adds coverage for unstructured PII that regex can't reach, while keeping false positives manageable through context-aware classification.

If you're not looking at context, you're not detecting PII-you're just grepping for digits. And if you're blocking instead of redacting, you're not protecting users-you're punishing them for having personal information.

How We Detect 39+ Types of PII With Layered Regex, Checksum Validation, and ML NER

How We Detect 39+ Types of PII With Layered Regex, Checksum Validation, and ML NER

Why We Use a Hybrid Approach for PII

The 39+ PII Types

Tier 1: Always Detected (All Presets)

Tier 2: Detected on Moderate and Strict Presets

Tier 3: Only Detected on Strict Preset

Tier 4: Additional Structured Types (Regex + Checksum)

Tier 5: Unstructured PII (ML NER)

Encoded PII Detection

Checksum Validation: Beyond Luhn

Preset-Based Sensitivity

Redaction vs. Blocking

Synthetic Data Replacement

API Key Detection: A Separate Detector

Output Scanning: The Other Side

Conclusion

Continue Reading

Beyond Redaction: Why We Replace PII With Synthetic Data

One MCP Server to Secure Every AI Tool

How Our Multi-Model ML Ensemble Detects Attacks Without Adding Latency