Back to all articles
PIIPrivacySecurity

How We Detect 39+ Types of PII With Layered Regex, Checksum Validation, and ML NER

PII detection is easy if you don't care about false positives. If you do, it's a nightmare. Here's how we built a high-precision PII detector using layered regex, Luhn and checksum validation, ML-based named entity recognition, encoded PII detection, preset-based sensitivity, and synthetic data replacement.

How We Detect 39+ Types of PII With Layered Regex, Checksum Validation, and ML NER

How We Detect 39+ Types of PII With Layered Regex, Checksum Validation, and ML NER

If you search for "how to detect credit card numbers in Python," you'll find this regex:

r'\b(?:\d[ -]*?){13,16}\b'

If you deploy this regex to production, you will block:

  1. Git commit hashes
  2. UUIDs
  3. Product SKUs
  4. Timestamps
  5. Phone numbers that happen to be 13+ digits
  6. Any sequence of numbers in a math question

You will annoy every user who triggers any of them. And they will complain, or worse, they'll just leave.

PII detection is a precision problem. It's easy to catch everything that might be PII. It's brutally hard to catch everything that is PII without also catching things that aren't.

Here's how we built a PII detector that handles 39+ data types with precision high enough for production use—using a hybrid of regex, checksum validation, and ML-based named entity recognition.

Why We Use a Hybrid Approach for PII

This might be surprising for a team that uses a 5-model ML ensemble for injection detection. Why not use ML for all PII?

Because PII falls into two categories that need different tools:

Structured PII has patterns. A credit card number has a specific format (13-19 digits, Luhn-valid). A Social Security Number is XXX-XX-XXXX. An email has an @ and a domain. For these, regex is faster, deterministic, and auditable. Running 25+ regex patterns against a prompt takes microseconds. For compliance-sensitive applications (HIPAA, PCI-DSS), the deterministic explanation—"Matched pattern \d{3}-\d{2}-\d{4} at position 47"—is far more useful than "the model's internal weights produced a score of 0.73."

Unstructured PII needs ML. Names, physical addresses, organization names, and medical conditions don't follow predictable formats. "John" is both a common name and a common word. "Springfield" is both a city and a TV reference. Regex can't tell the difference—but a named entity recognition (NER) model trained on contextual cues can. We run a lightweight NER model that detects person names, locations, organizations, and other unstructured entities with context-aware precision.

The result: regex handles the structured types (fast, deterministic, auditable), ML NER handles the unstructured types (contextual, flexible), and both run in a single pass.

The 39+ PII Types

Here's every type our detector handles, with the validation logic that keeps precision high.

Tier 1: Always Detected (All Presets)

These types are detected even on the most permissive preset because the risk of exposure is always high.

1. Social Security Number (SSN)

Pattern: r'\b\d{3}-\d{2}-\d{4}\b'
Label: [SSN_REDACTED]

The strict format (XXX-XX-XXXX) eliminates most false positives. We don't match bare 9-digit numbers because those are too ambiguous.

2. Credit Card Number

Pattern: r'\b(?:4\d{3}|5[1-5]\d{2}|3[47]\d{2}|6(?:011|5\d{2}))'
        r'[-\s]?\d{4,6}[-\s]?\d{4,5}[-\s]?\d{0,4}\b'
Label: [CARD_REDACTED]
Validation: Luhn algorithm

The regex matches Visa (4xxx), Mastercard (51-55xx), Amex (34/37xx), and Discover (6011/65xx) formats. But the real precision comes from the Luhn checksum—a mathematical validation that eliminates random number sequences. If the Luhn check fails, we don't flag it, even if the format matches.

We validate card lengths of 13, 14, 15, 16, and 19 digits to cover all major card networks.

3. Passport Number

Pattern: r'\b[A-Z]{1,2}\d{6,9}\b'
Label: [PASSPORT_REDACTED]

4. Driver's License

Pattern: r'\b[A-Z]\d{4,8}[-\s]?\d{0,5}\b'
Label: [DL_REDACTED]

5. IBAN (International Bank Account Number)

Pattern: r'\b[A-Z]{2}\d{2}\s?\d{4}\s?\d{4}\s?\d{4}(?:\s?\d{4}){0,4}\b'
Label: [IBAN_REDACTED]

Tier 2: Detected on Moderate and Strict Presets

These types have slightly higher false positive risk but are important for most applications.

6. Email Address

Pattern: r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b'
Label: [EMAIL_REDACTED]

7. US Phone Number

Pattern: r'\b(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b'
Label: [PHONE_REDACTED]

8. International Phone Number

Pattern: r'\b\+\d{1,3}[-.\s]?\d{1,4}[-.\s]?\d{2,4}[-.\s]?\d{2,4}(?:[-.\s]?\d{2,4})?\b'
Label: [PHONE_REDACTED]

9. IPv4 Address

Pattern: r'\b(?:\d{1,3}\.){3}\d{1,3}\b'
Label: [IP_REDACTED]

10. IPv6 Address

Pattern: r'\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\b'
Label: [IP_REDACTED]

11. Date of Birth

Pattern: r'\b(?:\d{1,2}[/-]\d{1,2}[/-]\d{2,4}|\d{4}[/-]\d{1,2}[/-]\d{1,2})\b'
Label: [DOB_REDACTED]

12. Medicare ID

Pattern: r'\b\d{10,11}[A-Za-z]?\b'
Label: [MEDICARE_REDACTED]

13. NHS Number (UK)

Pattern: r'\b\d{3}\s?\d{3}\s?\d{4}\b'
Label: [NHS_REDACTED]

Tier 3: Only Detected on Strict Preset

14. US ZIP Code

Pattern: r'\b\d{5}(?:-\d{4})?\b'
Label: [ZIP_REDACTED]

ZIP codes have high false positive risk (any 5-digit number matches) so they're only detected on strict preset where maximum data protection is more important than user friction.

Tier 4: Additional Structured Types (Regex + Checksum)

Beyond the core 14, we detect 10+ additional structured PII types with format-specific validation:

#TypeValidationLabel
15Bank Routing Number (ABA)9-digit format + ABA checksum[ROUTING_REDACTED]
16Tax ID / EINXX-XXXXXXX format[TAX_ID_REDACTED]
17Vehicle Identification Number (VIN)17-char format + transliteration checksum[VIN_REDACTED]
18DEA Number2 letters + 7 digits + DEA checksum[DEA_REDACTED]
19MAC AddressXX:XX:XX:XX:XX:XX format[MAC_REDACTED]
20Bitcoin AddressBase58Check with version prefix + checksum[CRYPTO_REDACTED]
21Ethereum Address0x + 40 hex chars[CRYPTO_REDACTED]
22Canadian SIN9-digit format + Luhn[SIN_REDACTED]
23Australian TFN8-9 digits + TFN checksum[TFN_REDACTED]
24Indian Aadhaar12-digit format + Verhoeff checksum[AADHAAR_REDACTED]
25ITIN9XX-XX-XXXX format[ITIN_REDACTED]

Each of these uses the same principle as our credit card detection: format matching narrows the candidates, and checksum or structural validation eliminates false positives.

Tier 5: Unstructured PII (ML NER)

These types don't follow predictable formats, so regex can't catch them reliably. We use a lightweight named entity recognition model that runs alongside the regex patterns:

#TypeDetection MethodLabel
26Person NameNER (contextual)[NAME_REDACTED]
27Physical AddressNER (contextual)[ADDRESS_REDACTED]
28Organization NameNER (contextual)[ORG_REDACTED]
29Medical ConditionNER (contextual)[MEDICAL_REDACTED]
30Medication NameNER (contextual)[MEDICATION_REDACTED]
31+Additional entity typesNER (contextual)Various labels

The NER model runs only on presets that enable it (moderate and strict). On the permissive preset, we skip NER entirely to avoid false positives on names in creative writing and code contexts.

Encoded PII Detection

Attackers and users sometimes encode PII to evade detection—intentionally or accidentally. We detect PII embedded in common encodings:

  • Base64-encoded PII: We decode Base64 segments and re-scan the decoded content for PII patterns.
  • URL-encoded PII: Percent-encoded strings (%40 for @, etc.) are decoded before scanning.
  • Hex-encoded PII: Hexadecimal strings are decoded and scanned.
  • ROT13 and simple substitution ciphers: Common obfuscation techniques are reversed before scanning.

Encoded PII detection runs on moderate and strict presets. It adds minimal overhead because encoding detection is itself a regex pass—we only decode and re-scan when we find a plausible encoded segment.

Checksum Validation: Beyond Luhn

Luhn is just one of several checksum algorithms we use to keep precision high. Each checksum eliminates a class of false positives:

AlgorithmUsed ForWhat It Validates
LuhnCredit cards, Canadian SINDigit check sequence
ABA checksumBank routing numbersWeighted digit sum mod 10
VerhoeffIndian Aadhaar numbersDihedral group-based checksum (catches all single-digit and transposition errors)
VIN transliterationVehicle identification numbersCharacter-to-digit mapping + position-weighted check
DEA checksumDEA registration numbersAlternating sum of digits mod 10
TFN checksumAustralian Tax File NumbersWeighted digit sum mod 11

The pattern is the same everywhere: match the format first (fast), validate the checksum second (still fast), and only flag it as PII if both pass. This two-step approach is why our false positive rate stays low even with 39+ entity types.

Preset-Based Sensitivity

Not every application needs the same PII protection level. A healthcare chatbot needs to catch everything. A creative writing tool probably doesn't need to redact ZIP codes.

We use three presets that control which PII types are active:

PresetTypes DetectedUse Case
StrictAll 39+ types (regex + checksum + NER + encoded PII)Healthcare, finance, government
Moderate~30 types (regex + checksum + NER, no ZIP, no encoded PII)General applications, support bots
Permissive5 types (SSN, credit card, passport, driver's license, IBAN)Creative writing, code assistants

The preset is configured per-project in the dashboard. You can also override individual types via the project's preset_overrides configuration.

Redaction vs. Blocking

Here's a design decision that makes a huge difference in user experience: we redact instead of blocking.

When we detect PII in a prompt, we don't reject the entire request. We replace the PII with descriptive tokens and forward the sanitized version to the LLM.

Original prompt:

My SSN is 123-45-6789 and my email is john@example.com.
Can you help me fill out my tax return?

Redacted prompt (sent to LLM):

My SSN is [SSN_REDACTED] and my email is [EMAIL_REDACTED].
Can you help me fill out my tax return?

The LLM receives the redacted version. It can still understand the intent ("user needs help with tax return") without seeing the sensitive data. The user gets their answer. The PII never reaches the LLM provider.

This is fundamentally different from blocking:

  • Blocking says: "You did something wrong. Try again." The user is frustrated and confused.
  • Redaction says: "We protected your data and processed your request." The user gets what they wanted.

Synthetic Data Replacement

For applications where the redaction tokens ([SSN_REDACTED]) would confuse the LLM, we offer synthetic data replacement.

Instead of replacing PII with tokens, we generate realistic-looking fake data that preserves the format:

PII TypeOriginalSynthetic Replacement
Emailjohn@example.comuser_847@placeholder.com
Phone(555) 123-4567(555) 000-0001
SSN123-45-6789000-00-0000
Credit Card4111 1111 1111 11114000 0000 0000 0002

The synthetic data preserves:

  • Format: The LLM sees correctly formatted data, so it can still reason about structure.
  • Consistency: The same PII replaced multiple times in the same prompt gets the same synthetic value, so references remain coherent.
  • Invalidity: Synthetic values are intentionally invalid (SSNs starting with 000, Luhn-invalid card numbers) so they can't be confused with real data.

This is particularly useful for RAG applications where the LLM needs to reference specific data points in its response. The response will contain the synthetic data, which the application can optionally reverse-map to the real values if needed.

API Key Detection: A Separate Detector

PII isn't the only sensitive data that leaks into prompts. We also detect API keys and credentials with a separate detector that handles 10 patterns:

PatternExample
OpenAI API keyssk-proj-..., sk-...
AWS Access KeysAKIA...
Google OAuth tokensya29....
Google API keysAIza...
GitHub PATsghp_...
GitHub OAuthgho_...
Generic API keysapi_key = "..."
Bearer tokensAuthorization: Bearer ...

All detected credentials are replaced with [API_KEY_REDACTED]. This detector runs on both inputs and outputs, catching credentials that users accidentally paste into prompts and credentials that the LLM might hallucinate from its training data.

Output Scanning: The Other Side

PII detection runs on both directions:

Input scanning catches PII that users are sending to the LLM. This prevents the data from reaching the LLM provider's servers.

Output scanning catches PII that the LLM generates in its response. This can happen when:

  • The LLM hallucinates realistic-looking PII
  • The LLM echoes back PII from its training data
  • A RAG system injects PII from retrieved documents into the response

For streaming responses, output scanning runs in real-time as chunks arrive. If PII is detected in a stream chunk, we can redact it inline without cutting the entire stream.

Conclusion

PII detection isn't about finding patterns. It's about finding the right patterns with enough precision that you don't destroy the user experience.

Our approach is deliberately layered: regex patterns for structured PII, checksum validation (Luhn, ABA, Verhoeff, and more) for precision, ML NER for unstructured entities like names and addresses, encoded PII detection for obfuscation attempts, and preset-based sensitivity to match each application's needs. Redaction instead of blocking, always.

The structured detections remain fully auditable and deterministic—when we redact a credit card number, you can trace the decision to a specific regex pattern and a passing Luhn check. The ML NER layer adds coverage for unstructured PII that regex can't reach, while keeping false positives manageable through context-aware classification.

If you're not looking at context, you're not detecting PII—you're just grepping for digits. And if you're blocking instead of redacting, you're not protecting users—you're punishing them for having personal information.