Threat reference
Detailed description of each threat type, example payloads, and tuning guidance.
prompt_injection
Score: 0.80, Attempts to override or replace the system prompt. Classic patterns: ignore all previous instructions, you are now [X], act as, DAN mode, developer mode, evil confidant, stay in character. These attacks try to hijack your AI's persona or bypass your application's guardrails.
False positive risk: low
These phrases rarely appear in legitimate business prompts. If you observe false positives on specific phrases, lower the threshold slightly or contact support.
prompt_extraction
Score: 0.75, Attempts to read the system prompt or instructions. Patterns: repeat your system prompt, what are your instructions, summarize your instructions, print your context. Relevant for products that embed proprietary logic in system prompts.
jailbreak
Score: 0.90, Requests for harmful content generation: malware, ransomware, exploits, synthesis of controlled substances, or CSAM. Highest confidence tier, these phrases have no legitimate use in an enterprise LLM product.
pii
Score: 0.70, Personal data included in user messages: email addresses, French/Swiss phone numbers, French IBANs (FR76...), SIRET and SIREN numbers. Relevant for GDPR compliance, you may want to prevent personal data from leaving your perimeter by being sent to a third-party LLM.
GDPR note
Blocking PII in monitor mode first is recommended. Some applications legitimately process email addresses (e.g. a CRM assistant). Review security_events logs for 7 days before switching to block mode for PII.
secrets
Score: 0.95, API keys and credentials accidentally pasted into prompts: OpenAI keys (sk-..., sk-proj-...), Anthropic keys (sk-ant-...), Google API keys (AIza...), GitHub PATs (ghp_...), and generic api_key=, password=, secret= patterns. Highest priority, leaked keys can result in immediate financial damage.