GDPR and LLMs: What Enterprise Teams Get Wrong
Sending personal data to a third-party LLM is a data transfer. Most teams haven't thought through the implications.
Enterprise teams deploying LLMs are creating GDPR compliance gaps they don't realize exist. Here's what the regulation actually requires, where the violations are hiding, and what technical controls close the gap.
Here is a very common scenario: a B2B SaaS company builds an AI assistant for their product. Users can ask questions, get summaries, generate content. The team integrates the OpenAI API, writes a system prompt, ships it.
Six months later, their legal team asks: "What personal data are we sending to OpenAI, and on what legal basis?"
The silence that follows is awkward.
This gap, between what engineering teams build and what GDPR actually requires, is widespread. Here's a framework for closing it.
Why This Is a GDPR Problem
GDPR applies when you process personal data. "Processing" includes transmitting data to a third party. When a user's message contains personal data, their name, their email address, their employer, their health situation, and you send that message to an LLM API, you are transferring personal data to a data processor.
Under GDPR, you need:
- A legal basis for the processing (Article 6, usually contractual necessity or legitimate interest)
- A Data Processing Agreement (DPA) with the processor (Article 28)
- If the processor is outside the EU/EEA: an international transfer mechanism (Article 46, typically Standard Contractual Clauses)
- A privacy notice that discloses the use of AI processing to your users
Most teams have checked box 2 (OpenAI, Anthropic, and Google all offer DPAs). Boxes 1, 3, and 4 are where the gaps are.
Where Personal Data Actually Ends Up
The naive assumption is: "we're only sending generic prompts, no personal data." In practice, this is almost never true.
User messages are the most obvious source. If your product lets users type free text, they will include personal data. They paste their CV, describe their medical situation, mention their colleague's name, ask questions about specific customers.
But the less obvious sources are more dangerous:
RAG-retrieved documents: if your AI retrieves documents to provide context, those documents often contain personal data. Names, emails, contracts, invoices, meeting notes, all of this gets assembled into the context window and transmitted to the model.
Conversation history: multi-turn conversations accumulate context. Personal data mentioned five turns ago is still in the window being sent with every subsequent request.
System prompts: sometimes the system prompt itself contains personal data, a user's profile, their subscription tier, their past behavior. If this is assembled from a database and included in every call, it's personal data being transmitted on every request.
The Technical Controls That Matter
Compliance isn't just about contracts. Technical controls are what actually prevent personal data from leaving your perimeter unintentionally.
PII detection at the gateway
The most effective control is scanning user messages (and optionally retrieved content) for personal data before transmission to the LLM API.
A GDPR-relevant scanner should detect:
- Email addresses
- Phone numbers (French:
+33/0033/0[1-9]formats; Swiss:+41prefix) - IBANs (French IBANs start with
FR76and are 27 characters) - National identifiers (SIRET 14-digit, SIREN 9-digit)
In monitor mode, you get visibility into how much personal data is flowing through your system without blocking anything. This is the right starting point, you may discover that your RAG pipeline is pulling in far more personal data than you realized.
In block mode, you prevent the transmission entirely and return an error to the application, which can prompt the user to remove the personal data.
Prompt hashing for audit trails
When a threat event is logged, storing the full prompt text creates a new data retention problem, now you have a copy of the personal data in your security logs.
The correct approach is to hash the prompt (SHA-256) and store only the hash. This lets you correlate events with specific requests (for incident response) without retaining the raw text. It also means your security logs themselves don't become a GDPR liability.
Retention configuration
GDPR's data minimization principle applies to your security event logs too. Configure a retention window, typically 30 to 90 days, and enforce it with an automated purge job. The default 90-day window is appropriate for most enterprise use cases; you can lower it to 30 days if your DPA with the model provider requires shorter retention.
The DPA and International Transfer Gap
Most teams have signed the model provider's DPA. Fewer have verified that the DPA covers their actual use case.
Key questions to check:
Where is data processed? OpenAI, Anthropic, and Google process data in the United States. If your users are in the EU/EEA, this is an international transfer. You need a valid transfer mechanism, Standard Contractual Clauses (SCCs) are the standard path. All three providers offer them, but you may need to request the specific SCC documentation.
Does the provider use your data for training? This is the question everyone asks. For API access with a signed DPA, all major providers offer zero-data-retention or explicit opt-out from training. Verify this is configured for your account, not just assumed.
What's the subprocessor list? When the model provider uses subprocessors (cloud infrastructure, CDN, etc.), those subprocessors become part of the data flow. Your DPA should cover subprocessors.
The Privacy Notice Gap
If your product uses AI to process user data, GDPR requires you to disclose this in your privacy notice. The disclosure needs to cover:
- What AI systems you use and who provides them
- What categories of personal data are processed by the AI
- The legal basis for the processing
- Whether the data is used for training (and if so, on what basis)
- How users can exercise their rights (access, deletion, portability)
"We use AI" is not sufficient. "We transmit message content to Anthropic for processing under their API terms, subject to their privacy policy and our Data Processing Agreement" is the right level of specificity.
Building a Defensible Position
The teams that end up in regulatory trouble aren't the ones that made technical mistakes. They're the ones that built systems without a compliance framework and then couldn't demonstrate adequate controls when asked.
A defensible position requires:
- PII detection in your pipeline, so you can demonstrate that personal data is monitored and optionally blocked
- An audit trail, immutable logs of security events, retained for the appropriate window
- A DPA with the model provider, covering your specific use case and data categories
- SCCs or equivalent, for international transfers
- A privacy notice, that accurately describes your AI processing
None of these require stopping your LLM deployment. They require instrumenting it correctly.
The gateway is the right place to implement controls 1 and 2. Controls 3-5 are legal and process work that the gateway can support with the audit data it generates.
No credit card required
Was this useful?
Comments
Be the first to comment.