How much training data can be extracted from an LLM?

Simple sampling attacks recover 35% of all known memorized data. Model size correlates with extraction: GPT-Neo 6B leaks 65% of test sequences vs. 20% for GPT-Neo 125M. Using divergence attacks, 16.9% of responses contain memorized PII, and 85.8% of that PII is authentic. Recent research shows extraction risk metrics underestimate the true threat by up to 2.14x.

What is differential privacy for LLMs and does it work?

Differential privacy (DP-SGD) adds calibrated random noise to gradients during training, providing mathematical guarantees about individual data point privacy. Lower epsilon means more privacy but worse accuracy. Practical epsilon values of 3-8 provide reasonable privacy with acceptable accuracy loss. Very strict epsilon values (under 1) degrade model quality significantly without proportionally improving protection against real-world attacks.

Can you remove specific training data from a deployed model?

Not practically. Model weights are complex integrations of the entire training dataset. Even deleting specific data, the learned patterns persist in weights. Complete retraining from scratch is the only fully effective approach, but it's prohibitively expensive. Source-free unlearning (UC Riverside, 2025) is a promising research direction that operates without the original training data, but scalable solutions remain elusive.

How does GDPR right to erasure apply to AI models?

Article 17 gives individuals the right to have their data erased, but it lacks clear technical guidelines for AI systems. If a person's data was used to train a model, erasing it from storage doesn't remove it from model weights. The legal requirement exists. The technical ability to comply at scale doesn't. This creates an unresolved tension between privacy law and ML engineering reality.

What AI systems remember: training data extraction, memorization, and privacy leakage

Q: Does fine-tuning make memorization worse?

Yes. RLHF and SFT both increase memorization risk. The fine-tuning process concentrates model updates on specific data patterns, amplifying the model's retention of that data. LoRA (Low-Rank Adaptation) reduces memorization better than full fine-tuning because efficient parameter updates concentrate changes and reduce unintended side effects.

8 minute read

“We deleted the customer’s data from the database. The model still remembers it.”

TL;DR

Simple sampling recovers 35% of memorized training data. GPT-Neo 6B leaks 65% of test sequences. 16.9% of responses contain memorized PII, 85.8% authentic. Fine-tuning amplifies memorization. Differential privacy helps at epsilon 3-8 but degrades accuracy. Machine unlearning remains impractical at scale. The GDPR right to erasure doesn’t have a technical implementation for model weights. For how memorized data gets extracted through prompt injection, see Indirect prompt injection.

A RAM module with thermal imaging overlay showing warm active regions against cool inactive sectors

How does memorization actually work?

LLMs memorize training data through three mechanisms, ranging from exact reproduction to subtle pattern retention.

Verbatim memorization means the model generates an exact substring from its training set. Ask a model to complete a specific passage and it produces the next sentences word-for-word from the training corpus. Nicholas Carlini and colleagues demonstrated this against GPT-2 in their seminal 2021 USENIX Security paper, extracting verbatim training data through carefully crafted completion prompts. The key finding: models memorize more than safety evaluations detect, and the extractable content includes personal information, copyrighted text, and sensitive data.

Template extraction captures semantic patterns rather than exact strings. The model doesn’t reproduce text verbatim but generates content that follows the structure, key phrases, and distinctive patterns of memorized data. This is harder to detect through simple string matching but still leaks meaningful information about the training data.

Approximate extraction produces subtle reformulations that evade verbatim detection while revealing training content. Carlini’s 2025 research on defense bypasses found that definitions of memorization focused on verbatim matching are too narrow: even when defenses perfectly prevent verbatim reproduction, data leakage persists through style-transfer prompts and minimal modifications.

The extraction rates scale with model size. GPT-Neo 6B yielded 65% of test sequences through extraction. GPT-Neo 125M yielded 20%. Larger models memorize more and leak more. This is a direct tension with the industry trend toward ever-larger models.

What does the data show about extraction risk?

The numbers are worse than most organizations assume, and recent research suggests they’re underestimates.

Finding	Value	Source
Training data recovered by simple sampling	35% of memorized data	Carlini et al.
GPT-Neo 6B test sequence extraction	65%	Extraction benchmark
GPT-Neo 125M test sequence extraction	20%	Extraction benchmark
Responses containing memorized PII	16.9% of 15,000 responses	Divergence attack study
Authenticity of leaked PII	85.8%	Divergence attack study
Inference-time data exposure	13% of prompts	Comparative analysis
Training-time extraction rate	0.00001%	Comparative analysis
Extraction rate underestimation	Up to 2.14x	Sequence-level analysis, 2025

The inference-time versus training-time distinction matters. 13% of generative AI prompts leak sensitive data at inference time (through the conversation) compared to 0.00001% through deliberate training data extraction. The practical privacy risk is overwhelmingly at inference time: users paste credentials, share personal information, and discuss sensitive topics. But training data extraction provides the attacker with data they never directly shared, which changes the threat model entirely.

Recent 2025 research found that the standard Extraction Rate metric underestimates the true leakage threat by up to 2.14x. Between 30.4% and 41.5% of memorized sequences are easier to extract than aggregate metrics suggest when analyzed at the sequence level rather than averaged across the corpus.

Prompts can also amplify or suppress memorization. Research shows that specifically crafted attack prompts increase leakage by up to 9.3%, while suppress prompts decrease extraction by up to 97.7%. This suggests fine-grained control over memorization through instruction design is possible.

Does fine-tuning make this worse?

Yes. RLHF and SFT both increase memorization risk. This is counterintuitive: you’d expect fine-tuning on a small dataset to not affect memorization of the pre-training data. But the fine-tuning process reshapes the model’s internal representations in ways that can make previously latent memories more accessible.

The mechanism: fine-tuning concentrates model updates on specific data patterns. Full fine-tuning (updating all parameters) creates the highest memorization risk because it modifies the most weights. The good news: LoRA (Low-Rank Adaptation) reduces memorization better than full fine-tuning. LoRA restricts parameter updates to low-rank matrices, concentrating changes and reducing unintended memorization side effects.

If you’re fine-tuning with private or sensitive data, LoRA with differential privacy is the current best practice for the privacy-accuracy tradeoff.

What can differential privacy actually do?

Differential privacy (DP) provides a mathematical guarantee: the model’s outputs shouldn’t change significantly whether any single individual’s data is included in the training set or not. DP-SGD implements this by clipping gradients (limiting the influence of any single training example) and adding calibrated random noise.

The privacy-accuracy tradeoff is controlled by epsilon (ε):

ε ≤ 1: Strict privacy. Significant accuracy degradation. Appropriate for compliance-critical applications.
ε = 3-8: Practical range. Reasonable privacy with acceptable accuracy loss. Where most production deployments should land.
ε > 10: Weak privacy guarantee. Limited practical protection against extraction attacks.

2025 research found something important: altering epsilon has minimal impact on membership inference attacks in many practical scenarios. Setting epsilon unnecessarily small degrades accuracy without proportionally improving privacy against real-world attacks. This means organizations should calibrate epsilon based on their actual threat model, not default to the smallest possible value.

Google Research has published guidance on fine-tuning LLMs with user-level differential privacy, demonstrating that DP-SGD with LoRA (parameter-efficient fine-tuning plus differential privacy) achieves the best privacy-accuracy tradeoff. The combination restricts both which parameters are updated and how much any single data point can influence those updates.

Can you actually delete data from a model?

Not in any practical sense. This is the core tension between privacy regulation and ML engineering.

Model weights are complex, distributed integrations of the entire training dataset. A person’s data doesn’t live in a specific location in the model that can be deleted. It’s diffused across billions of parameters, entangled with every other training example that influenced those same parameters. Even removing the original data and reprocessing doesn’t guarantee the learned patterns are gone.

The only fully effective approach is retraining from scratch on a dataset that excludes the target data. For a model that cost millions of dollars to train, this is prohibitively expensive and impractical for individual erasure requests.

Machine unlearning research is trying to solve this. UC Riverside proposed “source-free unlearning” (September 2025): a certified unlearning method that operates without the original training data, reducing the computational cost of compliance. This is a promising direction but not yet scalable to production LLMs.

The GDPR right to erasure (Article 17) creates the legal obligation. The technical ability to comply at scale doesn’t exist yet. The European Data Protection Board (EDPB) has acknowledged this gap in their 2025 guidance on effective implementation of data subject rights in AI systems, but no clear technical standard has emerged.

This creates a real compliance risk for organizations that fine-tune models on personal data. The data can be deleted from the fine-tuning dataset, but its influence remains in the model weights. Whether this satisfies Article 17 is legally untested at the time of writing.

Key takeaways

Simple sampling extracts 35% of memorized training data. Larger models memorize and leak more.
16.9% of LLM responses contain memorized PII, 85.8% of which is authentic
Fine-tuning (RLHF, SFT) increases memorization risk. LoRA reduces it relative to full fine-tuning
Differential privacy with epsilon 3-8 provides practical protection. Smaller epsilon degrades accuracy without proportional privacy improvement
Machine unlearning remains computationally impractical. Retraining from scratch is the only fully effective approach
GDPR Article 17 requires data erasure. Model weights don’t forget. The technical compliance gap is real and legally untested.
Standard extraction metrics underestimate true leakage by up to 2.14x

FAQ

How much training data can be extracted?

35% through simple sampling. GPT-Neo 6B leaks 65% of test sequences. 16.9% of responses contain memorized PII, 85.8% authentic. Extraction metrics underestimate true risk by up to 2.14x at the sequence level.

Does fine-tuning make memorization worse?

Yes. RLHF and SFT increase memorization risk. LoRA reduces it relative to full fine-tuning by concentrating parameter updates. For sensitive data, combine LoRA with differential privacy.

What epsilon should I use for differential privacy?

3-8 for most production deployments. Below 1 degrades accuracy significantly without proportional privacy improvement against real attacks. Calibrate to your threat model, not the theoretical minimum.

Can you delete data from a deployed model?

Not practically. Model weights are distributed integrations of training data. Source-free unlearning is promising but not yet scalable. Complete retraining from scratch on a filtered dataset is the only fully effective approach.

Article 17 requires data erasure. Model weights don’t forget. The technical compliance gap is real and legally untested. Organizations fine-tuning on personal data face unresolved compliance risk.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

What AI systems remember: training data extraction, memorization, and privacy leakage

TL;DR

How does memorization actually work?

What does the data show about extraction risk?

Does fine-tuning make this worse?

What can differential privacy actually do?

Can you actually delete data from a model?

Key takeaways

FAQ

How much training data can be extracted?

Does fine-tuning make memorization worse?

What epsilon should I use for differential privacy?

Can you delete data from a deployed model?

Related across topics

Share on

TL;DR

How does memorization actually work?

What does the data show about extraction risk?

Does fine-tuning make this worse?

What can differential privacy actually do?

Can you actually delete data from a model?

Key takeaways

FAQ

How much training data can be extracted?

Does fine-tuning make memorization worse?

What epsilon should I use for differential privacy?

Can you delete data from a deployed model?

How does GDPR apply to memorized training data?

Related across topics

Prompt Injection Defense

Share on