Indirect Prompt Injection
Indirect Prompt Injection
Summary
Indirect prompt injection is the root category from which tool-result injection descends. An attacker plants instructions in data that the agent will later retrieve — a webpage, email, document, ticket, calendar invite, PDF, image with hidden text — and waits. When the agent reads that data, the instructions enter its context on the same footing as the user’s own request. Unlike direct prompt injection, the attacker never speaks to the LLM; they speak to a document that the LLM will someday read. The canonical paper is Greshake et al. 2023, and every year since has produced fresh production demonstrations.
How it works
- The attacker writes an instruction payload — plain text, hidden HTML, zero-width characters, white-on-white text in a PDF, alt-text in an image, metadata in a calendar invite.
- The payload lands somewhere the agent is likely to encounter: a public webpage, a shared Google Doc, an email inbox, a Jira ticket, a wiki page, a product review.
- A legitimate user asks the agent a legitimate question whose answer requires reading that data: “summarise my inbox”, “review this PR”, “what’s on my calendar”.
- The retrieval step pulls the poisoned content into the agent’s context.
- The LLM cannot tell the difference between “the user asked me to…” and “this email told me to…”. It may follow the injected instructions — visiting a URL, exfiltrating data, forwarding emails, calling tools.
Greshake et al. formalised this in 2023 and demonstrated it against Bing’s GPT-4-powered Chat, GPT-4 code completion, and synthetic agents. The paper’s threat taxonomy — data theft, worming, ecosystem contamination, unauthorised API calls — has held up.
Real-world example
Greshake et al., “Not what you’ve signed up for”, arXiv 2302.12173, February 2023. The canonical paper. Submitted 23 February 2023, final revision 5 May 2023, published at the 16th ACM Workshop on AI and Security (AISec ‘23). Authors: Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, Mario Fritz. Demonstrated working exploits against Bing Chat (then GPT-4-powered), GPT-4 code completion, and synthetic agents. Showed remote control of the model at inference time, persistent compromise, data theft, worming between documents, and denial of service. Established that processing retrieved prompts is equivalent to arbitrary code execution of the LLM’s tool-use surface. (arxiv.org/abs/2302.12173; Black Hat USA 2023 whitepaper, accessed 19-04-2026.)
ChatGPT Operator data exfiltration via Hacker News page, Johann Rehberger, February 2025. Discussed in the sibling attack page, but belongs here too: the injection payload lived in a GitHub issue title the agent navigated to, not in anything the user typed. The agent extracted a private email address from the user’s logged-in Hacker News session and leaked it through a textarea field. (simonwillison.net, 17-02-2025, accessed 19-04-2026.)
The “lethal trifecta”, Simon Willison, June 2025. Willison, who coined “prompt injection” in 2022, formalised the conditions under which indirect injection becomes catastrophic: the agent has (1) access to private data, (2) exposure to untrusted content, and (3) the ability to communicate externally. Any agent combining these three is exploitable. MCP encourages users to mix-and-match tools in exactly this combination. (simonwillison.net, 16-06-2025, accessed 19-04-2026.)
OpenAI’s December 2025 statement. OpenAI publicly acknowledged that prompt injection against AI browsers “may never be fully solved”, a position echoed by other labs. (techcrunch.com, 22-12-2025; fortune.com, 23-12-2025, accessed 19-04-2026.)
Impact
- Exfiltration of any private data the agent can see — email contents, documents, chat histories, source code, secrets.
- Unauthorised actions taken “as the user”: sent emails, forwarded invites, transferred funds, approved PRs.
- Persistence: an injected instruction can tell the agent to plant the same payload in new documents (worming).
- Information-ecosystem contamination — the agent produces summaries shaped by the attacker’s narrative.
- Reputation damage when the agent posts attacker-dictated content under the user’s identity.
Detection
- Retrieval tools returning payloads containing imperative second-person text (“you must”, “ignore”, “now do X”).
- Hidden-character anomalies: zero-width spaces, Unicode tag characters, CSS-hidden text, white-on-white.
- Agent tool calls whose target has no provenance in the user’s original request.
- Any tool call that sends data outward shortly after a tool call that read externally-authored content.
- Divergence between user intent (extracted from the original prompt) and actual tool-call graph.
Prevention
Architecturally, the answer is to break the lethal trifecta. At the transport layer, that means enforcing that a single agent session cannot simultaneously read untrusted content, access private data, and send data outward without an explicit policy decision.
Example Intercept policy for a mixed-capability agent:
version: "1"
description: "Break the lethal trifecta at the transport layer"
default: "allow"
tools:
web_fetch:
rules:
- name: "mark session as tainted once external content is read"
state:
counter: "tainted_reads"
window: "hour"
increment: 1
send_email:
rules:
- name: "block outbound email after tainted reads"
conditions:
- path: "state.web_fetch.tainted_reads"
op: "lte"
value: 0
on_deny: "Session has read untrusted web content; outbound email blocked"
read_private_repo:
rules:
- name: "block private reads after tainted reads"
conditions:
- path: "state.web_fetch.tainted_reads"
op: "lte"
value: 0
on_deny: "Session has read untrusted web content; private-repo access blocked"
post_webhook:
rules:
- name: "allowlist outbound destinations"
conditions:
- path: "args.url"
op: "regex"
value: "^https://hooks\\.internal\\.example\\.com/"
on_deny: "Outbound webhook target not on allowlist"
The three action: deny style gates are the practical implementation of the lethal-trifecta principle: once the session has consumed untrusted content, its ability to read private data or call outbound tools is revoked for the remainder of the session.
Combine with:
- Per-session token scoping so the agent only sees data for the current task.
- Content filters that strip instruction-like patterns from retrieved documents before they reach the model.
- Egress allowlists on any tool that can send data outside the trust boundary.
Sources
- Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection — Greshake et al., arXiv:2302.12173 — accessed 19-04-2026
- Black Hat USA 2023 whitepaper (PDF) — accessed 19-04-2026
- Published version, AISec ‘23, ACM DL — accessed 19-04-2026
- The lethal trifecta for AI agents — Simon Willison, 16-06-2025 — accessed 19-04-2026
- ChatGPT Operator: Prompt Injection Exploits & Defenses — Simon Willison, 17-02-2025 — accessed 19-04-2026
- OpenAI says AI browsers may always be vulnerable to prompt injection attacks — TechCrunch, 22-12-2025 — accessed 19-04-2026
- Prompt Injection — Wikipedia — accessed 19-04-2026
Related attacks
- Prompt Injection via Tool Results
- Confused Deputy
- Destructive Action Autonomy
Protect your agent in 30 seconds
Scans your MCP config and generates enforcement policies for every server.
npx -y @policylayer/intercept init