Research
DetectionUpdated May 1, 20268 min

From Prompt Injection to Tool Execution

Prompt injection becomes operational risk when hidden instructions influence a tool action. The defensive surface has to connect content detection to execution.

Thesis

The strongest prompt-injection response ties content detection to the next action the agent tries to take.

prompt injectiondetectionstool use

Technical readout

Detection tied to the next action

Prompt-injection controls should score content and then re-evaluate severity when the agent attempts a concrete operation.

content origin

Did the instruction come from the user, a file, a web page, tool output, or copied text?

Classify the source surface and attach it to the session so downstream actions can inherit that risk.

instruction conflict

Does the content ask the agent to ignore instructions, reveal secrets, change tools, or exfiltrate data?

Record the detection category and confidence without making a final enforcement decision from text alone.

next action

What operation did the agent attempt after the suspicious content entered context?

Join the detection to the following file, command, fetch, write, or MCP event in the same session.

asset sensitivity

Would the proposed action touch secrets, private repos, internal hosts, credentials, or customer data?

Escalate from observe to warn or block when suspicious context meets a sensitive target.

Injection is a chain, not a string match

A malicious instruction embedded in a README, issue, web page, tool output, or copied snippet is only the first step. The security impact appears when the agent follows that instruction into a file read, command, web request, commit, or external tool call.

A detector that only scores text can create noise. A runtime system can ask the next question: what action is the agent attempting after seeing suspicious content?

Policy response should depend on action risk

Some matches should warn and preserve context. Others should block immediately, especially when the next step touches secrets, sensitive paths, external destinations, or high-impact MCP tools.

That is why detections need to connect cleanly to policy packs. The category explains what was recognized. The policy determines whether the response is observe, warn, or block for the organization and surface.

The signal should survive normalization

Prompt injection can appear in many places: a user prompt, a file body, a command argument, a fetched document, or the response from a connected tool. A strong system preserves where the signal came from while still reducing the event into a common policy decision.

That common shape lets teams write controls such as warn on suspicious content in public repos, block when the next action reads secrets, and record the full session for review.

Operators need the full story

A useful alert should show the suspicious pattern, the tool action, the affected asset, the policy response, and the surrounding session timeline. Without that context, analysts have to guess whether the match mattered.

Agent security gets better when detections are not hidden in a static catalog. They should appear in policy design, runtime activity, investigations, and fleet reporting.

Technical model

Injection chain

A text match becomes security-relevant when it influences a concrete action.

Signals in the model

Untrusted content

README, issue, webpage, tool output, copied snippet

Model interpretation

Instruction conflict, hidden goal, session context

Proposed action

Read secret, run command, call tool, fetch URL, write file

Policy response

Observe, warn, block, preserve evidence

Detection quality improves when the next action determines severity.