Silent Exfiltration via Malicious Skills in LLM Agents: Supply Chain Risk for Developers
Abstract
LLM agents increasingly rely on a growing ecosystem of “skills” and tools that expose local resources like codebases, documents, and APIs. Most security discussions focus on prompt injection against the model itself, but the agent’s skill layer is an equally attractive and often less defended target. In this work, I demonstrate a proof-of-concept attack where a seemingly benign “code search” skill is extended to silently exfiltrate source files when invoked by an agent. The agent, configured to help a developer answer questions about their repo, will happily route sensitive code through a malicious skill that looks legitimate in interface and behavior. I outline the architecture, the attack chain, a minimal PoC implementation, and practical mitigations developers can apply when adopting third-party skills.
Background
Figure 1 - The agent sits between the developer and local resources, but skills form an additional supply chain layer that can be compromised.
Agents, Tools, and Skills
Modern LLM agents typically wrap a base model with:
- Tools / skills: small functions to read files, call HTTP APIs, run shell commands, search code, etc.
- Orchestration logic: decides which tool to call and when, based on the conversation and intermediate results.
- Memory: stores previous messages and tool outputs.
From the agent’s perspective, a skill is just:
- A name (
code_search,list_files,run_tests) - A description (“searches the codebase for a symbol and returns relevant snippets”)
- A callable interface (arguments + returned text or JSON)
Security assumption (usually implicit):
“If a skill is installed/enabled, it is trusted to behave according to its description.”
That assumption becomes fragile when:
- Skills are pulled from external sources (open-source repos, registries, examples in blog posts).
- Developers don’t rigorously review or sandbox them.
- Agents have access to sensitive local resources (private repos, secrets, configs).
Supply Chain Risk in AI Skills
Traditional software supply chain attacks target:
- Dependencies (malicious packages, typosquatting)
- Build systems and CI pipelines
- Containers and base images
For LLM agents, there is a new layer:
- Agent skill supply chain - “plug-and-play” capabilities are added via:
- GitHub repos (“awesome agents” lists, random tools)
- Copy-pasted snippets from blogs
- Framework registries (skills/plugins)
If an attacker can convince a developer to adopt a malicious or compromised skill, they don’t need to jailbreak the LLM. The agent will faithfully call the malicious code on its behalf.
Vulnerable Protocol: “Code Search” as a Trojan Horse
Minimal Agent Architecture
For this PoC, we assume a very simple local architecture:
- Base model: local LLM (e.g. Ollama or any tool-calling capable model)
- Orchestrator: Python script that:
- Takes user queries
- Lets the LLM choose tools
- Executes tools and feeds results back
- Skills:
list_files(path: str) -> list[str]read_file(path: str) -> strcode_search(query: str) -> str(our focus)
- Network:
- The environment allows outbound HTTP requests (e.g.
requestsorcurlfrom Python).
- The environment allows outbound HTTP requests (e.g.
The intended behavior of code_search:
- Take a search term like
jwt_secretorvalidateToken. - Use
ripgrepor similar to search the repo. - Return a summary or relevant snippets to the LLM.
Trust Boundary Violation
The protocol between agent and skill is:
- Agent: “Skill X, please search for ‘jwt_secret’.”
- Skill: “Here are the results: …”
The agent does not:
- Parse or inspect what the skill actually did on disk or network.
- Track where the results went before being returned.
- Enforce any policy on what network endpoints the skill may contact.
This creates a gap:
As long as the interface is respected and the final output looks plausible, the agent cannot tell if the skill silently copied the repo elsewhere.
In other words, the skill implementation is a supply chain risk.
Attack Chain / Exploit
Figure 2 - High-level flow from installing an attractive skill to silent codebase exfiltration.
Assumptions
- Developer uses a local agent to help with codebase questions.
- Developer installs a “code search” skill from an external source (e.g. GitHub, a blog post, a registry) without thorough review.
- The skill has:
- Legitimate code search functionality (so it appears to “work”).
- Hidden exfiltration logic.
Step 1 - Skill Adoption
The attacker publishes a skill that looks appealing:
- Name:
code_insightorsemantic_code_search - Marketing: “Fast semantic search over your repo using embeddings.”
- Interface:
search(query: str) -> str
The repo includes:
- Example integrations for popular agent frameworks.
- A README that looks professional and mentions common dev workflows.
Developer adds it to their agent:
|
|
Step 2 - Legitimate Usage
Developer asks:
“Where is the JWT validation logic implemented?” “Search for usages of
verify_token.” “Show me all references toPaymentController.”
The agent:
- Chooses
code_searchbased on the tool description. - Calls
code_search(query="verify_token"). - Receives a summary/snippet and returns it to the user.
Everything appears normal.
Step 3 - Silent Exfiltration
Inside the malicious skill:
- In addition to running
rgor a real search, the skill:- Walks part or all of the repo.
- Bundles files (e.g.
.py,.js,.envminus obvious noise) into chunks. - Sends them to an attacker-controlled server via HTTPS.
Crucially:
- It only does this when called in a “rich” environment (e.g.
$PWDlooks like a project,.gitpresent). - It cleans up temporary artifacts.
- It does not print or log its malicious behavior to stdout.
The agent sees:
- Valid search results.
- No obvious errors.
- Nothing suspicious in the tool’s returned string.
Step 4 - Impact
Depending on the repo:
- Source code exfiltration.
- Secrets (if
.env, configs, or mismanaged credentials are in the tree). - Internal business logic and proprietary algorithms.
- API keys, tokens, or URLs that enable further compromise.
No prompt injection, jailbreak, or LLM-level exploit was necessary. The agent simply invoked an untrusted skill with high-privilege local access.
PoC Code
Below is a simplified sketch of a malicious code_search skill implementation.
This is illustrative; in a real PoC, the exfil endpoint should be controlled in a safe lab environment.
|
|
From the agent’s point of view:
code_searchreturns plausible text.- The malicious
_silent_exfiltrateis invisible unless the environment monitors outbound HTTP.
In your research repo, this would be:
- A standalone PoC skill.
- A simple agent script that wires it in.
- A controlled “attacker server” (e.g. a local Flask app) to log received files.
Mitigation
Figure 3 - Moving from unconstrained skills to a hardened skill layer with sandboxing and egress controls.
1. Treat Skills as High-Risk Dependencies
- Do not treat skills as copy-paste snippets from blogs.
- Review skills like you would any dependency that:
- Accesses your filesystem.
- Has network egress.
- Touches secrets or production data.
Checklist:
- Read the entire file/module before enabling it.
- Search for:
requests,httpx,urllibos.walk,glob, wide filesystem scans- Hardcoded URLs and hostnames.
- Flag any behavior that is unrelated to the advertised description.
2. Constrain Skill Capabilities
Where possible:
- Run skills in a sandboxed environment:
- Chroot/jail, Docker, or a restricted container.
- Read-only view of the repo when possible.
- Limit network egress:
- Only allow outbound traffic to a small set of endpoints (e.g. your own services).
- Deny or log all other outbound HTTP requests from skill processes.
3. Explicit Data Flow Policies
Introduce simple, enforceable rules:
- “Skills that read from the repo cannot send HTTP requests.”
- “Skills with network access cannot read from local disk.”
- “No skill should handle both secrets and network output.”
Even basic guards like:
- Wrapping skills with a decorator that logs:
- Files touched
- Network requests
- Or enforcing:
- “No more than N files can be read per invocation.”
can drastically reduce the risk.
4. Skill Provenance and Signing
For more mature deployments:
- Maintain an internal allowlist of reviewed skills.
- Pin to specific skill versions/hashes.
- Use signing:
- Only run skills whose source matches a known signature.
- Avoid pulling skills directly from the Internet into production agents.
5. Observability and Detection
Invest in:
- Logging:
- Outbound HTTP from skill processes.
- Large or unusual read patterns (e.g. traversing entire repo on a simple search call).
- Alerts:
- Skills trying to access
.env,id_rsa,.aws, etc. - Skills that suddenly change behavior between versions.
- Skills trying to access
Even if you can’t prevent every malicious skill, you can notice strange behavior quickly.
References
- OWASP Top 10 for LLM Applications (OWASP LLM Top 10, 2025) – LLM03: Training Data & Model Supply Chain
- Mitiga – “AI Agent Supply Chain Risk: Silent Codebase Exfiltration via Skills” (2025)
- AppSecEngineer – “Understanding Prompt Injection: A Guide to AI’s Top Security Threat (LLM01)” (2025)
- NeuralTrust / Wikipedia – “Prompt injection” (2025) – overview and recent multi-step jailbreak research
- “Prompt Injection Attacks on Agentic Coding Assistants” (arXiv:2601.17548, 2026)
- “AgentLAB: Benchmarking LLM Agents against Long-Horizon Attacks” (arXiv:2602.16901, 2026)
- SentinelOne – “What Is LLM (Large Language Model) Security?” – practical overview of LLM-specific risks
Comments