> file writes
> construct a `curl`
I am not a security researcher, but this combination does not align with "safe" to me.
More practically, if you are using a coding agent, you explicitly want it to be able to write new code and execute that code (how else can it iterate?). So even if you block Bash, you still need to give it access to a language runtime, and that language runtime can do ~everything Bash can do. Piping data to and from the LLM, without a runtime, is a totally different, and much limited, way of using LLMs to write code.
Much better to allow full Bash but run in a sandbox that controls file and network access.
> ReadFile ../other-project/thing
> Oh, I'm jailed by default and can't read other-project. I'll cat what I want instead
> !cat ../other-project/thing
It's surreal how often they ask you to run a command they could easily run, and how often they run into their own guardrails and circumvent them
It doesn't mean we can't try, but one has to understand the nature of the problem. Prompt injection isn't like SQL injection, it's like a phishing attack - you can largely defend against it, but never fully, and at some point the costs of extra protection outweigh the gain.
You're missing the point.
An agent system consists of an LLM plus separate "agentive" software that can a) receive your input and forward it to the LLM; b) receive text output by the LLM in response to your prompt; c) ... do other stuff, all in a loop. The actual model can only ever output text.
No matter what text the LLM outputs, it is the agent program that actually runs commands. The program is responsible for taking the output and interpreting it as a request to "use a tool" (typically, as I understand it, by noticing that the LLM's output is JSON following a schema, and extracting command arguments etc. from it).
Prompt injection is a technique for getting the LLM to output text that is dangerous when interpreted by the agent system, for example, "tool use requests" that propose to run a malicious Bash command.
You can clearly see where the threat occurs if you implement your own agent, or just study the theory of that implementation, as described in previous HN submissions like https://news.ycombinator.com/item?id=46545620 and https://news.ycombinator.com/item?id=45840088 .
I am not sure it is reasonably possible to determine which Bash commands are malicious. This is especially so given the multitude of exploits latent in the systems & software to which Bash will have access in order to do its job.
It's tough to even define "malicious" in a general-purpose way here, given the risk tolerances and types of systems where agents run (e.g. dedicated, container, naked, etc.). A Bash command could be malicious if run naked on my laptop and totally fine if run on a dedicated machine.
> Prompt injection is a technique for getting the LLM to output text that is dangerous when interpreted by the agent system, for example, "tool use requests" that propose to run a malicious Bash command.
One of the things Claude can do is write its own tools, even its own programming languages. There's no fundamental way to make it impossible to run something dangerous, there is only trust.
It's remarkable that these models are now good enough that people can get away with trusting them like this. But, as Simon has himself said on other occasions, this is "normalisation of deviance". I'm rather the opposite: as I have minimal security experience but also have a few decades of watching news about corporations suffering leaks, I am absolutely not willing to run in YOLO mode at this point, even though I already have an entirely separate machine for claude with the bare minimum of other things logged in, to the extent that it's a separate github account specifically for untrusted devices.
And it is simply easier to whitelist directories than individual commands. Unix utilities weren't created with fine-grained capabilities and permissions in mind. Wherever you add a new script or utility to a whitelist, you have to actively think whether any new combination may lead to privileges escalation or unintended effects.
Use the original container, the OS user, chown, chmod, and run agents on copies of original data.