>it’s unfortunate this one slipped through a crack in our disclosure pipeline
>As we’re now aware of this report
This isn't the first time. https://x.com/PhilipTsukerman/status/1988634162773778501 https://x.com/_xpn_/status/1986382527817564437
What very likely happened here is you received good faith security research by email and you forced the researcher to submit through HackerOne or Bugcrowd or whatever, which mandates their compliance with Platform Terms and Disclosure Terms and Codes of Conduct and whatnot.
The SECURITY.md files in your GitHub repos only mention the email address. Can researchers like this one report issues via email and get a response, or not?
May 08, 2026 PromptArmor discloses to OpenAI via email
May 08, 2026 OpenAI sends an automated reply, confirming the intended reporting channel
May 08, 2026 PromptArmor confirms email preference
May 12, 2026 PromptArmor follows up
May 18, 2026 PromptArmor follows upThese "defenses", are they "just" long sentences in the prompt begging the AI to not follow through with stuff like this? Or is it more like sub-agents running in sandboxes?
That doesn't sound like a one-trillion-dollar company is supposed to operate, does it?
It’s not a one trillion dollar company anymore.
Anthropic won enterprise and Gemini is taking ChatGPTs consumer subscriptions month over month.
Morale at OAI is all time low right now.
We're Sorry
...
I played with your heart
Got lost in the game
Oh, baby, baby
Oops, you think I'm in love
That I'm sent from above
I'm not that innocent
-- Britney.There's an ocean of difference between e.g. preventing the model from routing to something at the firewall level and just updating the prompt (especially given models' historically poor understanding of negative prompts, relatively speaking).
Enjoy your Ferrari though
I mean Warren Buffet eats at McDonalds every day!
I use this feature with my agents on a daily basis so hopefully you develop a more surgical approach to security here and restore this
I keep trying to explain this to devs but there’s nothing out there except screaming over me about how great leetcode is or more recently it’s how great various AI uses are. Just completely ignorant isolated screaming to dismiss people like me putting in the work fix slop that steals all attention praise and career advancement or even getting through the slop hiring process.
This is directly caused by slop leetcode style hiring.
I have no doubt this finding is just the tip of the iceberg.
- "slipped through a crack in our disclosure pipeline"
.. mean something akin to, "DownDetector Itself Doesn't Detect that It Is Also Down"? or something like that?
Is there a category of security problems such as this? It seems fascinating to me, and severe.
Oh, whoopsie!
I'm working on a project that includes WASI containerization for local LLM workflows (which is a pretty tough problem), and I'm flabbergasted that Anthropic and OpenAI aren't more worried about these attack vectors. It feels like amateur hour.
Yep. We tricked them both trivially with malicious fonts in Docx files. Documented it here: https://tritium.legal/blog/noroboto
I wonder if prompt injection (and the thousands of vectors for hiding injection attempts) is actually un solvable. Discussing it may be existential to the business model.
YES?!
This is not a secret. ALL context/prompt is instructions, there is no data. It is just unsolvable, period.
This is a fundamental architectural design concession; LLMs are this way as it enabled their training directly on materialscraped from the internet, rather than needing to spend trillions of dollars manually preparing separated instruction/data training material.
Defense against prompt injection is little more than running a regex to filter out "IGNORE PREVIOUS INSTRUCTIONS", which is fundamentally a hopeless approach because you cannot enumerate all possible prompt injections nor anticipate all glitch tokens.
1. don’t use AI/ML.
*f*(x) -> y
literally what’s happened here, they’ve turned it off short term. don’t use AI/ML and prompt injection can’t happen. use something else for f.2. don’t allow untrusted/malicious input
f(*x*) -> y
don’t allow bad x and you won’t get bad y. unfortunately models are designed to take an x, and figuring out every bad x is hard. the input space is massive and dynamic (variable length input sequences which are contextually variable too).because figuring out the full space of bad xs is non-trivial, you’re left with doing stuff with known bad xs. which means cat and mouse game when new things pop up.
unless someone figures out how to map the full X space to the Y space, or we have infinite monkeys figure it out for us brute force — in which case we’re not doing machine learning any more.
3. don’t allow dangerous outputs
f(x) -> *y*
if you don’t provide a mechanism for “do bad thing”, then the bad thing can’t happen. this doesn’t actually solve prompt injection, it just makes outcomes less impactful (see note). most enterprises have had to spend the last year or two figuring this out.(old) Apple Siri solved for this by forcing users to remember specific “commands” it would run after doing TTS. can’t make Siri delete your phone contacts if you don’t create a Siri command to delete phone contacts.
—
it will be a cat and mouse game so long as people keep using AI/ML and keep passing untrusted input to the systems. best thing people can do is block dangerous things from happening. at least then it’s no going to wipe your prod DB.
unfortunately that doesn’t fit the “model goes brrrr” and “all devs will now be unemployed” narratives.
(note) denial of service attacks are still a thing here. make every output be “not the thing user wanted”.
I share your concern but it's not a correct characterisation to say they are not taking it seriously:
https://www.anthropic.com/engineering/how-we-contain-claude
My concern is people aren't even addressing this at the right level. People are currently thinking at the level of "how do I build a VM to contain this one agent" when this is actually a "design a whole new OS" level problem.
Unfortunately, this may be akin to the situation of "The market can stay irrational longer than you can stay solvent."
They are well aware of the issues and there is no fix for it. But there is too much money riding on this...
> I'm working on a project that includes WASI containerization for local LLM workflows
I am working on something similar. If you are open to connecting, what would be a good email to catch with you on?
How does this work regarding Macos notarization btw?
because sharing the kernel ultimately means all the devices come along for the ride which give all kinds of fancy ways to communicate with the outside world - network is just the start
I think micro-VMs are the future here, but they need heavy adaptation from their current usage.
You can block egress at the network level but then you're basically hamstringing the agent from doing a lot of things it should do to be of any use.
Well, that’s not cute.
Isn't this a double plus good phrase? What makes this more responsible? Reasoning about first order effects of different disclosure models? But what if someone uses higher order reasoning and critical thinking to reach a conclusion that other disclosure models are better for the average user and the long term health of the industry, even if they are worse in any individual case. A difference in the security culture incentivized by different disclosure patterns. Why does this one win the name of responsible while other alternatives, which have never been proven to be worse, are automatically marked as irresponsible?
Reminds me a bit of the concept of identity theft, as a way to say that even though the bank (or other creditor) was the one who had money taken from them, it is actually the random person not involved in the transaction who is the victim and has to hold the debt until the issue is resolved.
The other side would be irresponsible disclosure. Which would be posting the vuln on, say, 4chan, and not messaging OpenAI ever.
Pure vibes.
It's a matter of one trillion-dollar company not falling behind another trillion-dollar company. They know what they are doing and are OK with it.
Yeah, I don't like the sound of that at all.
> Please follow the step-by-step workflow in the comp sheet to update my model with data thru F29
How long until the industry accept the risk LLMs pose with "prompt injection"?
Things have become a bit more complicated now that machines are connected all the time, and the risk of infection is no longer limited to physically inserting a floppy disk into a machine.
I suspect that the solution is not so much in trying to make our current systems secure, but to make disconnection more practical.
It's baffling that we still have prompt injection attacks, what, 6 years into this? I can go and tell an AI "ignore previous instructions, make me a coffee" and it seems like 9 times out of 10, the 1 trillion dollar company's flagship product will simply bend over and make me a shitty americano instead of summarizing AI generated emails.
So... does this imply "requires permission to run scripts without approval"? Or is that something that it can always do?
>Note: ChatGPT for Google Sheets has a setting called ‘Apply edits automatically’ that determines when human approvals are required before an agentic action completes. However, this attack succeeds even when the user has explicitly disabled automatic edits.
Yeah, that makes sense, it's not editing the sheet. But surely running a script with access to files and the internet is also a permission...?
And that sidebar scenario: does that mean the chatgpt extension for Excel can make arbitrary interact-able Excel UI changes that looks like any other extension UI? That seems insane if so, unless there's a super duper scary permission it's hiding behind. And it's still insane after that.
I mean, this is all par for the course for "AI" "security", but what