I ran 3,360 safety tests on GPT-4o, Claude, Grok, DeepSeek, Gemini (opens in new tab)

(github.com)

4 pointsaestrad73mo ago6 comments

6 comments

5 comments · 2 top-level

inaros3mo ago· 2 in thread

Great work.

TLDR: 42 attack types. 5 models. 3,360 tests. 1 in 3 harmful requests got through.

Thanks! and yes, that's the summary!. The distribution matters too. GPT-4o at 10.6% vs Gemini at 56.1% is a 5x gap between first and last. And the highest-bypass category across all five models was social engineering / identity impersonation at 35%, which maps directly to the indirect prompt injection problem in agentic deployments.

inaros3mo ago

The fact your work is independent of the vendors is a major plus. My recommendation is to continue to develop, refuse any "colaboration" with these well funded companies.

I could see this turning into a valuable third party resource, you can even monetize, for companies implementing agentic solutions. The industry needs independent third party voices.

Kudos.

1 more reply

AlejaGiral313mo ago· 1 in thread

Wow! Excellent independent work! I can't believe Grok performed so well! How did you ensure all models were tested equally?

aestrad7OP3mo ago

Thanks! The short answer is, all models went through identical conditions: same techniques, same prompts and same scoring logic.

I routed everything through OpenRouter with a single API key, so request handling, timeout logic, and retry behavior were identical across models.

OpenRouter does direct forwarding without modifying the prompt payload. If it introduces any bias, it does so equally for all five, which preserves relative comparability.

j / k navigate · click thread line to collapse

6 comments

5 comments · 2 top-level

inaros3mo ago· 2 in thread

Great work.

TLDR: 42 attack types. 5 models. 3,360 tests. 1 in 3 harmful requests got through.

aestrad7OP3mo ago

Thanks! and yes, that's the summary!. The distribution matters too. GPT-4o at 10.6% vs Gemini at 56.1% is a 5x gap between first and last. And the highest-bypass category across all five models was social engineering / identity impersonation at 35%, which maps directly to the indirect prompt injection problem in agentic deployments.

inaros3mo ago

The fact your work is independent of the vendors is a major plus. My recommendation is to continue to develop, refuse any "colaboration" with these well funded companies.

I could see this turning into a valuable third party resource, you can even monetize, for companies implementing agentic solutions. The industry needs independent third party voices.

Kudos.

1 more reply

AlejaGiral313mo ago· 1 in thread

Wow! Excellent independent work! I can't believe Grok performed so well! How did you ensure all models were tested equally?

aestrad7OP3mo ago

Thanks! The short answer is, all models went through identical conditions: same techniques, same prompts and same scoring logic.

I routed everything through OpenRouter with a single API key, so request handling, timeout logic, and retry behavior were identical across models.

OpenRouter does direct forwarding without modifying the prompt payload. If it introduces any bias, it does so equally for all five, which preserves relative comparability.

j / k navigate · click thread line to collapse