Interestingly, your use of quotation mark delimiters around the injected text seems to be making a difference here.
I am 100% certain that could be defeated with more iterating on the attack, but I try to resist the temptation to get sucked into games of prompt injection whac-a-mole so I'm not going to try and prove it.
4 here as well. I get similar results when using the API directly, though without a "system" role message.
LLMs are, naturally, non-deterministic. Reducing the temperature in your guardrail calls can reduce that a bit, but the lesson learned from the "working" and "non-working" attempts is this: the guardrails are "predictably failing in unpredictable ways" (if I may coin a phrase).