Use a cheap purpose-built LLM like OpenAI's free moderation endpoint to classify the text and send the original text plus the classification to clients, and let clients choose what to do with it, with opinionated defaults appropriate to the app.
Maybe you still need to identify persistent bad actors rather than acting only on content. But still, allow clients to decide what to do with that information.
I suppose my thinking is that strong default automatic moderation that's invisible to offenders is a requirement for a project like this to be able to offer a welcoming experience to users, but putting the power in an LLM and fixed filter lists feels very wrong. So my thought is to use those things to give the client power. But maybe that makes no difference if nobody changes settings away from defaults anyway.