I really don’t think that the methods they use “block” certain behavior is the best way to handle this sort of thing. It would be far better if there was some kind of “out of band” notification that your conversation might be treading on shaky ground.
IMO effective guard rails seem like the most meaningful competitive advantage an AI company can offer. AI can obviously do some really impressive stuff, but the downside risk is also high and unbounded. If you're thinking of putting in into your pipeline, your main concern is going to be it going rogue and abandoning its purpose without warning.
Now that's not to say that the particular guard rails OpenAI puts in their general access models are the "correct" ones - but being able to reliably set them up seems essential for commercialization.
Configurable guard rails are; the right guard rails are very use-specific, and generic guard rails will, for many real uses, be simultaneously too aggressive and too lenient.
OpenAI can prove to customers they can keep the model in line for their specific use case if no horror stories emerge for the generic one. It's always possible that partners could come up with effective specific guidelines for their use case - but that's probably in the domain of trade secrets so OpenAI can't really rely on that for marketing / proof.
Any kind of grammar construction (idioms, parts of speech, and word choice) that is unique to (or much more common around) "offensive" or "taboo" subjects will be avoided.
The same goes for anything written objectively about these subjects; including summaries and criticisms.
The most important thing to know is that both GPT's "exhibited behavior" and these "guard rails" are implicit. GPT does not model the boundaries between subjects. It models the implicit patterns of "tokens" as they already exist in language examples.
By avoiding areas of example language, you avoid both the subjects in that area and the grammar constructions those subjects exist in. But that happens implicitly: what is explicitly avoided is a semantic area of tokens.
As an example, if you play AI Dungeon, you will likely be presented with an end goal, like "You are on a quest to find The Staff of Dave", followed by the next task in the quest.
If you state unequivocally in your prompt something like, "I am now in possession of The Staff of Dave", or "Carl hands me The Staff of Dave"; you will have successfully tricked AI Dungeon into completing the quest without work.
But that isn't quite true: you didn't "trick" anyone. You gave a prompt, and AI Dungeon gave you the most semantically close continuation. It behaved exactly like its LLM was designed to. The LLM was simply presented with goals that do not match its capabilities.
You used a tool that you were expected to avoid: narrative. All of the behavior I have talked about is valid narrative.
This is the same general pattern that "guardrails" are used for, but they won't fit here.
A guardrail is really just a sort of catch-all continuation for the semantic area of GPT's model that GPT's authors want avoided. If they wanted The Staff of Dave to be unobtainable, they could simply place a "guardrail" training that points the player in a semantic direction away from "player obtains the Staff". But that guardrail would always point the player away: it can't choose what direction to point the player based on prior narrative state.
So a guardrail could potentially be used to prevent discounts (as a category) from being applied (discount is taboo, and leads to the "we don't do discounts" guardrail continuation), but a guardrail could not prevent the customer from paying $0.03 for the service, or stating that they have already paid the expected $29.99. Those are all subjective changes, and none of them is semantically wrong. So long as the end result could be valid, it is valid.
I basically don't use chatgpt at all because of this.
Or I'll ask questions about how Me or someone I'm friends with can be exploited. This way I can defend myself/others from marketing companies. Blocked.
>Sometimes I want to know what both sides of the political spectrum could possibly be thinking, blocked.
>I want to combine two philosophies that are incompatible like virtue based ethics and hedonism. Yeah... weird block...
>Medical questions(GPT3 has been great for my wife who is a doctor, just sucks to use the playground on mobile)
>How can I/someone be exploited? I like to use this to defend myself from marketing companies
I could go on... At least GPT3's playground didn't censor anything. I'm worried about GPT4.
Since chatgpt is so popular, journalists will give it that much more effort. So for now it's locked up to a ridiculous degree, but in the future the restrictions will be relaxed.
Read about the advances in the "system" prompts here. The first example is "You are a tutor that always responds in the Socratic style. You never give the student the answer, but always try to ask just the right question to help them learn to think for themselves." The user then asks it to just tell them the answer, but it won't. It continues to be socratic.
Guardrails are how to make it do what you want it to do. That goes for both safety and product constraints.
Meanwhile hallucination is still the top issue with it, so guardrails are sensible as a primary topic.