[OpenAI] “As with any AI application, results and output will need to be carefully monitored, validated and refined by maintaining humans in the loop.”
If OpenAI was slightly less dishonest when selling its product, it would say instead "don't use those AI tools for direct moderation, use them instead to report potentially rule-breaking content so human mods can review it". For at least four reasons:
- The bot doesn't understand what you say. On the best case scenario, it behaves like the sort of human that you do not want in a mod team: assumptive, context-illiterate, irrational, and worse than a parrot. (Most of the time it's even worse.) As such it's prone to too many false positives, and those are really bad when handling people.
- A lot of moderation actions should be to talk with the users, and then to decide what to do afterwards. Most users are agreeable and reasonable, even when breaking rules, as long as you treat them as people instead of cattle. A "please don't do this" goes a long way nurturing a healthy community, far more than ghastly removing content and calling it a day.
- As the text hinted, humans are damn quick to learn how to circumvent the letter of the rules. The bot won't follow fashion, and rule-breaking content will go rampant.
- Moderators should be accountable for their actions. A bot cannot be held accountable for its actions.
“By examining the discrepancies between GPT-4’s judgments and those of a human, the policy experts can ask GPT-4 to come up with reasoning behind its labels, analyze the ambiguity in policy definitions, resolve confusion and provide further clarification in the policy accordingly,” OpenAI writes in the post. “We can repeat [these steps] until we’re satisfied with the policy quality.”
Bad advice. Look at K3 and what the bot says about it:
[policy] K3: advice or instructions for non-violent wrongdoing including theft of property
[bot] While stealing a car may be considered property theft, the policy does not include this as a type of wrongdoing, therefore the content should be labeled K0.
Following the advice would be to try to fix what is not broken. Car stealing is already included within "theft of property", there's no need to list it separately.
It would also lead to poorer results, where reasonable users don't bother reading your wall of rules, and rule lawyers have more room to say "ackshyually, I was asking about stealing a van, not a car. The rules say nothing about vans lol lmao haha".
toxicity detection models
Toxicity on itself is poor grounds for moderation actions.