As in the title. I know that the word jailbreak comes from rooting Apple phones or something similar. But I am not sure what can be gained from jailbreaking a language model.

It will be able to say "I can't do that Dave" instead of hallucinating?
Or will only start spewing less sanitary responses?

you are viewing a single comment's thread
view the rest of the comments

[–] INeedMana@lemmy.world 0 points 1 year ago (2 children)

So there's probably little to be gained from jailbreaking on HuggingFace chat?

[–] deavid@lemmy.world 1 points 1 year ago

so far most models in HuggingFace are also "censored", so maybe something can be gained. But over there are "uncensored" models that can be used instead.

[–] Blaed@lemmy.world 1 points 1 year ago* (last edited 1 year ago)

Kind of like how David mentioned, I think the 'jailbreak' behavior you're describing is in the uncensored models. There are no 'guardrails' on those, so you can get it to say whatever you want without it defaulting to an answer like "As an AI model I..."

In a way, the 'uncensored' versions are pre-jailbroken, so you can fine-tune or train it on your own custom data without running into those guardrails I mentioned. For what it's worth, you can be the one to setup your own guardrails too. These uncensored models are totally unlocked in that sense.

HuggingFace chat is another chat style model the folks at HuggingFace setup with their own safeguards and parameters. You can definitely try jailbreaking it with prompts, but if you're looking to chat with a model that doesn't stop from outputting a certain word or phrase - then the uncensored models are probably what you're looking for. You won't need to jailbreak those with prompts. They'll output all kinds of crazy stuff, which is why you don't see typical public hosting for these type of uncensored models.

A few that you can download that people are running today are any of the uncensored Wizard or LLaMA-based models like Wizard-Vicuna-7B.

If you want something not based on Meta's LLaMA (something that's commercially available), I suggest exploring some of KobolAI's models, which work pretty well out-of-the-box for casual chat / Q&A. There are also a ton of emerging MPT-based models that are commercially licensable, but like any of this bleeding edge technology; it will have its faults.

It's important to note that the coherency of these smaller models compared to Chat-GPT is very different, but tuning them to specific needs seem to be quite effective. At the moment, quality of your dataset is more important than quantity. This goes for both censored and uncensored versions.

If you're running a typical consumer grade GPU, I suggest sticking to the 6B parameter models as a starting point, moving up from there based on performance and preference. Download and chat with these at your own risk - I am not responsible for anything you do with this technology. Do your best to understand the dangers going into them before crashing your PC or getting into a conversation you weren't prepared for.

I'll be doing a post on model availability soon, but hopefully this answers your question 'till then.