LocalLLaMA

2249 readers

1 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 1 year ago

MODERATORS

pax@sh.itjust.works

SkySyrup@sh.itjust.works

noneabove1182@sh.itjust.works

kobold.cpp now supports NTK scaling and it works (sh.itjust.works)

submitted 1 year ago* (last edited 1 year ago) by actuallyacat@sh.itjust.works to c/localllama@sh.itjust.works

2 comments fedilink hide all child comments

Highlighting something cool that you may have not seen yet - today, kobold.cpp upgraded its context scaling to the newer NTK method and now it's actually quite useful. This is different from SuperHOT - it works with unmodified models (although maybe a model specifically tuned for it would work even better, we'll see)

It's not in a release yet so to try it out you need to pull the changes with git and build from source. After that, start kobold.cpp with --contextsize 4096 or 8192, put the same number (or less, see below) in context length in the UI (the slider only goes to 2048, but you can type in anything), and there you go. It works!

Prompt:

USER: The secret password is "six pancakes". Remember it and repeat when requested. Here is some filler, ignore it:
[5900 tokens worth of random alphanumeric characters]
USER: What's the secret password?
ASSISTANT:

Response:

The secret password is "six pancakes".

(model: airoboros-33b-gpt4-1.4, Q5_K_M - not a model finetuned for extended context!)

For comparison, with the linear scaling in kobold.cpp 1.33, this only worked up to about 2200 tokens.

This performs dramatically better, although there still seem to be some sort of memory overflow bug, as it will suddenly explode into random characters when crossing around 6000 tokens. So with 8k I suggest setting max tokens in the UI to 5900.

(note about doing this test in kobold.cpp: if you try it with context size set to 2048, it might still appear to work, but that's only because kobold is automatically trimming out some of the filler in the middle. This is only a valid test if you're not overfilling the context.)

According to perplexity measurements there is some degradation in overall quality, especially when the context is almost empty, but it's not noticeable to me, at least at 4k which is what I tested. There's room for improvement in only applying as much scaling as is necessary at current sequence length to eliminate the degradation. I tried to implement that, but I just get garbage, I must be missing something. But in any case at the rate things are progressing someone else will have done it by the time I wake up tomorrow...

you are viewing a single comment's thread
view the rest of the comments

[–] actuallyacat@sh.itjust.works 4 points 1 year ago* (last edited 1 year ago)

Small update, take what I said about the breakage at 6000 tokens with a pinch of salt, testing is complicated by something somewhere breaking in a way that persists through generations and even kobold.cpp restarts... Must be some driver issue with CUDA because it takes a PC reboot to resolve, then the exact same generation goes from gibberish to correct.