overview for actuallyacat

is the 4k context length of llama2 for real? in c/localllama@sh.itjust.works

[–] actuallyacat@sh.itjust.works 3 points 1 year ago* (last edited 1 year ago)

You are supposed to manually set scale to 1.0 and base to 10000 when using llama 2 with 4096 context. The automatic scaling assumes the model was trained for 2048. Though as I say in the OP, that still doesn't work, at least with this particular fine tune.

14

is the 4k context length of llama2 for real? (sh.itjust.works)

submitted 1 year ago by actuallyacat@sh.itjust.works to c/localllama@sh.itjust.works

10 comments fedilink

I've been using airoboros-l2-70b for writing fiction, and while overall I'd describe the results as excellent and better than any llama1 model I've used, it doesn't seem to be living up to the promise of 4k token sequence length.

Around 2500 tokens output quality degrades rapidly, and either starts repeating previous text verbatim, or becomes incoherent (grammar, punctuation and capitalization disappear, becomes salad of vaguely related words)

Any other experiences with llama2 and long context? Does the base model work better? Are other fine tunes behaving similarly? I'll try myself eventually, but the 70b models are chunky downloads, and experimentation takes a while at 1 t/s.

(I'm using GGML Q4_K_M on kobold.cpp, with rope scaling off like you're supposed to do with llama2)

Guide on setting up a local GGML model? in c/localllama@sh.itjust.works

[–] actuallyacat@sh.itjust.works 2 points 1 year ago

Those are OpenCL platform and device identifiers, you can use clinfo to find out which numbers are what on your system.

Also note that if you're building kobold.cpp yourself, you need to build with LLAMA_CLBLAST=1 for OpenCL support to exist in the first place. Or LLAMA_CUBLAS for CUDA.

Guide on setting up a local GGML model? in c/localllama@sh.itjust.works

[–] actuallyacat@sh.itjust.works 2 points 1 year ago (2 children)

What's the problem you're having with kobold? It doesn't really require any setup. Download the exe, click on it, select model in the window, click launch. The webui should open in your default browser.

kobold.cpp now supports NTK scaling and it works in c/localllama@sh.itjust.works

[–] actuallyacat@sh.itjust.works 4 points 1 year ago* (last edited 1 year ago)

Small update, take what I said about the breakage at 6000 tokens with a pinch of salt, testing is complicated by something somewhere breaking in a way that persists through generations and even kobold.cpp restarts... Must be some driver issue with CUDA because it takes a PC reboot to resolve, then the exact same generation goes from gibberish to correct.

18

kobold.cpp now supports NTK scaling and it works (sh.itjust.works)

submitted 1 year ago* (last edited 1 year ago) by actuallyacat@sh.itjust.works to c/localllama@sh.itjust.works

2 comments fedilink

Highlighting something cool that you may have not seen yet - today, kobold.cpp upgraded its context scaling to the newer NTK method and now it's actually quite useful. This is different from SuperHOT - it works with unmodified models (although maybe a model specifically tuned for it would work even better, we'll see)

It's not in a release yet so to try it out you need to pull the changes with git and build from source. After that, start kobold.cpp with --contextsize 4096 or 8192, put the same number (or less, see below) in context length in the UI (the slider only goes to 2048, but you can type in anything), and there you go. It works!

Prompt:

USER: The secret password is "six pancakes". Remember it and repeat when requested. Here is some filler, ignore it:
[5900 tokens worth of random alphanumeric characters]
USER: What's the secret password?
ASSISTANT:

Response:

The secret password is "six pancakes".

(model: airoboros-33b-gpt4-1.4, Q5_K_M - not a model finetuned for extended context!)

For comparison, with the linear scaling in kobold.cpp 1.33, this only worked up to about 2200 tokens.

This performs dramatically better, although there still seem to be some sort of memory overflow bug, as it will suddenly explode into random characters when crossing around 6000 tokens. So with 8k I suggest setting max tokens in the UI to 5900.

(note about doing this test in kobold.cpp: if you try it with context size set to 2048, it might still appear to work, but that's only because kobold is automatically trimming out some of the filler in the middle. This is only a valid test if you're not overfilling the context.)

According to perplexity measurements there is some degradation in overall quality, especially when the context is almost empty, but it's not noticeable to me, at least at 4k which is what I tested. There's room for improvement in only applying as much scaling as is necessary at current sequence length to eliminate the degradation. I tried to implement that, but I just get garbage, I must be missing something. But in any case at the rate things are progressing someone else will have done it by the time I wake up tomorrow...

best method do use amd GPU for inference on linux in c/localllama@sh.itjust.works

[–] actuallyacat@sh.itjust.works 5 points 1 year ago (1 children)

I can recommend kobold, it's a lot simpler to set up than ooba and usually runs faster too.

best method do use amd GPU for inference on linux in c/localllama@sh.itjust.works

[–] actuallyacat@sh.itjust.works 3 points 1 year ago (3 children)

Not sure what happened to this comment... Anyway, ooba (text-generation-webui) works with AMD on Linux but ROCm is super jank at the best of times and 6700XT is not officially supported so it might be hopeless.

llama.cpp has some GPU acceleration support on AMD in CLBlast mode, if you aren't already using it, might be worth trying.

best method do use amd GPU for inference on linux in c/localllama@sh.itjust.works

Vicuna-33B-1-3-SuperHOT-8K-GPTQ in c/localllama@sh.itjust.works

[–] actuallyacat@sh.itjust.works 3 points 1 year ago

That's what llama.cpp and kobold.cpp do, the KV cache is the last thing that gets offloaded so you can offload weights and keep the cache in RAM. Although neither support SuperHOT right now.

MQA models like Falcon-40B or MPT are going to be better for large context lengths. They have a tiny KV cache so even blown up 16x it's not a problem.

Vicuna-33B-1-3-SuperHOT-8K-GPTQ in c/localllama@sh.itjust.works

[–] actuallyacat@sh.itjust.works 3 points 1 year ago (2 children)

Unfortunately there's just no way. KV cache size scales with the square of context length, so at 8k it's 16 times larger than at 2k, for 33b that's over 20GB for the cache alone, without weights or other buffers.

Why is the front page suddenly so stale? in c/main@sh.itjust.works

[–] actuallyacat@sh.itjust.works 1 points 1 year ago

Top day is a good tip. Though I do think something is broken, seems too unlikely that this particular batch of shitposts is so uniquely hot it stays up all day when before the feed was moving quite fast

14

Why is the front page suddenly so stale? (sh.itjust.works)

submitted 1 year ago by actuallyacat@sh.itjust.works to c/main@sh.itjust.works

7 comments fedilink

The All feed on both Hot and Active modes is exactly the same as it was most of a day ago, all the same posts in the same order, except they're all 12h+ old now. At first I thought federation might not be working due to overload, but on New posts are coming in all the time, both local and remote.

What's up?

How are we going to pay for all this? in c/asklemmy@lemmy.ml

[–] actuallyacat@sh.itjust.works 19 points 1 year ago (1 children)

Reddit has over 2,000 employees most of whom are doing bullshit nobody using the site actually needs or wants, it's possible to run a lot leaner than that. Like Reddit itself used to, before they started burning hundreds of millions trying to compete with every other social media site at once instead of being Reddit