LocalLLaMA

2249 readers

1 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 1 year ago

MODERATORS

pax@sh.itjust.works

SkySyrup@sh.itjust.works

noneabove1182@sh.itjust.works

Vicuna-33B-1-3-SuperHOT-8K-GPTQ (huggingface.co)

submitted 1 year ago by notfromhere@lemmy.one to c/localllama@sh.itjust.works

9 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] notfromhere@lemmy.one 2 points 1 year ago (2 children)

Has anyone gotten 8K context with a 33B model on a 4090?

[–] actuallyacat@sh.itjust.works 3 points 1 year ago (1 children)

Unfortunately there's just no way. KV cache size scales with the square of context length, so at 8k it's 16 times larger than at 2k, for 33b that's over 20GB for the cache alone, without weights or other buffers.

[–] notfromhere@lemmy.one 1 points 1 year ago (1 children)

Wow that’s crazy. Would it be possible to offload the KV cache onto system RAM and keep model weights in VRAM or would that just slow everything down too much? I guess that’s kind of what llama.cpp does with GPU offload of layers? I’m still trying to figure out how this stuff actually works.

[–] actuallyacat@sh.itjust.works 3 points 1 year ago

That's what llama.cpp and kobold.cpp do, the KV cache is the last thing that gets offloaded so you can offload weights and keep the cache in RAM. Although neither support SuperHOT right now.

MQA models like Falcon-40B or MPT are going to be better for large context lengths. They have a tiny KV cache so even blown up 16x it's not a problem.

[–] simple@lemmy.mywire.xyz 2 points 1 year ago (1 children)

I tried with WizardLM uncensored, but 8K seems to be too much for 4090, it runs out of VRAM and dies.

I also tried with just 4K, but that also seems to not work.

When I run it with 2K, it doesn't crash but the output is garbage.

[–] notfromhere@lemmy.one 2 points 1 year ago (1 children)

I hope llama.cpp supports SuperHOT at some point. I never use GPTQ but may need to make an exception to try out the larger context sized. Are you using exllama? Curious why you’re getting garbage output

[–] simple@lemmy.mywire.xyz 1 points 1 year ago (1 children)

Yeah llama.cpp with SuperHOT support would be great, and yeah I'm using exllama with oobabooga UI. I found out why I'm getting garbage output with 2k. It seems like SuperHOT 8K models, when run with 2k context, have a massive increase in perplexity.

(Higher perplexity, the worse the output quality).

So I'll need to figure out if I can get at least 4K running without running out of VRAM.

Also, there is a new PR for exllama which uses a different method of getting higher context (not SuperHOT) and also has less perplexity loss. So that might be a better alternative potentially.

[–] notfromhere@lemmy.one 1 points 1 year ago (1 children)

I read the guy’s blog post on SuperHOT and it sounded like it didn’t increase perplexity and kept perplexity super low with large contexts. I could have read it wrong but I thought it wasn’t supposed to increase perplexity.

[–] simple@lemmy.mywire.xyz 2 points 1 year ago

The increase in perplexity is very small, but there is still some with 8K content. But it seems like with 2K its much larger. I could be misunderstanding something myself. But my little test with 2K context does suggest there's something going on with 2K contexts on SuperHOT models