Generative Artificial Intelligence

206 readers

1 users here now

Welcome to the Generative AI community on Lemmy! This is a place where you can share and discuss anything related to generative AI, which is a kind of technology that can make new things, like pictures, words, or sounds, by learning from existing things. You can post your own creations, ask for feedback, share resources, or just chat with other fans. Whether you are a beginner or an expert, you are welcome here. Please follow the Lemmy etiquette and be respectful to each other. Have fun and enjoy the magic of generative AI!

P.s. Every aspect of this community was created with AI tools, isn't that nifty.

founded 1 year ago

MODERATORS

Tetreo@sopuli.xyz

Quantized model issues (sopuli.xyz)

submitted 11 months ago by SpiderShoeCult@sopuli.xyz to c/gai@sopuli.xyz

0 comments fedilink hide all child comments

Hey, so first off, this is my first time dabbling with LLMs and most of the information I found myself by rummaging through githubs.

I have a fairly modest set-up, an older gaming laptop with a RTX3060 video card with 6 GB VRAM. I run inside WSL2.

I have had some success running fastchat with the vicuna 7B model, but it's extremely slow, at roughly 1 word every 2-3 seconds output, with --load-8bit, lest I get a CUDA OOM error. Starts faster at 1-2 words per second but slows to a crawl later on (I suspect it's because it also uses a bit of the 'Shared video RAM' according to the task manager). So I heard about quantization which is supposed to compress models at the cost of some accuracy. Tried ready-quantized models (compatible with the fastchat implementation) from hugginface.co, but I ran into an issue - whenever I'd ask something, the output would be repeated quite a lot. Say I'd say 'hello' and I'd get 200 'Hello!' in response. Tried quantizing a model myself with exllamav2 (using some .parquet wikitext files also from hugginface for calibration) and then using fastchat but the problem persists. Endless repeated output. It does work faster, though at the actual generation, so at least that part is going well.

Any ideas on what I'm doing wrong?

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here