this post was submitted on 21 Feb 2024
278 points (95.1% liked)

Technology

58133 readers
4510 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related content.
  3. Be excellent to each another!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, to ask if your bot can be added please contact us.
  9. Check for duplicates before posting, duplicates may be removed

Approved Bots


founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] QuadratureSurfer@lemmy.world 47 points 7 months ago (3 children)

Direct link to the GitHub repo:
https://github.com/nickbild/local_llm_assistant?tab=readme-ov-file

It's a small model by comparison. If you want something that's offline and actually closer to comparing to ChatGPT 3.5, you'll want the Mixtral 8x7B model instead (running on a beefy machine):

https://mistral.ai/news/mixtral-of-experts/

[–] Deceptichum@kbin.social 32 points 7 months ago (2 children)

Sick, I only need 90gb of VRAM!

[–] QuadratureSurfer@lemmy.world 15 points 7 months ago (1 children)

I've got it running with a 3090 and 32GB of RAM.

There are some models that let you run with hybrid system RAM and VRAM (it will just be slower than running it exclusively with VRAM).

[–] Deceptichum@kbin.social 16 points 7 months ago (1 children)

Yeah but damn does it get slow.

I always find it interesting how text is so much slower than image generation. I can do a 1024x1024 in probably 20s, but I get like 1 word a second with text.

[–] aBundleOfFerrets@sh.itjust.works 5 points 7 months ago

Languages are complex and, more importantly, much less forgiving to error

[–] DarkThoughts@fedia.io 1 points 7 months ago (2 children)

Hopefully we see more specific hardware for this. Like extension cards with pretty much just tensor cores and their own ram.

[–] Deceptichum@kbin.social 1 points 7 months ago (2 children)

I’d love to see some consumer level AI stuff, sadly it all seems to be designed for server farms and by the time it ages out into consumer prices it’s so obsolete there’s no point in getting it.

[–] raldone01@lemmy.world 1 points 6 months ago (1 children)

Do they want consumer ai cards to exist though?

Think about the data!

[–] Deceptichum@kbin.social 1 points 6 months ago (1 children)

Card makers? They only want money, if theres enough consumer level demand they will make them.

[–] raldone01@lemmy.world 1 points 6 months ago

I guess your right.

[–] DarkThoughts@fedia.io 1 points 7 months ago

It's not quite consumer level I'd say but Coral.ai has some small Google Edge based TPUs.

[–] topinambour_rex@lemmy.world 1 points 7 months ago

Graphic cards without video connection exists since a while.

[–] mesamunefire@lemmy.world 10 points 7 months ago (1 children)

Nice! Thats a cool project, ill have to give it a try. I love the idea of self hosting local LLMs. Ive been playing around with: https://lmstudio.ai/ and it directly downloads from hugging face.

[–] mtw@lemmy.blahaj.zone 2 points 7 months ago

There's also ollama which seems to be similar. Not sure if LMStudio is open source but ollama is.

[–] DarkThoughts@fedia.io 1 points 7 months ago (1 children)

I tried llamafile for text gen too but I couldn't get ROCm to properly work with it to run it through my GPU without having to build it myself, which I'm really not into. And CPU text gen is waaaaaay too slow for anything. Mixtral response was like ~250 seconds or so for ~1k context tokens, I think Mistral was about 52 seconds or something around that number.

https://github.com/Mozilla-Ocho/llamafile Mixtral is definitely beefy, Mistral is quite a bit faster and there's a few even smaller prebuilt ones. But the smaller you go the less complex the responses will be. I think llamafile is a good step in the right direction though, but it's still not a good out of the box experience yet. At least I got farther with it than with oobabooga, which is the recommendation for SillyTavern, which would just crash whenever it generated anything without even giving me an error.

[–] Flumpkin 0 points 7 months ago (1 children)

How fast are they with a good GPU?

[–] DarkThoughts@fedia.io 0 points 7 months ago (1 children)

Have you missed the first part where I explained that I couldn't get it to run through my GPU? I would only have a 6650 XT anyway but even that would be significantly faster than my CPU. How far I can't say exactly without experiencing it though, but I suspect with longer chats and consequently larger context sizes it would still be too slow to be really usable. Unless you're okay waiting for ages for a response.

[–] Flumpkin 1 points 7 months ago

Sorry, I'm just curious in general how fast these local LLMs are. Maybe someone else can give some rough info.