this post was submitted on 15 Jun 2023
164 points (99.4% liked)

Programming

17433 readers
234 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

  • Follow the programming.dev instance rules
  • Keep content related to programming in some way
  • If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities !webdev@programming.dev



founded 1 year ago
MODERATORS
 

My first experience with Lemmy was thinking that the UI was beautiful, and lemmy.ml (the first instance I looked at) was asking people not to join because they already had 1500 users and were struggling to scale.

1500 users just doesn't seem like much, it seems like the type of load you could handle with a Raspberry Pi in a dusty corner.

Are the Lemmy servers struggling to scale because of the federation process / protocols?

Maybe I underestimate how much compute goes into hosting user generated content? Users generate very little text, but uploading pictures takes more space. Users are generating millions of bytes of content and it's overloading computers that can handle billions of bytes with ease, what happened? Am I missing something here?

Or maybe the code is just inefficient?

Which brings me to the title's question: Does Lemmy benefit from using Rust? None of the problems I can imagine are related to code execution speed.

If the federation process and protocols are inefficient, then everything is being built on sand. Popular protocols are hard to change. How often does the HTTP protocol change? Never. The language used for the code doesn't matter in this case.

If the code is just inefficient, well, inefficient Rust is probably slower than efficient Python or JavaScript. Could the complexity of Rust have pushed the devs towards a simpler but less efficient solution that ends up being slower than garbage collected languages? I'm sure this has happened before, but I don't know anything about the Lemmy code.

Or, again, maybe I'm just underestimating the amount of compute required to support 1500 users sharing a little bit of text and a few images?

you are viewing a single comment's thread
view the rest of the comments
[–] TauZero@mander.xyz 8 points 1 year ago (12 children)

I agree, hearing about scaling issues so early into adoption is concerning. Lemmy advocates say "horizontal scaling is already built in! just add more instances!", but that doesn't explain the problem.

It's all just text! By my guess too, handling text alone a server should easily support a thousand concurrent users, and hundreds of thousands of daily users. A RasPI should handle thousands. I've heard the bottleneck is the database? In that case Rust is not to blame, Postgres is.

But my fear is that the data structures are implemented in a trivial way. If you have a good reddit-sized thread with a thousand comments, but you store each comment as a separate database entry, then every pageview will trigger a thousand database lookups! The way I imagined making a reddit clone is that I would store the comments as a flat list with some tree data on top, such that serving a single page with 1000 comments is no different that streaming a 100K text file. I'll go take a look how Lemmy does it currently once I get the courage!

[–] heartlessevil@lemmy.one 6 points 1 year ago (10 children)

If you have a good reddit-sized thread with a thousand comments, but you store each comment as a separate database entry, then every pageview will trigger a thousand database lookups!

No it wouldn't, that's called the N+1 query problem and it can be avoided by writing more efficient queries

[–] TauZero@mander.xyz 2 points 1 year ago (9 children)

Could you explain more how this works? I see how you should be reducing the number of SQL queries from N+1:

SELECT p.comment_ids FROM posts p WHERE p.post_id = 79
-> (5, 13, 42, 57)
SELECT c.text FROM comments c WHERE c.comment_id = 5
SELECT c.text FROM comments c WHERE c.comment_id = 13
SELECT c.text FROM comments c WHERE c.comment_id = 42
SELECT c.text FROM comments c WHERE c.comment_id = 57

down to 1 query:

SELECT c.text FROM comments c WHERE c.parent_post = 79

(Or something like this, I don't know SQL sorry). But wouldn't the database still have to lookup each comment line record on the backend? Yes, they are all indexed and hashed, but if you have a thousand comments, or even ten thousand (that reddit handles perfectly fine!) - isn't 10000 fetches from a hashtable still slower than fetching a 10000-long array? And what if you've been running your reddit clone for years and accumulated so many gigs of content that they don't fit in memory and have to be stored on disk. Aren't you looking at 10000 disk reads in worst case scenario with a hashtable?

[–] JustinFTL@kbin.social 3 points 1 year ago* (last edited 1 year ago)

@TauZero you would use a join so that one call would fetch all comment rows for that post:

`SELECT p.post_id, p.title, p.description, c.text

FROM posts p JOIN comments c ON p.post_id = c.post_id

WHERE p.post_id = 79`

This would return a list of all comments for post 79, along with the post id, title, and description also in every row. This isn't a perfect example of optimized SQL, and I have no idea if these are the correct table and field names, as I didn't look them up. But it's a simple example of joins.

load more comments (8 replies)
load more comments (8 replies)
load more comments (9 replies)