Can you keep this thread posted please? Or you can share a PR link so I can follow up the progress there. Am very interested.
Removed the hashtags. The automated post didn't appear as I'd hoped it to. So I ended up actually copy-pasting things over ๐คทโโ๏ธ https://mastodon.social/@bahmanm/110899110596116547
So I can just boost the automated post in Mastodon instead of copy-pasting everything over.
Hopefully it didn't clutter the post.
I'm not impressed. Talk is cheap, show me the code.
Well said ๐
I bookmarked your reply to come back to it whenever this discussion comes up for me!
potential to reuse
I have a feeling that it's going to make a noticeable difference; it's way cheaper than a JOIN ... GROUP BY
query.
order they are updated in so that the deepest child gets count updated first
Given the declarative nature of SQL, I'm afraid that's not possible - at least to my knowledge.
But worry not! That's why there are stored procedures in almost every RDBMS; to add an imperative flare to the engine.
In purely technical terms, Implementing what you're thinking about is rather straight-forward in a stored procedure using a CURSOR
. This could be possibly the quickest win (plus the idea of COUNT(*)
if applicable.)
Now, I'd like to suggest a possibly longer route which I think may be more scalable. The idea is based around the fact that comments themselves are utterly more important than the number of child comments.
- The first priority should be to ensure
INSERT/UPDATE/SELECT
are super quick oncomment
andpost
. - The second priority should be to ensure
child_count
is eventually correctly updated when (1) happens. - The last priority should be to try to make (2) as fast as we can while making sure (3) doesn't interfere w/ (1) and (2) performance.
Before rambling on, I'd like to ask if you think the priorities make sense? If they do, I can elaborate on the implementation.
First off, IIRC, COUNT(*)
used to be slightly faster (~10-15%) than COUNT(some_column)
in PG. There's a chance that recent versions of PG have fixed this inconsistency but still worth benchmarking.
Now to the query:
To my mind, we've already got comment_aggregate
which is supposed to store the result of the query shared above, right? Why do we need to run that SELECT
again instead of simply:
-- pseudo-code
SELECT
ca.id, ca.child_count, ca.path
FROM
comment_aggregate ca
WHERE
ca.post_id = :post_id
I think I'm confusing matters here b/c I don't know lemmy's DB structure. Is there a link to an ERD/SQL/... you could share so I could take a look and leave more educated replies?
I don't think you're supposed to post business ads in this community (and most others); hence the downvotes.
Perhaps a line in the community description could help w/ this in the future?
DISCLAIMER: I've never looked at lemmy's code base. ๐คฆโโ๏ธ
I think no matter any possible optimisation to the query (if any), the current design may not be going to scale very well given it traverses all the comment X comment
space every time a comment is added.
To my mind, it works well when there are many shallow comments (ie little nesting/threading) which might not be the best strategy for the content lemmy serves.
Can you share the structures of comment_aggregates
and comment
? I feel there's a good opportunity for denormalisation there which may mean better performance.
That said, here's one concrete idea that crossed my mind and could be worth benchmarking:
- Create an
AFTER UPDATE
trigger oncomment_aggregates
which updates a comment's immediate parent(s)child_count
(basically increment it by 1.) - Re-write the posted query to only update the immediate parents of the comment being added.
That results in the trigger being run exactly m
times where m
is the number of comments of the subtree where the new comment was just added to.
Does that make sense?
๐