40
Fedi Admins of Lemmy, How do you keep your servers up to date without increasing downtime
(lemmy.blahaj.zone)
A community to talk about the Fediverse and all it's related services using ActivityPub (Mastodon, Lemmy, KBin, etc).
If you wanted to get help with moderating your own community then head over to !moderators@lemmy.world!
Learn more at these websites: Join The Fediverse Wiki, Fediverse.info, Wikipedia Page, The Federation Info (Stats), FediDB (Stats), Sub Rehab (Reddit Migration), Search Lemmy
One way this problem can be solved is by having multiple servers with one being the production server. The other server is usually the failover backup, but when it’s time to upgrade the other server can go down for work. Then when it’s ready you use a load balancer to send all requests to the new server. That switchover happens instantaneously: one millisecond requests are being handled by the old server then the next millisecond they’re going to the new server.
It’s kind of like teleporting a new car into place as a way of avoiding a pit stop in a race. Terrible analogy but there you go.
How do you handle the database schema changes during updates? Have two databases and disable replication during the update? How do you sync changes that occur to the backup while/after the main is upgraded?
I should preface by saying I’m reporting what I’ve heard from other engineers. I myself have not worked on high-uptime systems
The only high-uptime system I built was very simple and that simplicity was the strategy. Basically it was built once, never changed, then was up continuously from 2017 to 2022 before it was retired.
I’m not sure how I’d handle a schema change. I’m tempted to say I could have a new db with the migrations applied, then I’d replay the log of all transactions from my production database to the new database, applying a pure function to alter it as necessary to match the new structure.
In this way I’d have a lagging real-time copy of the database ready to swap in.
Then it would be a matter of applying enough resources to get that gap tiny, but any moment of switchover is going to have some unprocessed transactions.
That log of unprocessed transactions, assuming I can’t eliminate it, will need to be played into the new databas before at least some subset of its related data will be valid, meaning a time gap between switchover and availability.
I could maybe get the time gap scoped to only a few records, so that most records would work flawlessly in for users but these particular records would have their own “locked for maintenance” screens.
Or I could just silently update the UI for any data that changes as a result of being generated in the post-migration schema, and make it clear to my users that my system isn’t guaranteed up to date except right after a page refresh. Ie sometimes changes in data can take a few seconds to be reflected at the edges.
This is all me speculating though, I haven’t done it myself.
I think I could probably produce a system that would work well, with the user understanding that “sometimes there’s stale data and you gotta refresh”, to have no down time and a few data mismatches resolvable down to a little UI update delay based on polling or push notifications.
But I don’t know if I could produce a system that would go through a schema update, with no downtime, while also never displaying incorrect data.
Maybe in a high-reliability scenario the best thing would be to put up a banner saying “system migration in progress; data error window is open until this message is gone” and have the page auto refresh.
There could be something much simpler I’m missing though. I’d love to hear from some devops or sysadmins who can speak about it from experience.
Not much of an addition, but you're absolutely right, in most systems that are expected to be highly available, there's standard maintenance times, an agreement in place, and no critical use of the system is permitted to be scheduled in that regular time period. Any deployments are limited to that window, in case a rollback is necessary, data sync, etc.
All of that is in addition to the type of high availability stuff you're describing.