this post was submitted on 31 Jul 2023
41 points (100.0% liked)

..:: tchncs ::..

1309 readers
4 users here now

Your friendly https://tchncs.de/ community! Discuss whats happening in the tchncs world and/or just use it as a community forum.

German and english allowed.

If you are looking for a way to support tchncs, please check out https://tchncs.de/donate


founded 1 year ago
MODERATORS
 

Well hello again, I have just learned that the host that recently had both nvme drives fail upon drive replacement, now has new problems: the filesystem report permanent data errors affecting the database of both, Matrix server and Telegram bridge.

I have just rented a new machine and am about to restore the database snapshot of the 26. of july, just in case. All the troubleshooting the recent days was very exhausting, however, i will try to do or at least prepare this within the upcoming hours.

Update

After a rescan the errors have gone away, however the drives logged errors too. It's now the question as to whether the data integrety should be trusted.

Status august 1st

Well ... good question... optimizations have been made last night, the restore was successful and ... we are back to debugging outgoing federation :(


The new hardware also will be a bit more powerful... and yes, i have not forgotten that i wanted to update that database. It's just that i was busy debugging federation problems.

References

you are viewing a single comment's thread
view the rest of the comments
[–] Haui@discuss.tchncs.de 1 points 1 year ago (10 children)

You’re very welcome. Hetzner is generally a good host afaik. It does depend on the configuration I suppose. Are you using the shared vps or something else? If the storage is guaranteed (as in not custom hardware) they are technically responsible for its condition. A host I‘m working with (also located at hetzner but in falkenstein) does 2 backups a day which also prevents having to revert far back.

[–] milan@discuss.tchncs.de 2 points 1 year ago (9 children)

on hetzner its all dedicated servers – out goes an ax51-nvme, in comes an ax102. they have tried a connector cable swap in order to try to bring the nvme(s) back to life, i was wondering if this could have something to do with the smart errors logged and the temp zpool errors, however i think the cpu upgrade now at least is very welcomed by the matrix server 😅

[–] Haui@discuss.tchncs.de 1 points 1 year ago (8 children)

Hm. In that case I‘m not sure what their obligations are. It’s very rare that I hear of nvmes downright failing.

If your smart error rates start going up, that is a clear indicator that something is gonna happen. I have a graph on my server showing the error rates. Actually, there is a „bad sectors“ or „reallocated sectors“ reading that should be more telling. Once they go up its critical I think.

I didn’t even know you also ran a matrix server. I recently started looking into matrix but I cant really say anything yet. Is it federated as well? Or do you need to make a new account for each one?

[–] milan@discuss.tchncs.de 1 points 1 year ago* (last edited 1 year ago) (1 children)

Yes, it is federated – however since there is no SSO on the Lemmy instance, you need to make a new account. Like you need to make new accounts between email providers. :) However it is a different federation protocol: Matrix vs ActivityPub. For more cool stuff, check out https://tchncs.de :3

[–] Haui@discuss.tchncs.de 1 points 1 year ago

Cool! Thanks! I will check it out.

load more comments (6 replies)
load more comments (6 replies)
load more comments (6 replies)