crashdoom

joined 1 year ago
MODERATOR OF
 

Firstly, the maintenance for Pawb.Social services has been completed and we've upgraded Lemmy to 0.19.6-beta.13, and our Mastodon instances to v4.3.0.

Thanks for your patience.


and now, onto the "Toastify is awesome" (because it is. Totally.) postmortem:

Short summary: Crash did a stupid with the config and emails weren’t getting sent (oops). Emails should now be working again, and registration shouldn't require admin intervention.

Longer story:

Lemmy has several configuration options that need to be updated to correspond to the specific instance that's running it. For example, here's a snippet of Pawb.Social's config:

{
  database: {
    host: "the.best.db.url.internal"
    user: "pawbsocial_lemmy"
    database: "pawbsocial_lemmy"
    password: "pawbsocial_do_you_really_think_i_would_leak_this"
    pool_size: 128
  }
  hostname: "pawb.social"
  pictrs: {
    url: "http://pawb-social-pictrs:8080/"
    api_key: "sekrit"
  }
  email: {
    smtp_server: "email-smtp.us-east-2.amazonaws.com:587"
    smtp_login: "pls_send_emails"
    smtp_password: "jeff_im_begging_you"
    smtp_from_address: "Pawb.Social Lemmy <{{domain}}>"
    tls_type: "starttls"
  }
}

Can you spot the problem? (No, it's not my awful password, or the pleading I'm doing with AWS to send emails so y'all stop having email problems.)

Originally, Lemmy was deployed through docker-compose which uses variable replacement to change something like {{domain}} to pawb.social automatically. Crash swapped away from using docker-compose due to repeated issues with trying to deploy the Lemmy changes and opted for a more bare Docker configuration.

Can you guess what didn’t get changed? …Yeah, the {{domain}} part was never removed and replaced with pawb.social so Lemmy correctly caught the error and threw an exception. The unfortunate part is that the thrown exception caused the whole comment process to close the connection to the web proxy server which returned a 502 Bad Gateway, which gave the UI HTML instead of a JSON response. The response handler in the UI then tried to figure out what was wrong, failed to translate the HTML to JSON, and gave the default error message “Toastify is awesome!” (which it is. Objectively.)

HOPEFULLY. This does fix the issue. I can’t guarantee it, but I’ve been able to replicate it by replying to someone else’s comment (which usually triggers an email notification) and after making the change, it successfully sent off the comment with a happy green confirmation toast.

[–] crashdoom@pawb.social 6 points 2 weeks ago (1 children)

Thank you for doing your part, and thank you for the cute sticker!! :3

[–] crashdoom@pawb.social 3 points 1 month ago (1 children)

Also got this myself… Seeming like the API is throwing errors despite working (…what?! >.>). Taking a look

[–] crashdoom@pawb.social 5 points 1 month ago

If anyone has been struggling to get a verification email:

  • Please ensure you check your spam folder (We’ve been having some mail delivery issues with AWS)
  • Send an email to network@pawb.social using the same email you created your account with, and include your username, and we’ll manually approve it in the database

We’re working to figure out what’s going on but it seems to be a mix of issues

[–] crashdoom@pawb.social 1 points 1 month ago

Investigating this! Apologies for the inconvenience!

[–] crashdoom@pawb.social 3 points 1 month ago (1 children)

Looking like I had an off-by-one error, it's actually fixed in the beta for 0.19.6; Will look at upgrading to that to fix the issue, just reviewing the changes to make sure there's nothing else major there.

 

We've successfully upgraded Lemmy from v0.19.3 to v0.19.5 and have a bunch of changes to share!

Bug Fixes

  • Previously users were reporting that the web client kept logging them out. This was suspected to be due to a misconfiguration of the cookie storing the session information and should be fixed.

  • Some communities were slow to load. Federation changes in v0.19.4 and v0.19.5 should rectify some federation issues. Additionally, we increased the number of database connections for Lemmy and will gradually increase this if we identify any further slowdowns.

(Longer term, we're planning to upgrade the database hardware which should improve the performance for all services)

Local Only Communities

When creating or editing your communities, you'll now see a "Visibility" option that allows you to choose between "Public" and "Local Only".

The "Public" setting will continue to operate as normal and federate your community to all other instances we federate with.

"Local Only" will disable federation for your community and restrict the community to being viewed by local logged in users only; Logged out users may not be able to see your community until they log in.

Image Proxying

Once we've established that the database is no longer under substantial load and the performance issues previously experienced are fixed, we'll be enabling an image proxying mode which will allow us to locally cache images from remote instances rather than serving them directly.

This will allow us to improve the performance of loading images without applying additional load to other instances, and prevent remote bad actors from trying to scrape IP addresses.

Other changes

There are a handful of other miscellaneous changes that you can view in the full change logs:

[–] crashdoom@pawb.social 2 points 1 month ago

👋 This isn't just you! We're aware of the issue and believe that the latest version of Lemmy should fix it; It seems to be an issue with the cookies being set with Strict instead of Lax causing it to not be sent by your browser.

Full announcement for maintenance to perform the upgrade: https://pawb.social/post/14286683

 

We’re aware of an issue affecting our Lemmy service (pawb.social) that seems to be causing users to be randomly logged out. We believe we’ve identified the issue thanks to a bug fix in a newer version of the UI code.

We’re needing to verify there are no other breaking changes for the upgrade process, and expect to perform maintenance this Sunday, Sept 22nd at Noon US Mountain Time (https://everytimezone.com/s/7abd4955). We’ll provide an update tomorrow to confirm the maintenance window.

Thank you to everyone who brought this to our attention and we apologize for the inconvenience.

[–] crashdoom@pawb.social 2 points 9 months ago

Looks really, really well done! :D

 

tl;dr summary furry.engineer and pawb.fun will be down for several hours this evening (5 PM Mountain Time onward) as we migrate data from the cloud to local storage. We'll post updates via our announcements channel at https://t.me/pawbsocial.


In order to reduce costs and expand our storage pool, we'll be migrating data from our existing Cloudflare R2 buckets to local replicated network storage, and from Proxmox-based LXC containers to Kubernetes pods.

Currently, according to Mastodon, we're using about 1 TB of media storage, but according to Cloudflare, we're using near 6 TB. This appears to be due to Cloudflare R2's implementation of the underlying S3 protocol that Mastodon uses for cloud-based media storage, which is preventing Mastodon from properly cleaning up no longer used files.

As part of the move, we'll be creating / using new Docker-based images for Glitch-SOC (the fork of Mastodon we use) and hooking that up to a dedicated set of database nodes and replicated storage through Longhorn. This should allow us to seamlessly move the instances from one Kubernetes node to another for performing routine hardware and system maintenance without taking the instances offline.

We're planning to roll out the changes in several stages:

  1. Taking furry.engineer and pawb.fun down for maintenance to prevent additional media being created.

  2. Initiating a transfer from R2 to the new local replicated network storage for locally generated user content first, then remote media. (This will happen in parallel to the other stages, so some media may be unavailable until the transfer fully completes).

  3. Exporting and re-importing the databases from their LXC containers to the new dedicated database servers.

  4. Creating and deploying the new Kubernetes pods, and bringing one of the two instances back online, pointing at the new database and storage.

  5. Monitoring for any media-related issues, and bringing the second instance back online.

We'll be beginning the maintenance window at 5 PM Mountain Time (4 PM Pacific Time) and have no ETA at this time. We'll provide updates through our existing Telegram announcements channel at https://t.me/pawbsocial.

During this maintenance window, furry.engineer and pawb.fun will be unavailable until the maintenance concluded. Our Lemmy instance at pawb.social will remain online, though you may experience longer than normal load times due to high network traffic.


Finally and most importantly, I want to thank those who have been donating through our Ko-Fi page as this has allowed us to build up a small war chest to make this transfer possible through both new hardware and the inevitable data export fees we'll face bringing content down from Cloudflare R2.

Going forward, we're looking into providing additional fediverse services (such as Pixelfed) and extending our data retention length to allow us to maintain more content for longer, but none of this would be possible if it weren't for your generous donations.

1
Lemmy v0.19.3 (pawb.social)
submitted 9 months ago* (last edited 9 months ago) by crashdoom@pawb.social to c/pawbsocial_announcements@pawb.social
 

We've updated to Lemmy v0.19.3!

For a full change log, see the updates below:

Major changes

Improved Post Ranking

There is a new scaled sort which takes into account the number of active users in a community, and boosts posts from less-active communities to the top. Additionally there is a new controversial sort which brings posts and comments to the top that have similar amounts of upvotes and downvotes. Lemmy’s sorts are detailed here.

Instance Blocks for Users

Users can now block instances. Similar to community blocks, it means that any posts from communities which are hosted on that instance are hidden. However the block doesn’t affect users from the blocked instance, their posts and comments can still be seen normally in other communities.

Two-Factor Auth Rework

Previously 2FA was enabled in a single step which made it easy to lock yourself out. This is now fixed by using a two-step process, where the secret is generated first, and then 2FA is enabled by entering a valid 2FA token. It also fixes the problem where 2FA can be disabled without passing any 2FA token. As part of this change, 2FA is disabled for all users. This allows users who are locked out to get into their account again.

New Federation Queue

Outgoing federation actions are processed through a new persistent queue. This means that actions don’t get lost if Lemmy is restarted. It is also much more performant, with separate senders for each target instance. This avoids problems when instances are unreachable. Additionally it supports horizontal scaling across different servers. The endpoint /api/v3/federated_instances contains details about federation state of each remote instance

Remote Follow

Another new feature is support for remote follow. When browsing another instance where you don’t have an account, you can click the subscribe button and enter the domain of your home instance in the popup dialog. It will automatically redirect you to your home instance where it fetches the community and presents a subscribe button. Here is a video showing how it works.

Moderation

Reports are now resolved automatically when the associated post/comment is marked as deleted. This reduces the amount of work for moderators. There is a new log for image uploads which stores uploader. For now it is used to delete all user uploads when an account is purged. Later the list can be used for other purposes and made available through the API.

[–] crashdoom@pawb.social 8 points 11 months ago (2 children)

Apologies for the confusion and delay! We've gone ahead and updated to 0.18.5!

[–] crashdoom@pawb.social 1 points 1 year ago* (last edited 1 year ago)

Side note: Approved the local https://pawb.social/u/CommunityBoost account just now.

 

Currently, we’re running the Ubiquiti Dream Machine directly as the modem via PPPoE, but there appears to be an intermittent issue with the software implementation that results in periodic downtimes of a few minutes while it reconnects.

We’re looking at switching this back out with the ISP provided router in pass through mode to negate the PPPoE connectivity drop.

We don’t expect this to take longer than 1 hour to switch over and test for reliability before bringing the services back up.

We’ll be performing this maintenance around 11 AM US Mountain Time, and will provide updates via the Telegram channel at https://t.me/pawbsocial.

 

One of the data storage systems (CEPH) encountered a critical failure when Proxmox lost connection to several of its nodes, ultimately resulting in the CEPH configuration being cleared by the Proxmox cluster consensus mechanism. No data, except ElasticSearch, was stored on CEPH.

When the connection was lost to the other nodes, a split-brain occurred (when nodes disagree on which changes are authoritative and which should be dropped). As we tried to recluster all of the nodes, a resolution occured that resulted in the ceph.conf file being wiped and the data on CEPH being unrecoverable.

Thankfully, we’ve suffered no significant data loss, with the exception of having to rebuild the Mastodon ElasticSearch indexes from 6 AM this morning to present.

I’d like to profusely apologize for the inconvenience, but we felt it necessary at the time to offline all services as part of our disaster recovery plan to ensure no damage occurred to the containers while we investigated.

 

We've migrated all of the Pawb.Social (Lemmy) media from within the container to the new local NAS.

We'll be performing the same updates for pawb.fun and furry.engineer over the course of the next week, minimizing downtime where possible.

We will need to take the services offline once the initial transfer is completed to perform a final check for any new data, and transition to the local NAS, but we'll post an announcement prior to the final maintenance for each instance.

These changes are to allow us longer and cheaper retention of remote and local media across all of our services, and to allow us to begin backing up media in case of a catastrophic failure.

 

With the upgrade to Mastodon 4.2.0, the way the search system works is changing to allow more broad searches across posts made across the fediverse.

Cutting to the chase: Full-text search across opted-in posts will be possible in the next few days.


In more depth, the system will allow you to search across your own posts and posts you've interacted with from others irrespective of their opt-in status for full-text search, but will additionally allow you to search ANY posts made by users that opt-in to allow their posts to be indexed.

To that effect, a new setting was added to the Public Profile > Privacy and Reach page called "Include public posts in search results" which is disabled by default. Turning this on will allow any user across the Fediverse to search and find your posts using full-text search.

Shortcuts:

To allow our users to decide if this is a feature they want to use on their posts going forward, we won't be enabling ElasticSearch until Friday Sept. 22.


Additionally, new keywords will be added to the search system:

  • from:me
  • before:2022-11-01, after:2022-11-01, during:2022-11-01
  • language:fr
  • has:poll
  • in:library for searching only in posts you have written or interacted with

These will be enabled on Friday along with the rest of the changes to the search system.


If you've got any questions, please let us know!

[–] crashdoom@pawb.social 2 points 1 year ago (1 children)

I've been thinking about this for a while, especially seeing several of the admin and maintainer communities for Mastodon / Lemmy taking place on Matrix. But, at this time I don't think we're in a position to be able to spin one up due to existing work we need to finish before taking on new responsibilities.

So, not an outright no, but in the future, yes. If this is something folks do want us to provide, please do add your feedback here and updoot. :3

 

EDIT Aug 2, 1:53 PM US Mountain Time: Maintenance has now concluded and all services are back online!


We're preparing the server rack for an additional Proxmox node to be added to support HA failover and have identified the need to offline and move some equipment around to complete this work.

As such, we'll have a temporary downtime for up to 1 hour from 1 PM US Mountain Time, today August 2nd, or 1 hour from this post. Though, we expect the work to update the node, move it, and bring it back online to take no longer than 30 minutes.

We're hoping these changes will help us maintain a more stable service going forward, spread out the Sidekiq processes (the ones that receive / send data across the Fediverse), and allow us to perform maintenance as necessary without having to offline the service in the future.

We apologize for the short notice as we had hoped to perform this maintenance without any downtime.


Maintenance window: 1 PM - 2 PM US Mountain Time, August 2nd, 2023

ETA: 30 minutes to 1 hour

During the maintenance window, the following services will be unavailable:

  • pawb.fun
  • pawb.social
  • furry.engineer

We’ll have updates throughout the maintenance window available at https://t.me/pawbsocial.

 

Cleared The Shifting Oubliettes of Lyhe Ghiah with a group of four, hitting every single exclamation mark (though two were secret sweeper trolls)!

Love the fact the devs are willing to troll you. FFXIV is awesome XD

 

Around 11:30 PM, one of the core GCFI breakers tripped resulting in the UPS array running until around just after midnight before dropping entirely.

We had not set up monitoring on the APC UPS as we hadn’t anticipated that as a failure point due to the reliability of the Colorado electrical grid, and didn’t anticipate the breaker itself flipping after months of continued use.

All services should be online again and working to catch up on any dropped content overnight, and we apologize again for the inconvenience.

We’ll be trying to identify the root cause throughout the day, working on the electrical connections, and setting up alerting through the APC UPS to detect and persistent A/C loss.

[–] crashdoom@pawb.social 1 points 1 year ago

Personally, I pronounce it “POB” because W’s are hard :p

Though, I totally didn’t realize when I chose the domain that Pawb also meant “everybody” in Welch as @match@pawb.social mentioned, but I think it makes it all the better :3

[–] crashdoom@pawb.social 6 points 1 year ago (2 children)

We're using the Ansible playbook deployment, and I ended up giving the pictrs service a restart through docker (docker restart <id> and you can get the id by using docker ps).

It didn't seem to be out of space or even offline, it just locked up and stopped responding to both new uploads and existing image requests.

view more: next ›