this post was submitted on 02 Sep 2024
91 points (95.0% liked)
Asklemmy
43811 readers
949 users here now
A loosely moderated place to ask open-ended questions
If your post meets the following criteria, it's welcome here!
- Open-ended question
- Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
- Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
- Not ad nauseam inducing: please make sure it is a question that would be new to most members
- An actual topic of discussion
Looking for support?
Looking for a community?
- Lemmyverse: community search
- sub.rehab: maps old subreddits to fediverse options, marks official as such
- !lemmy411@lemmy.ca: a community for finding communities
~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~
founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
I took down the home page of one of the top 5 websites for around 5 minutes.
There were two existing functions that were written by a different team: An
encode
method that took a name of something (only used internally, never shown to the user) and returned a numeric identifier for it, and adecode
method that did the opposite.Some existing code already used
encode
, but I had to usedecode
in my new code. Added the code, rolled it out to 80% of employees, and it seemed to work fine. Next day, I rolled it out to 5% public and it still seemed okay.Once I rolled it out to everyone, it all broke.
Turns out that while the
encode
function used a static map built at build-time (and was thus just an O(1) lookup at runtime),decode
connected to a database that was only ever designed for internal use. The DB only had ten replicas, which was nowhere near enough to handle hundreds of thousands of concurrent users.Luckily, it's commonplace to use feature flags changes, which is how I could roll it out just to employees initially. The devops team were able to find stack traces of the error from the prod logs, find my code, find the commit that added it, find the name of the killswitch, and disable my code, before I even noticed that there was a problem. No code rollback needed.
That was probably 7 years ago now. Thankfully I haven't made any mistakes as large as that one again!
Always use feature flags for major changes, especially if they're risky!