this post was submitted on 07 Jul 2024

239 points (92.2% liked)

Technology

59232 readers

3876 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

239

I hate Clouds - a personal perspective on why I think Clouds suck (loudwhisper.me)

submitted 4 months ago by loudwhisper@infosec.pub to c/technology@lemmy.world

112 comments fedilink hide all child comments

I hope this won't be counted as some form of self-promotion, even though I am sharing a post from my own blog.

As a tech worker who works in a Cloud shop, I wanted to elaborate the many reasons why I find working with Clouds terrible, from multiple points of view.

I tried to organize my thoughts in a (relatively long) post, in which both technical aspects and political aspects (which are very related) are covered.

I am sure many people will have different perspectives, and this could be potentially also a nice prompt for a discussion.

you are viewing a single comment's thread
view the rest of the comments

[–] ctkatz@lemmy.ml 15 points 4 months ago (3 children)

there are too many points of failure for me to ever be comfortable using the cloud as a primary storage option.

i've always maintained this opinion when "the cloud" started being touted as being the future. and yet more corporations (including mine) are reliant on it. i mean sure, i can log in on my home computer and have some access to stuff as though i were physically at the office but that convenience ain't worth the headache if the main storage site crashes.

[–] figjam@midwest.social 5 points 4 months ago (1 children)

there are too many points of failure for me to ever be comfortable using the cloud as a primary storage option.

If everything that you run is local as in the same physical location and there is no requirement for external or internet access then sure. Not everyone has that luxury. Otherwise, There are the same number of points of failure in a non-cloud configuration. You just feel more comfortable with those because you have direct hands on control.

[–] corsicanguppy@lemmy.ca 1 points 4 months ago (2 children)

. You just feel more comfortable with those because you have direct hands on control.

You write "actually following best practice instead of faking it and lying" funny.

[–] figjam@midwest.social 2 points 4 months ago

You write "actually following best practice instead of faking it and lying" funny.

Are you implying that the various cloud vendors lie about the way they configure their environments or admins don't have emotional biases or something else entirely?

[–] schizo@forum.uncomfortable.business 2 points 4 months ago (1 children)

There are places that actually do that?

Can you provide a list, because I'd like to work there.

(I do not have 25 years of sysadmin angst over nobody ever doing shit right until after it's on fire.)

[–] loudwhisper@infosec.pub 1 points 4 months ago

Proton runs fully on their own hardware, they have some positions open!

[–] IphtashuFitz@lemmy.world 4 points 4 months ago (1 children)

Having done everything from building my own servers 30 years ago to managing hundreds of servers in data centers to now managing hundreds of instances and other services in AWS, I’ll gladly stick with AWS. The hardware management alone makes it well worth the overhead.

25 or so years ago I had to troubleshoot a hardware issue in a SCSI-based server with 6 hard drives in it. A drive appeared to be failing so I replaced it and immediately another drive failed, then another, and so on. After almost a full day of troubleshooting later and we realized the power supply was actually the culprit and could no longer provide sufficient power to the full set of hard drives.

20 years ago while managing 700+ servers in a datacenter we had to manage a recall of about 400 of them thanks to the Capacitor plague that caused a handful of our servers to literally burst into flames.

Hardware failures like the above and dozens of others were mitigated in most cases thanks to redundancies in the software we wrote. But dealing with hardware failures and the resulting software recovery was a real PITA.

With AWS I may occasionally have a Linux instance lock up due to a hardware failure but it’s usually fairly easy to reboot the instance and have it migrate to new hardware. It’s also trivial to migrate a server to run on more (or less) number of CPU’s, RAM, etc. with only a couple of minutes of downtime.

The more advanced services AWS offers like object storage, queues, databases, etc. are even more resilient. We occasionally get notified that a replica for one of these services had failed or was determined to be on hardware that was failing, and it was automatically replaced with a new replica.

I’d much rather work this way than the way I did 20+ years ago.

[–] loudwhisper@infosec.pub 2 points 4 months ago (1 children)

Why not outsourcing just the hardware then? Dedicated servers and Kubernetes slapped on them. Hardware failure mitigated for the most part, and the full effort goes into making the cluster as resilient as possible, for 1/5 of the cost of AWS. If machines burn, it's not your problem (you can have them spread over multiple sites, DCs, rooms, racks) anymore.

[–] IphtashuFitz@lemmy.world 3 points 4 months ago

We did that (with Rackspace) for years before migrating to AWS. AWS is still far better from a service & flexibility perspective.

My employers website has certain times of the year where we see a huge increase in web traffic. When we had a hosted solution it took weeks of preparation to provision additional web servers to handle that load. We had to submit formal requests for additional servers, document how to wire them into our network & required firewall rules, etc. Then we had to wait an arbitrary number of days for them to do the work. And then we had to repeat that whole process when we no longer needed the additional capacity.

With AWS we just define an auto scaling group and additional web servers are spun up automatically when demand is high, and frees them up again when no longer needed. Even if we didn’t use auto scaling we could easily automate this sort of thing via terraform or other tools and spin up additional instances in minutes instead of days.

[–] Tja@programming.dev 4 points 4 months ago (2 children)

If the storage "crashes" it doesn't matter if it's in the cloud or on-prem.

With the cloud you get two substantial advantages:

the storage is built so it doesn't break so easily. I trust AWS engineers more than Mike, no matter how cool Mike is to hang out with. Additionally, if the storage breaks while Mike is on vacation we're screwed, with the cloud you get a whole team 24/7 on it.
you can prevent data loss with backups or multi-region setups with a few clicks/terraform lines. Try telling the PO that you need to rent datacenter space in Helsinki and Singapore for redundancy...

Of course all this costs big bucks, but technically it's superior, easier and less risky.

[–] corsicanguppy@lemmy.ca 6 points 4 months ago (1 children)

trust AWS engineers more than Mike, no matter how cool Mike is

AWS engineers' first responsibility is to shareholders
Mike's responsibility is to your same boss.

They are not the same.

Bonus: you can see Mike's certs are real.

[–] Tja@programming.dev 0 points 4 months ago

It's not about responsibility (and only the c suite reports to the shareholders, not Mike), it's about capability, visibility, tooling and availability.

[–] nexusband@lemmy.world 4 points 4 months ago (1 children)

the storage is built so it doesn’t break so easily. I trust AWS engineers more than Mike, no matter how cool Mike is to hang out with. Additionally, if the storage breaks while Mike is on vacation we’re screwed, with the cloud you get a whole team 24/7 on it.

That's easily mitigated just following established standards. Redundancy is cheaper than anything else in the aftermath and documentation can be done easy with automation.

you can prevent data loss with backups or multi-region setups with a few clicks/terraform lines. Try telling the PO that you need to rent datacenter space in Helsinki and Singapore for redundancy…

You don't, you rent rack space in a location far enough away but close enough to get the data in a few hours.

It's neither superior, easier or less risky, it's just a shift in responsibility. And in most cases, it's so expensive that a second or third on site engineer is payed for.

[–] Tja@programming.dev 0 points 4 months ago (2 children)

And what is simpler and faster, renting rack space in another continent (and buying, shipping, racking and initializing) or editing your terraform file?

[–] nexusband@lemmy.world 3 points 4 months ago (2 children)

Why on another continent? Except maybe VDI, some direct calls to some LLM or some insane scales, there's nothing really that needs those round trip times.

[–] ErrorCode@lemmy.world 1 points 4 months ago

Also data rules / data privacy. Some things need to have the original in Europe; China & Russia also need their data separated from others.

[–] Tja@programming.dev 0 points 4 months ago

Because the customer demands it.

[–] loudwhisper@infosec.pub 2 points 4 months ago (2 children)

Not OP, but they are comparable efforts, especially since it's a relatively infrequent activity. You can rent dedicated boxes with off-the-sheld hardware almost instantly, if you don't want to deal with the hardware procurement, and often you can do that via APIs as well. And of course both options are much, much, much cheaper than the Cloud solution.

For sure speed in general is something Cloud provide. I would say it's a very bad metric though in this context.

[–] nexusband@lemmy.world 1 points 4 months ago

I would say it’s a very bad metric though in this context.

Full-ACK.

[–] Tja@programming.dev 0 points 4 months ago (1 children)

My last customer (global insurance company) provisions several systems a day. Now moving to hundreds via Jenkins. Frequency is environment dependent.

[–] loudwhisper@infosec.pub 1 points 4 months ago (1 children)

If your compute needs expand that much everyday, and possibly shrink in others, than your use-case is one that can benefit from Cloud (I covered this in the post).

That said, if provisioning means recycle, then it's obviously not a problem.

This is a very rare requirement. Most companies' load is fairly stable and relatively predictable, which means that with a proper capacity planning, increasing compute resources is something that happens rarely too. So rarely that even a lead time for hardware is acceptable.

So if I may ask (and you can tell), what is the purpose of provisioning that many systems each day? Are they continuously expanding?

[–] Tja@programming.dev 1 points 4 months ago* (last edited 4 months ago) (1 children)

Agree to disagree. Banking, telecommunications, insurance, automotive, retail are all industries where I have seen wild load fluctuations. The only applications where I have seen constant load are simulations: weather, oil&gas, scientific. That's where it makes sense to deploy your own hardware. For all else, server less or elastic provisioning makes economic sense.

Edit to answer the last question: to test variable loads, in the last one. Imagine a hurricane comes around and they have to recalculate a bunch of risk components. But can be as simple as running CI/CD tests.

[–] loudwhisper@infosec.pub 1 points 4 months ago (1 children)

Systems are always overspecced, obviously. Many companies in those industries are dynosaurs which run on very outdated systems (like banks) after all, and they all existed before Cloud was a thing.

I also can't talk for other industries, but I work in fintech and banks have a very predictable load, to the point that their numbers are almost fixed (and I am talking about UK big banks, not small ones).

I imagine retail and automotive are similar, they have so much data that their average load is almost 100% precise, which allows for good capacity planning, and their audience is so wide that it's very unlikely to have global spikes.

Industries that have variable load are those who do CPU intensive (or memory) tasks and have very variable customers: media (streaming), AI (training), etc.

I also worked in the gaming industry, and while there are huge peaks, the jobs are not so resource intensive to need anything else than a good capacity planning.

I assume however everybody has their own experiences, so I am not aiming to convince you or anything.

[–] Tja@programming.dev 1 points 4 months ago (1 children)

Banking is extremely variable. Instant transactions are periodic, I don't know any bank that runs them globally on one machine to compensate for time zones. Batches happen at a fixed time, are idle most of the day. Sure you can pay MIPS out of the ass, but you're much more cost effective paying more for peak and idling the rets of the day.

My experience are banks (including UK) that are modernizing, and cloud for most apps brings brutal savings if done right, or moderate savings if getting better HA/RTO.

Of course if you migrate to the cloud because the cto said so, and you lift and shift your 64 core monstrosity that does 3M operations a day, you're going to 3nd up more expensive. But that should have been a lambda function that would cost 5 bucks a day tops. That however requires effort, which most people avoid and complain later.

[–] loudwhisper@infosec.pub 1 points 4 months ago (1 children)

Instant transactions are periodic, I don’t know any bank that runs them globally on one machine to compensate for time zones.

Ofc they don't run them on one machine. I know that UK banks have only DCs in UK. Also, the daily pattern is almost identical everyday. You spec to handle the peaks, and you are good. Even if you systems are at 20% half the day everyday, you are still saving tons of money.

Batches happen at a fixed time, are idle most of the day.

Between banks, from customer to bank they are not. Also now most circuits are going toward instant payments, so the payments are settled more frequently between banks.

My experience are banks (including UK) that are modernizing, and cloud for most apps brings brutal savings if done right, or moderate savings if getting better HA/RTO.

I want to see this happening. I work for one and I see how our company is literally bleeding from cloud costs.

But that should have been a lambda function that would cost 5 bucks a day tops

One of the most expensive product, for high loads at least. Plus you need to sign things with HSMs etc., and you want a secure environment, perhaps. So I would say...it depends.

Obviously I agree with you, you need to design rationally and not just make a dummy translation of the architecture, but you are paying for someone else to do the work + the service, cloud is going to help to delegate some responsibilities, but it can't be cheaper, especially in the long run since you are not capitalizing anything.

[–] Tja@programming.dev 1 points 4 months ago (1 children)

Not only it can be cheaper, it is cheaper in most cases... when designed correctly. And if you compare TCO, not hardware vs IaaS.

It can also be much more expensive of course, but that's almost always a skill issue.

[–] loudwhisper@infosec.pub 2 points 4 months ago* (last edited 4 months ago)

In most cases! Sorry, I simply don't believe it. Once you operate for 5, 10, 20 years not having capitalized anything is expensive as hell, even without the skill issue (which is not a great argument, as it is the case for almost anything).

It's almost always the case with rent vs invest.

Do you have some numbers?

I cite a couple of articles in the post, and here is a nice list of companies and orgs that run outside the Cloud (it's a bit old!) or decided to move away. Many big companies with their own DC, which is not surprising, but also smaller (Wikipedia!).

37signals also showed a huge amount of savings (it's one of the two links in the post) moving away from the cloud. Do you have any similar data that shows the opposite (like we saved X after going cloud)? I am genuinely curious

Edit: here is another one https://tech.ahrefs.com/how-ahrefs-saved-us-400m-in-3-years-by-not-going-to-the-cloud-8939dd930af8 Looking solely at the compute resources, there was an order of magnitude of difference between cloud costs and hosting costs (x11). Basically a value comparable (in reality double) to the whole revenue of the company.