BirAdam 11 hours ago

Just to be honest for a bit here... we also should be asking what kind of scale?

Quite a while ago, before containers were a thing at all, I did systems for some very large porn companies. They were doing streaming video at scale before most, and the only other people working on video at that scale were Youtube.

The general setup for the largest players in that space was haproxy in front of nginx in front of several PHP servers in front of a MySQL database that had one primary r/w with one read only replica. Storage (at that time) was usually done with glusterfs. This was scalable enough at the time for hundreds of thousands of concurrent users, though the video quality was quite a bit lower than what people expect today.

Today at AWS, it is easily possible for people to spend a multiple of the cost of that hardware setup every month for far less compute power and storage.

  • sgarland 11 hours ago

    THANK YOU. People look at me like I’m insane when I tell them that their overly-complicated pipeline could be easily handled by a couple of beefy servers. Or at best, they’ll argue that “this way, they don’t have to manage infrastructure.” Except you do - you absolutely do. It’s just been partially abstracted away, and some parts like OS maintenance are handled (not that that was ever the difficult part of managing servers), but you absolutely need to configure and monitor your specific XaaS you’re renting.

    • resonious an hour ago

      I'll play devil's advocate a little bit here. But to be clear, I hate AWS and all of their crazy concepts and exorbitant pricing, so ultimately I think I'm on your side.

      OS maintenance honestly is a bit hard for me. I need to know what to install for monitoring, I need to maintain scripts or Ansible playbooks. I need to update these and make sure they don't break my setup.

      And the big kicker is compliance. I always work under SOC2, ISO27001, PCI-DSS, HIPAA, you name it. These require even more things like intrusion detection, antivirus, very detailed logging, backups, backup testing, web application firewall. When you just use AWS Lambda with DynamoDB, the compliance burden goes down a lot.

      • sgarland 37 minutes ago

        Yes, you need to write Ansible initially. But honestly, it’s not that much for your average application server. Turn on unattended-upgrades with anything critical to your application blacklisted, and you won’t have to touch it other than to bump version pins whenever you make a new golden image.

        Re: compliance, other than SOC2 being a giant theater of bullshit, agreed that it adds additional work. My point is that the claims of “not having to manage infrastructure” is highly misleading. You get to skip some stuff, yes, but you are paying through the nose in order to avoid writing some additional config files.

    • vidarh 5 hours ago

      I do consulting in this space, and I'm torn: I make much more money managing infrastructure from clients who insist on AWS. But it's much more enjoyable to work with people who knows how to keep it simple.

      • yoyohello13 5 hours ago

        I worked on a project for my company (a low volume basic web app) and I suggested we could just start the whole thing on one server. They brought on some Azure consultants and the project ballooned out to months of work and all kinds of services. I’m convinced most of the consultants were just piling on services so they could make more money.

        • xmprt 4 hours ago

          If you hire hammer experts then you're going to end up using a lot of hammers in your construction. The Azure experts aren't pitching Azure because they're trying to sell more Azure products. They do it because that's all they know and most likely because you don't know it so you'll be likely to come back to them for support when things inevitably need to evolve.

          • bunderbunder 2 hours ago

            Also, the more you use your cloud vendor's various services in your code, the more subject you are to vendor lock-in.

            I won't name any names, but I'm pretty sure this is a big part of the reason why a specific cloud vendor pushed so very hard for us to push a bunch of data into their highly advanced NoSQL big data solution, when the data in question was perfectly happy continuing indefinitely to exist as a few tens of megabytes of CSV files that were growing at a rate of a couple kilobytes per day.

        • vidarh 5 hours ago

          It's probably true. The biggest challenge with doing the right thing in this space is that the sales job is hard, time consuming and so expensive that it's a lot easier to make it profitable if you make projects balloon like that. The sales effort is much the same.

          I've been offering to help people cut costs for a while, and it's a shockingly hard sell even with offers of guarantees, so we're deemphasizing it to focus more on selling more complex DevOps assistance and AI advice instead... Got to eat (well, I do much better than that, but anyway), but I refuse to over engineer things just to make more money.

          • yoyohello13 5 hours ago

            I don't necessarily even blame the contractors. When the bosses look askance at simple solutions what can you do? It's weirdly harder to sell people on something simple than on something complex. They assume the simple solution must be missing something important.

        • treis 5 hours ago

          I joke with my boss that all our shit ends up running on a single server in some Amazon data center. It's probably not true but if you add up everything we do it's pretty close to one big server.

          • selcuka an hour ago

            It's probably not true, but the end result would be the same even if it was true.

      • baxtr 2 hours ago

        What I’ve always found concerning about managed cases is that the “platform” teams could never explain, in simple terms, how the application was actually deployed.

        It was so complex I gave up after a while. That’s never a good sign.

    • gaoshan 9 hours ago

      Anyone that says, "they don’t have to manage infrastructure" I would invite them to deal with a multi-environment terraform setup and tell me again that about what they don't have to manage.

      • darkwater 8 hours ago

        Those are the ones that also usually tell you you can just stitch together a few SaaS products and it's magic.

        • galaxyLogic 6 hours ago

          It's much the same mindset as: "Vibe-coding can do it for you so you don't have to program"

          • darkwater 6 hours ago

            Yep. Low-effort, shallow knowledge, risk-taking guys.

          • ta12653421 6 hours ago

            You are an outdate Boomer!!! I have 37 agents doing that for me!!!!!11^

            LOL

      • simianwords 7 hours ago

        While terraform is not ideal it is much much more easy to deal with managed services in AWS than to deal with on premises baremetal servers.

        Most are biased because they like dealing with the kind of issues in on premises.

        They like dealing with the performance regressions, heat maps, kernel issues etc. Because why not? You are a developer and you need some way to exercise your skills. AWS takes that away and makes you focus on the product. Issues arising from AWS only requires you talking to support. Most developers get into this industry for the love of solving these problems and not actually solving product requirements.

        AWS takes away what devs like and brings in more "actual" work.

        • ndriscoll 7 hours ago

          > AWS takes that away and makes you focus on the product. Issues arising from AWS only requires you talking to support.

          Not my experience at all. e.g. NLBs don't support ICMP which has broken some clients of the application I work on. When we tried to turn on preserve-client-ip so we could get past the ephemeral port limit, it started causing issues with MSS negotiation, breaking some small fraction of clients. This stuff is insanely hard to debug because you can't get onto the loadbalancer to do packet captures (nor can AWS support). Loadbalancing for long-lived connections works poorly.

          Lambda runs into performance issues immediately for a web application server because it's just an entirely broken architecture for that use-case (it's basically the exact opposite of user-mode threads to scale: let's use an entire VM per request!). For some reason they encourage people to do it anyway. Lord help you if you have someone with some political capital in your org that wants to push for that.

          RDS also runs into performance issues the moment you actually have some traffic. A baremetal server is orders of magnitude more capable.

          ipv6 support is still randomly full of gaps (or has only very recently been fixed, except you might have to do things like recreate your production EKS cluster, oops) which leads to random problems that you have to architect around. Taken with NAT gateway being absurdly expensive, you end up having to invert sensible architectures or go through extra proxy layers that just complicate things.

          AWS takes basic skills around how to build/maintain backend systems and makes half of your knowledge useless/impossible to apply, instead upgrading all of your simple tuning tasks into architectural design problems. The summary of my last few years has basically been working around problems that almost entirely originate from trying to move software into EKS and dealing with random constraints that would take minutes to fix baremetal.

          • icedchai 6 hours ago

            I agree that building your backend on Lambda is terrible for many reasons: slow starts, request / response size restrictions, limitations in "layer" sizes, etc.

            RDS, however, I have found to be rock solid. What have you run into?

            • mystifyingpoi 6 hours ago

              The parent compares RDS to baremetal, which I think isn't a fair comparison at all. Especially since we don't know the specs of either of these.

              I found RDS to be rock solid too, although performance issues are often resolved by developers by submitting a PR that bumps the instance size x2, because "why not". On baremetal it's often impossible to upgrade CPU just like that, so people have to fix performance issues elsewhere, which leads to better outcome at the end.

              • vidarh 5 hours ago

                RDS works great, but it's far easier to scale a bare metal setup to an extent that makes RDS look like an expensive toy because you have far more hardware options

                RDS is a good option if you want convenience and simplicity, though.

            • crdrost 6 hours ago

              I don't know too much about the performance side of RDS, but the backup model is absolutely a headache. It's at the point where I'd rather pg_dump into gz and upload to s3.

        • theamk 6 hours ago

          > performance regressions, heat maps, kernel issues etc.

          > AWS takes that away and makes you focus on the product.

          ha ha ha no. Have been dealing with kernel issues on my AWS machines for a long time. They lock up under certain kinds of high load. AWS support is useless. Experimenting with kernel version leads to performance regressions.

          AWS is great if your IT/purchasing department is inefficient. Getting a new AWS machine is instant, compared to getting purchasing to approve new machine and IT allocating it to you. But all the low-level stuff is still there.

          • geoduck14 4 hours ago

            >AWS is great if your IT/purchasing department is inefficient

            Fwiw, I think a lot of companies have this problem.

            • san1t1 4 hours ago

              I think the conversation has turned from "Can we spend more?" to "Can you please try and spend less?"

        • sgarland 36 minutes ago

          I have literally never met a dev who wanted to deal with any kind of infrastructure.

        • bcrosby95 6 hours ago

          We have both AWS and colocated servers. The server workload mostly scales with the machine count not the user count. And you can get a lot done with very few servers these days.

        • vidarh 5 hours ago

          That was barely true a decade ago. It's total nonsense today when it's trivial to ensure all your servers have IPMI or similar and it cann all be automated apart from the couple-of-times a year component swap outs.

          But it's also the wrong comparison: there's rarely a reason to go on premises, and need to take responsibility for the hardware yourself - renting bare metal servers is usually the sweet spot and means someone else does the annoying bits for you but you still have the simplicity and lower cost.

          As someone contracted to manage systems for people, I consistently make more money from people who overengineer their cloud setups than from people with bare metal servers. It tends to require far more maintenance to keep an AWS setup same, secure, and not bankrupting you.

        • vanviegen 7 hours ago

          Uh, no I don't in fact like dealing with any of that. And I've never had to in 20 years of managing some mid scale service on bare metal. (Though of course there have been issues of various sorts.)

          I think you may have it backwards: people like tinkering with complex cloud stuff, even if they don't need it.

      • toast0 3 hours ago

        I've certainly done some things where outsourcing hosting meant I didn't have to manage infrastructure. For services running on vm instances in gcp vs services running on bare metal managed hosts, there's not a whole lot of difference in terms of management IMHO.

        But any infrastructure that the product I support use is infrastructure I need to manage; having it outside my control just makes it that much harder to manage. If it's outside my control, the people who control it better do a much better job than I would at managing it, otherwise it's going to be a much bigger pain.

    • adamcharnock 2 hours ago

      I totally agree. So much complexity for generally no good reason [0]. I saw so much of this that I ended up starting a company doing the exact opposite. I figured I could do it better and cheaper, so that that's now what we do!

      If anyone wants to bail out of AWS et al and onto a few beefy servers, save some money, and gain a DevOps team in the process, then drop us an email (adam at domain in bio).

      [0] My pet theory about the real reason: the hyper-scalers hire all the engineers who have the skills to deploy-to-a-few-beefy-servers, and then charge a 10x multiplier for compute. Companies can then choose between impossible hiring, or paying more. Paying more is easier to stomach, and plenty of rationalisations are available.

      • sgarland 32 minutes ago

        > My pet theory about the real reason: the hyper-scalers hire all the engineers who have the skills to deploy-to-a-few-beefy-servers, and then charge a 10x multiplier for compute.

        This is also my pet theory, and it’s maddening. They’ve successfully convinced an entire generation of devs that physical servers are super scary and they shouldn’t ever have to look at them.

    • BobbyTables2 9 hours ago

      Have always felt the same.

      I’ve seen an entire company proudly proclaim a modern multicore Xeon with 32GB RAM can do basic monitoring tasks that should have been possible with little more than an Arduino.

      Except the 32GB Xeon was far too slow for their implementation...

      • icedchai 3 hours ago

        Let me guess: database tables with no indexes, full scans everywhere?

      • wltr 5 hours ago

        I swear, before I finished reading your comment, this thought jumped into my mind: ‘oh my, they do host everything with a computer similar to my [pretty old by the way, but still beefy] for-work computer! Impressive!’

        Which is, I still believe is perfectly possible to do.

        Then, I was ‘what?!’

      • MattGaiser 6 hours ago

        How did they implement it? That's horrendous.

    • macNchz 9 hours ago

      Working on various teams operating on infrastructure that ranged from a rack in the back of the office, a few beefy servers in a colo, a fleet of Chef-managed VMs, GKE, ECS, and various PaaSes, what I've liked the most about the cloud and containerized workflows is that they wind up being a forcing function for reproducibility, at least to a degree.

      While it's absolutely 100% possible to have a "big beefy server architecture" that's reasonably portable, reproducible, and documented, it takes discipline and policy to avoid the "there's a small issue preventing {something important}, I can fix it over SSH with this one-liner and totally document it/add it to the config management tooling later once we've finished with {something else important}" pattern, and once people have been doing that for a while it's a total nightmare to unwind down the line.

      Sometimes I want to smash my face into my monitor the 37th time I push an update to some CI code and wait 5 minutes for it to error out, wishing I could just make that band-aid fix, but at the end of the day I can't forget to write down what I did, since it's in my Dockerfile or deploy.yaml or entrypoint.sh or Terraform or whatever.

      • darkwater 8 hours ago

        You have to remove admin rights to your admins then, because scrappy enough DevOps/platform engineers/whatever will totally hand-edit your AWS infra or Kubernetes deployments. I suffered that first hand. And it's even worse that in the old days, because at least back in the day it was expected.

        • sgarland 26 minutes ago

          Nah, just run Puppet or similar. You’re welcome to run your command to validate what you already tested in stage, but if you don’t also push a PR that changes the IaC, it’s getting wiped out in a few minutes.

          I hate not having root access. I don’t want to have to request permission from someone who has no idea how to do what I want to do. Log everything, make everything auditable, and hold everyone accountable - if I fuck up prod, my name will be in logs, and there will be a retro, which I will lead - but don’t make me jump through hoops to do what I want, because odds are I’ll instead find a way around them, because you didn’t know what you were doing when you set up your security system.

        • jkaplowitz 7 hours ago

          Or at least you have to automatically destroy and recreate all nodes / VMs / similar every N days, so that nobody can pretend that any truly unavoidable hand-edits during emergency situations will persist. Possibly also control access to the ability to do hand edits behind a break-glass feature that also notifies executives or schedules a postmortem meeting about why it was necessary to do that.

          • vidarh 5 hours ago

            I know of at least one organisation that'd automatically wipe every instance on (ssh-)user logout, so you could log in to debug, but nothing you did would persist at all. I quite like that idea, though sometimes being able to e.g. delay the wipe for up to X hours might be slightly easier to deal with for genuinely critical emergency fixes.

            But, yes, gating it behind notifications would also be great.

            • sgarland 25 minutes ago

              Was this Mozilla?

            • 9dev 5 hours ago

              That sounds like the kind of thing that’s amazing, until it isn’t and you know exactly why your day just got a lot worse.

          • rustystump 6 hours ago

            Oh no it ran out of disk space because of bug! I will run a command on that instance to free it rather than fix bug. Oh no error now happens half of the time better debug for hours only to find out someone only fixed a single instance…

            I will never understand the argument for cloud other than bragging rights about burning money and saving money which never shoulda been burning to begin with.

        • Aeolun 4 hours ago

          But then your next deployment goes, and it all rolls back, right?

          And then it their fault, right?

          I might have mild trauma from people complaining their artisanal changes to our environment weren’t preserved.

        • kune 3 hours ago

          In my org nobody has admin rights with the exception of emergencies, but we are ending up with a directory full of Github workflows and nobody knows, which of them are currently supposed to work.

          Nothing beats people knowing what they are doing and cleaning up behind them.

      • tracker1 5 hours ago

        I'm still a pretty big fan of Docker (compose) behind Caddy as a reverse-proxy... I think that containers do offer a lot in terms of application support... even if it's a slightly bigger hoop to get started with in some ways.

        • vidarh 5 hours ago

          I'm working on an app server that's auto deploying itself behind Caddy + DNS/SSL aut config. Caddy is amazing, and there really should be no reason for complex setups for most people these days... I've worked on some huge systems, but most systems can run in trivially simple setups given modern hardware.

    • chamomeal 6 hours ago

      Docker compose on a couple nice VPS’s can do a LOT

    • citizenpaul 2 hours ago

      The problem with onsite or colo is always the same. You have to keep fighting the same battle again and again and again. In 5 years when the servers need replaced even though you have already proven it saves orders of magnitude in costs.

      I've never once been rewarded for saving 100k+ a month even though I have done exactly that. I have been punished by having to constantly re justify the decision though. I just don't care anymore. I let the "BIG BRAIN MBA's" go ahead and set money on fire in the cloud. It's easier for me. Now I get to hire a team of "cloud architects" to do the infra. At eye bleeding cost increases for a system that will never ever see more than a few thousand users.

    • frde_me 4 hours ago

      On the other hand, I know a lot of people who spend more time / salary messing around with their infra than the couple hundred bucks they've saved from not pressing a couple of buttons on vercel / cloudfare

      There's a time and place for just deploying quickly to a cloud provider versus trying to manage your infra. It's a nuanced tradeoff that rarely has a clear winner.

    • yibg 5 hours ago

      I think it depends on what you are optimizing for. If you are a VC funded startup trying to get to product market fit, spending a bit more on say AWS probably makes sense so you can be “agile”. The opportunity cost there might outweigh infrastructure cost. If you are bootstrapped and cost matters a lot, then different story.

    • skydhash 10 hours ago

      I look at what I can do with an old mac mini (2011) and it’s quite good. I think the only issue with hardware is technical maintenance, but at the scale of a small companies, that would probably be having a support contract with Dell and co.

      • amluto 7 hours ago

        Small companies should never forget to ask Dell, etc for discounts. The list prices at many of these companies are aspirational and, even at very small scale, huge discounts are available.

    • immibis 5 hours ago

      You can get a server now with, like, five hundred cores and a fifty terabytes of RAM. It's expensive, but you can get one.

      A used server with sixty cores and one terabyte of RAM is a lot cheaper. Couple thousand bucks. I mean, that's still a lot of bucks, but a terabyte for only four digits?

    • immibis 5 hours ago

      You can get a server now with, like, five hundred cores and a fifty terabytes of RAM. It's expensive, but you can get one.

      A used server with sixty cores and one terabyte of RAM is a lot cheaper. Couple thousand bucks. I mean, that's a lot of bucks, but a terabyte for only four digits?

    • huflungdung 10 hours ago

      What I say is that we massively underestimate just how fast computers are these days

      • hombre_fatal 6 hours ago

        On the other hand, there is a real crossroad that pops up that HNers tend to dismiss.

        A common story is that since day one you just have lightweight app servers handling http requests doing 99% I/O. And your app servers can be deployed on a cheap box anywhere since they're just doing I/O. Maybe they're on Google Cloud Run or a small cluster of $5 VPS. You've built them so that they have zero deps on the machine they're running on.

        But then one day you need to do some sort of computations.

        One incremental option is to create a worker that can sit on a machine that can crunch the tasks and a pipeline to feed it. This can be seen as operationally complex compared to one machine, but it's also simple in other ways.

        Another option is to do everything on one beefy server where your app servers just shell out the work on the same machine. This can be operationally simple in some ways, but not necessarily in all ways.

      • vidarh 5 hours ago

        Most younger devs just have no concept on how limited hardware we ran services on...

        I used to run a webmail system with 2m accounts on hardware with less total capacity (ram, disk, CPU throughput) than my laptop...

        What's more: It was a CGI (so new process for every request), and the storage backend spawned separate processes per user.

      • f1shy 7 hours ago

        In 2010 I was managing 100 servers, with many Oracle and Postgres DB, PHP, Apache, all on Solaris and Sun HW. I was constantly impressed by how people were unable to do more or less correct estimations. I had a discussion with my boss, he wanted to buy 8 servers, I argued one was more than enough. The system, after growing massively, was still in 2020 managing the load with just 3 servers. So I would argue, not only today, but 15 years ago already.

      • ahartmetz 10 hours ago

        Indeed - they are incredibly fast, it's just buried under layers upon layers of stuff

      • robotresearcher an hour ago

        No worries, another fifteen layers of software abstraction will soak that up pronto.

    • hrimfaxi 10 hours ago

      Depending on your regulatory environment, it can be cost-effective to not have to maintain your own data center with 24/7 security response, environmental monitoring, fire suppression systems, etc. (of course, the majority of businesses are probably not interested in things like SOC 2)

      • wongarsu 9 hours ago

        This argument comes up a lot, but it feels a bit silly to me. If you want a beefy server you start out with renting one. $150/month will give you a server with 24 core Xeon and 256GB of RAM, in a data center with everything you mentined plus a 24/7 hands-on technician you can book. Preferably rent two servers, because reliablity. Once you outgrow renting servers you start renting rack space in a certified data center with all the same amenities. Once you outgrow that you start renting entire racks, then rows of racks or small rooms inside the DC. Then you start renting portions of the DC. Once you have outgrown that you have to seriously worry about maintaining your own data center. But at that point you have so much scale that this will be the least of your worries

        • hrimfaxi 8 hours ago

          > This argument comes up a lot, but it feels a bit silly to me. If you want a beefy server you start out with renting one. $150/month will give you a server with 24 core Xeon and 256GB of RAM, in a data center with everything you mentined plus a 24/7 hands-on technician you can book.

          What's the bandwidth and where can I rent one of these??

          • wongarsu 7 hours ago

            Hetzner [1]. Bandwidth is 1 GBit/s. You can also get 10 GBit/s, that's hidden away a bit instead of being mentioned on the order page [2]

            1: https://www.hetzner.com/dedicated-rootserver/matrix-ex

            2: https://docs.hetzner.com/robot/dedicated-server/network/10g-...

            • coder543 7 hours ago

              I have wished for years that Hetzner would offer their bare metal servers in the U.S., and not just Hetzner Cloud.

              • Aeolun 4 hours ago

                Here is US Hetzner: https://ioflood.com/

                Their prices have come down a lot. I used them when the servers still cost $200 a piece, but their support at the time was fantastic.

                • christophilus 2 hours ago

                  Wow. No joke. I haven’t heard of them, but I like their blurb, and those are Hetzner like prices. Now, I just need to find a use for that much beef.

            • hrimfaxi 7 hours ago

              How is that any different from cloud?

              This whole thread was a response to

              > Today at AWS, it is easily possible for people to spend a multiple of the cost of that hardware setup every month for far less compute power and storage.

              suggesting to use a few beefy servers but if we are renting them from cloud we're back where we started.

              • wongarsu 7 hours ago

                The difference from the big clouds is that an equivalent instance at AWS costs 10x as much. If you go with few beefy servers AWS offers very little value for the money they charge, they only make sense for "cloud native" architectures. But if you rent raw servers from traditional hosters you can get prices much closer to the amortized costs of running them yourself, with the added convenience of having them in a certified data center with 24/7 security, backup power, etc.

                If you want more control than that, colo is also pretty cheap [1]. But I'd consider that a step above what 95% of people need

                https://www.hetzner.com/colocation

                • hrimfaxi 6 hours ago

                  For me the comparison was not against the specific instance of AWS but cloud in general, and AWS was a for instance. Which was the whole reason why I brought up compliance and stuff—it is much cheaper to have someone else handle that for you (even if it is hetzner!). That was my whole point.

            • deaux 6 hours ago

              Not ideal when a large part of your userbase is in APAC.

        • throwaway894345 7 hours ago

          I'm a lot less concerned about CPU and ram and a lot more concerned about replicated object storage (across data centers). High end GPUs are also pretty important.

      • orev 8 hours ago

        The only companies directly dealing with that type of stuff are the ones already at such a scale where they need to actually build their own data centers. Everyone else is just renting space somewhere that already takes care of those things and you just need to review their ISO/SOC reports.

        This kind of argument comes from the cloud provider marketing playbook, not reality.

      • BirAdam 9 hours ago

        This is handled by colo.

  • walkabout 7 hours ago

    Around 2013 I was handling bursts up to thousands of requests per second for multi-megabyte file downloads with dynamic authentication using just PHP5, Apache2, and Haproxy, with single-node MySQL (or may have been MariaDB, by then?) as the database, and Redis for caching. On a single mid-range rented server. And Haproxy was only there for operational convenience, you could cut it out and it'd work just as well. No CDN. Rock solid.

    My joke but not-actually-a-joke is that the Cloud is where you send a workload that's fast on your laptop, if you need it to be way slower. The performance of these fussy, over-complicated, hard-to-administer[1] systems is truly awful for the price.

    [1] They're hypothetically simpler and easier to administer, but I've never seen this in the wild. If anything, we always seem to end up with more hours dedicated to care & feeding of this crap, and more glitchiness and failures, than we would with a handful of rented servers with maybe a CDN in front.

    • kazinator 7 hours ago

      > My joke but not-actually-a-joke is that the Cloud is where you send a workload that's fast on your laptop, if you need it to be way slower.

      Not to forget: where you send a workload that is free on your laptop, in order to be charged for it.

  • donatj 10 hours ago

    Exactly this! The educational product I work on is used by hundreds of thousands of students a day, and the secret to our success is how simple our architecture is. PHP monoliths + Cache (Redis/Memcached) scale super wide basically for free. We don't really think about scalability, it just happens.

    I have a friend whose startup had a super complicated architecture that was falling apart at 20 requests per second. I used to be his boss a lifetime ago and he brought me in for a meeting with his team to talk about it. I was just there flabbergasted at "Why is any of this so complicated?!" It was hundreds of microservices, many of them black boxes they'd paid for but had no access to the source. Your app is essentially an async chat app, a fancy forum. It could have been a simple CRUD app.

    I basically told my friend I couldn't help, if I can't get to the source of the problematic nodes. They'll need to talk to the vendor. I explained that I'd probably rewrite it from the ground up. They ran out of runway and shut down. He's an AI influencer now...

  • dig1 10 hours ago

    > The general setup for the largest players in that space was haproxy in front of nginx in front of several PHP servers in front of a MySQL database that had one primary r/w with one read only replica.

    You'd be surprised that the most stable setups today are run this way. The problem is that this way it's hard to attract investors; they'll assume you are running on old or outdated tech. Everything should be serverless, agentic and, at least on paper, hyperscalable, because that sells further.

    > Today at AWS, it is easily possible for people to spend a multiple of the cost of that hardware setup every month for far less compute power and storage.

    That is actually the goal of hyperscalers: they are charging you premium for way inferior results. Also, the article stated a very cold truth: "every engineer wants a fashionable CV that will help her get the next job" and you won't definitely get a job if you said: "I moved everything from AWS and put it behind haproxy on one bare-metal box for $100/mo infra bill".

    • bootsmann 8 hours ago

      > The problem is that this way it's hard to attract investors; they'll assume you are running on old or outdated tech. Everything should be serverless, agentic and, at least on paper, hyperscalable, because that sells further.

      Investors don't give a shit about your stack

      • cjblomqvist 7 hours ago

        Many do. For most it's not the biggest concern (that would be quite weird). AFAIK it's mostly about reducing risk (avoiding complete garbage/duck taped setups)

        Source: I know a person who does tech DD for investors, and I've also been asked this question in DD processes.

  • tracker1 5 hours ago

    Exactly.. it was a lot different when a typical server was 2-4 CPUs and costs more than a luxury car... today you get hundreds of simultaneous threads and upwards of a terabyte of ram for even less, not counting inflation.

    You can go a very, very, very long way on 2-3 modern servers with a fast internet connection and a good backup strategy.

    Even with a traditional RDBMS like MS-SQL/PostgreSQL, you aren't bottlenecked by the 1-2ghz cpu and spinning rust hard drives. You can easily get to millions of users for a typical site/app with a couple servers just for a read replica/redundancy. As much as I happen to like some of the ergonomics of Mongo from a developer standpoint, or appreciate the scale of Cassandra/.ScyllaDB or even Cockroach... it's just not always necessary early on, or ever.

    I've historically been more than happy to reach for RabbitMQ or Redis when you need queueing or caching... but that's still so much simpler than where some microservice architectures have gone. And while I appreciate what Apollo and GraphQL bring to the table, it's over the top for the vast majority of applications.

  • Arch-TK 7 hours ago

    I've seen an application, a 95% CRUD application, which had about 100-1000 users across the UK, users who would only be using it from 9am-5:30pm, and at that - barely interacting with it. This was backed by literally the most sophisticated and complex architecture I have ever seen in my entire life.

    There were 3 instances of cognito. RDS, DynamoDB and S3. The entire architecture diagram would only be legible on an A2 (heck, maybe even A1) page. And that was the high level diagram. The central A4 part of that diagram was a bunch of micro-services for handling different portions of this CRUD application.

    This company could afford a system architect as well as a team of developers to work on this full time.

    I was genuinely baffled, but this company was in an extremely lucrative industry, so I guess in this case it's fine to just take some of your profits and burn them.

    • peacebeard 5 hours ago

      Akin to buying a high performance sports car and never driving it. Maybe it has social value, maybe you just feel good having it.

  • ahoka 10 hours ago

    Are those over engineered systems even actually scalable? I know teams who designed a CQRS architecture using messages queues and a distributed NoSQL database and fail to sustain 10req/s for a read in something that is basically a CRUD application. Heck once someone literally said "But we use Kafka, why aren't we fast?!".

    • arealaccount 10 hours ago

      Exactly this, every time I see kafka or similar its a web of 10M microprocesses that take more time in invocation alone than if you just ran the program in one go.

      • _kb 10 hours ago

        How very kafkaesque.

    • Aeolun 4 hours ago

      Eh, they scale between $1000 and $10000 per month fairly easily. I’m not sure about the requests though.

    • sgarland 10 hours ago

      I watched in amusement as the architecture team at $JOB eagerly did a PoC of a distributed RDBMS, only to eventually conclude that the latency was too high. Gee… if only someone had told you that would happen when you mentioned the idea. Oh wait.

  • hinkley 5 hours ago

    I’m as likely to talk about human scale as hardware scale, and one of the big issues with human scale is what the consequences are of having the wrong team size in either direction.

    When you reduce the man hours per customer you can get farther down your backlog. You can carve people off for new prospective business units. You can absorb the effects of a huge sale or bad press better because you aren’t trying to violate Brooks’ Law nor doing giant layoffs that screw your business numbers.

    You have time for people to speculate on big features or more work on reducing the costs further. If you don’t tackle this work early you end up in the armed Queen Problem: running as fast as you can just to stay still.

  • JackSlateur 6 hours ago

    It is an issue to mistake scalability with resiliency

    Yes, we can run twitter on a single server (https://thume.ca/2023/01/02/one-machine-twitter/) No, we do not want to run twitter on a single server

    • patapong 6 hours ago

      I would argue that even resiliency is a metric that should not be overemphasized in early stages of development. I would rather have a system that suffers occasional outages than one that has perfect resiliency but has added complexity with tradeoffs in costs, complexity and thus development velocity. I think the risks of not being quick enough to product market fit in early stages is bigger than losing customers over short duration outages - except of course if the selling point is resiliency.

      Of course this should not be overdone, but there is something to be said for single server + backup setups, and reweriting to scale + resiliency once traction has been established.

    • abujazar 3 hours ago

      It's much easier to build a resilient system with a simple architecture. E.g. run the application on a decent VM or even bare metal server and mirror the whole system between a few different data centers.

  • CableNinja 11 hours ago

    I thought i knew about scaled deployments before i started working where i do now. After staring here, i realized i had no idea what an environment of huuuuge scale actually was. Id been part of multi site deployments and scaled infra, but it was basically potatoes comparatively. We have a team whose platform we, on IT, call the DoS'er of the company. Its responsible for processing hundreds of thousands of test runs a day, and data is fed to a plethora of services after. The scale is so large that they are able to take down critical services, or deeply impact them, purely due to throughput, if a developer goes too far (like say uploading a million small logs to an s3 bucket every minute).

    We also have been contacted by AWS having them ask us what the hell we are doing, for a specific set of operations. We do a huge prep for some operations, and the prep feeds massive amounts of data through some AWS services, so much so, they thought we were under attack or had been compromised. Nope, just doin data ingestion!

  • aeyes 11 hours ago

    The architecture you describe is ok because in the end it is a fairly simple website. Little user interaction, limited amount of content (at most a few million records), few content changes per day. The most complex part is probably to have some kind of search engine but even with 10 million videos an ElasticSearch index is probably no larger than 1GB.

    The only problem is that there is a lot of video data.

    • ben_w 11 hours ago

      This is probably also true for 98% of startups.

      I think most people don't realise that "10 million" records is small, for a computer.

      (That said, I have had to deal with code that included an O(n^2) de-duplication where the test data had n ~= 20,000, causing app startup to take 20 minutes; the other developer insisted there was no possible way to speed this up, later that day I found the problem, asked the CTO if there was a business reason for that de-duplication, removed the de-duplication, and the following morning's stand-up was "you know that 20 minute startup you said couldn't possibly be sped up? Yeah, well, I sped it up and now it takes 200ms")

      • phkahler 10 hours ago

        I thought you were going to say to reduced O(n^2) to O(n*log(n)), but you just deleted the operation. Normally I'd say that's great, but just how much duplicate data is being left around now? Is that OK?

        • ben_w 10 hours ago

          Each element was about, oh I can't remember exactly, perhaps 50 bytes? It wasn't a constant value, there could in theory be a string in there, but those needed to be added manually and when you have 20,000 of them, nobody would.

          Also, it was overwhelmingly likely that none of the elements were duplicates in the first place, and the few exceptions were probably exactly one duplicate.

          • hedora 10 hours ago

            I'm kind of surprised no one just searched for "deduplication algorithm". If it was absolutely necessary to get this 1MB dataset to be smaller (when was this? Did it need to fit in L2 on a pentium 3 something?), then it could probably have been deduped + loaded in 300-400ms.

            Most engineers that I've worked with that die on a premature optimization molehill like you describe also make that molehill as complicated as possible. Replacing the inside of the nested loop with a hashtable probe certainly fits the stereotype.

            • ben_w 10 hours ago

              > I'm kind of surprised no one just searched for "deduplication algorithm".

              Fair.

              To set the scene a bit: the other developer at this point was arrogant, not at all up to date with even the developments of his preferred language, did not listen to or take advice from anyone.

              I think a full quarter of my time there was just fire-fighting yet another weird thing he'd done.

              > If it was absolutely necessary to get this 1MB dataset to be smaller

              It was not, which is why my conversation with the CTO to check on if it was still needed was approximately one or two sentences from each of us. It's possible this might have been important on a previous pivot of the thing, at least one platform shift before I got there, but not when I got to it.

    • gf000 11 hours ago

      As opposed to what problem?

      Like I can honestly have trouble listing too many business problems/areas that would fail to scale with their expected user count, given reasonable hardware and technical competence.

      Like YouTube and Facebook are absolute outliers. Famously, stackoverflow used to run on a single beefy machine (and the reason they changed their architecture was not due to scaling issues), and "your" startup ain't needing more scale than SO.

      • bccdee 7 hours ago

        Scaling to a lot of reads is relatively easy, but you get into weird architectural territory once you hit a certain volume of writes. Anything involving monitoring or real-time event analysis can get hairy. That's when stuff like kafka becomes really valuable.

    • bobdvb 10 hours ago

      In streaming your website is typically totally divorced from your media serving. Media serving is just a question of cloud storage and pointing at an hls/dash manifest in that object store. Once it starts playing the website itself does almost nothing. Live streaming adds more complexity but it's still not much of a website problem.

      Maintaining the media lifecycle, receiving, transcoding, making it available and removing it, is the big task but that's not real-time, it's batch/event processing at best efforts.

      The biggest challenges with streaming are maintaining the content catalogue, which aren't just a few million records but rich metadata about the lifecycle and content relationships. Then user management and payments tends to also have a significant overhead, especially when you're talking about international payment processing.

      • BirAdam 9 hours ago

        This was before HTML5 and before the browser magically handled a lot of this… so there was definitely a bit more to it. Every company also wanted to have statistics of where people scrub to and all of that. It wasn’t super simple, but yeah, it also wasn’t crazy complex. The point is, scale is achievable without complex inf.

CaptainOfCoit 12 hours ago

I've seen startups killed because of one or two "influential" programmers deciding they need to start architecturing the project for 1000TPS and 10K daily users, as "that's the proper way to build scalable software", while the project itself hasn't even found product-market fit yet and barely has users. Inevitably, the project needs to make a drastic change which now is so painful to do because it no longer fits the perfect vision the lead(s) had.

Cue programmers blaming the product team for "always changing their mind" as they discover what users actually need, and the product team blaming developers for being hesitant to do changes, and when programmers agree, it takes a long time to undo the perfect architecture they've spent weeks fine-tuning against some imaginary future user-base.

  • hamburglar 6 hours ago

    I was part of a small team that built a $300M company on Ruby and MySQL that made every scaling mistake you can possibly make. This was also the right decision because it forced us to stay lean and focus on what we needed right now, as opposed to getting starry-eyed about what it was going to be like when we had 10 million users. At every order of magnitude, we had sudden emergencies where some new part of the system had become a bottleneck, and we scrambled like crazy to rearchitect things to accommodate. It was hard, and it was fun. And it was frugal. We eventually hit over 10 million users before I left, and I can’t say I regret the painful approach one bit.

    • stavros an hour ago

      I also imagine you were pretty agile by not having tons of complexity to grapple with every time you wanted to add a new feature.

  • smoe 9 hours ago

    In my opinion, if those influential programmers actually architected around some concrete metrics like 1,000 TPS and 10K daily users, they would end up with much simpler systems.

    The problem I see is much more about extremely vague notions of scalability, trends, best practices, clean code, and so on. For example we need Kafka, because Kafka is for the big boys like us. Not because the alternatives couldn’t handle the actual numbers.

    CV-driven development is a much bigger issue than people picking overly ambitious target numbers.

  • stavros 10 hours ago

    > 1000TPS and 10K daily users

    I absolutely agree with your point, but I want to point out, like other commenters here, that the numbers should be much larger. We think that, because 10k daily users is a big deal for a product, they're also a big deal for a small server, but they really aren't.

    It's fantastic that our servers nowadays can easily handle multiple tens of thousands of daily users on $100/mo.

    • hamdingers 6 hours ago

      Users/TPS aren't the right metric in the first place. I have a webhook glue side project that I didn't even realize had ~8k daily users/~300tps until I set up Cloudflare analytics. As a go program doing trivial work, the load is dwarfed by the cpu/memory usage of all my seedbox related software (which has 1 user, not even every day).

      • CaptainOfCoit 6 hours ago

        > Users/TPS aren't the right metric in the first place.

        This was my initial point :) Don't focus on trying to achieve some metrics, focus on making sure to build the right thing.

    • thewebguyd 3 hours ago

      > We think that, because 10k daily users is a big deal for a product, they're also a big deal for a small server, but they really aren't.

      Yeah we seem to forget just how fast computers are now a days. Obviously varies with complexity of the app & what other tech you are using, but for simpler things 10k daily users could be handled by a reasonbly powerful desktop sitting under my desk without breaking a sweat.

  • strken 11 hours ago

    I've seen senior engineers get fired and the business suffer a setback because they didn't have any way to scale beyond a single low spec VPS from a budget provider, and their system crashed when a hall full of students tried to sign up together during a demo and each triggered 200ms of bcrypt CPU activity.

    • nasmorn 10 hours ago

      This seems weird. I have a lot of experience with rails which is considered super slow. But the scenario you describe is trivial. Just get a bigger VPS and change a single env var. even if you fucked up everything else like file storage etc you can still to that. If you build your whole application in way where you can’t scale anything you should be fired. That is not even that easy

      • hedora 10 hours ago

        People screw up the bcrypt thing all the time. Pick a single threaded server stack (and run on one core, because Kubernetes), then configure bcrypt so brute forcing 8 character passwords is slow on an A100. Configure kubernetes to run on a medium range CPU because you have no load. Finally, leave your cloud provider's HTTP proxy's timeout set to default.

        The result is 100% of auth requests timeout once the login queue depth gets above a hundred or so. At that point, the users retry their login attempts, so you need to scale out fast. If you haven't tested scale out, then it's time to implement a bcrypt thread pool, or reimplement your application.

        But at least the architecture I described "scales".

        • achierius 4 hours ago

          "because Kubernetes"? Is this assuming that you're running your server inside of a Kubernetes instance (and if so, is Kubernetes going to have problems with more than one thread?), or is there some other reason why it comes into this?

        • ericwood 5 hours ago

          Fond memories of a job circa 2013 on a very large Rails app where CI times were sped up by a factor of 10 when someone realized bcrypt was misconfigured when running tests and slowing things down every time a user was created through a factory.

      • strken 9 hours ago

        Of course you should be fired for doing that! I meant the example as an illustration of how "you don't need to scale" thinking turns into A-grade bullshit.

        You do, in fact, need to scale to trivial numbers of users. You may even need to scale to a small number of users in the near future.

        • g8oz 9 hours ago

          I'm not seeing how your example proves that a beefy server/cloud free architecture cannot handle the workload that most companies will encounter. The example you give of an under specified VPS is not what is being discussed in the article.

          • strken 9 hours ago

            I was responding to CaptainOfColt, who was writing about premature optimisation killing companies. The article's proposed architecture seems fine and is similar to things I've done, but it's not an excuse to completely avoid thinking about future traffic patterns.

    • esafak 6 hours ago

      I will never forget the time my university's home-grown Web-based registration system crashed at the beginning of the semester, and the entirety of the university's student body had to form a line in order to have their registration entered manually. I waited a whole day, and they did not get round to me by night, so I had to wait the next day too.

      • dwaltrip 5 hours ago

        “Knowing what’s reasonable” matters.

        If you have a product that’s being deployed for a new school year, yeah you should be prepared for any one-time load for that time period.

        Many products don’t have the “school year just started” spikes. But some do.

        It requires careful thought, pragmatism, and business sense to balance everything and achieve the most with the available resources.

    • sgarland 11 hours ago

      That’s a skill issue, not an indictment on the limitations of the architecture. You can spin up N servers and load-balance them, as TFA points out. If the server is a snowflake and has nothing in IaC, again, not an architectural issue, but a personnel / knowledge issue.

      • strken 9 hours ago

        The architecture in TFA is fine, and sounds preferable to microservices for most use cases.

        I am worried by the talk of 10k daily users and a peak of 1000TPS being too much premature optimisation. Those numbers are quite low. You should know your expected traffic patterns, add a margin of error, and stress test your system to make sure it can handle the traffic.

        I disagree that self-inflicted architectural issues and personnel issues are different.

    • CaptainOfCoit 11 hours ago

      Wonder which one happens more often? Personally I haven't worked in that kind of "find the person to blame" culture which would led to something like that, so I haven't witnessed what you're talking about, but I believe you it does happen in some places.

    • ipsento606 10 hours ago

      > they didn't have any way to scale beyond a single low spec VPS from a budget provider

      they couldn't redeploy to a high-spec VPS instead?

    • kunley 11 hours ago

      I frankly don't believe that in a workplace where an userbase can be characterized as a "hall full of students" anyone was fired overnight. Doesn't happen at these places. Reprimanded, maybe

      • hedora 9 hours ago

        More frequently, anyone that sounded the alarm about this was let go months ago, so the one that'd be fired is the one in charge of the firing.

        Instead, they celebrate "learning from running at scale" or some nonsense.

  • jstimpfle 7 hours ago

    Something that does not scale to 10k users is likely so badly architected, it would be faster to iterate on it if it was more scalable hence better architected and more maintainable.

    • o11c 7 hours ago

      For reference, in 1999 10K was still considered a (doable) challenge ... but they were talking "simultaneous" not "per day".

      The modern equivalent challenge is 10 million simultaneous users per machine.

  • the8472 11 hours ago

    1000TPS isn't that much? Engineer for low latency and with a 10ms budget that'd be 10 cores if it were CPU-bound, less in practice since usually part of the time is spent in IO wait.

    • CaptainOfCoit 11 hours ago

      > 1000TPS isn't that much?

      Why does that matter? My argument is: Engineer for what you know, leave the rest for when you know better, which isn't before you have lots of users.

      • the8472 11 hours ago

        What I'm saying is that "building for 1000TPS" is not what gets you an overengineered 5-layer microservice architecture. If you build for a good user experience (which includes low latency) you get that not-that-big scale without sharding.

    • hedora 9 hours ago

      I doubt much time would be in I/O wait if this was really a scale up architecture. Ignoring the 100's of GB of page cache, it should be sitting on NVMe drives, where a write is just a PCIe round trip, and a read is < 1ms.

    • drob518 11 hours ago

      And with CPUs now being shipped with 100+ cores, you can brute force that sucker a long way.

  • systems 10 hours ago

    Clearly this project failed for either

      1. scaling for a very specific use case, or because
      2. it hasn't even found product-market fit 
    
    Blaming the failure or designing for scale seem misplaced, you can scale while remaining agile and open to change
  • otabdeveloper4 11 hours ago

    > 1000TPS and 10K daily users

    That is not a lot. You can host that on a Raspberry Pi.

    • byroot 4 hours ago

      That entirely depends on what these transactions are meant to do.

      I always find these debate weird. How can you compare one app’s TPS with another?

    • pja 9 hours ago

      Not if you’re going to be “web scale” (tm) you can’t.

      • hedora 9 hours ago

        You can host it on 8 raspberry pi's: Three for etcd, three for minio/ceph, and two for Kubernetes workers.

        (16 if you need geo replication.)

      • moffkalast 5 hours ago

        You put one Mongo shard on each Pi, they are the secret ingredient in the web scale sauce.

  • throwaway894345 5 hours ago

    On the flip side, I've seen a project fail because it was built on the unvalidated assumption that the naive architecture would scale to real world loads only to find that a modest real world workload was exceeding targets by a factor of 100X. You really do need technical leadership with good judgment and experience; we can't substitute it with facile "assume low scale" or "assume large scale" axioms.

  • th0ma5 7 hours ago

    You simply can't get the software or support for a lot of smaller solutions. It can be sometimes easier to do the seemingly more difficult thing, and sometimes because all the money goes to those more difficult seeming technical problems and solutions.

yobbo 12 hours ago

Many startup business models have no chance of becoming profitable unless they reach a certain scale, but they might have less than 1% probability of reaching that scale. Making it scalable is easy work since it is deterministic, but growing customers is not.

Another perspective is that the defacto purpose of startups (and projects at random companies) may actually be work experience and rehearsal for the day the founders and friends get to interview at an actual FAANG.

I think the author's “dress for the job you want, not the job you have” nails it.

  • stavros 10 hours ago

    Unfortunately, you can't really get experience from solving hypothetical problems. The actual problems you'll encounter are different, and while you can get experience in a particular "scalable" stack, it won't be worth its maintenance cost for a company that doesn't need it.

  • nicoburns 11 hours ago

    I guess the work is deterministic, but it often (unintentionally) makes the systems being developed non-deterministic!

    • potatolicious 10 hours ago

      Ah yes. I once worked at a startup that insisted on Mongo despite not having anywhere near the data volume for it to make any sense at all. Like, we're talking 5 orders of magnitude off of what one would reasonably expect to need a Mongo deployment.

      I was but a baby engineer then, and the leads would not countenance anything as pedestrian as MySQL/Postgres.

      Anyway, fast forward a bit and we were tasked with building an in-house messaging service. And at that point Mongo's eventual consistency became a roaring problem. Users would get notifications that they had a new message, and then when they tried to read it it was... well... not yet consistent.

      We ended up implementing all kinds of ugly UX hacks to work around this, but really we could've run the entire thing off of sqlite on a single box and users would've been able to read messages instantaneously, so...

      • nicoburns 10 hours ago

        I've seen similar with Firebase. Luckily I took over as tech lead at this company, so I was able to migrate us to Postgres. Amusingly, as well as being more reliable, the Postgres version (on a single small database instance) was also much faster than the previous Firebase-based version (due to it enabling JOINs in the database rather than in application code).

        • potatolicious 10 hours ago

          Funnily enough prior to this startup I had worked at a rainforest-themed big tech co where we ran all kinds of stuff on MySQL without issue, at scales that dwarfed what this startup was up to by 3-4 orders of magnitude.

          I feel like that's kind of the other arm of this whole argument: on the one hand, you ain't gonna need that "scalable" thing. On the other hand, the "unscalable" thing scales waaaaaay higher than you are led to believe.

          A single primary instance with a few read-only mirrors gets you a reaaaaaaally long way before you have to seriously think about doing something else.

          • toast0 8 hours ago

            > On the other hand, the "unscalable" thing scales waaaaaay higher than you are led to believe.

            Agreeing with you... Any reasonable database will scale pretty far if you put in a machine with 160 cores and 3 TB of RAM. And that's just a single socket board.

            There's no reason to do anything other than get bigger machines until you're near or at the limits of single socket. Dual socket and cpu generations should cover you for long enough to move to something else if you need to. Sharding a traditional database works pretty well in a lot of cases, and it mostly feels like the regular database.

            • nicoburns 8 hours ago

              That, and a lot of companies don't have the scale they think they have.

              The Postgres database for a company I worked for (that was very concerned about scaling when they interviewed me because their inefficient "nosql" solution was slow) ran very happily on a machine with 2 shared CPU cores and 4GB RAM.

      • andoando 2 hours ago

        Dawg tinder was operating at 20M DAU with all its DB based on Dynamo. Probably still is.

        And yeah there was ton of those issues but yolo

      • walkabout 7 hours ago

        I watched a company you've probably heard of burn stupid amounts of money because one guy there was trying to build a personal brand as a Graph Database Expert, and another had fallen hard for Neo4j's marketing. Stability issues, stupid bugs, weak featureset, mediocre performance for most of the stuff they wanted to do (Neo4j, at least at this time, was tuned to perform some graph-related operations very fast, but it was extremely easy to find other graph-related operations that it's terrible at, and they're weren't exactly obscure things) all stretching out project development times to like 2x what they needed to be, with absolutely zero benefits for it. So fucking dumb.

        Meanwhile all they needed was... frankly, probably SQLite, for their particular use case, having each client of theirs based around a single portable file actually would have been a big win for them. Their data for each client were tiny, like put-it-all-in-memory-on-an-RPi2 tiny. But no, "it's graphs so we need a graph database! Everything's graphs when you think about it, really! (So says Neo4j's marketing material, anyway)"

  • ahartmetz 10 hours ago

    >“dress for the job you want, not the job you have”

    I don't think I should dress down any further :>

closeparen 6 hours ago

>Modules cannot call each other, except through specific interfaces (for our Python monolith, we put those in a file called some_module/api.py, so other modules can do from some_module.api import some_function, SomeClass and call things that way.

This is a solution to a large chunk of what people want out of microservices. There are just two problems, both of which feel tractable to a language/runtime that really wanted to solve them:

1. If the code implementing the module API is private, it must all be colocated in one package. If it is public, then anyone can import it, breaking the module boundary. You need a visibility system that can say "this subtree of packages can cooperate with each other, but code outside the subtree can only use the top-level package."

2. If a change module A has a problem, you must roll back the entire monolith, preventing a good change in module B from reaching users. You need a way to change the deployed version of different modules independently. Short of microservices, they could be separate processes doing some kind of IPC, or you need a runtime with hot reloading (and a reasonably careful backwards compatibility story).

  • daxfohl 6 hours ago

    Though even the solution to 1 doesn't solve "ACLs", which distribution does. If you want to ensure your module is only called by an approved list of upstream modules, public / private isn't granular enough. (You can solve it with attributes or some build tools, but it's ad-hoc and complex, doesn't integrate with the editor, etc. I've always thought something more granular and configurable should've been built into Java/OOP theory from the start).

    That said, 2 is really the big problem. As things really scale, this tends to cause problems on every deployment and slow the whole company down, cause important new features to get blocked by minor unrelated changes, a lot of extra feature flag maintenance, etc. 90% of the time, that should be the gating factor by which you decide went to split a service into multiple physical components.

    As the author said, an additional reason for distribution is sometimes it's prudent to distribute because of physical scale reasons (conflicts between subservice A needing high throughput, B needing low latency, C needing high availability, and D needing high IOPS that blows your budget to have VMs that satisfy every characteristic, or impossible in memory-managed languages), but usually I see this being done way too early, based more on ideals than numbers.

    • closeparen 26 minutes ago

      Exactly. There should be no problem having tens of thousands of stateless, I/O bound database-transaction-wrapper endpoints in the same service. You're not going to run out of memory to hold the binary or something. If you want to partition physical capacity between groups of endpoints, you can probably accomplish that at the load balancer.

      Having the latent capability to serve endpoint A in the binary is not interfering with endpoint B's QPS unless it implies some kind of crazy background job or huge in-memory dataset. Even in this case, monoliths normally have a few different components according to function: API, DB, cache, queue, background worker, etc. You can group workloads by their structure even if their business purposes are diverse.

  • ants_everywhere 5 hours ago

    > This is a solution to a large chunk of what people want out of microservices.

    Yeah, just put some gRPC services in a single pod and have them communicate over unix domain socket. You now have a single unit of modules that can only communicate over IPC using a well-defined type-safe boundary. As a bonus you can set resource quotas separately according to the needs of the module.

    Want to scale? Rewrite some config and expose a few internal services and have them communicate over TCP.

  • aiono 6 hours ago

    > 1. If the code implementing the module API is private, it must all be colocated in one package. If it is public, then anyone can import it, breaking the module boundary. You need a visibility system that can say "this subtree of packages can cooperate with each other, but code outside the subtree can only use the top-level package."

    This is easily achieved in Scala with a particular use of package paths. You are allowed to make some symbols only visible under a package path.

    • closeparen 6 hours ago

      Unfortunately the thought pattern “modern software development is way too complex, we need something radically simpler… I know! I’ll use Scala!” is not a thing.

  • eximius 6 hours ago

    Visibility systems are great!

    > If a change module A has a problem, you must roll back the entire monolith, preventing a good change in module B from reaching users.

    eh. In these setups you really want to be fixing forward for the reason you describe - so you revert the commit for feature A or turn off the feature flag for it or something. You don't really want to be reverting deployments. If you have to, well, then it's probably worth the small cost of feature B being delayed. But there are good solutions to shipping multiple features at the same time without conflicting.

    • closeparen 5 hours ago

      There is a point in commit throughput at which finding a working combination of reverts becomes unsustainable. I've seen it. Feature flagging can delay this point but probably not prevent it unless you’re isolating literally every change with its own flag (at which point you have a weird VCS).

jwr 12 hours ago

I don't get this scalability craze either. Computers are stupid fast these days and unless you are doing something silly, it's difficult to run into CPU speed limitations.

I've been running a SaaS for 10 years now. Initially on a single server, after a couple of years moved to a distributed database (RethinkDB) and a 3-server setup, not for "scalability" but to get redundancy and prevent data loss. Haven't felt a need for more servers yet. No microservices, no Kubernetes, no AWS, just plain bare-metal servers managed through ansible.

I guess things look different if you're using somebody else's money.

  • danielmarkbruce 7 hours ago

    It's not about scalability. It's about copying what the leaders in a space do, regardless of whether it makes sense or not. It's pervasive in most areas of life.

  • drob518 11 hours ago

    One of the silliest things you can do to cripple your performance is build something that is artificially over distributed, injecting lots of network delays between components, all of which have to be transited to fulfill a single user request. Monoliths are fast. Yes, sometimes you absolutely have to break something into a standalone service, but that’s rare.

    • hedora 9 hours ago

      I've notice a strong correlation between artificially over-distributing, and not understanding things like the CAP theorem. So, you end up with a slow system that's added a bunch of unsolvable distributed systems problems on its fast path.

      (Most distributed systems problems are solvable, but only if the person that architected the system knows what they're doing. If they know what they're doing, they won't over-distribute stuff.)

      • drob518 7 hours ago

        Yes, that too. If you look at the commits for Heisenbugs associated with the system, you have a good chance of seeing artificial waits injected to “fix” things.

      • Groxx 6 hours ago

        You can solve just about any distributed systems problem by accepting latency, but nobody wants to accept latency :)

        ...despite the vast majority of latency issues being extremely low-hanging fruit, like "maybe don't have tens of megabytes of data required to do first paint on your website" or "hey maybe have an index in that database?".

        • hedora 5 hours ago

          Well, yeah, but the people that create the issues typically solve them by just corrupting the crap out of app state and adding manual ops procedures.

    • gowld 5 hours ago

      There's no need to deploy separate service on separate machines.

  • intrasight 7 hours ago

    I ran a SaaS for 10 years. Two products. Profitable from day 1 as customers paid $500/month and it ran on a couple of EC2 instances as well as a small RDS database.

    Another thing one has to consider is the market size and timeframe window of your SaaS. No sense in building for scalability if the business opportunity is only 100 customers and only for a few years.

  • ben_w 10 hours ago

    > unless you are doing something silly, it's difficult to run into CPU speed limitations.

    Yes, but it's not difficult to do something silly without even noticing until too late. Implicitly (and unintentionally) calling something with the wrong big-O, for example.

    That said, anyone know what's up with the slow deletion of Safari history? Clearly O(n), but as shown in this blog post still only deleted at a rate of 22 items in 10 seconds: https://benwheatley.github.io/blog/2025/06/19-15.56.44.html

    • phkahler 10 hours ago

      >> Yes, but it's not difficult to do something silly without even noticing until too late. Implicitly (and unintentionally) calling something with the wrong big-O, for example.

      On a non-scalable system you're going to notice that big-O problem and correct it quickly. On a scalable system you're not going to notice it until you get your AWS bill.

      • hedora 9 hours ago

        Also, instead of having a small team of people to fight scalable infrastructure configuration, you could put 1-2 full time engineers on performance engineering. They'd find big-O and constant factor problems way before they mattered in production.

        Of course, those people's weekly status reports would always be "we spent all week tracking down a dumb mistake, wrote one line of code and solved a scaling problem we'd hit at 100x our current scale".

        That's equivalent to waving a "fire me" flag at the bean counters and any borderline engineering managers.

  • floating-io 11 hours ago

    For how many users, and at what transaction rate?

    Not disagreeing that you can do a lot on a lot less than in the old days, but your story would be much more impactful with that information. :)

  • crazygringo 11 hours ago

    Scalability isn't just about CPU.

    It's just as much about storage and IO and memory and bandwidth.

    Different types of sites have completely different resource profiles.

    • sreekanth850 10 hours ago

      Microservice is not a solution for scalability. There are multiple options for building scalable software, even a monolith or a modular monolith with proper loadbalanced setup will drastically reduce the complexity of microservice and get massive scale. Only bottleneck will be db.

      • hedora 9 hours ago

        Microservices take an organizational problem:

        The teams don't talk, and always blame each other

        and adds distributed systems and additional organizational problems:

        Each team implements one half of dozens of bespoke network protocols, but they still don't talk, and still always blame each other. Also, now they have access to weaponizable uptime and latency metrics, since because each team "owns" the server half of one network endpoint, but not the client half.

  • JackSlateur 6 hours ago

    Is it about scalability, or about resiliency ?

    • worldsayshi 5 hours ago

      Or is it about outsourcing problems?

      There's a lot of off the shelf microservices that can solve difficult problems for me. Like keycloak for user management. Isn't that a good reason?

      Or Grafana for log visualization?

      Should I build that into the monolith too? Or should I just skip it?

thelastgallon 2 hours ago

Everything is scalable because for an enterprising executive/manager/employee the most important thing they are trying to solve is scaling their empire. Pay is based on how many reports you have and how big is your budget. Instead of 2 x 20K beefy servers, you get to spend tens or hundreds of million on cloud spend, then dozens of $500K/year SRE/DevOps/Cloud 'experts' running multi-cloud hybrid running kubernetes thingies. There will be SO many internal automation tools and big bonuses for developing those tools (wins all around), yet more and more hiring!

Once other executives understand that you can scale your team massively, they will hire you to scale even greater heights! More #winning!

There is a human element of everything and perverse incentives. Understanding these explains most things that seem baffling at first.

  • roncesvalles 2 hours ago

    Exactly. Once you become a people manager, the only metric that matters in your career is "number of people below you".

radarsat1 12 hours ago

> scalability needs a whole bunch of complexity

I am not sure this is true. Complexity is a function of architecture. Scalability can be achieved by abstraction, it doesn't necessarily imply highly coupled architecture, in fact scalability benefits from decoupling as much as possible, which effectively reduces complexity.

If you have a simple job to do that fits in an AWS Lambda, why not deploy it that way, scalability is essentially free. But the real advantage is that by writing it as a Lambda you are forced to think of it in stateless terms. On the other hand if suddenly it needs to coordinate with 50 other Lambdas or services, then you have complexity -- usually scalability will suffer in this case, as things become more and more synchronous and interdependent.

> The monolith is composed of separate modules (modules which all run together in the same process).

It's of course great to have a modular architecture, but whether or not they run in the same process should be an implementation detail. Barriers should be explicit. By writing it all depending on local, synchronous, same-process logic, you are likely building in all sorts of implicit barriers that will become hidden dangers when suddenly you do need to scale. And by the way that's one of the reasons we think about scaling in advance, is that when the need comes, it comes quickly.

It's not that you should scale early. But if you're designing a system architecture, I think it's better to think about scaling, not because you need it, but because doing so forces you to modularize, decouple, and make synchronization barriers explicit. If done correctly, this will lead to a better, more robust system even when it's small.

Just like premature optimization -- it's better not to get caught up doing it too early, but you still want to design your system so that you'll be able to do it later when needed, because that time will come, and the opportunity to start over is not going to come as easily as you might imagine.

  • saidinesh5 11 hours ago

    > If you have a simple job to do that fits in an AWS Lambda, why not deploy it that way, scalability is essentially free. But the real advantage is that by writing it as a Lambda you are forced to think of it in stateless terms.

    What you are describing is already the example of premature optimization. The moment you are thinking of a job in terms of "fits in an AWS Lambda" you are automatically stuck with "Use S3 to store the results" and "use a queue to manage the jobs" decisions.

    You don't even know if that job is the bottleneck that needs to scale. For all you know, writing a simple monolithic script to deploy onto a VM/server would be a lot simpler deployment. Just use the ram/filesystem as the cache. Write the results to the filesystem/database. When the time comes to scale you know exactly which parts of your monolith are the bottleneck that need to be split. For all you know - you can simply replicate your monolith, shard the inputs and the scaling is already done. Or just use the DB's replication functionality.

    To put things into perspective, even a cheap raspberry pi/entry level cloud VM gives you thousands of postgres queries per second. Most startups I worked at NEVER hit that number. Yet their deployment stories started off with "let's use lambdas, s3, etc..". That's just added complexity. And a lot of bills - if it weren't for the "free cloud credits".

    • bpicolo 10 hours ago

      > The moment you are thinking of a job in terms of "fits in an AWS Lambda" you are automatically stuck with "Use S3 to store the results" and "use a queue to manage the jobs" decisions.

      I think the most important one you get is that inputs/outputs must always be < 6mb in size. It makes sense as a limitation for Lambda's scalability, but you will definitely dread it the moment a 6.1mb use case makes sense for your application.

      • hedora 9 hours ago

        The counterargument to this point is also incredibly weak: It forces you to have clean interfaces to your functions, and to think about where the application state lives, and how it's passed around inside your application.

        That's equivalent to paying attention in software engineering 101. If you can't get those things right on one machine, you're going to be in world of hurt dealing with something like lambda.

        • daxfohl 5 hours ago

          I'd say the real advantage is that if you need to change it you don't have to deploy your monolith. Of course, the relative benefit of that is situationally dependent, but I was recently burned by a team that built a new replication handler we needed into their monolith, and every time it had a bug, and the monolith only got deployed once a week. I begged them to put it into a lambda but every week was "we'll get it right next week", for months. So it does happen.

  • CaptainOfCoit 12 hours ago

    > It's of course great to have a modular architecture, but whether or not they run in the same process should be an implementation detail

    It should be, but I think "microservices" somehow screwed up that. Many developers think "modular architecture == separate services communicating via HTTP/network that can be swapped", failing to realize you can do exactly what you're talking about. It doesn't really matter what the barrier is, as long as it's clear, and more often than not, network seems to be the default barrier when it doesn't have to be.

    • worldsayshi 5 hours ago

      > network seems to be the default barrier when it doesn't have to be.

      But if you want to use off the shelf solutions to your problems it often is. You can't very well do 'from keycloak import login_page'.

  • dapperdrake 11 hours ago

    The complexity that makes money is all the essential complexity of the problem domain. The "complexity in the architecture" can only add to that (and often does).

    This is the part that is about math as a language for patterns as well as research for finding counter-examples. It’s not an engineering problem yet.

    Once you have product market fit, then it becomes and engineering problem.

mannyv 7 hours ago

Actually, scalability is cheap. Our AWS bill until recently was around $160-$200 a month. To get the level of HA and performance would require at least 20 boxes in two data centers.

Dev/test/prod with an HA db and a backend that never dies. I’ve built those on bare iron and they’re expensive.

If you’re going for saas and customers that don’t care about your infrastructure then a hetzner box is fine.

But really, creating resilient infrastructure is super cheap now.

Agree with “make what your customers want,” but many customers actually want a service that doesn’t barf.

  • n_u 6 hours ago

    Could you explain this some more? How are your costs so low in comparison? Are you using serverless?

    • bigbuppo 5 hours ago

      That's how the cloudy platforms get you. They're very cheap on the low end, until they're not.

      • mannyv 4 hours ago

        No, it's because we changed the way we process our metrics.

        Previously we processed our metrics by consolidating them into multidimensional entries on a minute basis.

        We moved to single metric second-based collection, because it was getting too complicated to process and because we wanted second-by-second measurement to measure engagement more granularly. That increased our data retention tremendously. We're still under the cost for the other timestream products, but we'll be adjusting how we do that in a quarter or two.

      • etothepii 4 hours ago

        I've heard this a few times. Can you explain a bit more why you think that's a problem.

        I've always made the assumption that once they become "not cheap" you now have the cost to offset investment against.

    • mannyv 4 hours ago

      It depends on you understanding your app and how things need to be structured. We have what essentially is a video CMS, so we have two parts: a management UI that end-users use and a backend that actually delivers the video and collects metrics.

      They are essentially two products, and are designed that way; if the management UI barfed the backend would continue along forever.

      You can combine management and delivery in one app, but that makes delivery more fragile and will be slower because presumably it has to invoke a lot of useless stuff just to deliver bytes. I remember working with a spring app that essentially built and destroyed the whole spring runtime just to serve a request, which was an unbelievably dumb thing to do. Spring became the bottleneck, and for most requests there was actually no work done; 99% of the time was in spring doing spring things.

      So really, once you separate the delivery and management it becomes easier to figure out the minimum amount of stuff you need. Redis, because you need to cache a bunch of metadata and handle lots of connections. Mysql, because you need a persistent store. Lambda, as a thin layer between everything. And a CDN, because you don't want to serve stuff out of AWS if you can help it. SQS for what essentially becomes job control. And for metric collection we use fastly with synthetic logging.

      To be fair, our AWS cost was low but our CDN cost is like $1800/mo for some number of PB/mo (5? 10? I forget).

      In the old days this would require at least (2 DB + 2 App server + 2 NAS) * 2 locations = 8 boxes. If we were going to do the networking ourselves we'd add 4 f5s. Ideally we'd have the app server, redis, and the various lambdas on different boxes, so 2 redis + 2 runners = 8 more servers. If we didn't use f5s we'd have 2 reverse proxies as the front end at each location. Each box would have 2 PSUs, at least a raid 1, dual NICs, and ECC. I think the lowest end Dell boxes with those features are like $5k each? Today I'd probably just stuff some 1TB SSDs in them and mirror them instead of going SAS. The NAS would be hard to spec because you have to figure out how much storage you need and they can be a pain to reconfigure. You don't want to spend too much up front, but you also don't want to have downtime while you add some more drive space.

      Having built this out, it's not as easy as you'd think. I've been lucky enough to have built this sort of thing a few times. It's fun to do, but maintaining it can be a PITA. If you don't believe in documentation your deployment will fail miserably because you did something out of order.

acron0 12 hours ago

Ugh, there is just something so satisfying about developer cynicism. It gives me that warm, fuzzy feeling.

I basically agree with most of what the author is saying here, and I think that my feeling is that most developers are at least aware that they should resist technical self-pleasure in pursuit of making sure the business/product they're attached to is actually performing. Are there really people out there who still reach for Meta-scale by default? Who start with microservices?

  • lpapez 11 hours ago

    > Are there really people out there who still reach for Meta-scale by default? Who start with microservices?

    Anecdotally, the last three greenfield projects I was a part of, the Architects (distinct people in every case) began the project along the lines of "let us define the microservices to handle our domains".

    Every one of those projects failed, in my opinion not primarily owing to bad technical decisions - but they surely didn't help either by making things harder to pivot, extend and change.

    Clean Code ruined a generation of engineers IMO.

    • robertlagrant 11 hours ago

      I think this sounds more like Domain Driven Design than Clean Code.

      • ahoka 10 hours ago

        It kinda started with Clean Code. I remember some old colleagues walking around with the book in their hand and deleting ten year old comments in every commit they made: "You see, we don't need that anymore, because the code describes itself". It made a generation (generations?) of software developers think that all the architectural patterns were found now, we can finally do real engineering and just have to find the one that fits for the problem at hand! Everyone asked the SOLID principles during interviews, because that's how real engineers design! I think "cargo cult" was getting used at that time too to describe this phenomenon.

        • sarchertech 10 hours ago

          It was (is) bad. The worst part is they the majority of people pushing it haven’t even read Clean Code. They’ve read a blog post by a guy who read a blog post by a guy who skimmed the book.

  • worldsayshi 5 hours ago

    I don't buy the idea that people mainly reach for microservices for scalability or "pleasure" reasons though.

    I personally reach for it to outsource some problems by using off the shelf solutions. I don't want to reinvent the wheel. And if everyone else is doing it in a certain way I want to do it in the same way to try to stand on the shoulders of giants and not reinvent everything.

    But that's probably the wrong approach then...

  • wilkommen 6 hours ago

    Yes, there are still people who start with microservices, unfortunately. There are where I work.

  • dwoldrich 3 hours ago

    I needed to build an internal admin console, not super-scalable, just a handful of business users to start. The SQL database it would access was on-premises, but might move to the cloud in future. Authorized users needed single sign-on to their Azure-based active directory accounts for login. I wanted to do tracing of user requests with OpenTelemetry or something like.

    At this point in my career, why wouldn't I reach for microservices to supply the endpoints that my frontend calls out to? Microservices are straightforward to implement with NodeJS (or any other language, for that matter.) I get very straightforward tracing and Azure SSO support in NodeJS. For my admin console, I figured I would need one backend-for-frontend microservice that the frontend would connect to and a domain service for each domain that needed to be represented (with only one domain to start). We picked server technologies and frameworks that could easily port to the cloud.

    So two microservices to implement a secure admin console from scratch, is that too many? I guess I lack the imagination to do the project differently. I do enjoy the "API First" approach and the way it lets me engage meaningfully with the business folks to come up with a design before we write any code. I like how it's easy to unit/functional test with microservices, very tidy.

    Perhaps what makes a lot/most of microservice development so gross is misguided architectural and deployment goals. Like, having a server/cluster per deployed service is insane. I deploy all of my services monolithically until a service has some unique security or scaling needs that require it to separate from the others.

    Similarly, it seems common for microservices teams to keep multiple git repos, one for each service. Why?! Some strange separation-of-concerns/purity ideals. Code reuse, testing, pull requests, and atomic releases suffer needless friction unless everything is kept in a monorepo, as the OP implied.

    Also, when teams build microservices in such a way that services must call other services completely misses the point of services - that's just creating a distributed monolith (slow!)

    I made a rule on my team that the only service type that can call another service is aggregation services like my backend-for-frontend which could launch downstream calls in parallel and aggregate the results for the caller. This made the architecture very flat with the minimum number of network hops and with as much parallelism as possible so it would stay performant. Domain services owned their data sources, no drama with backend data.

    I see a lot of distributed monolith drama and abuse of NoSQL data sources giving microservices a bad reputation.

maxamillion 2 hours ago

I love this blog post so much. <3

Stop solving problems you don't have. You have no users, you don't need to support a million of them. You make no money, you don't need to burn through all your cloud credits so fast. Just chill, start with a couple Linodes, Droplets, or Lightsail instances and see how far you can get with a simple app, api, load balancer, and a database. You'd be shocked how far that stretches when you're talking about servicing real paying customers.

drob518 11 hours ago

> The first problem every startup solves is scalability. The first problem every startup should solve is “how do we have enough money to not go bust in two months”, but that’s a hard problem, whereas scalability is trivially solvable by reading a few engineering blogs, and anyway it’s not like anyone will ever call you out on it, since you’ll go bust in two months.

I laughed. I cried. Having a back full of microservices scars, I can attest that everything said here is true. Just build an effin monolith and get it done.

Animats 5 hours ago

You can get an awful lot done with Go, MySQL or MariaDB, FCGI, Apache or Ngnix. You start with one server on shared hosting for a few dollars a month. Then scale up to a dedicated server, if the load appears. Then scale up to multiple servers with a load balancer. Replicate the database.

On the load side, if you're accumulating "statistics" about user behavior, do you really need them for every user? Maybe only one user in a hundred. Or a thousand.

When you exceed the limits of that, you're a big company and can afford AWS.

A few years ago, we had those guys with the liquid meal (not Jucero, the Soylent guys) boasting about their "infrastructure". Not for making the product, for their web site. From their financials, you could calculate that they were doing about four sales a minute. Their "infrastructure" could run on an Raspberry Pi.

  • roncesvalles an hour ago

    The problem with a VPS/baremetal solution is always high availability. Once you introduce HA, you may as well go into the cloud since HA'ing stateful stuff using VPS is a major pain.

    Also, you can get a lot done with a serverless (FaaS/PaaS) solution and a simple DB like DynamoDB.

    • foldr 8 minutes ago

      I don't disagree with this, but I think people sometimes miss the fact that even cheap commodity hardware is way more reliable than most software. Your site may go down more often due to bugs in your HA configuration than it would have gone down because of your VPS dying once a year or so. If you are not a huge operation, the sweet spot for HA is probably something very simple (e.g. two servers behind a load balancer, or something like that).

abujazar 12 hours ago

I've seen my share of insanely over-engineered Azure locked-in applications that could easily have been run on an open source stack on a $20 VM.

  • hedora 9 hours ago

    But what if payroll grows to 100M internal users?

    • roncesvalles an hour ago

      I sense sarcasm but just to add on, most software problems actually have very predictable natural upper bounds. There are only X number of people in the world growing at a well-defined rate. There are only X number of people in your country, or in your TAM (e.g. if you're a restaurant) etc. This is especially true for B2B.

      The need to accommodate runaway scale (unbounded N and unbounded rate of growth of N) is actually quite rare.

    • WJW 6 hours ago

      Seems like a great problem to have. Surely one of those millions of employees can be used to change the current system at that point. Until then, no reason to overengineer it.

    • abujazar 8 hours ago

      Yea, and what if my country suddenly grows by 10000%?

jrochkind1 6 hours ago

> No, no you don’t, you can deploy your modules to their own repos and work on them that way, though then you lose the nice property of being able to atomically deploy changes across your entire codebase when an API changes.

That's not all you lose. You also lose being able to have a single git SHA to describe the state of the entire system at any time. And, lose the naturalness of running CI on the entire system at a given state (and knowing what CI ran on what state), although you can rig that up unnaturally of course.

  • cestith 3 hours ago

    I hear you about rigging those repos together artificially.

    I used to work somewhere that had several different systems written as monoliths that eventually needed to interact more closely with each other. There was an in-house ticket system (since made a support service wrapped around ZenDesk, but ZD wouldn’t replicate all its functionality). There was an in-house employee management system. There was a first-party CRM. There was a homegrown e-commerce store. They’d built their own licensing servers for their software. Eventually the CRM was managing large customer licenses, the ticket system linked to the store to sell priority support, and the store was selling licenses to smaller customers.

    So instead of making these things all support APIs that supported the other applications, people started copying libraries around from service to service. Then to make sure those libraries didn’t fall out of date, the (at the time rsync) deployment process for each of those apps was changed to require a pull from every one of those repos, then a push from that staging server to the production servers. Then security did a PCI-DSS internal audit, and the developers couldn’t just get onto a staging server and make direct changes to production.

    So I, as the lead SRE at the time, wrote a builder web app that takes a config file per project. It holds the data on the repos and the default tag to pull from each. The web app allows the developer to update to a different commit or tag for any particular repo involved. Then a single button pulls everything and serially takes production servers out of rotation, updates them, and puts them back into production. It’s something that could have been avoided many different ways including using a monorepo for those systems.

gampleman 11 hours ago

Hilariously written but also too true.

One start up I worked at we had 2 Kubernetes clusters and a rat's nest of microservices for an internal tool that, had we been actually successful at delivering sufficient value would have been used by at most a 100 employees (and those would unlikely be concurrent). And this was an extremely highly valued company at the time.

Another place I worked at we were paying for 2 dev ops engineers (and those guys don't come cheap) to maintain our deployment cluster for 3 apps which each had a single customer (with a handful of users). This whole operation had like 20 people and an engineering team of 8.

  • radiator 11 hours ago

    This sounds just about right: I have read that Kubernetes is the greek term for "more containers than customers".

  • andoando 11 hours ago

    We have the same shit and its super annoying too cause in addition I cant do shit without going through the dev ops team even though were 5 engineers.

  • jeffrallen 4 hours ago

    I work at a place with 8 k8s clusters. We needed to evolve from generation 2 to generation 3 because of "manageability" or something. Gen 3 needed two clusters instead of one. Now we have 8 * (1 + 2) = 24 clusters.

    Happy days.

  • Thiez 10 hours ago

    What were these dev ops engineers doing all day? Surely you can only polish a cluster so much before it's done and there is nothing left to do?

    • gampleman 10 hours ago

      You should have seen the architecture they came up with... it had ALL the bells and whistles you could possibly imagine and cost an absolute fortune.

      Of course they eventually got bored and quit. And then it became really annoying since no one else understood anything about it.

    • jeffrallen 4 hours ago

      It takes approximately 3 months to get it "just right". Luckily, k8s releases (and CNI and auth sidecars, and...) release every 2.8 months.

treve 9 hours ago

A bit of an alternative take on this, but I talk to a lot of folks at small start-ups (in Toronto, if that matters), but it seems like most people actually get this right and understand not to bring in complexity until later. Things like microservices seems like they are mostly understood as a tool that's not really meant to solve a real scalibility problem and is massive liability early on.

The exceptions are usually just inexperienced people at the helm. My feeling is, hire someone with adequate experience and this is likely not an issue.

I do think architecture astronauts tend to talk a lot more about their houses of cards, which makes it seem like these set ups are more popular than they are.

liampulles 6 hours ago

Early in my career I was given the opportunity (in between consulting jobs) to make an MVP for a revamp of our internal employee management system.

I seized the opportunity to deploy my own Kubernetes cluster, and create a sidecar to help with user authentication (because of course we'd need a common way to do this for the multi-language suite of microservices we'd be building). I used up pretty much the entirety of my time designing and architecting how this colossus was going to work, and by the end I realized how foolish the whole endeavor was.

That was really an instructive failure - at my next job, I got everyone behind turning our team's microservices back into a modular monolith, and it worked very well.

mmcnl 5 hours ago

Scalability is not purely technical. It's also organizational. For all its drawbacks, the microservices architecture is easier to scale from an organizational perspective.

  • whstl 5 hours ago

    Only when the service boundaries and interfaces are built with this in mind.

    A service that is isolated enough it could be another company? Sure, this scales. But do company hierarchies and organization practices help this happen? I haven't seen it outside of places like Amazon where there was a mandate for it to be that way.

    What companies end up with in practice are services so tightly coupled with the rest of the company that they requiring a mishmash of API requests in both directions and endless coordination. Aka a distributed monolith. All the problems with zero the advantages.

sagyam 9 hours ago

I have read and watched these articles and videos where people seem to have a problem with Microservice, Kubernetes, cloud providers, or anything that's not a PHP server sitting behind an nginx running on a $5 VPS. I have also seen the front-end analogy of these types of posts, where anything that is not written using HTML, CSS, and jQuery is unnecessary bloat. I will soon write a blog, which I think will cover more points and nuances of both sides. For now, here are some of my scattered thoughts.

- If deploying your MVP to EKS is overengineering, then signing a year-long lease for bare metal is hubris. Both think one day they will need it, but only one of them can undo that decision.

- Don't compare your JBOD to a multi-region replicated, CDN-enabled object store that can shrug off a DDoS attack. One protects you from those egress fees, and the other protects you from a disaster. They are not comparable.

- A year from now, the startup you work for may not exist. Being able to write that you have experience with that trendy technology on your resume sure sounds nice. Given the layoffs we are seeing right now, putting our interest above the company's may be a good idea.

- Yes, everyone knows modern CPUs are very fast, and paying $300/mo for an 8-core machine feels like a ripoff, but unless you are business of renting GPUs and selling tokens. Compute was never your cost center; it was always humans. For some companies, not being able to meet your SLA due to talent attrition is scarier than the cloud bill.

I know these are one-sided arguments, and I said I would cover both sides with more nuance. I need some time to think through all the arguments, especially on the frontend side. I will soon write a blog.

  • rvitorper 8 hours ago

    I thought capitalism was about adding value, not conflict of interest

alpine01 10 hours ago

There's a now famous Harvard lecture video on YouTube of Zuckerberg earlier in the Facebook days, where he walks through the issues they hit early on.

https://www.youtube.com/watch?v=xFFs9UgOAlE

I watched it ages ago, but I seem to remember one thing that I liked was that each time they changed the architecture, it was to solve a problem they had, or were beginning to have. They seemed to be staying away from pre-optimization and instead took the approach of tackling problems as they had as they appeared, rather than imagining problems long before/if they occurred.

It's a bit like the "perfect is the enemy of done" concept - you could spend 2-3x the time making it much more scalable, but that might have an opportunity cost which weakens you somewhere else or makes it harder/more expensive to maintain and support.

Take it with a pinch of salt, but I thought it seemed like quite a good level-headed approach to choosing how to spend time/money early on, when there's a lot of financial/time constraints.

gamerDude 10 hours ago

When I'm working with new developers I always have to convince them to simplify their setup. Why are we on autoscaled, pay by the query infra when we are serving a few people. Then they complain how expensive it is. I had someone tell me that their costs were $1500/mon when they were in demo stages. I asked them why they aren't hosting on a single small server for $20. And they responded that it didn't matter because they were using free credits.

Except that those free credits will go away and you'll find yourself not wanting to do all the work to move it over when it would've been easier to do so when you just had that first monolith server up.

I think free credits and hyped up technology is to blame. So, basically a gamed onboarding process that gets people to over-engineer and spend more.

mjr00 11 hours ago

> The first problem every startup should solve is “how do we have enough money to not go bust in two months”, but that’s a hard problem, whereas scalability is trivially solvable by reading a few engineering blogs [...] Do you know what the difference between Google and your startup is? It’s definitely not scalability, you’ve solved that problem. It’s that Google has billions upon billions with which to pay for that scalability, which is really good because scalability is expensive.

Too true. Now that I've stepped into an "engineering leadership" role and spend as much time looking at finances as I do at code, I've formed the opinion that in 99.999% of cases, engineering problems are really business problems. If you could throw infinite time and money at the technical challenges, they'd no longer be challenging. But businesses, especially startups, don't have infinite (or even "some") money and time, so the challenge is doing the best engineering work you can, given time and budget constraints.

> The downsides [of the monolith approach]

I like the article's suggestion of using explicitly defined API boundaries between modules, and that's a good approach for a monolith. However one massive downside that cannot be ignored -- by having a single monolith you now have an implicit dependency on the same runtime working on all parts of your code. What I mean by this is, all your code is going to share the same Python version and same libraries (particularly true in Python, where it's not a common/well-supported use case to have multiple versions of library dependencies). This means that if you're working on Module A, and you realize you need a new feature from Pandas 2.x, but the rest of the code is on Pandas 1.x... well, you can't upgrade unless you go and fix Modules B, C, D ... Z to work with Pandas 2.

This won't be an issue at the start, but it's worth pointing out. Being forced to upgrade a core library or language runtime and finding out it's a multi-month disruptive project can be brutal.

DrScientist 11 hours ago

Isn't it simple as the following?

Break your code into modules/components that have a defined interface between them. That interface only passes data - not code with behaviour - and signal the method calls may fail to complete ( ie throw exceptions ).

ie the interface could be a network call in the future.

Allow easy swapping of interface implementations by passing them into constructors/ using factories or dependency injection frameworks if you must.

That's it - you can then start with everything in-process and the rapid development that allows, but if you need to you can add splitting into networked microservices - any complexity that arises from the network aspect is hidden behind the proxy, with the ultimate escape hatch of the exception.

Have I missed something?

  • williamdclt 10 hours ago

    You're not missing something, but you're assuming that it's easy to know ahead of time where the module boundaries should be and what the interfaces should look like. This is very far from easy, if possible at all (eg google "abstraction boundaries are optimization boundaries").

    Also, most of these interfaces you'll likely never need. It's a cost of initial development, and the indirection is a cost on maintainability of your code. It's probably (although not certainly) cheaper to refactor to introduce interfaces as needed, rather than always anticipate a need that might never come.

  • crazygringo 11 hours ago

    You're not missing much, but I don't understand why you're just basically repeating what the article already says. Except the article also says to use a monorepo.

    • stavros 10 hours ago

      No, I'm saying you don't need to use a monorepo! The repo discussion is a bit orthogonal, and up to you to decide whether you want a single repo or multiple repos with modules/libraries that get deployed together.

    • DrScientist 11 hours ago

      I think I've added a couple of elements to make it possible to scale your auth service if you need to. Easily swappable implementations and making sure the interfaces advertise that calls may simply fail.

      Even so it's still very simple.

      To scale your auth service you just write a proxy to a remote implementation and pass that in - any load balancing etc is hidden behind that same interface and none of the rest of the code cares.

      • crazygringo 11 hours ago

        Good point! Sorry if I was being ungenerous.

        I like the idea of the remote implementation being proxied -- not sure I've come across that pattern before.

  • 8note 4 hours ago

    the swap from interface to network call is still non-trivial.

    you get to have new problems that are qualitatively different from before like timeouts, which can break the adsumptions in the rest of your code about say, whether state was updated or not, and in what order. you also then get to deal with thundering herds and circuit breakers and so on.

  • goodpoint 6 hours ago

    Yes, you are missing the cost of complexity and network calls. You are describing a distributed monolith. It does not help.

pigcat 10 hours ago

My friend is the first dev hire at a startup where they prematurely overengineered for scalability. The technical founders had recently exited a previous startup and their rationale was that it makes a future acquisition easier, since a potential acquirer will weigh scalability in their evaluation of the code (and maybe even conflate it with quality). In fact it was a regret from their first startup that they hadn't baked in scalability earlier. I remain skeptical of the decision, but curious if there's any truth to the fact that acquirers weigh scalability in their scorecard?

  • danielmarkbruce 6 hours ago

    Sure, if the acquirer thinks the product is going to sell a lot.

    A relatively common plan (it doesn't always work) for large enterprise software companies is to buy a product and then use their very large sales force to sell it into all their existing customers. If thats the plan, you have to make sure the product will work with all the increased usage.

    I'd still suggest it's far better to optimize for building the right product - the "is this going to scale" problem is one of the nicest problems you can face.

mattbillenstein 4 hours ago

I think more engineers would benefit from "you can just build stuff" thinking. Like you don't have to use all the complicated whizz-bang tech that everyone else is using.

You can build a boring backend on Linux VMs without containers using open-source software - it's simpler or at least a different level of complexity compared to the big clouds and orchestration systems like k8s, and honestly, it's just more fun to work on - I almost never write yaml - it's a joy.

I wrote my own deployment system using this idea - machines, roles, software and services that map to those roles, idempotent operations, and a constantly-connected async rpc system to orchestrate it all. Written from scratch in a language I like with a config language I like. My deploys are often < 10s (if I'm not waiting on webpack to build the UI) and all connect up to a chatops channel in Slack. I understand it because I wrote it all. Will it scale to infinity? Definitely not, but it's good enough for my uses.

So, I guess - just build stuff using simple primitives. Write simple software - modules and functions and a lot of stateless code. Use postgres for persistence - it's really that good. Use nginx and dns load balancing - tried and true simple architecture.

afiodorov 6 hours ago

I've found that building my side projects to be "scalable" is a practical side effect of choosing the most cost-effective hosting.

When a project has little to no traffic, the on-demand pricing of serverless is unbeatable. A static site on S3 or a backend on Lambda with DynamoDB will cost nothing under the AWS free tier. A dedicated server, even a cheap one, is an immediate and fixed $8-10/month liability.

The cost to run a monolith on a VPS only becomes competitive once you have enough users to burn through the very generous free tiers, which for many side projects is a long way off. The primary driver here is minimizing cost and operational overhead from day one.

arthurofbabylon 6 hours ago

I agree, of course. Choose an architecture for the pragmatic reality, not some fantastical non-reality.

However, I appreciate the craft. Some of these unnecessary optimizations (rather, “introduced complexities”) are vestigial accoutrements that come alongside generally good software design. Not all, but some. So I tolerate a fair amount of fanciness in myself and others when it coincides with solid intent and healthy output.

That said, we should absolutely not tolerate the presence of appurtenances of complexity at the architectural layer – that is a place reserved for pure 100% pragmatism.

scuff3d 5 hours ago

A problem not addressed by the article is customer expectations. If you work for a company that does contracting, you have to deal with customers who have enough knowledge to know all the buzz words but not enough knowledge to actually know what they're asking for. If you don't give them Kubernetes and micro services they don't want to pay you.

zem an hour ago

as an aside to the last bit about monorepos, from a dev point of view monorepos are pretty awesome. you never need to deal with complicated merge commits and you always have a consistent state of the world, enforced by ci. i would imagine maintaining the tooling and infrastructure to support said monorepo is pretty hard and painful, but as a user I love working in one.

andersmurphy 6 hours ago

A VPS, caddy, sqlite and livestream is really all you need in my experience. Help's if your language has decent support for real and green threads (java/go/clojure).

shenenee 7 hours ago

Man is 100% right; the insanity we put ourselves through for all kinds of hypotheticals is beyond mental

koito17 6 hours ago

In 2015, a single server with 40 cores and 128 GB of RAM was able to handle 2 million WebSocket connections running an Elixir program[0].

One of my hot takes is that a gaming PC has fast enough hardware to serve thousands of clients with a static Rust binary and SQLite. Pair with Litestream, and you have easy-to-test, continuous backups. It's nice being able to test backups by just running `litestream restore` and then running a single binary on my development machine. Addtionally, when the backend is a single static binary, you gain the opportunity to test the entire system in CI without maintaining an ad-hoc cloud environment or heirarchy of mock services.

The points of contention, for me personally, would be managing deployments and observability.

Of course, at my workplace, I wouldn't dare to suggest this kind of architecture, but as others have mentioned, a single machine can go a long way, and I doubt most my projects will ever need to scale beyond 40 cores and 128 GB of RAM.

[0] https://x.com/chris_mccord/status/659430661942550528

  • rvitorper 3 hours ago

    Can’t argue with that. Machines are pretty capable, and elixir is awesome as well

incorrecthorse 7 hours ago

Plot twist: It's not actually scalable because no amount of tools and buzzwords can compensate for the lack of experience in proper architecture for scaling.

zokier 10 hours ago

In lot of contexts scaling down is far more important than scaling up. In that sense scalability is cost-optimization; instead of provisioning fixed capacity that is enough for (predicted) peak loads, you can scale based on actual demand and save money or have higher utilization.

hsn915 12 hours ago

I think it was around 2015 when everything was basically AWS and Kubernetes

The turning point might have been Heroku? Prior to Heroku, I think people just assumed you deploy to a VPS. Heroku taught people to stop thinking about the production environment so much.

I think people were so inspired by it and wanted to mimic it for other languages. It got more people curios about AWS.

Ironically, while the point of Heroku was to make deployment easy and done with a single command, the modern deployment story on cloud infrastructure is so complicated most teams need to hold a one hour meeting with several developers "hands on deck" and going through a very manual process.

So it might seem counter intuitive to suggest that the trend was started by Heroku, because the result is the exact opposite of the inspiration.

JohnMakin 11 hours ago

If you’ve ever been in a situation where you do suddenly face scale and have to rip apart a legacy monolith that was built without scale in mind, you’ll chuckle at this article. It’s extremely painful.

  • stavros 38 minutes ago

    I'm not saying "you're never going to scale", I'm saying "you aren't going to scale right now". There's a happy medium between overengineering everything and making everything so shoddily that you can never grow.

  • sgarland 10 hours ago

    Legitimately asking, how? The only bottleneck should be the DB, and if you can saturate a 128-core DB, I want to see your queries and working set size. Not saying it can’t happen, but it’s rare that someone has actually maxed out MySQL or Postgres without there being some serious schema and query flaws, or just poor / absent tuning.

    • JohnMakin 10 hours ago

      You’re thinking purely in terms of app performance. have you ever seen a terrible db schema? Having to suddenly iterate fast with a brittle codebase that doesnt really allow that ive seen bring teams to their knees for a year+.

      I’ve seen monoliths because of their sheer size and how much crap and debt is packed into them, build and deploy processes taking several hours if not an entire day for some fix that could be ci/cd’d in seconds if it wasn’t such a ball of mud. Then, what tends to happen, is the infrastructure around it tends to compensate heavily for it, which turns into its own ball of mud. Nothing wrong with properly scaled monoliths but it’s a bit naive, in my personal experience, to just scoff at scale when your business succeeding relies on scale at some point. Don’t prematurely optimize, but don’t be oblivious to future scenarios, because they can happen quicker than you think

      • whstl 5 hours ago

        That was the reality of the fintech I worked at.

        The schema wasn't really a problem, but the sheer amount of queries per request. Often a user opening a page or clicking a button would cause 100-200 database queries, including updates. This would prevent strategies such as "just replicating the data somewhere". It was so badly architected that every morning the app would stop responding due to users doing their morning routine operations. And they only had around 300 employees.

        And this was just an internal app, the B2C part was already isolated because we couldn't afford to be offline.

        The solution I started working on was doing similar to the strangler fig pattern and replacing parts of the API with new code that talked directly to the ORM. Naturally this didn't made the people who wrote the legacy code happy, but at least the outages stopped.

        • JohnMakin 17 minutes ago

          sounds so similar to a fintech situation I’ve been in once I swear I was gonna say we worked at the same place, but the size of company is wrong. I’ve now seen pretty similar things since, enough to say it’s probably everywhere

      • sgarland 4 hours ago

        > have you ever seen a terrible db schema?

        I am a DBRE, so yes, unfortunately most days I see terrible schemata.

        > Having to suddenly iterate fast with a brittle codebase that doesnt really allow that ive seen bring teams to their knees for a year+.

        IME, the “let’s move fast” mindset causes further problems, because it’s rare that a dev has any inkling about proper data modeling, let alone RDBMS internals. What I usually see are heavily denormalized tables, UUIDs everywhere, and JSON taking the place of good modeling practices. Then they’re surprised when I tell them the issue can’t be fixed with yet another index, or a query rewrite. Turns out when you have the largest instance the cloud provider has, and your working set still doesn’t fit into memory, you’re gonna have a bad time.

    • VirusNewbie 6 hours ago

      For e-commerce, sure. But for telecom or IoT, it doesn’t take a large company to easily overrun the limits if what Postgres can do.

tesdinger 11 hours ago

> you usually only have one database

What if I use the cloud? I don't even know how many servers my database runs on. Nor do I care. It's liberating not having to think about it at all.

  • sgarland 10 hours ago

    And your cloud providers thank you for giving their executives a third yacht.

_ZeD_ 12 hours ago

Honestly, in my experience, the only good reason to have microservices in a "software solution" is to be able to match 1 service -> 1 mantainer/team and have a big (read "nested", with multiple level of middle-managers) group of teams, each that may have different goals. In this way it's very easy to "map" a manager/team to a "place" in the solution map, with very explicit and documented interactions between them

  • hsn915 11 hours ago

    Nice story but the places I've seen that make use of services, there's never a "1 server -> 1 team". It's more like 20 services distributed among 3 teams, and some services are "shared" by all teams

    • rvitorper 36 minutes ago

      I can relate to this

  • nycdotnet 12 hours ago

    This is Conway’s Law: You ship your org chart.

lambdaone 11 hours ago

I had a client with a system just like this. EBS, S3, RDS, Cognito, the lot. It cost $00s per month under almost no load, and was a maintenance nightmare - which was the real problem, not the cost, as it stopped working altogether eventually. A bit of hacking later, it all fits on a single VM that costs ~$10/month to run and is far easier to build, deploy and maintain.

rednafi 5 hours ago

CV padding is a real thing. Too many engineers love joining moonshot programs at large companies. The reason - being able to play with cool toys with the least amount of accountability.

Google has Google problems. So unless you are operating at that scale, blindly adopting their tech won’t solve your problems. But it might bring you a raise.

pdhborges 11 hours ago

Scale articles are too focused on architecture. What about business problems that come with scale. At a certain scale rare events are common many cases cease to be fixable by some random process that involves humans you have to handle a lot more business scenarios with your code.

esher 11 hours ago

I can relate - running a small hosting business. People come up with too complex solutions. They solve problems that they'd wish to have. For instance: HA setups are complex. If not done correctly, like in most cases, people don't gain the additional '9' from the SLA.

dt3ft 7 hours ago

Because Microsoft, Google and Amazon, to name a few, sell the story of the cloud to decision makers who can sign a subscription contract. Developers go with the flow, not daring to question the setup. Meanwhile, they host their own servers on a r-pi and ship sideprojects. Devs are not at fault here. It’s the management.

  • stavros 6 hours ago

    Having been on both sides, nope, it's 100% the devs pushing for this. I've had innumerable discussions trying to convince developers to simplify things, but their reaction was basically "how can we go against best practice?! Unthinkable!". It was a huge uphill battle.

    Meanwhile, management didn't have an opinion, it was up to the teams to architect things how they liked, and they liked sprawling microservice everything, because they wanted to own 100% of their code, any other concern be damned.

shwaj 4 hours ago

I recall hearing, more than a decade ago IIRC, that facebook.com ran as a monolith for much longer than you would have expected. Perhaps someone with direct experience can comment?

Havoc 11 hours ago

Comes down to knowing when to stop. You don’t really want to DIY your own orchestrator etc. So better off just using kubernetes. But then not going too far down that rabbit hole.

ie yes kubernetes but the simplest vanilla version of it you can manage

  • apf6 6 hours ago

    Simplest version of Kubernetes is zero Kubernetes. You can instead run your service using a process manager like PM2 or similar. I think even using Docker is overkill for a lot of small teams.

  • sgarland 10 hours ago

    I wouldn’t start with K8s, and I’ve administered it at multiple companies. Unless every one of your initial hires is a SWE turned SRE, you’re in for a bad time (and you don’t need it).

    I’d personally start with Linux services on some VMs, but Docker Compose is also valid. There are plenty of wrappers around Compose to add features if you’d like.

okaleniuk 11 hours ago

Most of the startups actually plan to survive for more than 2 months. And it makes total sense to think about scalability, reliability, and performance while it's still possible to change your whole stack every other week. Not forgetting about other things such as securing your cash flow, growing your talent pool, protecting your IP, etc. Finding a good balance between multiple focii is exactly the job for a founder. Of course, it's a hard job, that's why we don't see many successful startups to begin with.

xg15 6 hours ago

> The first problem every startup solves is scalability. The first problem every startup should solve is “how do we have enough money to not go bust in two months”

Why is the second question the devs' responsibility? Shouldn't it be the founders'?

  • bcrosby95 6 hours ago

    Because how the devs decide to build the system influences whether you will have enough money to not go bust in two months.

    • xg15 6 hours ago

      Good point, but unless the only dev is also a founder, I'd think they aren't the ones with the responsibility to come up with a business model.

      They definitely have the responsibility to write software so it most efficiently serves that business model.

oooyay 5 hours ago

The architecture the author is describing is called SOA and to me is much more optimal. There's variations of SOA that can occur in a monolith or as separate services but at the end of the day it stresses separation at interfaces that most people like microservices for. Microservices are really only architecturally necessary if certain parts of your application have outlier performance characteristics where it wouldn't make sense to scale the whole thing for any number of reasons (eg: database connection pooling or some other metric).

As for why microservices got so popular I think the answer lies in writing. The more a pattern is written on the more likely it is to be repeated and modeled.

Finally, the author also goes in after cloud usage in general. We're a decade and a half away from the first of these convergences. I was a systems engineer in those days and I remember how terrible they were. Software engineers requested a virtual machine, a pool of network resources, firewall rules, etc and eventually they got what they needed. The primary motivator of the cloud wasn't scale per se, to me it was that now a competent developer has an API to request those things on. They're operating on machine time instead of human time.

Some people extrapolate the same argument above and replace cloud with virtual machine and containers, "Why do I need containers when I can simply operate this load balancer, some VMs on an autoscaling group, and a managed database?!" Again, we quickly forgot that for many software engineer immersing themselves in image pipelines, operating systems maintenance, and networking details bogs them down in releasing the thing they really care about - the thing that logically gets them paid. Containers, again, traded that complexity to another team that solves it once for many people and lets software engineers live a relatively less complex life from their perspective.

There's an old adage I used to hear in the Marines that goes something like, "If you're not in the infantry then you're serving it. If you aren't doing that then you should question exactly what it is that you do here." The same can be said for software and those outside of product - we end up living to serve those that are building more closely on the product itself.

That's how I contextualize this history/evolution anyway.

sreekanth850 11 hours ago

Frontend folks weren’t happy with how simple things were, so after seeing microservices, they invented microfrontends, and balance in complexity restored.

dev_l1x_be 6 hours ago

Scalability is solved problem as of 2025, this is why. Do we need it? Probably not, in most cases anyways. Most of the time people do it because they like complex systems.

charlimangy 10 hours ago

Working in modern architectures that can scale is pretty important for developers that want to have attractive resumes. Given that your startup has a 9 out of 10 chance of failing you're going to need another job. If you want people to stay you have to give them the security of keeping up with at least some of the latest fashions.

aubanel 7 hours ago

Great article! It seems like the whole industry has overfitted to the System designs interviews of FAANG, thus focused on extreme scaling need that few companies actually have.

jcarrano 9 hours ago

I think part of the problem is (some) programmers being unable to draw clear encapsulation boundaries when writing a monolith. I'm not even referring to imposing a discipline for a whole team, but the ability to design a clean internal API and stick to it oneself.

right2copy 5 hours ago

if copyright didn't exist, i'm not so sure that our current data systems would be considered "scalable" - scalability is a relative term.

vbezhenar 9 hours ago

Everything is scalable, because it became very easy to write scalable software. I guess that's the reason.

nate_martin 6 hours ago

>The first problem every startup solves is scalability.

I don't think this is true at all. The first problem they solve is typically finding product market fit, which startups will do by sacrificing scalability and quality for speed of execution.

TickleSteve 11 hours ago

scale vertically before horizontally...

- scaling vertically is cheaper to develop

- scaling horizontally gets you further.

What is correct for your situation depends on your human, financial and time resources.

dzonga 7 hours ago

There's also the 'myth' of the modular monolith.

tobyhinloopen 10 hours ago

It isn't hard to make something a bit scalable, but it is very hard to make it scalable _later_.

ecshafer 5 hours ago

The topic of this blog post is the same thing that DHH has been railing on for the last few years. Run your own infrastructure, keep things simple.

j45 5 hours ago

Because when the cloud initially came out 2000-2010, linux had a few things to improve including networking, virtualization.

The public cloud provided a way to avoid all of this headache.

Once it got figured out, the cloud wasn't the only way to scale anymore.

Except, more people than not might not know that today about the cloud.

Linux has gotten orders of magnitude more efficient and effective, so has the public cloud, and by extension, so has scaling (and self hosting)..

deadbabe 6 hours ago

Unfortunately software engineering suffers from the opposite problem of many other engineering disciplines.

Any idiot could build a bridge by just overbuilding everything, an engineer helps you build the minimum viable bridge.

In software, it’s the opposite. Idiots can easily roll out products and services with crude and basic code. You only need true engineers for the high volume high performance stuff. And if that’s not what you’re doing – you don’t need engineers. The logical conclusion is then to fire them.

ReptileMan 10 hours ago

That is helluva verbose way to quote Knuth ...

llm_nerd 11 hours ago

This piece is written with a pretty cliche dismissive tone that assumes that everything everyone else does is driven by cargo-culting if not outright ignorance. That people make these choices because they're just rushing to chase the latest trend.

They're just trying to be cool, you see.

Here's the thing, though: Almost every choice that leads to scalability also leads to reliability. These two patterns are effectively interchangeable. Having your infra costs be "$100 per month" (a claim that usually comes with a massive disclaimer, as an aside) but then falling over for a day because your DB server crashed is a really, really bad place to be.

  • crazygringo 11 hours ago

    > Almost every choice that leads to scalability also leads to reliability.

    Empirically, that does not seem to be the case. Large scalable systems also go offline for hours at a time. There are so many more potential points of failure due to the complexity.

    And even with a single regular server, it's very easy to keep a live replica backup of the database and point to that if the main one goes down. Which is a common practice. That's not scaling, just redundancy.

    • llm_nerd 11 hours ago

      >Empirically, that does not seem to be the case.

      Failures are astonishingly, vanishingly rare. Like it's amazing at this point how reliable almost every system is. There are a tiny number of failures at enormous scale operations (almost always due to network misconfigurations, FWIW), but in the grand scheme of things we've architected an outrageously reliable set of platforms.

      >That's not scaling, just redundancy.

      In practice it almost always is scaling. No one wants to pay for a whole n server just to apply shipped logs to. I mean, the whole premise of this article is that you should get the most out of your spend, so in that case much better is two hot servers. And once you have two hot...why not four, distributed across data centers. And so on.

      • crazygringo 11 hours ago

        > Failures are astonishingly, vanishingly rare

        You and I must be using different sites and different clouds.

        There's a reason isitdownrightnow.com exists. And why HN'ers are always complaining about service status pages being hosted on the same services.

        By your logic, AWS and Azure should fail once in a millennium, yet they regularly bring down large chunks of the internet.

        Literally last week: https://cyberpress.org/microsoft-azure-faces-global-outage-i...

  • sgarland 10 hours ago

    A distributed monolith - which is what nearly all places claiming to run microservices actually have - has N^m uptime.

    Even if you do truly have a microservices architecture, you’ve also now introduced a great deal of complexity, and unless you have some extremely competent infra / SRE folk on staff, that’s going to bite you. I have seen this over and over and over again.

    People make these choices because they don’t understand computing fundamentals, let alone distributed systems, but the Medium blogs and ChatGPT have assured them that they do.

    • dinkleberg 5 hours ago

      This is the truth. I work with an application that has nearly 100 microservices and it seems like at any given point in time at least one is busted. Is it going to impact what you’re doing? Maybe. Maybe not.

      But if it was just a monolith and had proper startup checks, when they roll out a new version and it fails, just kill it right there. Leave the old working version up.

      Monoliths have their issues too. But doing microservices correctly is quite the job.

  • okaleniuk 11 hours ago

    Yes, reliability comes from the same ground the scalability does, and yes people are mostly chasing the latest trend. One does not contradict the other.

    • llm_nerd 11 hours ago

      >yes people are mostly chasing the latest trend

      https://www.youtube.com/watch?v=b2F-DItXtZs

      15 years ago people were making the same "chasing trends" complaints. In that case there absolutely were people cargo culting, but to still be whining about this a decade and a half later, when it's quite literally just absolutely basic best practices.

  • blueflow 11 hours ago

    > Here's the thing, though: Almost every choice that leads to scalability also leads to reliability.

    How is that supposed to happen. Without k8 involved somehow?

    • 97nomad 11 hours ago

      There is a lot of instruments, that don't need k8s to be scalable and reliable. Starting from stateless services and simple load balancers and ending with actor systems like in Erlang or Akka.

bogwog 7 hours ago

TL;DR: premature optimization is the root of all evil

reactordev 11 hours ago

I read this, and have the opposite experience. Your monolith will fester as developers step on each others toes. You aren’t solving for scalability, you’re solving for sovereignty. Giving other teams the ability to develop their own service without needing to conform to your archaic grey beard architecture restrictions and your lack of understanding what a pod is or how to get your logs from your cloud.

No, this whole article reads like someone who is crying that they no longer have their AS/200. Bye. The reason people use AWS and all those 3rd party is so they don’t have to reinvent the wheel which this author seems hell bent on.

Why are we using TCP when a Unix file is fine… why are we using databases when a directory and files is fine? Why are we scaling when we aren’t Google when my single machine can serve a webpage? Why am I getting paid to be an engineer while eschewing all the things that we have advanced over the last two decades?

Yeah, these are not the right questions. The real question should be: “Now that we have scale what are we gonna do with it?”

  • sgarland 10 hours ago

    > Giving other teams the ability to develop their own service without needing to conform to your archaic grey beard architecture restrictions

    IME at many different SaaS companies, the only one that had serious reliability was the one that had “archaic grey beard architecture restrictions.” Devs want to use New Shiny X? Put a formal request before the architectural review committee; they’ll read it, then explain how what the team wants already exists in a different form.

    I don’t know why so many developers - notably, not system design experts, nor having any background in infrastructure - think that they know better than the gray beards. They’ve seen some shit.

    > and your lack of understanding what a pod is or how to get your logs from your cloud.

    No one said the gray beards don’t know this. At the aforementioned company, we ran hybrid on-prem and AWS, and our product was hybrid K8s and traditional Linux services.

    Re: cloud logs, every time I’ve needed logs, it has consistently been faster for me to ssh onto the instance (assuming it wasn’t ephemeral) and use ripgrep. If I don’t know where the logs were emitted from, I’ll find that first, then ssh. The only LaaS I’ve used that was worth a damn was Sumologic, but I have no idea how they are now, as that was years ago.

    • fragmede 10 hours ago

      Splunk was (and is) the gold standard for centralized logging. The problem with it now is mainly that it's crazy expensive, though the operational engineering burden in order to run it well is non-zero and has to be accounted for. But being able to basically grep across all logs on the whole fleet, and then easily being able to visualize those results, made me never want to go back to having to ssh somewhere and run grep manually. I could write a script to ssh to all the app servers, grab the past 15 minutes of requests, extract their IPs, and plot them on a map to see which countries are hot, but that would be annoying enough that I'd really have to want to do that.

      Meanwhile if you have Splunk, you specify the logfile name and how to extract the IP and then append "| iplocation clientip | geostats count by Country" to see which countries requests are coming from, for example. Or append "| stats count by http_version" and then click pie chart and get a visualization that breaks down how much traffic is still on HTTP 1.1, who's on 1.2, whos is on 2, and who's moved to QUIC/3.

  • sarchertech 10 hours ago

    >step on each others toes

    Which leads us to a huge problem I’ve seen over the past few decades.

    Too many developers for the task at hand. It’s easier for large companies to hire 100 developers with a lower bar that may or may not be a great fit than it is to hire 5 experts.

    Then you have a 100 developers that you need to keep busy and not all of them can be busy 100% of the time because most people aren’t good at making their own impactful work. Then instead of trying to actually find naturally separate projects for some of them to do, you attempt to artificially break up your existing project in a way that 100 developers can work on together (and enforce those boundaries at through a network).

    This artificial separation fixes some issues (merge conflicts, some deployment issues), but it causes others (everything is a distributed system now, multi stage and multi system deployments required for the smallest changes, massive infrastructure, added network latency everywhere).

    That’s not to say that some problems aren’t really so big that you need a huge number of devs, but the vast majority aren’t.

    > they don’t have to reinvent the wheel

    Everything is a trade off, but we shouldn’t discount the cost of using generic solutions in place of bespoke ones.

    Generic solutions are never going to be as good of a fit as something designed to do exactly what you need. Sometimes the tradeoff is worth it. Sometimes it’s isn’t. Like when you need to horizontally scale just to handle the overhead. Or when you have to maintain a fork of a complex system that does way more than you need.

    It’s the same problem as hiring 100 generic devs instead of 5 experts. Sometimes worth it. Sometimes not.

    There’s another issue here too. If not enough people are reinventing the wheel we get stuck in local optima.

    The worst part is that not enough people spend enough time even thinking about these issues to make informed decisions regarding the tradeoffs they are making.