3 min read

Design Principle: Earn Your Scale

Earn Your Scale
Push run-of-the-mill stacks as far as they’ll take you. Learn their limits, then make the trade-offs of specialty infrastructure only when you’ve truly outgrown simple, proven paths. That’s how your app grows sustainably and how you actually build the experience to run systems at scale.

This post continues my Design Principles series. Each principle captures a lesson I’ve learned about building and running software systems over the years. The goal isn’t to prescribe one “right” way, but to highlight patterns that keep complexity in check and make systems more sustainable.

One area where complexity sneaks in quickly is in the technologies we choose. Every system is built on many decisions, and while most choices look fine from the outside, some are harder to understand. That’s where I find myself squinting, wondering why someone would pick a certain tool given what I know about their needs.

There are so many trade-offs behind the scenes that it’s rarely worth worrying about if it doesn’t affect you directly and the variety of approaches to building software is a good thing. But I have to admit I turn into Old Man Yells at Cloud when I hear someone reach for Kafka, CockroachDB, or the latest shiny “Web3DB”—not because of real needs, but because they assume relational databases or well-worn queueing backends can’t scale. They’ve read the marketing material, listened to the hype, and made choices based on theory instead of experience. Those choices ripple outward: others see them and repeat them, feeding the cycle.

Take Amazon SQS. A standard FIFO queue supports about 300 messages per second for send, receive, and delete operations. With batching you can hit ~3,000 messages per second, and in some regions high-throughput mode pushes it even further. That’s far more than most SaaS apps will ever need, and you don’t have to start at that level. You can scale into it gradually without re-architecting anything.

Start with a queue (SQS or RabbitMQ) for jobs and simple async work. If you later need durable replay, multiple independent consumers, and very high partitioned throughput, that’s when Kafka or Kinesis start to make sense.

The same is true of PostgreSQL. On AWS RDS, benchmarks show 5,000–7,000 transactions per second on decently sized instances. With the right indexes, storage, memory, and IOPS configuration, you can get a long way before you outgrow it. And remember, PostgreSQL has been around for decades. It’s been battle-tested at scales much larger than your app will likely ever see. If you do start to hit those limits, AWS Aurora is often the next step. It’s still Postgres at the core, but with a distributed, managed storage layer that can scale reads and handle much higher throughput while keeping the operational model familiar.

Only when you need global distribution, multi-region consistency, or the ability to survive entire data centers going offline without losing availability does it make sense to move to a distributed SQL database like CockroachDB.

Of course, specialty tools like Kafka and CockroachDB exist for a reason. They are deeply impressive pieces of engineering and absolutely the right choice when you hit their sweet spot. As your service continues to grow, you should keep pushing run-of-the-mill stacks as far as they’ll take you, but also recognize that there may come a tipping point where the trade-offs favor a shift. That point does exist, it’s just that the number of services that truly reach it is much smaller than people think.

Before adopting specialty tools, take an honest look at your current and near-future needs. Can SQS or another familiar queueing backend still serve your workload? Can a properly tuned RDS PostgreSQL handle your current traffic and 2-5x growth? Often the answer is yes. And remember: specialty infrastructure isn’t just more scalable. It is more expensive, harder to maintain, harder to debug, and comes with its own “hidden” failure modes. Fewer companies use it, which means fewer people to hire and less support to lean on. There’s a reason many teams start running Kafka themselves and eventually give in to managed services from Confluent.

It’s okay to not have scaling experience. Everyone starts from zero. But when you don’t have that experience, that is exactly when you need to develop a pragmatic, critical eye for technology choices. Running Kafka or CockroachDB in a service that doesn’t truly need them won’t make you an expert. It only adds operational overhead without teaching you the lessons that come from actually pushing a system to its limits. Real scaling experience is earned the same way your app earns scale: by growing into it. There’s no shame in not having it yet. Scaling experience comes naturally as your app grows, and every challenge you face along the way makes you better prepared for the next one.