Any founder with a product in production knows the pain of waking up at 0200 to outage alerts, or support interruptions at family dinner. It can be rough and scaling support is challenging. Here are some things you can do to make your systems more resilient.
Remove complexity.
Simple systems are just easier to maintain. They fail less often, and in less interesting ways. Interesting failures are time-consuming, expensive, and annoying to deal with. So you should do eveything you can to make sure your failures are as boring as possible! A couple of ways you might remove complexity:
- Do less. Example: Deciding between adding another service to handle rate limiting and JWT validation or to include it in your API code? Just take the path of least resistance. Your proxy likely has rate limiting and you can do JWT validation with like 13 lines of code. Repeat yourself a couple of times if you have to, and optimize later. You’ll have one less service to worry about!
- Avoid jumping on bandwagons. Just because the industry has standardized on an orchestrator and everyone’s doing it doesn’t mean you have to. Your architectural decisions could mean the difference between leaving the runway or running out of it.
- Asking “what is the problem we’re trying to solve?" Always bring your engineering efforts back to the end goal you’re trying to achieve. Make sure you and your team are spending the least amount of time, energy, and cash you can to reach a minimum functional product.
By the way, I write one of these every few weeks or so. Sign-up with your email here to receive the latest, as soon as it’s posted!
Automate provisioning and deployment.
DevOps and automation are principles manifested by non-functional components. While non-functional things don’t directly bring in more revenue, they can make your life easier and reduce engineering effort (expense). Rather than pointing, clicking, or typing CLI commands over and over to do the same thing, automate! It’s so nice to type git push
and know your app is being deployed for you while you refill your coffee cup. Start small and expand what you automate as you get closer to launching (hint: things like automated testing, code coverage, static analysis, etc. should probably come later).
Instrument your systems.
Collecting metrics and logs about your services is critical for two reasons:
- You’ll know what’s actually happening.
- You’ll find out what your users value.
The obvious purpose of metrics and logging is to diagnose issues and trace problems to their sources. It’s an absolute must. You can also use this data to analyze how your users are interacting with your services, and how components of your architecture interact with each other as you scale or during usage peaks or troughs. Incorporate these data points in your continuous improvement cycles.
Alert (and take action!) on negative conditions.
Once you’ve started collecting metrics and logs for your apps and infrastructure, set up some key alerts and automations to respond to negative conditions in your environment. You want to make sure these alerts represent conditions users will care about, and supply this context within the content of the alert. Providing only technical information, like server names and performance figures does nothing to compell or help a person to respond. Here is an example of a high-quality alert:
Title: "API Error Rate Spike Detected"
Severity: High
Description: "The error rate for the appname-api has increased to 2.5%,
surpassing the acceptable threshold of 1%. This may lead to user-facing
errors, and needs to be addressed promptly to minimize the impact on users."
Actions and Resources:
- Examine the error logs <link to logging tool>.
- Investigate recent deployments <link to CI/CD tooling>.
- Coordinate with the development team <link to slack channel.
You can also create automations that will take action for you when conditions change. Some examples include:
- Your orchestrator rescheduling a service when it goes down, or adding more copies when a node crashes.
- Autoscaling API containers for increased inbound connections.
- Restarting critical services which have crashed, logging the event, and alerting engineers to investigate the root-cause.
Tackling even a few of these areas will drastically improve the resiliency of your products and platforms.
If you need help with product development and software delivery, reach out.