Writing Software That Lasts

2023-04-21

Here I’ll present some topics that I believe are important when developing software so that it’s highly resilient and lasts for many years. The topics are in no particular order and reflect my opinion, grounded in my experience as a developer.

Treat this text as a collection of ideas I’ve gathered over the years. I tried to keep them as generic as possible, but I don’t claim they apply to every scenario. I think they fit best if your team is small and you need to ship fast.

I tried to keep the text generic, but many of the topics are about Golang, my favorite language at the moment.

Focus on simplicity

Write the software in the simplest way you can, with as few layers and abstractions as possible. The reason is that absolutely nothing comes for free: every new layer, every new abstraction, all of it has a price, usually paid in development time, more maintenance time, more complexity, and worse, downtime.

The software should do strictly what’s necessary to fulfill its purpose, nothing beyond that.

Keep the stack small

The fewer tools your software needs to run, the better.

For example, it might seem easy to solve potential scalability problems with a queue and workers that consume it. But beyond often being unnecessary, adding a queue means you’ve added several items to your structure that you didn’t have to worry about before. Now you have to deal with network connections, security keys to read and write to the queue service, and so on.

Of course this doesn’t mean you’ll never use queues or other similar structures, but you need to think very carefully about whether it’s really necessary, because the burden of complexity, code, and administrative problems that have nothing to do with programming will increase considerably.

Reduce the attack surface

Keeping the stack small also has strong implications for security. Ask any SRE specialist whether they’d rather deal with the security of several interconnected services or take care of a small set of services.

Forget the ghost of scalability

You know that saying that optimization is the root of all evil? Optimizing for scalability without measuring just because “I think we need it” is far more serious. A REST API written in Go and storing data in a database like PostgreSQL, both well written, can handle an incredible number of connections per second even on a simple machine with 1GB of RAM and 1 core. If the database is on a separate machine, or better yet if it’s hired as SaaS, the load this small server can handle is enormous.

It’s very common for a system to be designed with several microservices, queues, an in-memory key/value store, a vault service to centralize the keys for the various services, and so on.

Then you spend a fortune on infrastructure and an enormous amount of time dealing with the orchestration of all that. You’ll need a security specialist to make sure everything is properly configured and nothing leaks.

And in the end you serve only a handful of simultaneous connections.

Prefer the monolith

Monolith, microservices, FaaS, these are just architectural choices. They all have advantages and disadvantages, but if you can, choose the monolith. The reason is that communication between the parts of the system is much simpler in a monolith.

The other options include communication layers that need to be managed, they have an impact on security and performance, and we have a strong tendency to underestimate those costs.

I know how tempting it is to have several microservices, each with one responsibility. But if poorly designed, you end up with a “distributed monolith,” meaning you have all the problems of a monolith, except now everything is distributed, and instead of getting the advantages of microservices you’re left only with the disadvantages.

When designing service layers, always think about whether the cost of dealing with security, networking, and so on is worth the separation into distinct services, and always be very critical.

For example, it makes perfect sense for a large company with several products to have a microservice specialized in authentication that all the others consult. That service sits behind a firewall, has load balancing, several redundant instances, and so on.

But that’s usually not what we have here in Brazil. What I see most are small companies and startups with fairly moderate load.

Divide the system by responsibilities

A good way to organize the system and create independent libraries that can be reused is to divide the system by responsibilities, with each piece of software having only one responsibility.

This also applies if you decide to create microservices: each one should have only one responsibility. It may sound obvious put this way, but day to day, with demands coming in and the ever-growing pressure for delivery speed, it’s important not to give in to the temptation of taking shortcuts.

Even within a monolith, libraries that accumulate different responsibilities are bad.

Avoid dividing the team

Dividing the system by responsibilities is good, but that doesn’t apply to the team. It may not be possible depending on the size of the company, but it’s better to have a single team where everyone shares information quickly.

Communication

Communication has a major impact on software quality. The lack of communication easily leads to duplicated effort; that’s how Windows ended up with 6 different audio controls. Good communication ensures that everyone understands and knows the company’s priorities and the difficulties the other developers are facing.

It’s fine to have separate teams if the company is big enough, but you have to remember that communication between teams is more costly than communication between members of the same team. In fact, the product owner, product manager, scrum master, or whatever the role of the person who passes demands to the developers is called, should stay as close as possible to the devs, preferably right there with them all day.

There are ways to ease the communication problem. For example, one thing that works very well is having everyone in the same chat room. Even with people physically apart, you just open the audio and talk to them. Beyond strengthening the team’s bonds, this avoids the “telephone game,” avoids scheduling useless meetings, and so on.

I’ve tried this dynamic at two different companies with great success. You start your day, join the chat room, mute your microphone, and keep working; if someone calls you it’s the same dynamic as calling a person at the next desk.

Prefer tested and boring technology

We have a tendency to want to use the newest and shiniest technologies. Programmers love novelty and technology. So they watch a talk and get eager to put the new technology into anything.

This is terrible for the resilience of the software being built. Ideally every technology used should be boring, in the sense that it brings nothing new, the programmers master it almost completely, it has been battle-tested countless times, and there’s no risk of it being abandoned or taking an unexpected turn in the coming years.

For example, if you pick up a good PostgreSQL book from five years ago it’s probably still fully relevant today, a good Bash book easily holds up for a decade, and a good C book is eternal. And I really hope Go follows the same path.

Operating system commands

Speaking of Bash, learn to use the operating system commands well. Often it’s better to write a small script and use what already exists than to develop it yourself.

Queues and workers

Better than the whole weight of adding a queue service to your structure is using the database you’re probably already using anyway.

In my case, where the chosen database is PostgreSQL, mainly for historical reasons, I use a combination of select… for update and select… skip locked to process records safely without record locks. I also add a simple field to record the current status of the record that needs to be processed, and that’s it.

Another interesting mechanism, especially when you need to consume third-party services to process the record, is to create a field with the processing history.

A text field where you concatenate things like the result of an API call, errors, and so on. It’s quite nice to have the error in the same record that’s causing it.

If the processing doesn’t need to be fast, it’s better to call the worker via crontab than to have it loop waiting for some record to be processed. The big advantage of using crontab is that your program doesn’t stay in memory all the time; it always starts in a clean state, and you run no risk of a memory leak growing to the point of becoming a problem.

FTS

Instead of using ElasticSearch, both PostgreSQL and SQLite are great for full text search. Recently I needed to create a small log server to collect logs from several systems and, most importantly, to be able to retrieve that information easily.

I used SQLite and a small Go server and the result was very good. A single FTS-type table receives the records with one field for the log itself, another to store the date, and another for the name of the system that produced the log.

With this small system it became very simple to search for occurrences, errors, and to get a view of the overall state of the system.

Don’t use panic

When programming in Go, handle every error. I know how repetitive it is; sometimes we handle errors that, if they happen, mean the operating system itself is in trouble. But I can’t tell you how many times I skipped handling an error that seemed impossible to happen, and of course it happened.

Don’t recover after a panic

Sometimes we write systems that can’t go down; that’s the justification for the system recovering from a panic.

But the panic exists to tell you that something went very wrong and needs your attention. Ideally, when a program panics, the right thing to do is write a test that reproduces the problem and then handle the error.

To keep the system up, use a service manager like supervisor, for example.

That way, over time your system will become more and more resilient.

Handle the error where it occurs

When I programmed in C++ I would disable try/catch directly in gcc with the -fno-exceptions flag; it’s always preferable to handle errors where they occur. In Go, for example, never use a function to hide the if err != nil. That creates an indirection that makes the code less explicit.

Never trust the data you receive

It doesn’t matter if it’s a REST API or internal functions of your program, never trust the user, even if that user is you in the future. All data must be validated.

Always validate interface types

When a Go function receives a parameter that’s an interface, the compiler itself ensures it’s of the correct type. The real problem is when we use an empty interface, and then cast it to some specific type. If you don’t test the type and it isn’t what you expect, or if it’s nil, the program will panic.

Besides, internally interfaces are pointers, so they can be nil.

Always check for nil

Whenever you receive a pointer or an interface, check whether it’s nil before using or trying to convert it.

A good practice for detecting these problems is running linters on the code. golangci-lint applies several linters that detect these and many other problems.

Always check the length of arrays

Programmers are usually used to checking the length of arrays in for loops, but they forget to check the array they receive when they’re only interested in the first item.

Don’t break your API

When creating an API, be minimalist. Avoid adding extra features that might seem interesting during implementation; keep a lean specification. If you overload your API with features, it becomes more costly to make changes and updates to your code in the future.

Don’t have versions of the same API

Few things are as much work as maintaining two different versions of the same API. But things change, and it’s often unavoidable that you’ll need to change something that potentially breaks the API and you’ll have to maintain two (or more) versions. Instead, try to design the API so that it can take additions; usually there’s no problem adding an extra field to a call’s response or even adding a new feature, as long as the existing ones keep working and stay tested. The devs (internal and the clients’) will thank you.

Tests are essential

Tests are essential, but focus first on testing features, not code.

In compiled languages the test coverage doesn’t need to be very high, because the compiler has already caught all the syntax errors, oversights, and so on.

In a scripting language you need to be more rigorous.

Any function that has SQL inside it needs to be fully tested. The reason is that even if you’re using a compiled language, the SQL code is a little piece of interpreted code inside your compiled program, and the compiler has no way of knowing whether that snippet is valid.

Keep the corporate card funded

It has nothing to do with programming, but I put this point here because I’m sure several senior programmers will laugh when they read it and remember the times they spent looking for the reason the system went down, only to find out it was because the person in charge of the corporate card let it run out of funds.

References

Cesar Gimenes