How to do a good RCA

Larry Kiniu
6 min readAug 6, 2021

RCA stands for Root Cause Analysis. In life and work, we always encounter problems. Usually, problems present themselves as symptoms.
You might have a headache, your car might not start in the morning or the NodeJS processes are still running in your task manager after hitting Ctrl+C in your VSCode integrated terminal (oddly specific! 😅).
While you can get rid of these problems quickly by treating the symptoms e.g taking a painkiller, taking the bus to work or just restarting your computer, it would get pretty annoying and frustrating to have to do this all the time (and it’s probably not healthy to take painkillers that frequently). The best thing would be to try and find out the root cause for the problem and try and fix that permanently.

A little background

I’m a software engineer at Microsoft and while my parents think all I do is this 👇🏾,

hacker

One of my team’s core priority is to create and foster best-in-class engineering systems and developer experience. Recently, one colleague pointed out the fact that hitting Ctrl+C on the VSCode integrated terminal was not behaving as expected.
They were running a command on our JS monorepo to test something and on attempting to stop, the terminal shows that the command is stopped but in the task manager, the NodeJS processes were still running! On trying to rerun the command again, the would get the infamous EADDRINUSE NodeJS error.

EADDRINUSE

So how do you go about this?

Replicate, Rationalize & Reach out

The first step I normally take is to try and replicate the problem. This helps to not only experience the pain and frustration that someone else is going through, but you also get to understand what exact steps got someone in that situation.
Imagine if you had a tummy-ache. If you go to the doctor, he or she will ask what you ate yesterday. If you had eaten 3-day-old unrefrigerated chinese food, then he or she will probably get a good idea of how things played out. Fortunately for the doctors, they don’t have to eat the 3-day-old chinese food that had not been refridgerated. Unfortunately for software engineers, we usually have to (comparatively speaking).

chinese food

Once you have managed to replicate the problem on your own, you can start rationalizing about why things turned out the way they did — from past experience. You can create a few hypotheses and test them out to confirm your suspicions. If you are a new engineer (new is relative — for instance, in as much as I have worked for the past 7 years as a software engineer, I’m relatively new in Microsoft as at the time of writing this), you can reach out to colleagues or search on the internet to try and find out if others have experienced the same problem you have and luckily get an idea of what could be casuing it and possibly a solution. Stackoverflow is a pretty good place to start.

Change, Catalogue & Confirm

So now you have a few theories to follow. What next you ask? Well, at this point, you can start tweaking the variables. What are variables? These are the factors that can change. For instance, in our example, instead of running your project on VSCode, try and run it in a standalone terminal and see if the behaviour persists. If it does, you can be sure that it’s not the IDE that is causing the problem. Change the NodeJS version — see what happens. Change the project…run a different command etc. When doing all this, it is advisable to catalogue all the actions and results. Write down what variables you changed, what actions you took and what the results were. This will be especially helpful when you hit a dead-end and don’t know what to do.
One more thing you might want to do is to confirm all your actions & results. Like the famous saying goes ‘measure twice, cut once’.

to do list

Articulate, Alleviate or Adjourn

If you have gotten this far, you probably have a pretty good idea of what is causing the problem (or at least a hunch). At this point, I like to articulate the problem as I have understood it in an Azure Devops work item.
I give it a title, which is summary of the problem at hand — and a description. Now the description should be a bit detailed. It should include the steps to reproduce the problem, the cause of the problem and what steps can be taken to mitigate it or to solve it (more permanently).
If the problem falls under my team’s domain, I can take it up and mitigate it or solve it if possible. Some problems might be outside my control — for instance if it is a 3rd party dependency that needs to be fixed (although in this case — you can raise a Github issue or if you are adventurous, create a PR with a fix).
While we software engineers pride ourselves in solving these kind of issues (my online alias is the Bug Slayer, literally slaying bugs 🐱‍👤), solutions for some problems can elude us and it’s ok to punt on them until we have more clarity.

Azure Devops Work Item

Conclusion

I hope that this gives more clarity on how you can go about finding the root cause of a problem in your project. The real knack of a good software engineer is treating this root causing rather than just the symptoms. If you do this enough times, you are guranteed to become an invaluable engineer in your team and organisation. 👇🏾

hacker

And for those who are curious to know if I managed to solve the NodeJS problem, unfortunately not yet (as of writing this). It was related to a 3rd party open-source package and someone had beat me to opening the issue on Github.
In this article, we talked about how to do an RCA on your local dev environment. What if you encounter problems on your DevOps pipelines (whether in Azure or AWS). How would you go about it? Will these steps help in such a case? Do you have more tips and steps to doing a RCA?
Reach out for any questions and see you in the next article.

--

--

Larry Kiniu

Sometimes I code, but most times I solve problems.