When Success is an Exception

"Everything fails all the time"

The famous words of Werner Vogels, remind us again and again - never take things for granted.

Every experienced IT professional has hurt herself at some time or other. All the experts across the globe have reminded us in many different ways. Everyone knows that error handling is absolutely important. Yet, this understanding does not translate into code. It is unfortunate that most of our applications have a very bad error handling mechanism - if any.

Root Cause

I have seen new developers learn from a variety of tutorials. And I feel the root of the problem is right there. Most of the introductory tutorials and videos available to us (developed by well meaning experts), do not talk about errors. Somehow, the error handling is left out - to be learnt "when you grow up". Unfortunately, we never grow up and our code does not handle errors.

Due to this mindset, we think only in the space of the happy scenarios - anything else is an error. Our user stories, design, test cases and the code - all focus on the happy scenario. Because of our limited training, this "anything else" is too wide to be conceived or handled appropriately. So it remains incomplete. And the outcome is that our applications do what they should not, fail when they should not.

https://clnk.in/pToE

Developing for Errors

As they say, the greatest hurdle in exception handling is that we call it an exception. We have to understand that error is not an exception. Error is the normal behavior. Success is an exception.

That also does not mean going overboard with flooding the code with error handling. That would leave the business logic buried under unmanageable debris. Each line of code cannot check all the errors all over again.

Error handling has to be delegated. But we should know what we are doing when we do so. I think the right way to do it is by defining the error space properly. If we are clear on that, the error handling will naturally show up in our design and code.

This is an attempt to enumerate the different aspects and types of errors that can trouble the application. This is based on what I have faced and what I remember over the years of my experience. I know this is incomplete. Request you to please add more to the comments so that we can have a good collection here.

Primarily, we can have four types of errors. Let's check them one by one.

Input Errors
Dependency Errors
Internal Errors
System Errors

Input Errors

A typical microservice gets an input request from another service. It reacts to such an input, processes it and perhaps returns a response. This input may or may not be what we expect. It could be the

wrong data
wrong format
wrong source of data
out of sequence
delayed data

And of course, it could also be what we really need there. But as we said, that is an exception. We can check that later.

We cannot pack all the resilience into a single microservice. That will defeat the microservice architecture. A service has to depend upon its neighbors - to provide what it needs. It is not wrong to believe in the authorization/authentication service for ensuring security, or in the caller service to provide the right data in the right format. But, we should know that these services can fail. They can have errors, and we have to break the chain when that happens.

There is a fine line between the two. There is a difference between implementing the complete authorization/authentication in every service, and ensuring that the services are resilient when the authorization/authentication service fails. There is a difference between verifying the input data, and rebuilding the input data within the service.

Each architect has to identify this line in the context of the given service. However, from my experience, it is always better push very hard towards the safer side of the line. Some lazy developers and irrational managers are going to drain it down anyway.

People tend to ignore the time and sequence of events. In a live system, data coming late or out of sequence is as bad as wrong data. This can be a major problem when inputs come in from two different services. We have to ensure the correct sequence. It is not enough to just hope for it. The time and sequence validation has to be a part of input validation. So it has to be a part of the data architecture itself.

Dependency Errors

A microservice rarely does all the job by itself. Often it has to invoke API's in other services. It is possible that those services are not stable. In such a case, the response from that service may not be what we want it to be. Such a response has to be validated.

Response validation could mean a simple error code check, or validating the response body itself. This depends upon the overall architecture, and the logical distance between the two services. Of course, it makes no sense to rebuild the entire response in order to validate it. I don't want to mention the other extreme. But that is what most developers do.

But the point is that any service should not take other services for granted. They are going to fail at some time. Our service should be able to discover this and break the chain when that happens. Any API call has to be made with the understanding that it will fail - and it could also succeed.

When a downstream service fails, it could result in several problems:

Data Loss
Incorrect response
Service down
Delay

We have to make sure we minimize these for our service - when a downstream service fails. The input request may be important. Even if we are not able to fulfil it, the contents of the request may be important. Is it relevant after some time? Should we save it or discard it? The architect has to answer these questions - to decide the right approach based on the context of the service.

If an error in the downstream service is not identified, our service will end up returning an incorrect response. This error can potentially cascade through the entire system, causing havoc. As we saw before, validating the response does not mean building the response all over again. But, the data architecture should provide a way to accurately verify the response of any API. This is important for breaking the chain.

Now what do we do when we find out that the downstream is not stable and will not process our requests? Pull down yourself? That is certainly not the solution. A microservice should be resilient. It should be able to survive, stay alive and resume when the downstream system is back. The service should have the ability to sense when the other service is back. It should have a way to handle the requests when the other service is not available. It should be able to give a meaningful response to any request.

Any microservice should follow the principle of "Fail Fast and Fail Safe". Any failure should not result in delay. In fact, it should mean an instant error response. This could mean adding timeouts, liveliness checks, or any of those patterns. But one service should not delay a response just because a downstream service is going slow. This chain has to be broken. In the ideal scenario, we should try our best to avoid synchronous chaining. But when it is essential, this error has to be handled and managed with all caution.

System Errors

Everything can fail - not just my code. Services can fail for a variety of reasons. An upgraded OS could throw tantrums. A network connection can fail. A nerd may push in the wrong version of code... anything can go wrong! The world around may not stop with this. Will our code continue? It should.

Internal Errors

Enough of blaming. We tried blaming the downstream service, and also the input data. Then we blamed the OS, network, and all that we could. Nothing helped. The problem is in our own service! That is when we try to look within. Isn't that too late? The errors should be checked when we code.

Any microservice should be split into components that work in smaller chunks of tasks. These independent logical blocks should be guarded against each other - with strong error handling. The data and command flow among these components should be validated.

The business logic we have developed worked for the inputs that we had imagined. Of course, we have missed scenarios. What will happen when those inputs are fed into our code? Will it go wrong? How wrong can it go? How can we restrict the potential harm caused by a code error in this service?

The architect and developer should continuously ask these questions. This cannot be done in code review session on last day of the sprint. It has to be an integral part of the thought process. Error handling should become a habit. Then it will naturally seep into each line of our code.

Until then, it is limited to blogs and conferences.

Summary

The whole idea in error handling is that we should doubt everything. It may be a well established S3 service on the AWS cloud or it could be a code written by the newly hired intern. Potentially, anything can fail.

Some errors have a higher probability than others. And some errors have a higher cost than others. Usually, the actual cost and probability are much higher than we imagine. It is impossible to plan for, and handle all possible errors. We have to identify the ones that we can ignore. And even if we decide to ignore some errors, we should have a detailed analysis of what errors we are planning to ignore.