In last month’s issue, we tackled the Pets vs. Cattle discussion within the cloud-native movement. In this issue, we tackle the notion of stateless-only cloud-native apps and inspect their practicality.
What is Stateless?
First, let’s explain what is typically meant by stateless. It means that when a service is called, you pass it a set of inputs and then based on those it will return a result. It does this every time you call it. It doesn’t remember what you asked it to do the last time you called it.
As a good example, consider a REST service that is deployed across multiple OS instances, where a load balancer will equally distribute traffic to all of the REST service instances without special regard to any of the instances. In this case, if an instance is shutdown and later returned to service it can do this without major disruption. In a nutshell this is a stateless service.
Now of course many will say that there is zero disruption to the business when this happens, but in reality, some mid-flight requests will probably be lost. When mid-flight transactions are lost due to a service being shut down, typically, an error handling mechanism will be activated to sweep the failed transactions/requests in order to clean them up (sometimes in the enterprise space this pattern is called a compensator function). Of course one may be tempted to say since the service is stateless that there is no disruption but the activation of the compensator function is a form of disruption. Here is why: from the time the request failed to the time the compensator remedies it, someone was affected, perhaps the user/customer, or if it was masked from the customer, then the internal processes were impacted. Therefore, it is not true to say there was zero disruption. In fact, it depends on how smartly coded your compensation mechanism is to be able to quickly deal with the failure.
Some services that are coded this way, where they claim that the service does not have to remember much as it is completely stateless, are slightly misleading from an overall enterprise perspective. It is indeed true that these stateless services are so during their execution, but not in the overall workflow of the enterprise transaction – the transaction that goes beyond the individual service call and where it has to traverse other systems, especially when it has to handle the failure mechanism. In fact, when during failure handling it delegates that to the compensator service, and the compensator handles it, in essence it has shifted the responsibility to another part of the system, while itself seems stateless, overall from a business transaction perspective the service and the associated failed transaction compensator service pair together to form the state context. As such, we humbly believe that you can never escape state management/persistence in any real enterprise business logic.
Now building stateless services are very attractive because it is quite liberating not having to remember state (liberating in the sense you do not have to provide fault tolerance infrastructure), but this means you would have to construct state every time during a service call is made. What do we mean by "reconstructing state?" We mean that since there is no internal cache (persistent cache) within the service that can track state then you would have to either recalculate everything from scratch — per service call — or fetch the data to build state from somewhere else, a time consuming practice. Every service call is constructed from static and dynamic data parts used within the service’s business logic definition. The static part should be cached, either internally or externally (depending on the use case it may make sense to cache internal and other times externally). Now some architects have relaxed this “stateless only MSA “ claim and are now proposing a process of delegating state to a caching service (remote but nearby caching service). This can work but it is obviously not as optimal from a performance perspective when compared to a service with internal service cache. On the other hand, it can be highly scalable.
The somewhat modern term that MSA architects use to describe these distributed caching services is persistence services, a layer where state is delegated to from the stateless service, which is great! So then how is a remote caching service or a persistence layer any different to what we have been doing all along? It is not, it is just that the original thought-leaders of cloud-native movement thought of a world that is stateless, and now have learnt that you cannot really do anything practically useful in the enterprise without state.
There is this perceived notion that stateless services are less of a burden on infrastructure, which is true to some extent. However, you do pay substantial cost on transaction response time if you are not careful. This is particularly true especially if you are not leveraging state (service internal cache, or distributed cache and/or persistence services) to improve the speed of your ongoing transactions. Stateful services, for example, may maintain a near/edge cache, or service plugged cache, and even remote cache, where cached data is the service’s state. As we mentioned earlier this prevents the need to reconstruct the static part of your business logic every time, and hence a great service response time improvement.
Of course if you chose to not have state, infrastructure cost is lower. On the other hand, if you maintain some sort of state you would have to make your platform much more robust to make sure you don’t lose your state in the event of failures. The one caveat with this is that with stateless services you do have to make additional calls because of needing to reconstruct the static part of the result every time and while the infrastructure cost are perceived to be lower, it eventually ends up being higher, especially around various multiple trips to the load balancer, you have essentially shifted complexity to another layer.
|When you go from monolithic to MSA architecture you do get a substantial number of peer-to-peer service distributed calls, so perhaps what use to be a small set of calls under a monolithic setup, with MSA it is maybe thousands more calls, and as a result will actually eventually lead to additional compute space utilization. It typically takes time to get to that level, but it does eventually manifest itself with increased network traffic and hence increased CPU utilization. The effects on infrastructure components may vary, but if you’re not careful, for example, the load balancer all of a sudden maybe bottlenecked due to the immense number of calls.|
A compromise would be to have stateless services communicate with stateful caching services (or distributed call queues) to avoid having to reconstruct the static part of the transaction every time. You still incur the additional network hop (or multiple) as opposed to a stateful service’s plugged in cache, but again it is better than not having state at all. The key to all of this is that MSA brings new complexity that traditional infrastructure simply cannot handle, unless you use a cloud-native platform and a highly tuned virtualized layer, and many cases a containerized virtual platform.
Stateless Services Attributes
You have to build state every time
Doesn’t recall much
Cost of each call is higher
It can be liberating from having to remember everything
Low infrastructure requirement, during its early life, but eventually it ends up costing more, unless you use an intelligent platform that can redistribute services for better resource utilization
Easier to deploy a new version of the app/service
Stateful Services Attributes
Builds once, recalls from a Cache
Has the burden of remembering state, even after failures
Response time of each call is eventually lower
Higher infrastructure requirement
In the case of this attribute, we believe it is a myth to say that cloud native needs to be stateless. This notion of stateless services in order to make something more cloud-native is a complete myth, it does not work in a practical scenario in the enterprise, and you always have state somewhere. Therefore, it is likely that practical enterprise grade cloud-native application services are 80% stateless and 20% stateful services. The actual percentage mix will vary with time, and as you learn more about the interactions of the deployed services, or as they grow in popularity over time, you will adjust the percentage mix to suit, but they are never just stateless.