What is the thought behind the Tensai™ product range?
Tensai™ is a framework that enables end-to-end automation for build, release and run across applications, infrastructure, cloud and employee experience services. The three objectives we want to achieve through Tensai™ are agility, efficiency and experience.
The thought behind building the platform was “How do we bring all of these three elements together for our customers across their IT landscape?” You have many products for specific use in the market, but there’s still a gap when it comes to system-wide integration that provides meaningful insight to both the technical and business sides of the operation. Tensai™ is designed to produce real-time meaningful insights and boost productivity in how IT supports business.
Tell us more about the components of Tensai™ and what use it makes of AI?
Let’s start from the build cycle – we call it Tensai™ for Build. Then comes the release, and then we have it supported with Tensai™ for Digital Assurance. These are the three elements that focus on the agility part. Next comes the run part, where Tensai™ for AI Ops is key. It’s basically a platform that supports complete maintenance and sustenance across different platforms and integrates well with Cloud Ops. Last but not the least, we have Tensai™ for User Experience, to enrich the employee experience.
Where does pre-emptive pattern recognition come into play?
That’s a key feature in Tensai™ for AI Ops. This platform digests data from multiple sources. It has an intelligent monitoring and management layer, where all the data gets collected – from the infrastructure, applications and so on. If there’s an anomaly, it starts analysing the event, judging it against the overall context. For instance, if there is a particular event that happened at, say, 5pm, the system will analyse the behaviour of the other components that are linked to that infrastructure or application. For example, you might have a website where the response time went very high. We deploy what we like to call a “neighbourhood behaviour watch” to find answers for questions like: “Were any other systems or the network devices impacted around the same time?” or “Were there other applications that could potentially be affected?” We do this on time series-based algorithms. This helps us in incident isolation and tells us how severe the issue is, possibilities of downtime, and so on.
Another thing we do is dynamic thresholding. Gone are the days where you could just define static thresholding, based on CPU utilisation above 80 per cent. That doesn’t happen any more because there could be a time of the day, or a day of the week, where it is supposed to go high because there’s some processing that’s happening. There is a complete chart for the application’s behavioural pattern that the system learns, and if there is any anomaly too severe to cause an impact, only then it raises the alert.
And to be clear, this is still pre-emptive – the anomaly itself is enough to kick the system into action without human intervention at that stage?
Absolutely. The way we look at it, the best incident is the incident that never happened. Hence, as soon as the system detects an anomaly that needs to be actioned, the automation engine kicks in to analyse and resolve it before human intervention.
Glitches and anomalies happen fairly often. How does the software avoid overwhelming the user with the reports?
That’s a very good question. The intelligence grid, the event correlation engine, the dynamic thresholding – all these actively help eliminate false positives. We’ve actually reduced false positives by as much as 84 per cent for a client. Earlier, command centres would be busy using multiple tools, monitoring and classifying a constant stream of alerts, not knowing where a real incident was happening. But now there’s no need for humans to distinguish between false positives and real alerts. Only the real alerts will hit you and then the back end can be integrated with the likes of ServiceNow or any other service management tool that you would have.
Can you give a couple of real-life examples of the system being put to good use?
Let me break this into two components. When we talk about IT, there is incident management and there is service request management – workflow-based changes like account creation, patching, commissioning or decommissioning of servers, and so on. On the incident part, as soon as abnormal behaviour is witnessed by our monitoring tools the system will recognise it and make a suggestion on how to fix it. We also managed to automate the solution for up to 40 per cent of the remaining incidents, using our self-healing bot. We, obviously, have a repository of bots, and we keep on continuously building it up for various incidents like disk space issues, CPU utilisation issues, issues the employees might encounter directly, like printer configuration and password resets – all of these things are automated.
To answer your question, every organisation we work with is quite specific on the workflow side, but what we are doing there is actually automating the complete chain. For instance, if you have to commission or decommission a particular server, or any other similar workflow, there are multiple components that need to be stitched through. We have automated the whole chain there. For one of our customers, there was a ten-member team that was just doing this work, but now our bots are actively doing that job without any manual effort.
How directly does this impact the business operation?
Very directly. The intelligence grid gets built into the system, which really helps customers derive meaningful insights. At the end of the day, we need to know not just how a server is behaving but how the front-end service is getting impacted and what this means for the business – that’s something the CIO would want to know.
We have created service maps, so at the highest end you will know that this is the service that’s going to be impacted. And if you can help us put a monetary value to that service, then we can even create insights in terms of the downtime causing revenue loss. On a more positive side, we can monitor your product: the number of policies being issued per minute, if, say, you’re an insurance company, and the revenue impact of that. These are just two of the countless insights that can be drawn using our platform.
Hemant Vijh is executive vice president and practice head (Digital IT Ops) at Hexaware