Mehdi Daoudi is CEO and Co-Founder for Catchpoint.
The field of application performance management (APM) is rapidly transforming, driven by a trifecta of increasing user demands, higher potential for lost revenues and greater reputational risks.
Traditional APM solutions have focused on correlating website and application performance levels to internal data center elements, helping to identify what is causing an online service to grow sluggish, or altogether unavailable. While this internal focus remains important, today’s organizations are increasingly incorporating external third-party elements and infrastructures beyond their firewall, in the never-ending pursuit to out-do competitors with faster, more feature-rich websites.
These include third-party services (everything from social network plug-ins, to photo display, to video services and analytics and marketing tags) as well as infrastructures (CDNs, the cloud, and DNS providers). With user performance (speed and availability) increasingly influenced by these elements, APM is morphing into what Gartner calls digital experience management – or DEM, where the user experience is the ultimate metric. DEM is triggering significant changes in the more established discipline of IT issue resolution, including the following:
Greater Coverage Area: The IT issue resolution surface area is expanding to include both internal and external elements, precisely correlating these with website and application performance levels in real-time. Ideally, information on these elements should be available in a single view, providing IT and DevOps teams with a quick summary and contextual information, without the need to toggle between multiple screens.
There may be times when an offending element is outside the firewall, and beyond an organization’s direct control – for example, a slow regional ISP or a poorly performing cloud service. Even in instances such as these, knowledge is power – because wasted war-rooming time can be avoided; SLAs can be enforced; and proactive communications with impacted users can be implemented.
Advanced Analytics: IT issue resolution consists of four phases – detecting the problem, diagnosing the problem (identification), fixing it, and verifying that performance levels have returned to normal. According to Puppet Labs’ 2017 State of DevOps Report, the speed of this end-to-end process is a major factor distinguishing high-performing teams (who keep the process under an hour) from medium- and poor performers (who can take anywhere from one day, up to a full week).
Phase two – identification – is the lengthiest, most arduous leg of this race for most organizations, and the one most primed for improvement. The issue isn’t insufficient data, or the wrong type of data, being collected. Rather, the challenge is the increased complexity of today’s IT infrastructure, which has resulted in massive amounts of performance data being generated from internal and third-party systems. If performance problems are the tip of an iceberg, advanced analytics are the key to quickly identifying the root cause lying way below the surface.
Becoming More Proactive and Predictive: According to a research study conducted by Enterprise Management Associates, even with a variety of toolsets in place, 36 percent of teams find out about performance-related problems via calls from users. In a social networked culture – where disgruntled customers can vent their grievances to thousands in mere seconds – this paradigm needs to change.
Having more advanced analytics will help, by making it easier and faster to identify root issues, while reducing the number of false positives (often leading to ignored alerts) as well as false negatives (resulting in issues going undetected for a length of time.). However, more predictive capabilities are also needed – allowing teams to see the likely impact of a third-party issue, in terms of both user experiences and business metrics like revenues. This helps organizations avoid being caught flat-footed, while also giving an opportunity to prioritize and get ahead of problems, ideally before users are impacted.
Selective Artificial Intelligence (AI): Like many areas of technology, AI is securing a foothold in performance monitoring, including digital performance assistants with natural language processing capabilities. These are positive developments, considering that IT issue resolution has typically required highly trained (and highly paid) experts to sift through and make sense of vast amounts of performance data.
AI can certainly play a valuable role in automating these processes and helping find patterns across amassed data. But IT issue resolution isn’t a completely “hands off” process just yet, and it’s dubious that it ever will be. AI’s real value is in empowering humans to take appropriate action – for example, issuing an earnest or even humorous apology to users in the event of an outage, or negotiating with a contracted third-party service on needed performance improvements. These are things machines simply can’t learn or be taught to do.
We live in an age where managing IT incidents is getting more and more difficult, due to an overgrowth of performance-impacting elements. The past few months have provided several examples of this – for example, the Amazon AWS S3 outage in February, which caused problems for hundreds of sites; as well as Lululemon’s 22-hour site outage, which was blamed on IBM cloud services.
It is no longer alright to assume that just because your data center systems are humming along nicely, users are having an excellent experience. Unless they are being consistently monitored, external third-party services and systems represent a huge (and potentially costly) blindspot. As APM evolves to DEM to address this reality, related disciplines must change in lockstep. The advances described here will be critical as IT issue resolution aligns to DEM by focusing on the ultimate metric – the user experience.
Opinions expressed in the article above do not necessarily reflect the opinions of Data Center Knowledge and Penton.