Tuesday, September 16, 2008

Service Outage Avoidance - The mother of all metrics

In my role at Vigilant as both a consultant and an executive I have had the opportunity to interview hundreds of operational IT managers and directors. In most cases the number one metric they were managed by was "Availability" or "System up-time".

What turns into a very interesting dialogue is when you talk to them about how they collect those metrics, report those metrics and respond to those metrics. Here are the shortfalls of taking this approach:
  • Up-time - this is very rarely measured from the End-users standpoint. So you are immediately putting IT on the defensive when you state the system was available on the network, but the end-user is not able to execute business on the system.
  • This reported metric only gives credibility to how quickly IT personnel was able to find and fix the outages. Outages are typically caused by poor release practices or change management, IT functions, anyway.
A new approach that should be considered is how I measured my operation as an IT director and what Vigilant consultants call "Service Outage Avoidance" (Not to be called SOA or real confusion sets in)
This metric is the marrying of component availability to end-user availability. You can accomplish this by monitoring a systems network & server components for availability along with the end-users behavior. When an outage occurs at the component level, yet the service stays up to the end user, due to you superior availability design of the system, you have achieved avoiding a service outage.
Availability metrics then should be broken into the 5 following categories:
  • Network      (Link status, utilization, drop/error rates)
  • Server         (OS stats, CPU, HD, Mem)
  • Application  (DB, J2ee, .Net, etc)
  • Business Logic    (Code interfaces, Connectors, ETL, etc..)
  • Business Process  (Transactions, order counts, etc...)
  • End-User      (real-time screen to screen, refresh, errors, etc..)
"Service Outage Avoidance" metric shows the percentage of downtime of a component where end-user was available.  (i.e.  4months of aggregate downtime of SAN on Email system during 12 months of end user availability)
Your next management report then will show something like this:
Email Services - Service Outage Avoidance: 25%
What this metric means is that we had an impact at a component level of 25%, but due to proper design and management we avoided having a business impact.
In other words, "You know how we weren't sure if it was worth it to build in all that fail-over and redundancy. Well here is how valuable that decision to spend was."

If you can equate the up-time value against this, you can calculate the ROI.
i.e. Up-time value of Email for 1mos= $1million dollars.
Cost of redundancy $1M
1 year ROI is 300%  
(4mos *$1M = $4M return - $1M investment = $3M. $3M(return)/$1M(Investment) = 300%)
                                                                       

In my next blog - Don't underestimate the infrastructure