Monday, December 1, 2008

Real cost savings in IT spending

IT savings is all the rage here in the States. With budgets tightening and falling stock prices everyone, and I mean everyone is looking for ways to cut costs and yet still somehow maintain the same level of business operation and quality.

Well here are my 5 fitness tips for trimming the fat in IT spending, where you could create significant value while reducing costs.

1) Stop doing things twice, and use automation. Quality assurance has long been viewed as a necessary evil. The reality however is that it is business critical if you want to avoid major customer satisfaction issues and maintain security and compliance coverage. Human capital is always the largest expense in issuing quality checks. However, most systems could be designed and built with error checking and quality validation as part of the build and design. If your development organization is writing code, and your QA team is checking code, you are wasting money. Get these teams to an agile development workshop so that they can start to think about optimization techniques that could eliminate teams of business analysts manually testing or having to write ineffective regression automation scripts.

2) "Do it yourself" only where it makes sense. Too many organizations have leveraged internal staff in a figure it out mode. While this build longer-term internal knowledge capital and can work towards job satisfaction and retention, it ultimately is more expensive. Where the function is core to your business model then it makes sense. Otherwise look to an Software as a service, or Manages Service Provider who can give you economies of scale with point expertise.
Most providers have demonstrated ROI in a 2 year period in both CapEx and OpEx.

3) Tune business processes not just systems. In the past 9 years of managing performance engineering teams, I have learned one thing. It is always faster to fix how people use a system, then to tweak an poorly designed system. So before embarking on expensive load testing and system tuning efforts, evaluate the end-users usage of the system through operational profiling. Monitor and shadow users for a week. It will be enlightening how many time saving tips you can bring to the business community without spending any additional cash.

4) Monitor the end user's experience not just the infrastructure. OK, you'll have to spend some money up-front. However, the pay back can be significant if transactions are properly captured and labeled. By monitoring the performance and availability of critical transactions, this allows you to focus and "break-fix" dollars on key business driving processes. Many organizations have embarked on fixing system issues that really aided very little in the operational aspect of a business. By knowing where you can create gains in the environment to affect the business, you can plan and spend with significantly greater value.

5) Integrate request management and time tracking. Last but certainly not least, is optimizing the people. Project management tools, time tracking tools and defect tracking tools and service desk tools continue to be disjointed in many organizations. This lack of clarity to issuing and fulfilling requests leaves many, many loops holes in personnel accountability and management. In this down economy, plus during this holiday season, it is a fact that people loose focus more easily, they get distracted, and they loose ambition to take full responsibility. Providing a consolidated view of "work orders" and "work plans" will help keep your most expensive asset working optimally.

While there are many more areas within IT to cut, I believe these are the key ways in which you can reduce costs while still maintaining the highest level of quality.

Next months blog: Is now the time to start an IT Service Management initiative? I'll discuss the pro's and con's of taking on a project like this in this economy.

Wednesday, October 1, 2008

Don't underestimate the infrastructure

We all would love to live in a world where extremely complex and sophisticated technology was just simple and easy. Commodity IT, where I can just turn-on my PC and the world is at my finger tips. Sorry to be the bearer of bad news but...



Things break - That's life.

We wish they wouldn't.



Like my good friend and colleague Jon Land says "When I booted up my 3270, things just worked. Unless the Mainframe was down". Of course that was until we tried to utilize that system with the corporate networks, Internet and now mobile applications. Then all of sudden we are back into the world chaos.



The analogy of comparing plumbing and IT infrastructure services drives a lot of IT folks crazy, but I like it.

Let's compare this:


  • Plumbers cost $60-$80/hr, here in Massachusetts anyway . Yet, the average IT infrastructure consultant is about $40-$55/hr.

  • The cost of copper is driving plumbing up through the roof. Just in case you didn't realize guess what is in those rubber ethernet cables... you guessed it, copper. Material costs are through the roof in IT as well, plus we have energy costs off the chart which are more now that we have more dense operating environments.

  • Recycling and conservation. Water costs are going up, and so is the cost of recycling old equipment.

  • The quality of water in larger facilities are less than desirable and costs are significant to upgrade and improve. Likewise, the performance of networks and systems in larger organizations are not acceptable and significant costs are needed to upgrade and improve.

I could go on with this comparison, but the bottom line is that having turn-key (or turn-faucet) results requires solid planning, a total approach to design and the understanding that when things break it's going to be messy.


A failure to incorporate your infrastructure needs into your enterprise architecture, will result in a continual failure to get the services you need, when you need them.


Like the old adage says: Proper preparation prevents poor performance.

Next up: Can we really spend less and get more? I will share with you 5 top areas of IT spending that could probably be cut in-half.

Tuesday, September 16, 2008

Service Outage Avoidance - The mother of all metrics

In my role at Vigilant as both a consultant and an executive I have had the opportunity to interview hundreds of operational IT managers and directors. In most cases the number one metric they were managed by was "Availability" or "System up-time".

What turns into a very interesting dialogue is when you talk to them about how they collect those metrics, report those metrics and respond to those metrics. Here are the shortfalls of taking this approach:
  • Up-time - this is very rarely measured from the End-users standpoint. So you are immediately putting IT on the defensive when you state the system was available on the network, but the end-user is not able to execute business on the system.
  • This reported metric only gives credibility to how quickly IT personnel was able to find and fix the outages. Outages are typically caused by poor release practices or change management, IT functions, anyway.
A new approach that should be considered is how I measured my operation as an IT director and what Vigilant consultants call "Service Outage Avoidance" (Not to be called SOA or real confusion sets in)
This metric is the marrying of component availability to end-user availability. You can accomplish this by monitoring a systems network & server components for availability along with the end-users behavior. When an outage occurs at the component level, yet the service stays up to the end user, due to you superior availability design of the system, you have achieved avoiding a service outage.
Availability metrics then should be broken into the 5 following categories:
  • Network      (Link status, utilization, drop/error rates)
  • Server         (OS stats, CPU, HD, Mem)
  • Application  (DB, J2ee, .Net, etc)
  • Business Logic    (Code interfaces, Connectors, ETL, etc..)
  • Business Process  (Transactions, order counts, etc...)
  • End-User      (real-time screen to screen, refresh, errors, etc..)
"Service Outage Avoidance" metric shows the percentage of downtime of a component where end-user was available.  (i.e.  4months of aggregate downtime of SAN on Email system during 12 months of end user availability)
Your next management report then will show something like this:
Email Services - Service Outage Avoidance: 25%
What this metric means is that we had an impact at a component level of 25%, but due to proper design and management we avoided having a business impact.
In other words, "You know how we weren't sure if it was worth it to build in all that fail-over and redundancy. Well here is how valuable that decision to spend was."

If you can equate the up-time value against this, you can calculate the ROI.
i.e. Up-time value of Email for 1mos= $1million dollars.
Cost of redundancy $1M
1 year ROI is 300%  
(4mos *$1M = $4M return - $1M investment = $3M. $3M(return)/$1M(Investment) = 300%)
                                                                       

In my next blog - Don't underestimate the infrastructure

Tuesday, August 5, 2008

Customers - Are they always right?

So you finally get a chance to look at your customer surveys. Disappointingly after all the coaching and training and process around your service desk, your customers are still complaining.

What's the deal? you say to yourself. "I thought if we put this Service Management process stuff in place customers would be happy, at least that is what Matt told me. Last time I listen to that idiot."

While not taking my advice is many times a good thing, it's not that the processes did not work, its that the implementation was not taking into the most important aspect of Service Management. It's Customer Service Management, not Process Service Management. That means that you have to take into consideration the customer's unique circumstances.

So while Customers are not always "Right", they can be made to feel like they are getting the service they are paying for, by listening to them and acknowledging their intelligence and frustration.

Too many groups are putting in Incident and Request models that are purely focused on the work flow, not the communication. The customers request or incident needs to be resolved, yes, but they also need to feel like they are getting attention individually. Let me share an example with you.

Recently I purchased a VHS to DVD copier. Upon making my 3rd or 4th copy successfully the device stopped reading the VHS tapes. It was snowy on the screen but with clear voice. I called tech support, where upon the technician following his trouble-shooting steps told me that I needed to plug in the "Yellow" jack from the back of the device to my television.

I explained that I was using the Component cables which were Red, Blue and Green and that my TV did not have a "Yellow" Jack to plug into. The technician insisted that the device could not work unless the "Yellow" jack was plugged in. Mind you I had told him several times that I had successfully made copies, and that nothing had changed with my physical connections. Needless to say a horrible experience, and I ended up returning a product that is probably fine because of someone not listening to the customer.

In a different call to my Internet provider Comcast, I had completely the opposite outcome. After troubleshooting why my connection was not working, I finally called Comcast. Now I know it's been a while since I've got my hands dirty with technology, but I still feel pretty comfortable troubleshooting network issues. Upon plugging my laptop directly into the Cable Modem, I realized that the problem was with their Cable Modem. So I called the Comcast tech support. I explained my situation and the steps I had taken. The Comcast rep told me "Can I put you on hold one moment, you clearly have taken some steps to isolate this and let me see if I can pick up where you left off." Literally within 2 minutes the router was up and running and my Internet was back up. He apologized for the inconvenience and then explained to me that they would add my device to their monitoring solution so that they would be notified again should this happen.

Now that is Customer Service.

Next on my hit list of topics: The mother of all Metrics "Service Outage Avoidance" - how this one set of metrics can be the key to your next raise.

Tuesday, July 8, 2008

Service Catalogs are the key to demonstrating Value

What is a Service Catalog? Simply put it is system or documentation that allows people to preview the services they can obtain from you and the expectation they can have of getting those services (time, cost, quality, etc...).
Do we need a Service Catalog? Do you need a resume to get a job? No, but if you want the right job, and want to get paid fairly for the abilities you can bring, and want to set the right expectation, then you will want to have a clearly articulated resume.
Same thing with the IT Service Catalog. If you want the business to appreciate the value IT brings to the organization, and you want to ensure that staff, suppliers, and costs are adequately budgeted for, then you must present to the business your capabilities. The Service Catalog is where you will publish and present what IT will do, and thus what they will not do. At face the business will not necessarily want IT to have this. If you do not currently have an IT Service Catalog, then currently the Business can ask you for whatever they want, and IT has to scramble to either try and justify why they can't do it, or figure it out. If there is no cost allocation in place for IT resources, then in the eye of the Business stake holder IT is a free resource, and we all know what the value of free - zero - free has no value.

Thus to really drive the value of IT services, IT must put in place a definitive "what we do, how we do it, and how much it costs" communication platform. More advanced organizations are using this information to build an on-line IT ordering site where people can order account setups, email boxes, new laptops, PDA's and Blackberries, and other enablement services. These sites will typically hang-off the Service Desk platform so that people can get services ordered without having to interact with a service request person. This can lead to tremendous cost savings and it also leaves the business in more control. So many organizations are finding the business more willing to fund the Service Catalog under the umbrella of Self-Service optimization and cost efficiency.

Next blog: "Is the customer always right?" I'll share some tech support stories to show the difference between a customer focused support person and a person who answers the phone and follows a script.

Tuesday, June 17, 2008

SLA's why are they needed?

Service Level Agreements - SLA's. For those who have been able to develop them with metrics that are meaningful and achievable, they love them. For everyone else, they are a nightmare. What makes a good SLA? In my experience, it takes very little to make a good SLA. First, it needs to be understandable. If you don't understand the commitment of service that you need to perform, then the actions you need to take to improve will be a mystery.
For example, if a person at a fast food restaurant was to get measured on the quality of their hamburg, that is good thought, but what does it mean? It's not like they can change the type of beef or bread or other things like that. So rather than just say improve the quality, the manager needs to put in measures that the employee can affect. Time on the shelf less than 10 minutes, bread no older than 5 days, etc...

Those are factors that the employee can watch and adjust, ultimately improving the SLA. Which brings me to the second factor. It has to be measurable. If you can not measure it you can not manage it. If you can not time the hamburger on the table, the age of the bread, you can not determine it's indication of quality.

If you look at the standard SLA's in place that are not on the nightmare side of the house, they are things like 99.999% up-time. If you asked most IT folks what that meant you would get different answers. Some would say the server 99.999% of the time up over the course of a year. Others might say 99.999% would mean that application services are available to all users for no less than 15 minutes in the course of a year.

1 is very measurable, but not of high-value. The other is extremely valuable but difficult to measure.

So when establishing your agreements to the level of service required it is important to determine what you can do and what the business needs. Then negotiate the middle ground. The more the business needs, then the more IT will need to deliver and the higher the cost. Over promising on an SLA that the IT department can not hit does not help anyone. So it is crucial for IT to establish what their capabilities look like. My next blog will be what a Service Catalog is and why it is needed to have true SLA management.

Monday, June 9, 2008

The Art of Triage

To many, troubleshooting seems to be a gift that either you have or you don't. For instance, my father is a mechanic. When he owned his own repair shop he would hire young guys who would spend hours troubleshooting a problem, but then within minutes from my old man getting involved he could quickly diagnose the cause of a problem. The timing belt, carburetor issues, whatever it was, he was quick to pinpoint. Inevitably once the problem was found there was the “oh, of course” from the Junior grease monkey.

I was too clumsy to be a mechanic, so my dad fired me and forced me into Computers. However, I didn’t forget what I had learned about troubleshooting.

First, troubleshooting is not something you are born with. It is a skill that is harnessed based on 3 common factors:
1) What you know
2) What you don’t know
3) What you are learning

When you piece these 3 factors together you create the framework for discovery. By adding at negative and positive approach will then lead you down a path of what good troubleshooters simply call the process of elimination.

Do you know what is working? Do you know what is not working?
What don’t you know is working? What don’t you know is failing?
What have I proved with this step? What have I disproved with this step?

So when it comes to troubleshooting complex systems, the same principle applies. You just need to analyze them in layers. Here are the layers that VIGILANT has documented as the logical points to eliminate.

Infrastructure: Hardware, Networking, Operating Systems
Application: 3rd party application services
System Interfaces: Connectivity between dependant systems
Business logic: Business rules that cause transactions to operate differently
Business Process: The way the end-user is executing the transaction
Business Service: Dependency on data or other elements for success

For really complex issues, take each of these tiers and apply the 3 principles of discovery to them and you fill find the problem is not as much as a black-hole as you thought it was.

Tuesday, June 3, 2008

Performance Engineering Tips - A solid plan leads to better results

Creating a performance plan has many challenges, creating a realistic load test has to be one the greatest. Load and performance testing is many times seen as a nice to have. However, no one ever says "yeah, I'm OK if my applications ran slower". As load and data capacity increases, this is exactly what happens. Mid-stream in the operation of business the applications can suddenly start to slow down. This happens usually without warning and almost always has a detrimental impact to the business.

How can you keep this from happening? A better test plan is the place to start.
  1. Review the types of activities that the users will be performing. We call these transactions.
  2. Review the location and amount of users. Take into consideration network speeds.
  3. Review the amount transactions that will be performed.

Many IT performance testers simply look at user count and business transactions. Failing to understand the network conditions the volume of transactions will produce an inadequate simulation.

The better the simulation - the more valuable the predicted operation of the system when it goes live.

Wednesday, May 28, 2008

Why are we still fighting fires?

"We have spent so much money on monitoring tools and consulting, why are we still fighting fires?" I heard this from a recent prospect. The simple answer is because we live in a complex and high demand world. As IT proffesionals we are trying to do a lot with a little; little training, little vision, little strategy, little focus. Organizational process alone is not going to cut it. If we want to fight fires, then we have to campaign like Smokey the Bear. "Only you can prevent forest fires!" was his motto. What is yours?

How about these? "Only you can prevent outages from unplanned changes!"- By implementing some more rigourous controls around change and release management.
"Only you can prevent unnecessary downtime by distracting priorities!" - By implementing better incident management procudures, you could avoid the "SWOT" call mentality that takes the engineers attention off of restoring the service and diverts it into CYO for mangaement.
"Only you can prevent business impact from IT service failure!" - By properly planning and validating your infrastrcuture through capacity and load testing, infratstucure validation for fail-over and applicaiton profiling, you can ensure your build and release managment processes are catching issues before your users do.

Why are you still fighting fires? Probably because your processes are like disconnected piles of twigs and your culture is overly reactive.

Tuesday, May 27, 2008

The value of visual monitoring systems

It still amazes me how many clients I still sit with at lunch or at a business meeting, where their pager or cell phone goes off and without even looking at the message just clear it. Then they will smile and say "stupid monitoring system". I've started to joke with them that they must have stock in Duracell. Is the monitoring system stupid, or is it the way they are using it?
The reason people ignore the messsages is because of false arlarms. Why are they false to begin with?

The whole ide of of paging people is great to get their attention while they are out and about. With technology advancements in cell phone, why are we not using more of a visual alerting capability. When is RIM (makers of BlackBerry) going to step up to the table and give us some visual icons to alert us IT folks of systems issues. It can tell me I have a text message or voicemail. How about a disk full or critical application failure? A few simple icons that in a glance I could see whether i need to look at the alter details or not.