Tuesday, September 29, 2009

Effective Problem Management

The Problem Management process can be divided into two types: Reactive and Proactive.
Proactive Problem Management focuses on evaluating trends that are happening in the environment before a problem ticket is associated with them. In other words, nothing is broken... yet.
Your goal, in this case, is to identify risks in the environment where things could break. (Capacity or Bug related, etc...)

This posting is going to focus on Reactive Problem Management. The "all hands on deck" P1 situation.

I am going to walk you through how I manage triage and problem situations with my clients. Using these steps I have been able to show up at a client’s site and within a day or two resolve issues that they have been trying to resolve for weeks, if not months. That's not meant to be a brag, but just evidence that the right methods and tools make the difference.
So let's walk through my approach and I will also specifically call out the tools I use.

Rule 1) Cool heads prevail.
The fact that this is a Problem Management process means that you already have the system back on-line or you have an effective workaround. Therefore, while time is by no means unlimited, rushing to a solution at this stage will cause more damage.

Step 1) Take the time to define the problem.
· Bullet list the symptoms.
· Bullet list the steps to resolve.
· Weed out the things that are most likely unrelated.

Step 2) Define what the resolved situation should look like.
· Be specific. “Users will have it operating like it was before” is not acceptable.
· Identify measurable points for a graph or chart. You will have to prove progress. Otherwise you will chase subjectivity.
· Reaffirm the resolution point. As you start to fix issues, others will interject with additional symptoms. Stay focused or else you will waste time on futile efforts to resolve unrelated issues.

Rule 2) Start wide and drill down

Step 3) Deploy tools that will help you get your arms around the environment as a whole.
I will typically start off with a network analyzer to get a sense of traffic and communications. Here are a few of my go-to tools:
Compuware Network Vantage and if possible Client Vantage Agentless
Network Instruments Observer
Others might be NetQOS, or Netscout.
Of course you can use any Nflow and Sflow data analyzer, but it's a lot more difficult to correlate these.
Look for these key things:
· Are the servers that you thought were supposed to talk with each other talking the protocol they are supposed to?
· Is the amount of data in between servers what you expected?
· Are there any blatant errors? DNS failures, TCP retransmissions at a high rate, TCP connections refused, TCP Failures, etc...

Next I deploy some basic server monitoring tools:
I look for CPU, Memory, Disk I/O, Context Switching, Page file Use.
I don't get overly concerned about the detail.
Here are my go-to tools:
HP Sitescope (because it's agentless and easy to get setup)
Typically the client has something: Solarwinds, MOM, or anything that can monitor CPU at least every minute and Memory ever 5 minutes.

Rule 3) Focus on facts - not perceptions or politics
Problems get heated real fast. Someone is to blame for this impact to the business, and the headhunters come out. You'll need to muster some boldness for these next steps, but here is where the real special sauce kicks in.

Step 4) Get an isolated environment.
(go for physical iron if you can - VM's are tougher to trace)
While I let my broader monitoring tools bake in, I take an isolated test environment. Usually it’s a QA or Dev box that is running the production version of code.
I setup network captures and any application monitoring tools and have a Business Analyst walk through a Vital Business Function (VBF).
We call this profiling, and the objective here is to understand the interrelationships between the client interface and the backend requests.

My go-to tool here is Compuware Application Vantage. (Awesome tool and makes this job a whole lot easier.) Compuware has been great to let me lease this when I've needed it.
You can also use Wireshark or other sniffers, but it's tough to piece the threads together.

For application monitoring of J2ee or .Net you can use CA Wily, Compuware Vantage Analyzer, HP Diagnostics. However, if I am in an isolated environment, I prefer to use Compuware DevPartner for Java or DevPartner for .Net. The profiling of code is much more extensive, and a lot easier to trace potential capacity bottlenecks.

Step 5) Map Profile to production
Once you have conducted a success profile, this gives you the ultimate baseline of capacity and performance for troubleshooting the production issue. You now have intimate knowledge of how much traffic of which protocols each end-user request is causing in the back end. You can now start to analyze the captured data to determine if you are getting scalability.

For example: Let's say the profile of your End-User request of Looking up a customer equates to:
10 HTTP requests @ 2 seconds to the Web servers
20 XML requests @.05 seconds from web server to app server
50 SQL requests @ 3 seconds from app server to web server with the longest taking 1.5 seconds

You simply start to filter HTTP> 2 seconds XML >.05 and SQL > 1.5
You compare the output against the profiled requests and you can start to map the production transactions.

If you are running in a J2ee and .Net environment, many good tools will do this for you. They accomplish this by tracing from the web session to the app session and db session. However, it is important to note that these tools can require reboots of your systems to start tracing and require overhead on the server.



Step 6) Present findings (evidence) and save speculation for others.
I frequently refer to my team as the Crime Scene Investigators (CSI). We produce the evidence and save the guilty or innocent verdict for someone else.

What do I mean by this? Simply that network, server, or app issues will all manifest themselves in different ways. Rushing to blame or trying to prove someone is at fault can many times skew your judgment. Continue to focus on the analysis - the problem is what it is. When you find it, regardless of who’s to blame, the problem now becomes who will fix it and how will it get fixed. That is why point 7 is the most important of all.

Step 7) Document, document, document.
I can not stress this enough. Document both the negative and positive findings. Take screenshots of the analysis tools. Annotate them on the fly - and remember to include in your annotation the tool, trace, date, time, person, and environment in which the analysis was performed.

Tool of choice here is hands-down SnagIt from Techsmith. This is a fantastic tool for grabbing the right graph, chart, error message, whatever and then right there adding comments, circling the main-point, drawing arrows to highlight correlation. I have to say 50% or more of my time on a triage is spent in SnagIt.

I hope these steps will help you with your resolution of production issues. You know who to call if you get stuck. J

So after you have gone through all of this, what are you going to do with all of this valuable data?
Next month, I will take on the subject of configuration management and its role within the knowledgebase.

Tuesday, September 1, 2009

Is now the time for a Service Management Initiative?

Yes, I'm still alive.

Very sorry I have not posted in a long time.(or as we say here in Boston " a wicked long time")
As many of you may have heard Vigilant was acquired by Inforonics in Littleton, MA. It's been a lot of work to get the acquisition accomplished, but we are excited about the new opportunities.
You can read more about that here: Inforonics Acquires Vigilant

Though I've delayed in posting for this particular topic, it may be a good thing.
I'll have to admit, that earlier in the year I may have written this quite a bit differently than I do today. It's been a wild ride here in the US with our economy, and the signs are still blinking as to our recovery. (imho). So my "If I were CIO" strategy is a bit different than in the past.

So what about the IT Service Management strategy that was so vigorously trumpeted in 2008?
From what I have seen there are 2 clear areas that every company needs to focus on, and they need to do it now.
Configuration and Problem Management.
(yes - I know, I didn't say Change, I'm shocked as well, but like I said this article has 9 months of my experiences behind it, so as opposed to the speculation I would have written about in January, I'm writing based on what I have seen.)

Why Configuration and Problem?
First off - the real issue is Problem Management. Companies are really bad at it! Thus when you are working on a problem, and making no progress, this means the business is suffering. Companies can not afford downtime and slowdowns from technology issues, ever - especially in this economy. They need to have systems up and running fast.

So what's the problem with their problem management?
Asset mapping and documentation. I've worked on several major performance and availability problems for clients this year. Serious revenue impacting issues! In each and every example the operational deficiency, and thus the reason to bring our team in, was a gap in understanding of how the technology really supported the business operation.

That is why if companies are going to invest anything in their Service Management initiatives this year, I truly believe it has to be in Configuration Management. Config Management is not just about the assets. It's about the business service, and how the asset's support them. If we (IT - the custodians of operational business technology assets) are going to add value to the business, we need to ensure that we have a handle on what the state of our operations truly is. We need to not only identify the asset relationship, we need to ensure that we can determine its health and its ability to perform on-going.

Yes, of course Change Management comes into play here. However, Change Management alone is not getting the job done. In each of the organizations I referred to above there was a CAB, RFC's, all that jazz. However, there was no record of truth or current health state of CI's. Changes were being made against assumed configurations, without any understanding of their current state of health. (in other words, no integration into event management) So changes were being made and requested against incorrect information and unstable CI's. Hence the problem kept getting worse, not better. (My analogy to the clients as "Stacking Cue Balls" each change caused another break - just like each ball stacked causes the lower ones to topple)

With a well thought out CMS strategy, including health monitoring and CI capacity analysis tools, Problem Management becomes a lot easier for organizations. A clear picture of the assets in relationship to each other helps the process of elimination, it provides a direct plan of attack to isolate root cause, and it also provides helpful information in getting the right people involved.

This is why companies struggle with Problem Management. They expect the PM process to give them root-cause. This will never happen if the proper data is not collected and managed in a meaningful way.

For my next posting I'm going to breakdown the Problem Management process we've used to isolate faults quickly. -I promise it won't take me 9 months to write it. :)