The Problem Management process can be divided into two types: Reactive and Proactive.
Proactive Problem Management focuses on evaluating trends that are happening in the environment before a problem ticket is associated with them. In other words, nothing is broken... yet.
Your goal, in this case, is to identify risks in the environment where things could break. (Capacity or Bug related, etc...)
This posting is going to focus on Reactive Problem Management. The "all hands on deck" P1 situation.
I am going to walk you through how I manage triage and problem situations with my clients. Using these steps I have been able to show up at a client’s site and within a day or two resolve issues that they have been trying to resolve for weeks, if not months. That's not meant to be a brag, but just evidence that the right methods and tools make the difference.
So let's walk through my approach and I will also specifically call out the tools I use.
Rule 1) Cool heads prevail.
The fact that this is a Problem Management process means that you already have the system back on-line or you have an effective workaround. Therefore, while time is by no means unlimited, rushing to a solution at this stage will cause more damage.
Step 1) Take the time to define the problem.
· Bullet list the symptoms.
· Bullet list the steps to resolve.
· Weed out the things that are most likely unrelated.
Step 2) Define what the resolved situation should look like.
· Be specific. “Users will have it operating like it was before” is not acceptable.
· Identify measurable points for a graph or chart. You will have to prove progress. Otherwise you will chase subjectivity.
· Reaffirm the resolution point. As you start to fix issues, others will interject with additional symptoms. Stay focused or else you will waste time on futile efforts to resolve unrelated issues.
Rule 2) Start wide and drill down
Step 3) Deploy tools that will help you get your arms around the environment as a whole.
I will typically start off with a network analyzer to get a sense of traffic and communications. Here are a few of my go-to tools:
Compuware Network Vantage and if possible Client Vantage Agentless
Network Instruments Observer
Others might be NetQOS, or Netscout.
Of course you can use any Nflow and Sflow data analyzer, but it's a lot more difficult to correlate these.
Look for these key things:
· Are the servers that you thought were supposed to talk with each other talking the protocol they are supposed to?
· Is the amount of data in between servers what you expected?
· Are there any blatant errors? DNS failures, TCP retransmissions at a high rate, TCP connections refused, TCP Failures, etc...
Next I deploy some basic server monitoring tools:
I look for CPU, Memory, Disk I/O, Context Switching, Page file Use.
I don't get overly concerned about the detail.
Here are my go-to tools:
HP Sitescope (because it's agentless and easy to get setup)
Typically the client has something: Solarwinds, MOM, or anything that can monitor CPU at least every minute and Memory ever 5 minutes.
Rule 3) Focus on facts - not perceptions or politics
Problems get heated real fast. Someone is to blame for this impact to the business, and the headhunters come out. You'll need to muster some boldness for these next steps, but here is where the real special sauce kicks in.
Step 4) Get an isolated environment.
(go for physical iron if you can - VM's are tougher to trace)
While I let my broader monitoring tools bake in, I take an isolated test environment. Usually it’s a QA or Dev box that is running the production version of code.
I setup network captures and any application monitoring tools and have a Business Analyst walk through a Vital Business Function (VBF).
We call this profiling, and the objective here is to understand the interrelationships between the client interface and the backend requests.
My go-to tool here is Compuware Application Vantage. (Awesome tool and makes this job a whole lot easier.) Compuware has been great to let me lease this when I've needed it.
You can also use Wireshark or other sniffers, but it's tough to piece the threads together.
For application monitoring of J2ee or .Net you can use CA Wily, Compuware Vantage Analyzer, HP Diagnostics. However, if I am in an isolated environment, I prefer to use Compuware DevPartner for Java or DevPartner for .Net. The profiling of code is much more extensive, and a lot easier to trace potential capacity bottlenecks.
Step 5) Map Profile to production
Once you have conducted a success profile, this gives you the ultimate baseline of capacity and performance for troubleshooting the production issue. You now have intimate knowledge of how much traffic of which protocols each end-user request is causing in the back end. You can now start to analyze the captured data to determine if you are getting scalability.
For example: Let's say the profile of your End-User request of Looking up a customer equates to:
10 HTTP requests @ 2 seconds to the Web servers
20 XML requests @.05 seconds from web server to app server
50 SQL requests @ 3 seconds from app server to web server with the longest taking 1.5 seconds
You simply start to filter HTTP> 2 seconds XML >.05 and SQL > 1.5
You compare the output against the profiled requests and you can start to map the production transactions.
If you are running in a J2ee and .Net environment, many good tools will do this for you. They accomplish this by tracing from the web session to the app session and db session. However, it is important to note that these tools can require reboots of your systems to start tracing and require overhead on the server.
Step 6) Present findings (evidence) and save speculation for others.
I frequently refer to my team as the Crime Scene Investigators (CSI). We produce the evidence and save the guilty or innocent verdict for someone else.
What do I mean by this? Simply that network, server, or app issues will all manifest themselves in different ways. Rushing to blame or trying to prove someone is at fault can many times skew your judgment. Continue to focus on the analysis - the problem is what it is. When you find it, regardless of who’s to blame, the problem now becomes who will fix it and how will it get fixed. That is why point 7 is the most important of all.
Step 7) Document, document, document.
I can not stress this enough. Document both the negative and positive findings. Take screenshots of the analysis tools. Annotate them on the fly - and remember to include in your annotation the tool, trace, date, time, person, and environment in which the analysis was performed.
Tool of choice here is hands-down SnagIt from Techsmith. This is a fantastic tool for grabbing the right graph, chart, error message, whatever and then right there adding comments, circling the main-point, drawing arrows to highlight correlation. I have to say 50% or more of my time on a triage is spent in SnagIt.
I hope these steps will help you with your resolution of production issues. You know who to call if you get stuck. J
So after you have gone through all of this, what are you going to do with all of this valuable data?
Next month, I will take on the subject of configuration management and its role within the knowledgebase.