Tuesday, June 17, 2008

SLA's why are they needed?

Service Level Agreements - SLA's. For those who have been able to develop them with metrics that are meaningful and achievable, they love them. For everyone else, they are a nightmare. What makes a good SLA? In my experience, it takes very little to make a good SLA. First, it needs to be understandable. If you don't understand the commitment of service that you need to perform, then the actions you need to take to improve will be a mystery.
For example, if a person at a fast food restaurant was to get measured on the quality of their hamburg, that is good thought, but what does it mean? It's not like they can change the type of beef or bread or other things like that. So rather than just say improve the quality, the manager needs to put in measures that the employee can affect. Time on the shelf less than 10 minutes, bread no older than 5 days, etc...

Those are factors that the employee can watch and adjust, ultimately improving the SLA. Which brings me to the second factor. It has to be measurable. If you can not measure it you can not manage it. If you can not time the hamburger on the table, the age of the bread, you can not determine it's indication of quality.

If you look at the standard SLA's in place that are not on the nightmare side of the house, they are things like 99.999% up-time. If you asked most IT folks what that meant you would get different answers. Some would say the server 99.999% of the time up over the course of a year. Others might say 99.999% would mean that application services are available to all users for no less than 15 minutes in the course of a year.

1 is very measurable, but not of high-value. The other is extremely valuable but difficult to measure.

So when establishing your agreements to the level of service required it is important to determine what you can do and what the business needs. Then negotiate the middle ground. The more the business needs, then the more IT will need to deliver and the higher the cost. Over promising on an SLA that the IT department can not hit does not help anyone. So it is crucial for IT to establish what their capabilities look like. My next blog will be what a Service Catalog is and why it is needed to have true SLA management.

Monday, June 9, 2008

The Art of Triage

To many, troubleshooting seems to be a gift that either you have or you don't. For instance, my father is a mechanic. When he owned his own repair shop he would hire young guys who would spend hours troubleshooting a problem, but then within minutes from my old man getting involved he could quickly diagnose the cause of a problem. The timing belt, carburetor issues, whatever it was, he was quick to pinpoint. Inevitably once the problem was found there was the “oh, of course” from the Junior grease monkey.

I was too clumsy to be a mechanic, so my dad fired me and forced me into Computers. However, I didn’t forget what I had learned about troubleshooting.

First, troubleshooting is not something you are born with. It is a skill that is harnessed based on 3 common factors:
1) What you know
2) What you don’t know
3) What you are learning

When you piece these 3 factors together you create the framework for discovery. By adding at negative and positive approach will then lead you down a path of what good troubleshooters simply call the process of elimination.

Do you know what is working? Do you know what is not working?
What don’t you know is working? What don’t you know is failing?
What have I proved with this step? What have I disproved with this step?

So when it comes to troubleshooting complex systems, the same principle applies. You just need to analyze them in layers. Here are the layers that VIGILANT has documented as the logical points to eliminate.

Infrastructure: Hardware, Networking, Operating Systems
Application: 3rd party application services
System Interfaces: Connectivity between dependant systems
Business logic: Business rules that cause transactions to operate differently
Business Process: The way the end-user is executing the transaction
Business Service: Dependency on data or other elements for success

For really complex issues, take each of these tiers and apply the 3 principles of discovery to them and you fill find the problem is not as much as a black-hole as you thought it was.

Tuesday, June 3, 2008

Performance Engineering Tips - A solid plan leads to better results

Creating a performance plan has many challenges, creating a realistic load test has to be one the greatest. Load and performance testing is many times seen as a nice to have. However, no one ever says "yeah, I'm OK if my applications ran slower". As load and data capacity increases, this is exactly what happens. Mid-stream in the operation of business the applications can suddenly start to slow down. This happens usually without warning and almost always has a detrimental impact to the business.

How can you keep this from happening? A better test plan is the place to start.
  1. Review the types of activities that the users will be performing. We call these transactions.
  2. Review the location and amount of users. Take into consideration network speeds.
  3. Review the amount transactions that will be performed.

Many IT performance testers simply look at user count and business transactions. Failing to understand the network conditions the volume of transactions will produce an inadequate simulation.

The better the simulation - the more valuable the predicted operation of the system when it goes live.