Diagnosing Complex Applications - Answering the Tough Questions
Tuesday, August 17, 2010 at 9:37AM One of the constant items that I come across in my professional career is the one that typically starts with leadership within an organization about a very basic premise - what is going on with their business applications? This seems like a very straightforward question. However in truth it is not as easy as one would imagine.
Many companies have grown organically that while beneficial has some operational costs to consider. One of the most challenging happens to be managing complex systems. This includes appraisal and diagnosis especially triage in the case where applications critical to the business are having issues.
Generally speaking almost all the organizations that I have had experience with typically have the same set of problems:
- Missing or out-of-date metrics. One cannot measure anything if nothing has been defined. This is where most businesses fail.
- Threshold Goals. Once metrics have determined thresholds have to be defined. These are basic boundaries that determine 3 basic states: healthy, not-so-healthy, and in jeopardy. These can also be characterized as zones: green, yellow and red. These boundaries help establish what the business expects from their applications and operations.
- Growth expectations. Businesses expect growth. However asking them to come up with an expectation to create a model is something most do not want to do. This is a tough balancing act typically around "planned" growth events. In truth if something does really well, all previous growth projections tend to be irrelevant since in essence the scale changes say from tens of thousands to millions. Regardless a growth model needs to be in place.
- Holistic analysis. Only a very small handful of organizations really see this as a key practice for complex applications and systems. Most think of only a handful of elements not their entirety. It is absolutely essential to look at the complete spectrum of options and be able to analyze everything that can impact a business. This means hardware, software, network, web traffic, and user-based activity. All of it.
So why these basics? It actually comes back to my training at Toyota. In order to diagnose what is wrong, you need to know what is normal. So the basic process that I go through includes:
- Get existing information. Whether it be from existing tools, logs, etc. It is important to get what is available.
- Target data to answer key questions. These data points range in names from Key Performance, to Business Activity, to Business Metrics, etc. Yet their purpose is the same - identifying key elements that the business is looking for to answer their questions.
- Identify what is missing. Invariably there are elements that are missing. These need to be identified and then tackled in order of precedence.
Following this basic formula holistic diagnosis and analytics can be automated and evolved over time.
So what sorts of scenarios does this cover? Some of the basic ones include:
- Capacity. The company is going to have some major event and wants to know if they can handle it. This is not simply not just a question about any one part of a complex system rather the complete domain itself. Can it handle the extra users? Can it handle the traffic? Can it handle the business transactions? Can it handle the fallout? What is most likely to break? When? Where? How is that handled? All of these smaller questions are wrapped around the initial one.
- Triage and Diagnosis. Another very common issue is around problems that have impacted a business. Why is X problem happening? What are the symptoms? How are symptoms winnowed to potential causes? How are potential causes vetted to actual causes? What are expected impacts to potential solutions? How fast can potential solutions be turned around? How much can triage address vs. long term care? Being able to effectively manage all aspects of a problem enables the business to rapidly identify barriers to their growth and operations saving money, cutting costs and capitalizing on opportunities.
- Business opportunities. With all the diagnostics in place, analysis quickly moves into business opportunity analysis. What are customers? What are the various business units doing? When are they doing? Why are they doing it? Is there something we are not doing? Is there something we can do better? Once a business has the ability to look at their system in a holistic manner all sorts of interesting patterns emerge that are of interest to any business leadership.
When presenting a business with this sort of proposal it is daunting and in many cases especially from the operations-side of the house considered redundant. However it is not to say that the analytics are designed to replace existing investments, rather it is a way to look at what exists and identify/plug gaps.
For example many businesses have raw infrastructure data in the form of CPU, memory, disk, network activity, web traffic, etc. However this data is almost never compared to business application operations which track groups of activity in relation to one another. For example I have often ask an operations expert what is the link-traffic for a user who is inquiring about their account? They can give me all sorts of raw data but cannot put it together. If I ask the specific application expert they can tell me the path but not the application components. If I ask a developer, they can tell me the functionality and application components, but rarely can tell me the actual business case. When put in this light the problem is very clear: each domain is responsible for their individual area of responsibility. However in most cases there is no one to put them all together.
Once put together a business can actually see for every business activity, it's impact to their technical infrastructure, personnel, and operations. They can also then put important business events such as quarterly close, specific product promotions, and anything else together and view a complete high-level view of what happens to their organization when that occurs: how busy is their application, how many users are actually assisting in the endeavors, are reports being executed before/during/after the event, how many business operations are being executed, what partners are being used the most, etc.
Being able to answer tough questions means diagnosing and analyzing complex applications and systems in a different and more relevant perspective. It also means being open to the idea that while you may have existing tools and perspectives available, it does not mean you can answer tough questions.

