Who is Altan Khendup?

A professional technologist that dabbles in innovative and interesting uses of technology, Mongolian history, philosophy and cooking ethnic foods.

Often described as part philosopher, scholar, technologist, and mentor Altan likes engaging in stimulating conversations with professionals, tackling problems in a hands-on and collaborative manner with technology, and enjoying the company of good friends and family.

 

My Twitter Stream

Entries in Analytics (1)

Thursday
Jun112009

Application Health - A Holistic View

One of the greatest challenges for technology professionals is to answer how "well" an application is doing. Usually everyone has an agreed upon measurement that is commonly defined as the health of the application complete with historical perspectives. Yet the devils are in the details in many cases where an application's health may in fact be in jeopardy and in spite of numerous tools and voluminous data there is no clear picture as to why things are going wrong. 

It is during this frenzy that mistakes are made. Over the years I have found that looking at a problem holistically rather than what is simply provided for in standard tools leads to better answers or at least different paths. 

Experience plays a large role for many technology professionals in their assessment of how to do things. This is not uncommon for many professionals. However, the technology world is faced with ever increasing complexity so every aspect of a business application is faced with growing pressures to deliver value. While tools are evolving many simply do not always provide the complete picture of what is needed. This is not to say that these tools will not be able to deliver necessary data, however often times they are not fully capable in terms of bridging the divide between business functions and operational metrics. 

This gap is where I find the most mistakes are made. Solutions are proposed without any real proof or indication of what the actual issue is. It is just not possible to propose effective changes without knowing where the problem actually lies. 

Holistic review of the data available allows more for faster, more effective solutions and incurs less cost by minimizing the amount of efforts that are applied to a problem. 

A good example would be a case that I had been involved with where there are concerns about the growing CPU consumption of an application. Overall business metrics point to a steady growth cycle but nothing overly large. Development had implemented a series of re-factored code that decreased unnecessary executions of functions making them in fact more efficient. System administrators and database administrators were at a similar loss in that they did not see any additional issues either. All tools were reporting normal operations. At this point it was fairly clear that while everyone had the proper data, it was not telling them where a potential issue could be. 

So I gathered the data from all teams and organized them according to each business function. I aggregated the data across all the clusters and created a holistic graph depicting discrete application function processes against their CPU consumption. The result looked like this...

Summary of Application Function with CPUUsing a fairly large window of time, the graph not only confirmed the overall health metric of increased CPU consumption but also the corresponding CPU for each function. The graph clearly showed increases for most operations across the board except for a few. Given that there had been changes to improve the efficiency of many functions it was necessary to provide a different view of the same data so that each function could be seen properly with relation to it's CPU consumption. That graph looked like this...

Individual Function with their CPUThis graph was even more revealing. It indicated that there had been indeed great progress made in many of the functions. At the same time, several functions were demonstrating increased growth in their CPU consumption. The greatest increases occurred in the Case Query and functions related to messaging. So that is where I decided to focus my attention. Both of these areas had been improved to reduce unnecessary executions. 

I stared the investigation with the Case Query.

Case Query executions and CPU consumption

As the next level of drill down demonstrated it appeared that the improvement had indeed worked in that the number of executions were down being reduced from a high of approximately 96M to about 68M for reduction of about 29%. However it also appeared that the CPU had increased rising from 7% to 9% of the whole server set.

The next area to investigate happened to involve the messaging. Since all messaging activity was based on the generator functions I drilled into that core area in more depth as well.

Message Generator and CPU ConsumptionThe improvements were similar in that the amount of message generators being requested fell from about 140M to 100M, yet the CPU only increased marginally by some .25% and that had taken place only in the past few weeks of the most recent month. In the prior month, the CPU had actually fallen as expected.

Presenting the data in this manner helped the teams to focus more on the code itself and other potential areas of the code base. Routine changes to maintenance were ruled out since not all the functions were being impacted. After a careful review it was revealed that a core function that had been relied on by many of the higher functions was indeed the culprit using a very inefficient algorithm to manage work and requests. It was re-written and deployed resulting in the favorable decreases of CPU consumption which were originally planned. 

This is obviously a very simple example. However it demonstrates that objective measurement and reviewing an area of concern from a more holistic view that incorporates both functions and detailed data can help to provide additional perspectives on problems affecting a company's application health. Especially in today's economy where improved cost containment while increasing business value is critical to a company's bottom line.