Entries in Holistic problem solving (6)

Tuesday
Aug172010

Diagnosing Complex Applications - Answering the Tough Questions

One of the constant items that I come across in my professional career is the one that typically starts with leadership within an organization about a very basic premise - what is going on with their business applications? This seems like a very straightforward question. However in truth it is not as easy as one would imagine.

Many companies have grown organically that while beneficial has some operational costs to consider. One of the most challenging happens to be managing complex systems. This includes appraisal and diagnosis especially triage in the case where applications critical to the business are having issues. 

Generally speaking almost all the organizations that I have had experience with typically have the same set of problems:

 

  • Missing or out-of-date metrics. One cannot measure anything if nothing has been defined. This is where most businesses fail. 
  • Threshold Goals. Once metrics have determined thresholds have to be defined. These are basic boundaries that determine 3 basic states: healthy, not-so-healthy, and in jeopardy. These can also be characterized as zones: green, yellow and red. These boundaries help establish what the business expects from their applications and operations.
  • Growth expectations. Businesses expect growth. However asking them to come up with an expectation to create a model is something most do not want to do. This is a tough balancing act typically around "planned" growth events. In truth if something does really well, all previous growth projections tend to be irrelevant since in essence the scale changes say from tens of thousands to millions. Regardless a growth model needs to be in place.
  • Holistic analysis. Only a very small handful of organizations really see this as a key practice for complex applications and systems. Most think of only a handful of elements not their entirety. It is absolutely essential to look at the complete spectrum of options and be able to analyze everything that can impact a business. This means hardware, software, network, web traffic, and user-based activity. All of it.

 

So why these basics? It actually comes back to my training at Toyota. In order to diagnose what is wrong, you need to know what is normal. So the basic process that I go through includes:

  • Get existing information. Whether it be from existing tools, logs, etc. It is important to get what is available.
  • Target data to answer key questions. These data points range in names from Key Performance, to Business Activity, to Business Metrics, etc. Yet their purpose is the same - identifying key elements that the business is looking for to answer their questions.
  • Identify what is missing. Invariably there are elements that are missing. These need to be identified and then tackled in order of precedence.

Following this basic formula holistic diagnosis and analytics can be automated and evolved over time.

So what sorts of scenarios does this cover? Some of the basic ones include:

  • Capacity. The company is going to have some major event and wants to know if they can handle it. This is not simply not just a question about any one part of a complex system rather the complete domain itself. Can it handle the extra users? Can it handle the traffic? Can it handle the business transactions? Can it handle the fallout? What is most likely to break? When? Where? How is that handled? All of these smaller questions are wrapped around the initial one. 
  • Triage and Diagnosis. Another very common issue is around problems that have impacted a business. Why is X problem happening? What are the symptoms? How are symptoms winnowed to potential causes? How are potential causes vetted to actual causes? What are expected impacts to potential solutions? How fast can potential solutions be turned around? How much can triage address vs. long term care? Being able to effectively manage all aspects of a problem enables the business to rapidly identify barriers to their growth and operations saving money, cutting costs and capitalizing on opportunities.
  • Business opportunities. With all the diagnostics in place, analysis quickly moves into business opportunity analysis. What are customers? What are the various business units doing? When are they doing? Why are they doing it? Is there something we are not doing? Is there something we can do better? Once a business has the ability to look at their system in a holistic manner all sorts of interesting patterns emerge that are of interest to any business leadership.

When presenting a business with this sort of proposal it is daunting and in many cases especially from the operations-side of the house considered redundant. However it is not to say that the analytics are designed to replace existing investments, rather it is a way to look at what exists and identify/plug gaps. 

For example many businesses have raw infrastructure data in the form of CPU, memory, disk, network activity, web traffic, etc. However this data is almost never compared to business application operations which track groups of activity in relation to one another. For example I have often ask an operations expert what is the link-traffic for a user who is inquiring about their account? They can give me all sorts of raw data but cannot put it together. If I ask the specific application expert they can tell me the path but not the application components. If I ask a developer, they can tell me the functionality and application components, but rarely can tell me the actual business case. When put in this light the problem is very clear: each domain is responsible for their individual area of responsibility. However in most cases there is no one to put them all together. 

Once put together a business can actually see for every business activity, it's impact to their technical infrastructure, personnel, and operations. They can also then put important business events such as quarterly close, specific product promotions, and anything else together and view a complete high-level view of what happens to their organization when that occurs: how busy is their application, how many users are actually assisting in the endeavors, are reports being executed before/during/after the event, how many business operations are being executed, what partners are being used the most, etc. 

Being able to answer tough questions means diagnosing and analyzing complex applications and systems in a different and more relevant perspective. It also means being open to the idea that while you may have existing tools and perspectives available, it does not mean you can answer tough questions.

Friday
Feb262010

PeopleSoft Logs - Finding Hidden Nuggets

In almost all shops that I go to with problems related to performance or business operations due to technical issues, log files are an important and valuable source of data and information. Once one gets past the basic questions such as "What are the symptoms?" or "What kind of application are you running?" many details start to emerge. One of the biggest challenges to any technical analysis happens to be for large production applications that are running either a) their own frameworks and solutions and b) running other's frameworks and solutions. Usually group A has lots of internal experts that while often times extremely brilliant have very little expertise in what happens to their solutions once it starts to run in production after a while. The benefit of a group A is that they are all close by, knowledgeable and often times very open minded in finding a solution. Group B happens to the more classic kind who run technologies not of their own creation. They could be running PeopleSoft, Oracle, SAP, IBM, or even some of their own internal solutions mixed with any or all other types of solutions. These complex environments have both internal experts and external experts in the form of the vendors themselves to call upon.

In complex system interactions there are many places for useful information to hide. One of these places are the little gems known as log files. Pretty much every application and system under the sun produces them and they are often captured for analysis. Some of the more advanced shops have their own tools again either by their own development or acquired that parses the logs and presents information. Others are a little less sophisticated. However log files are often an excellent way to look at events at various layers of an system to determine whether or not there are problems.

Often times log files are hard to read. This is especially true for those produced by applications from vendors whose formats and data vary wildly from one to another. A PeopleSoft application is no exception. There are several different logs produced that one has to consider for a variety of situations:

  • Web Log. This is the typical web server whether it be WebSphere, etc. It holds the core information for all web interactions which include both standard users and automated web interactions.
  • PeopleSoft Web Log. These are maintained on the PeopleSoft side usually kept in the webserv directory. It contains the stderr, the stdout and also the application gateway logs which contain web activity for a PeopleSoft application.
  • PeopleSoft middleware logs. These are a collection of logs used to describe what is occurring within the layers of the PeopleSoft application. They include:
    • Application Server Logs - Used to describe what is occurring within the application servers.
    • Tuxedo Logs - Used to describe what the transaction servers are actually doing from a resource and request level.
    • Dump Files - Usually only occurring when something strange has happened, these files can appear usually in the application server logs for extreme system events.
    • REN (Realtime Event Notification) Server Logs - Similar to Application server logs but for REN events.
    • Process Scheduler Logs - A small collection of logs for each part of the process scheduler that describes what is happening at each level.
  • PeopleSoft Application Data. These are the tables and constructs such as the messaging constructs (PSAPMSG*) or the process scheduler (PSPRCS*) that have useful information that may or may not be inside the previously mentioned logs.

The first thing to notice is how many logs there are and then secondly how disperse they are. They are spread across an entire infrastructure. Another point to consider is that often times some hint of a problem can be determined within these logs, but only if you set the logging levels to something meaningful. I am not talking about high levels of detail, but even summary information has it's limits. Factoring in constant and perpetual logging in a running production environment is typically seen as a "No-No" in terms of performance. However, logging takes little overhead if set properly and background processes that periodically "clean and archive" the data for processing can minimize the disk worries. Typically the overhead for a PeopleSoft environment is usually anywhere from 5-10%. This has to be weighed in terms of the time it takes to identify a production issue.

In my experience being proactive means having a good logging level, a means of capturing all that information, analyzing it, and presenting it such that you have in addition to all of your existing tools a holistic and historical view of what your application is doing relative to the business transactions.

Some issues such as web-based transactions using the integration gateway happen to be one kind of problem that requires traversing several different tiers. The reason is simple: all the information about the layers of the integrations are not stored in the same place. Each layer has a different story to tell and can be of great help in determining problems. Based on this I used the following process to look at integration issues:

  • Is the integration synchronous or asynchronous? They have different structures in the database in terms of what to look for.
    • Synchronous are typically not logged unless one sets that up in the PeopleSoft web screens (called PIA). Setting these to at least a "Log Header" will store the information in the database in the PSIBLOGHDR and PSPUBCON. Setting it to "Log Detail" is a lot more overhead and typically unnecessary unless you want to look at the contents of a message itself.
    • Asynchronous are usually logged. They are stored in different structures such as the PSAPMSGSUBCON.
    • Errors for either if the PeopleSoft developers have put in error handling will appear in the PSAPMSGPUBERR or PSAPMSGSUBERR. Overall errors will be placed in the PSIBLOGHDR.
  • Next the actual logs may contain information. Usually integration transactions run in their own application servers separate from standard transactions. Why? Because putting them both in a single domain usually overloads the application servers in all but the smallest of shops. Even if they are in the same domain, the logs will still contain additional information about their events.
  • Tuxedo will be able to present information on the messaging as well in terms of requests processed, requests currently waiting and the workload being processed. This is informative in the historical context against business transactions to see if there are any load issues occurring.
  • The gateway itself can produce information about transactions. There may be errors connecting to a service, there may be formatting issues, there may be even Java memory issues for a particular request. These can be seen in the PSIGW for any PeopleSoft integrations.
  • The web server. In the case that there is insufficient memory resources for Java or other issues, these will appear in these logs.

Most of the time getting the basics is pretty easy if you have all the access, scripts, SQL, and information readily at hand. Typically this is not the case. I find that many shops have different people covering different areas, each area can have different requirements, these requirements may or may not have the appropriate level of data needed to make a determination. All of this is time. If in a dev or QA function, time may be flexible. In production outage situations time is not on your side.

Having spent a lot of time helping companies solve production issues for their mission-critical applications the processes are not really all that different from one application to the next. Typically the challenge is finding the correlated data, piecing it together quickly and efficiently, and having the proper automated tools in place that can quickly answer a question. If people are being used for all issues, the length of time to find a problem is very high and usually not proactive enough to address production problems. Many implementations fail to properly consider the amount of on-going development used to meet the on-going SLA commitments for their projects.

Good luck on your own implementations!

 

Friday
Jan292010

Interviewing Questions - Memorization vs. Problem Solving

I am typically asked to interview very senior candidates. Mostly due to my experience, but also because I tend to be more honest about the challenges being faced in an organization and the delicate balance between technical, management, and leadership skills and personality that will be necessary for a great organization fit.

When interviewing candidates for positions I look at it as an investment. Firstly I am spending a lot of time and consequently money in the process, and the person will be with the organization for a long period of time, potentially up to 5 years if the match is good, longer if the match is great.

Personally and professionally I find very little value in what I call "memorization" questions for technical professionals. Most of the management I meet ask for a basic set of questions for vetting purposes. However I see these questions more ineffective and wasteful than anything else. I care less for someone who has memorized every buzzword or specific syntax for languages or commands; this is what things like Unix man pages and Google searches are for. Just because someone is good at memorizing things does not make them a good fit for creating solutions.

When interviewing I look for the following:

  • Are they leaders? Senior positions are not followers but rather leaders in their own right. So I ask questions about initiative taking, risk management, and investments.
  • Can they think outside of the box? In almost all of my situations the initiatives being undertaken have not really been done before. As a result, there is no "best practice" solution that someone can look up and solve it. The senior positions have to have the ability to see this situation and move ahead, figuring things out along the way. This includes new approaches, new viewpoints, new technical implementations, etc. It can be even a new way to do something on older approaches.
  • Do they ask for permission rather than for forgiveness? In many cases what I am looking for are the necessary leaders and initiative takers that will sometimes bend rules. Not important ones mind you such as regulatory compliance, but rather set-in-stone mind sets that need to be challenged. I want the person to be independent so the questions along this line are what I am looking for.
  • Can they play well with others? Regardless of how stellar an individual is, they are still working on a team. From a 2 person startup to 300,000 person fortune 50 company the needs are the same. This is very important to determine since prima donnas often derail efforts rather than build them.
  • Do they have what I call "I am a hammer, everything is a nail" perspective? In my experience specialists are needed for certain specific tasks but leaders need to be demonstratively better cutting across boundaries than specialists. Especially in technical professional ranks, I often come across people that are very good at a certain set of technologies: all Java, all C, all Python, all systems, all Oracle, etc. The questions that I ask determine whether or not they can cross multiple specialities and using generalist thinking quickly understand, grasp and apply what they have learned to the problems at hand.

Often times memorizers make very poor fits for effectively solving problems. They wait for requirements as opposed to going out at getting them, they are geniuses with certains sets of tools, but horrible when learning new ones, etc. In any organization that is looking to grow, they need to see and build towards the future, not what is front of them for the moment.

This process has worked out very well. I have had professionals who had done very poorly at the memorization and at my insistence have been brought back for another session. In many instances, these individuals that I have helped to hire turned into excellent and capable contributors and leaders, delivering outstanding values to the organizations that hired them.

 

 

Wednesday
Sep022009

Application Consolidation - The Enterprise Quandry 

One of the most natural actions for any company is to cut costs during tough economic times. Also the most common actions in large companies is what I call the dreaded "consolidation". Basically this is the activity that attempts to reduce a company's large portfolio of applications to a smaller set. Logically and fiscally this makes sense in terms of elimination. If you can actually reduce your portfolio through true elimination you can really save on costs.

However, this is not as straightforward as it appears. I am going to forego the larger management and operational views of the activity and rather focus on the lower level implementation which tends to be glossed over in these discussions.

Usually when large companies speak of consolidation, it is usually looking at making savings on redundancy. This is especially true of companies that have gone through mergers. For example a company that has merged several times may have in fact have many redundant applications such as ordering, provisioning, HR, etc. At some point consolidation focuses on these applications and attempts to merge them into one.

The issue for large companies happens to be the size and hence complexity of this approach. In my experience this exercise is fairly ponderous for even smaller companies. When you get to organizations with tens of thousands of employees or more, this is not so simple mostly because the amount of data in the applications are vast as is usually the infrastructure and support personnel to keep them afloat. One classic case I have run into is the predicament related to ordering. In many cases, especially in merged organizations, ordering applications tend to address different segments of the customer base. For example in a large telecom there would be one ordering application for broadband, one for wireless and one for land-line. The ideal would be to consolidate all of these into one platform.

Now comes the hard reality. In this scenario, each application services tens of millions of customers, billions of transactions, cover hundreds if not thousands of servers across multiple data centers, hundreds of terabytes of data, and more than likely has hundreds of individuals servicing them. Consolidating these applications into one platform is daunting. Mostly because such a move puts companies into areas that are not necessarily their strength which is expert technical infrastructure management. How do you move all the data? How do you merge the applications such that functions are available to all parties without having to extensively retrain thousands of customer service representatives? If you leverage existing hardware or reduce it, how do you manage such a large distributed application on big iron from big vendors? For example, tackle simply the database level which may be from a major vendor such as Oracle. In some of these cases of consolidation companies push the vendor's ability to deliver a solution potentially merging all the applications could be well over 800TB worth of data. Imagine the sheer magnitude of moving this data en masse which is how most shops know how to tackle a problem of that size.

Usually after hitting the brick wall of reality, many compromises are made to applications that can span years which result in paring systems down enough such that they can actually be shut down and rolled off. However, this works for only the smallest of applications or the least used ones. The larger ones which are still being used actually account for the greatest potential for savings and present the greatest of challenges. So in most companies actions, they tend to leave their consolidations to the smaller or medium sized applications that while achieving savings do not deliver as much as first thought.

Hence many technical groups within these large companies are intrigued by cloud, data management, and large scale infrastructures common to certain internet companies. Unlike some of their smaller cousins, larger companies do not have the luxury of growing into the problem. They are there all ready. The hope is that by looking at more innovative approaches, their own technical professionals will be able to implement solutions that can best address the business need to cut costs while insuring their applications can continue to meet the growth and functional demands.

Thursday
Jun112009

Application Health - A Holistic View

One of the greatest challenges for technology professionals is to answer how "well" an application is doing. Usually everyone has an agreed upon measurement that is commonly defined as the health of the application complete with historical perspectives. Yet the devils are in the details in many cases where an application's health may in fact be in jeopardy and in spite of numerous tools and voluminous data there is no clear picture as to why things are going wrong. 

It is during this frenzy that mistakes are made. Over the years I have found that looking at a problem holistically rather than what is simply provided for in standard tools leads to better answers or at least different paths. 

Experience plays a large role for many technology professionals in their assessment of how to do things. This is not uncommon for many professionals. However, the technology world is faced with ever increasing complexity so every aspect of a business application is faced with growing pressures to deliver value. While tools are evolving many simply do not always provide the complete picture of what is needed. This is not to say that these tools will not be able to deliver necessary data, however often times they are not fully capable in terms of bridging the divide between business functions and operational metrics. 

This gap is where I find the most mistakes are made. Solutions are proposed without any real proof or indication of what the actual issue is. It is just not possible to propose effective changes without knowing where the problem actually lies. 

Holistic review of the data available allows more for faster, more effective solutions and incurs less cost by minimizing the amount of efforts that are applied to a problem. 

A good example would be a case that I had been involved with where there are concerns about the growing CPU consumption of an application. Overall business metrics point to a steady growth cycle but nothing overly large. Development had implemented a series of re-factored code that decreased unnecessary executions of functions making them in fact more efficient. System administrators and database administrators were at a similar loss in that they did not see any additional issues either. All tools were reporting normal operations. At this point it was fairly clear that while everyone had the proper data, it was not telling them where a potential issue could be. 

So I gathered the data from all teams and organized them according to each business function. I aggregated the data across all the clusters and created a holistic graph depicting discrete application function processes against their CPU consumption. The result looked like this...

Summary of Application Function with CPUUsing a fairly large window of time, the graph not only confirmed the overall health metric of increased CPU consumption but also the corresponding CPU for each function. The graph clearly showed increases for most operations across the board except for a few. Given that there had been changes to improve the efficiency of many functions it was necessary to provide a different view of the same data so that each function could be seen properly with relation to it's CPU consumption. That graph looked like this...

Individual Function with their CPUThis graph was even more revealing. It indicated that there had been indeed great progress made in many of the functions. At the same time, several functions were demonstrating increased growth in their CPU consumption. The greatest increases occurred in the Case Query and functions related to messaging. So that is where I decided to focus my attention. Both of these areas had been improved to reduce unnecessary executions. 

I stared the investigation with the Case Query.

Case Query executions and CPU consumption

As the next level of drill down demonstrated it appeared that the improvement had indeed worked in that the number of executions were down being reduced from a high of approximately 96M to about 68M for reduction of about 29%. However it also appeared that the CPU had increased rising from 7% to 9% of the whole server set.

The next area to investigate happened to involve the messaging. Since all messaging activity was based on the generator functions I drilled into that core area in more depth as well.

Message Generator and CPU ConsumptionThe improvements were similar in that the amount of message generators being requested fell from about 140M to 100M, yet the CPU only increased marginally by some .25% and that had taken place only in the past few weeks of the most recent month. In the prior month, the CPU had actually fallen as expected.

Presenting the data in this manner helped the teams to focus more on the code itself and other potential areas of the code base. Routine changes to maintenance were ruled out since not all the functions were being impacted. After a careful review it was revealed that a core function that had been relied on by many of the higher functions was indeed the culprit using a very inefficient algorithm to manage work and requests. It was re-written and deployed resulting in the favorable decreases of CPU consumption which were originally planned. 

This is obviously a very simple example. However it demonstrates that objective measurement and reviewing an area of concern from a more holistic view that incorporates both functions and detailed data can help to provide additional perspectives on problems affecting a company's application health. Especially in today's economy where improved cost containment while increasing business value is critical to a company's bottom line.