Who is Altan Khendup?

A professional technologist that dabbles in innovative and interesting uses of technology, Mongolian history, philosophy and cooking ethnic foods.

Often described as part philosopher, scholar, technologist, and mentor Altan likes engaging in stimulating conversations with professionals, tackling problems in a hands-on and collaborative manner with technology, and enjoying the company of good friends and family.

 

My Twitter Stream

Entries in Holistic problem solving (6)

Tuesday
Aug172010

Diagnosing Complex Applications - Answering the Tough Questions

One of the constant items that I come across in my professional career is the one that typically starts with leadership within an organization about a very basic premise - what is going on with their business applications? This seems like a very straightforward question. However in truth it is not as easy as one would imagine.

Many companies have grown organically that while beneficial has some operational costs to consider. One of the most challenging happens to be managing complex systems. This includes appraisal and diagnosis especially triage in the case where applications critical to the business are having issues. 

Generally speaking almost all the organizations that I have had experience with typically have the same set of problems:

 

  • Missing or out-of-date metrics. One cannot measure anything if nothing has been defined. This is where most businesses fail. 
  • Threshold Goals. Once metrics have determined thresholds have to be defined. These are basic boundaries that determine 3 basic states: healthy, not-so-healthy, and in jeopardy. These can also be characterized as zones: green, yellow and red. These boundaries help establish what the business expects from their applications and operations.
  • Growth expectations. Businesses expect growth. However asking them to come up with an expectation to create a model is something most do not want to do. This is a tough balancing act typically around "planned" growth events. In truth if something does really well, all previous growth projections tend to be irrelevant since in essence the scale changes say from tens of thousands to millions. Regardless a growth model needs to be in place.
  • Holistic analysis. Only a very small handful of organizations really see this as a key practice for complex applications and systems. Most think of only a handful of elements not their entirety. It is absolutely essential to look at the complete spectrum of options and be able to analyze everything that can impact a business. This means hardware, software, network, web traffic, and user-based activity. All of it.

 

So why these basics? It actually comes back to my training at Toyota. In order to diagnose what is wrong, you need to know what is normal. So the basic process that I go through includes:

  • Get existing information. Whether it be from existing tools, logs, etc. It is important to get what is available.
  • Target data to answer key questions. These data points range in names from Key Performance, to Business Activity, to Business Metrics, etc. Yet their purpose is the same - identifying key elements that the business is looking for to answer their questions.
  • Identify what is missing. Invariably there are elements that are missing. These need to be identified and then tackled in order of precedence.

Following this basic formula holistic diagnosis and analytics can be automated and evolved over time.

So what sorts of scenarios does this cover? Some of the basic ones include:

  • Capacity. The company is going to have some major event and wants to know if they can handle it. This is not simply not just a question about any one part of a complex system rather the complete domain itself. Can it handle the extra users? Can it handle the traffic? Can it handle the business transactions? Can it handle the fallout? What is most likely to break? When? Where? How is that handled? All of these smaller questions are wrapped around the initial one. 
  • Triage and Diagnosis. Another very common issue is around problems that have impacted a business. Why is X problem happening? What are the symptoms? How are symptoms winnowed to potential causes? How are potential causes vetted to actual causes? What are expected impacts to potential solutions? How fast can potential solutions be turned around? How much can triage address vs. long term care? Being able to effectively manage all aspects of a problem enables the business to rapidly identify barriers to their growth and operations saving money, cutting costs and capitalizing on opportunities.
  • Business opportunities. With all the diagnostics in place, analysis quickly moves into business opportunity analysis. What are customers? What are the various business units doing? When are they doing? Why are they doing it? Is there something we are not doing? Is there something we can do better? Once a business has the ability to look at their system in a holistic manner all sorts of interesting patterns emerge that are of interest to any business leadership.

When presenting a business with this sort of proposal it is daunting and in many cases especially from the operations-side of the house considered redundant. However it is not to say that the analytics are designed to replace existing investments, rather it is a way to look at what exists and identify/plug gaps. 

For example many businesses have raw infrastructure data in the form of CPU, memory, disk, network activity, web traffic, etc. However this data is almost never compared to business application operations which track groups of activity in relation to one another. For example I have often ask an operations expert what is the link-traffic for a user who is inquiring about their account? They can give me all sorts of raw data but cannot put it together. If I ask the specific application expert they can tell me the path but not the application components. If I ask a developer, they can tell me the functionality and application components, but rarely can tell me the actual business case. When put in this light the problem is very clear: each domain is responsible for their individual area of responsibility. However in most cases there is no one to put them all together. 

Once put together a business can actually see for every business activity, it's impact to their technical infrastructure, personnel, and operations. They can also then put important business events such as quarterly close, specific product promotions, and anything else together and view a complete high-level view of what happens to their organization when that occurs: how busy is their application, how many users are actually assisting in the endeavors, are reports being executed before/during/after the event, how many business operations are being executed, what partners are being used the most, etc. 

Being able to answer tough questions means diagnosing and analyzing complex applications and systems in a different and more relevant perspective. It also means being open to the idea that while you may have existing tools and perspectives available, it does not mean you can answer tough questions.

Friday
Feb262010

PeopleSoft Logs - Finding Hidden Nuggets

In almost all shops that I go to with problems related to performance or business operations due to technical issues, log files are an important and valuable source of data and information. Once one gets past the basic questions such as "What are the symptoms?" or "What kind of application are you running?" many details start to emerge. One of the biggest challenges to any technical analysis happens to be for large production applications that are running either a) their own frameworks and solutions and b) running other's frameworks and solutions. Usually group A has lots of internal experts that while often times extremely brilliant have very little expertise in what happens to their solutions once it starts to run in production after a while. The benefit of a group A is that they are all close by, knowledgeable and often times very open minded in finding a solution. Group B happens to the more classic kind who run technologies not of their own creation. They could be running PeopleSoft, Oracle, SAP, IBM, or even some of their own internal solutions mixed with any or all other types of solutions. These complex environments have both internal experts and external experts in the form of the vendors themselves to call upon.

In complex system interactions there are many places for useful information to hide. One of these places are the little gems known as log files. Pretty much every application and system under the sun produces them and they are often captured for analysis. Some of the more advanced shops have their own tools again either by their own development or acquired that parses the logs and presents information. Others are a little less sophisticated. However log files are often an excellent way to look at events at various layers of an system to determine whether or not there are problems.

Often times log files are hard to read. This is especially true for those produced by applications from vendors whose formats and data vary wildly from one to another. A PeopleSoft application is no exception. There are several different logs produced that one has to consider for a variety of situations:

  • Web Log. This is the typical web server whether it be WebSphere, etc. It holds the core information for all web interactions which include both standard users and automated web interactions.
  • PeopleSoft Web Log. These are maintained on the PeopleSoft side usually kept in the webserv directory. It contains the stderr, the stdout and also the application gateway logs which contain web activity for a PeopleSoft application.
  • PeopleSoft middleware logs. These are a collection of logs used to describe what is occurring within the layers of the PeopleSoft application. They include:
    • Application Server Logs - Used to describe what is occurring within the application servers.
    • Tuxedo Logs - Used to describe what the transaction servers are actually doing from a resource and request level.
    • Dump Files - Usually only occurring when something strange has happened, these files can appear usually in the application server logs for extreme system events.
    • REN (Realtime Event Notification) Server Logs - Similar to Application server logs but for REN events.
    • Process Scheduler Logs - A small collection of logs for each part of the process scheduler that describes what is happening at each level.
  • PeopleSoft Application Data. These are the tables and constructs such as the messaging constructs (PSAPMSG*) or the process scheduler (PSPRCS*) that have useful information that may or may not be inside the previously mentioned logs.

The first thing to notice is how many logs there are and then secondly how disperse they are. They are spread across an entire infrastructure. Another point to consider is that often times some hint of a problem can be determined within these logs, but only if you set the logging levels to something meaningful. I am not talking about high levels of detail, but even summary information has it's limits. Factoring in constant and perpetual logging in a running production environment is typically seen as a "No-No" in terms of performance. However, logging takes little overhead if set properly and background processes that periodically "clean and archive" the data for processing can minimize the disk worries. Typically the overhead for a PeopleSoft environment is usually anywhere from 5-10%. This has to be weighed in terms of the time it takes to identify a production issue.

In my experience being proactive means having a good logging level, a means of capturing all that information, analyzing it, and presenting it such that you have in addition to all of your existing tools a holistic and historical view of what your application is doing relative to the business transactions.

Some issues such as web-based transactions using the integration gateway happen to be one kind of problem that requires traversing several different tiers. The reason is simple: all the information about the layers of the integrations are not stored in the same place. Each layer has a different story to tell and can be of great help in determining problems. Based on this I used the following process to look at integration issues:

  • Is the integration synchronous or asynchronous? They have different structures in the database in terms of what to look for.
    • Synchronous are typically not logged unless one sets that up in the PeopleSoft web screens (called PIA). Setting these to at least a "Log Header" will store the information in the database in the PSIBLOGHDR and PSPUBCON. Setting it to "Log Detail" is a lot more overhead and typically unnecessary unless you want to look at the contents of a message itself.
    • Asynchronous are usually logged. They are stored in different structures such as the PSAPMSGSUBCON.
    • Errors for either if the PeopleSoft developers have put in error handling will appear in the PSAPMSGPUBERR or PSAPMSGSUBERR. Overall errors will be placed in the PSIBLOGHDR.
  • Next the actual logs may contain information. Usually integration transactions run in their own application servers separate from standard transactions. Why? Because putting them both in a single domain usually overloads the application servers in all but the smallest of shops. Even if they are in the same domain, the logs will still contain additional information about their events.
  • Tuxedo will be able to present information on the messaging as well in terms of requests processed, requests currently waiting and the workload being processed. This is informative in the historical context against business transactions to see if there are any load issues occurring.
  • The gateway itself can produce information about transactions. There may be errors connecting to a service, there may be formatting issues, there may be even Java memory issues for a particular request. These can be seen in the PSIGW for any PeopleSoft integrations.
  • The web server. In the case that there is insufficient memory resources for Java or other issues, these will appear in these logs.

Most of the time getting the basics is pretty easy if you have all the access, scripts, SQL, and information readily at hand. Typically this is not the case. I find that many shops have different people covering different areas, each area can have different requirements, these requirements may or may not have the appropriate level of data needed to make a determination. All of this is time. If in a dev or QA function, time may be flexible. In production outage situations time is not on your side.

Having spent a lot of time helping companies solve production issues for their mission-critical applications the processes are not really all that different from one application to the next. Typically the challenge is finding the correlated data, piecing it together quickly and efficiently, and having the proper automated tools in place that can quickly answer a question. If people are being used for all issues, the length of time to find a problem is very high and usually not proactive enough to address production problems. Many implementations fail to properly consider the amount of on-going development used to meet the on-going SLA commitments for their projects.

Good luck on your own implementations!

 

Friday
Jan292010

Interviewing Questions - Memorization vs. Problem Solving

I am typically asked to interview very senior candidates. Mostly due to my experience, but also because I tend to be more honest about the challenges being faced in an organization and the delicate balance between technical, management, and leadership skills and personality that will be necessary for a great organization fit.

When interviewing candidates for positions I look at it as an investment. Firstly I am spending a lot of time and consequently money in the process, and the person will be with the organization for a long period of time, potentially up to 5 years if the match is good, longer if the match is great.

Personally and professionally I find very little value in what I call "memorization" questions for technical professionals. Most of the management I meet ask for a basic set of questions for vetting purposes. However I see these questions more ineffective and wasteful than anything else. I care less for someone who has memorized every buzzword or specific syntax for languages or commands; this is what things like Unix man pages and Google searches are for. Just because someone is good at memorizing things does not make them a good fit for creating solutions.

When interviewing I look for the following:

  • Are they leaders? Senior positions are not followers but rather leaders in their own right. So I ask questions about initiative taking, risk management, and investments.
  • Can they think outside of the box? In almost all of my situations the initiatives being undertaken have not really been done before. As a result, there is no "best practice" solution that someone can look up and solve it. The senior positions have to have the ability to see this situation and move ahead, figuring things out along the way. This includes new approaches, new viewpoints, new technical implementations, etc. It can be even a new way to do something on older approaches.
  • Do they ask for permission rather than for forgiveness? In many cases what I am looking for are the necessary leaders and initiative takers that will sometimes bend rules. Not important ones mind you such as regulatory compliance, but rather set-in-stone mind sets that need to be challenged. I want the person to be independent so the questions along this line are what I am looking for.
  • Can they play well with others? Regardless of how stellar an individual is, they are still working on a team. From a 2 person startup to 300,000 person fortune 50 company the needs are the same. This is very important to determine since prima donnas often derail efforts rather than build them.
  • Do they have what I call "I am a hammer, everything is a nail" perspective? In my experience specialists are needed for certain specific tasks but leaders need to be demonstratively better cutting across boundaries than specialists. Especially in technical professional ranks, I often come across people that are very good at a certain set of technologies: all Java, all C, all Python, all systems, all Oracle, etc. The questions that I ask determine whether or not they can cross multiple specialities and using generalist thinking quickly understand, grasp and apply what they have learned to the problems at hand.

Often times memorizers make very poor fits for effectively solving problems. They wait for requirements as opposed to going out at getting them, they are geniuses with certains sets of tools, but horrible when learning new ones, etc. In any organization that is looking to grow, they need to see and build towards the future, not what is front of them for the moment.

This process has worked out very well. I have had professionals who had done very poorly at the memorization and at my insistence have been brought back for another session. In many instances, these individuals that I have helped to hire turned into excellent and capable contributors and leaders, delivering outstanding values to the organizations that hired them.