Everyday Innovation - Using Hadoop with PeopleSoft Part 1
Tuesday, January 18, 2011 at 2:09PM During my professional career it is very interesting to see how organizations face challenges to their businesses. For many projects I find myself typically involved from the technical viewpoint since in today's marketplace, technology plays a very important role. Often times many organizations are unable to effectively find a way to bridge the gap in delivering new and modern offerings with their legacy assets. This is mostly due to the fact that there are fundamental differences in how the two groups often view problems: legacy is often heavily grounded in risk-adverse, tried-and-true approaches whereas the other groups tend to be more progressive.
On all of the projects I work on, I apply a mixture of experience and open-mindedness, approaching issues from multiple points of view for the benefit of the organization and it's customers. One such endeavor that I have used across multiple companies has been applying Hadoop to legacy applications such as Oracle/PeopleSoft. Hadoop is an excellent platform for data-intensive distributed computing and of the many legacy applications large organizations operate products such as Oracle/PeopleSoft are quite common.
Oracle/PeopleSoft produces a lot of very interesting data not just in the form of the standard metadata and transactional data, but also the operational data elements such as the logs that can be found in the various tiers. The most obvious challenges to any organization with such data happen to be that they are in diverse formats, located on different servers, take up a lot of space, are hard to work with, etc. However the benefits of this data are many and can include such viewpoints such as capacity planning, holistic event correlation, business activity analysis, geolocation analysis, and many others.
In almost every organization, analyzing business data in legacy Oracle/PeopleSoft consists of SQL-oriented operations from production copies. While this is fine for the database itself, often times the equally valuable logging and operational data is often left behind due to space constraints. This is where Hadoop comes in.
In the beginning of my endeavors with Hadoop I used the Apache open source version. However as time has progressed, legacy organizations became less comfortable with such implementations. Thankfully organizations such as Cloudera have emerged who not only provide excellent support for Hadoop, but who have also created bundled implementations that can be more easily introduced into companies via support and training programs.
In this first post, I will be demonstrating the power of Hadoop with simple examples based on the Cloudera distribution and Oracle/PeopleSoft CRM 9.
The Hadoop deployment is a development instance running 4 nodes distributed across 4 different data centers across the continental United States. In this particular case, the nodes in the cluster have a total capacity of some 300GB. This is a very small cluster with the nodes being 4 CPUs, 4GB RAM and only 200 GB each. Obviously not all the disk space in the nodes are allocated to Hadoop in this example, but that can be easily adjusted based on the resources available in your organization for the cluster.
I will be not going to go into the full details of installing and configuring Hadoop as this process can be found in great detail both at Cloudera's website and the Apache Hadoop website depending on your choice of implementation.
After installing, configuring and starting your Hadoop cluster please insure that you have access to the commands as can be seen below...
Hadoop command line interfaceThere are numerous commands that can be executed. If one has a systems administration background, many of these are somewhat familiar with obvious differences due to the distributed nature of Hadoop.
Each of the commands in turn have additional layers of assistance. For example, one of the more common operations involves the Hadoop filesystem...
Hadoop FS commandThe FS command is a useful command to implement as it can be used in scripts to deal with basic data elements on the file system. For example let's take one of the example Hadoop Java programs ...
Example Java program on the file systemAs one can see from this brief display, it is a straightforward example of raw Java code.
One of the more simple capabilities of Hadoop is to take any element from the file system, such as this Java program, and through the command line operations, interact with them via Hadoop.
In the following example, I will place the Java program onto the Hadoop distributed file system (HDFS) via the FS put command ...
Placing the Java program onto HDFSAs can be seen in this example, the HDFS listing indicates that the program called sample.java has been moved into Hadoop.
While there are additional command line parameters that one can use to verify this operation, another way to look at the operation is via the web interface that comes with the Cloudera distribution.
Cloudera Hadoop Distributed File System Web InterfaceAs can be seen, the HDFS web interface provides some basic information about the Hadoop cluster such as when it was started, the version, and the capacity of the cluster.
The web interface also provides a way to navigate the file system via the link named "Browse the filesystem". In this case, I followed the links until I reached the directory level that resembles my command line...
Hadoop File System ListingFrom this perspective I can see the program I placed onto the HDFS via the FS command which was called sample.java. Selecting that link from this listing provides me more details about the actual item in the directory...
Java code on the HDFSAs can be seen, the web interface actually displays the contents of the file. It also provides some simple commands such as "download the file", "tail the file", etc. Also from this perspective I can see that the example Java program fits into a single Hadoop data block which has successfully replicated to all 4 nodes across my network. From that single Hadoop FS put, the data I placed into the HDFS has successfully moved across all 4 data centers and is now consumable by Hadoop jobs all in a manner of a minute or two.
This capability can be used as a simple backup to make data easily replicated and distributed across an organization's infrastructure. However that is by no means the only capability that Hadoop brings to the table.
In my next post, I will demonstrate how by building on this simple concept, how Hadoop can be used to obtain information used to answer various questions.

