What the heck is Hadoop?

By Anthony Cecchini Newsletter Archives September 19, 2013

What the heck is Hadoop?

TAKE NOTE (Insights into SAP solutions and Emerging Technology)

Every day, people send 150 billion new email messages. The number of mobile devices already exceeds the world’s population and is growing. With every keystroke and click, we are creating new data at a blistering pace.

This brave new world is a potential treasure trove for data scientists and analysts who can comb through massive amounts of data for new insights, research breakthroughs, undetected fraud or other yet-to-be-discovered purposes. But it also presents a problem for traditional relational databases and analytics tools, which were not built to handle the data being created. Another challenge is the mixed sources and formats, which include XML, log files, objects, text, binary and more.

“We have a lot of data in structured databases, traditional relational databases now, but we have data coming in from so many sources that trying to categorize that, classify it and get it entered into a traditional database is beyond the scope of our capabilities,” said Jack Collins, director of the Advanced Biomedical Computing Center at the Frederick National Laboratory for Cancer Research.

Enter Apache Hadoop, an open-source, distributed programming framework that relies on parallel processing to store and analyze tremendous amounts of structured and unstructured data. Although Hadoop is far from the only big-data tool, it is one that has generated remarkable buzz and excitement in recent years. And it offers a possible solution for IT leaders who are realizing that they will soon be buried in more data than they can efficiently manage and use.

Why it matters…

Data is the new natural resource. Hadoop is the first enterprise tool we have that lets us create value from data. Take the Frederick laboratory, whose databases contain scientific knowledge about cancer genes, including the expression levels of a gene and what chromosome it is on. New projects seek to mine literature, scientific articles, results of clinical trials and adverse-event databases for related or useful connections. Other researchers are exploring whether big-data analysis of patient blogs, Google searches and Twitter feeds can also provide useful correlations.

The fundamentals…

Hadoop evolved out of Google researchers’ work on the MapReduce framework, which Yahoo programmers brought into the open-source Apache environment. Core Hadoop consists of the Hadoop Distributed File System for storage and the MapReduce framework for processing. Queries migrate to the data rather than pulling the data into the analysis, yielding fast load times but potentially slower queries. In addition, Hadoop queries require higher-level programming skills compared with the user-friendly SQL, so developers have released additional software solutions with colorful names such as Cassandra, HBase, Hive, Pig and ZooKeeper to make it easier to program Hadoop and perform complex analyses.

Hadoop appeals to IT leaders because of the improved performance, scalability, flexibility, efficiency, extensibility and fault tolerance it offers, said Glenn Tamkin, a software engineer at the NASA Center for Climate Simulation. Users can simply dump all their data into the framework without taking time to reformat it, which lifts a huge burden.

Skeptical? Look at how the Defense Department is using Hadoop to provide real-time tactical information in support of battlefield missions and intelligence operations. Or the genome sequencing that can now be accomplished in a few minutes instead of hours.

For more information on HADOOP, see the official Apache website https://hadoop.apache.org/

UNDER DEVELOPMENT (Information for ABAP Developers)

The SAP IDoc Technology

Standard SAP sends out or receives in data through IDocs using standard delivered Segments, Message Types and fields. But sometimes, these fields are not sufficient for a specific end-to-end business scenario as far as data transfer is concerned. So in such scenarios, we can add new segments with completely new structures to the standard IDoc as an extension. We create a brand new structure and insert it into existing delivered IDoc structure creating a whole new IDoc satisfying the requirement. This new IDoc is called an Extended IDoc.

READ MORE

Q&A (Post your questions to Facebook or Twitter and get your Questions answered)

Q. I have to find out why an Account is missing in SAP. The account should have been migrated by IDoc. I know the IDoc type, segment and the field name for the account number. How can I find the IDoc to see if it was migrated (status 53) or failed (status 51 / 60)?

A. The first place I would start is in the sending SAP system. Check if the IDoc was successfully transferred to the receiver. In “Sender System” execute WE02 – using message type and direction as 1 . Please attempt to give a date – on which the account was created or give range of probable dates .

Once you see that IDoc is present on outbound side, check the final status of your IDoc – its should be 12 or at least 03 ( with 03 status its not guaranteed if it has reached receiver – IDoc can be stuck in SM58 TRFC queue also) .

Now you can copy the dates , msg type , details from this IDoc and use it in WE02 on the “Receiver SAP System” to determine if your IDoc is present or not. If it is not there, then we can check with the middleware folks as to why IDoc has not reached the Receiver, or check with NetWeaver Admin (BASIS team) maybe they can help .

If the IDoc is present, then use WE09 – try giving msg type , probable dates if possible , segment name , field name and the business object ( Account No ), restrict your search by giving the range of dates – like say for one month duration if you are not able to find the IDoc, then change the dates again and check again . This way it can be searched iteratively and in a Performant way.