A Different Approach to Big Data
Big Data projects require a methodical approach. Fortunately, there's a lot of help out there.
- By Dan Kusnetzky
I was speaking with a journalist recently about the challenges enterprises face when they decide to take on a Big Data project in the hopes of learning more about their customers, business operations or some other topic. Big Data projects are based upon a different concept than traditional Business Intelligence (BI).
BI projects usually begin when an enterprise realizes that something is going on that needs to be tracked, analyzed and managed. For the most part, the enterprise knows what needs to be tracked, how to analyze the data and how to report on that analysis so it can make more informed business decisions.
Big Data projects are different. They typically begin when the enterprise observes customer, partner, supplier or industry behavior that it doesn't understand; often, it doesn't even know the proper questions to ask.
A retail company might start wondering why some of its sales locations are performing poorly. The possibilities for this decline are vast. It's wise to begin looking at operations, point-of-sales, weather, economic and other types of data; but it isn't clear what questions should be asked. Is the poor sales showing of a specific store due to its location, the hours it's open, the mix of products it offers -- or is it something else, like the weather?
A healthcare company might start wondering why certain classes of patients do well after treatment and others don't. There are a large number of factors that contribute to that outcome, and it isn't clear which factors are contributors. What a person eats, drinks or does on a daily basis can make all the deference.
Elements of a Big Data Solution
Here are a few elements of a Big Data solution that must be considered if that project is to produce useful results and not just well processed, well reported data that appears with the guise of credibility but really is warmed-over garbage.
Types of Data
Transformations, Normalization and Other Magic
A Big Data project needs to be able to gobble up static objects and files such as documents, spreadsheets and even presentations. It must be able to consume ongoing streams of data coming from POS devices, smartphones, tablets and many other types of intelligent devices. Social media data must be harvested from sources such as LinkedIn and Twitter. Don't forget data that can be found on Web pages, manufacturing data from the enterprise's own systems, and data from stock reporting systems to weather reporting systems.
Data items coming from different sources are likely to be in formats designed to satisfy the needs of the original application; not the use intended by the Big Data application. You'd be amazed at how many different formats data items such as ID numbers, currency and other types of data are presented in. All data items must be transformed from their current format to one designed for Big Data analysis. Knowing what to transform and the proper final form can be a major challenge all by itself.
Modeling, Frameworks and Simulations
There are an amazing number of modeling, data frameworks and simulation tools available today. One list I read includes tools for data regression, neural networks, data clustering, decision trees and more. Selecting the proper tool for each type of data is critical.
Once the proper data is selected, formatted, transformed and analyzed, systems can offer predictions about what will come next. Since these are heavily modeled items, they must be tested and measured against real-world data. Eventually, a model will emerge that provides useful guidance.
Reporting and Visualization
Dan's Take: How Long Is a Piece of String?
Most people are not very good at teasing out important insight from massive columns of numbers. They are pretty good at being able to look at a figure or graph and see something that sticks out as being different. Presenting the information in the proper form can make the difference between mounds of useless data and something offering quick and clear insight.
Our conversation went on for about an hour, and I believe I finally was able to help the journalist understand the complexities hidden in even the simplest Big Data project. Getting to the proper questions so that good answers can be developed often requires different skills and expertise than the enterprise has on hand.
When asked what tools would be the best to consider, I had to point out that question was very much like asking "How long is a piece of string?" The answer depends upon which string is in question. The best tools are typically based upon what the enterprise is trying to do.
Some typical approaches used include:
- Apache Hadoop, or a competitor such as Cloudera, Hortonworks or MapR. These are tools that have the ability to process huge amounts of data using distributed processing and storage.
- A flexible, distributed data store. These can include a NoSQL database such as MongoDB, CouchDB, Apache Cassandra and so on.
- Analytical tools. This area is exploding. Cloudera, Hortonworks, MapR and several others come to mind. Others such as SparkCognition have jumped into the competition with tools based upon machine learning and machine intelligence.
What are the best tools and approaches for your Big Data project? I'd suggest speaking with colleagues that have already developed a successful project to learn more about what they did and what they learned. It would be good to learn about false starts and failed projects if they're willing. It would also be good to speak to your preferred technology suppliers; most of them are offering packaged tools, systems and professional service offerings.
Daniel Kusnetzky, a reformed software engineer and product manager, founded Kusnetzky Group LLC in 2006. He's literally written the book on virtualization and often comments on cloud computing, mobility and systems software. He has been a business unit manager at a hardware company and head of corporate marketing and strategy at a software company.