If you want to dig into big data the first thing you will hear is Hadoop, or in case you go for near-realtime Spark or Storm, or any of the other big data frameworks around. They promise to solve all of your problems but what they don’t tell you is the cost they come with.
The decision to introduce big data analytics into your business processes is no longer a yes or no. You have to set up ways to tap into the data you create from your own business and possibly also acquired from third party data, such as from social media networks. The majority of companies have already wrapped their best heads around this as can be gleaned from this report by KPMG.
It remains an issue, though, that it can be a challenge for a larger organization to change your business processes accordingly, maybe even your business model. It begins with the slow process of making your employees aware of collecting relevant data, the actual analysis, and is only finished after measures have been put in place to derive decisions from the results.
Others can better elaborate on how change company culture and ensure analytics results are fed into the decision making process. I would like to focus on an integral part that is key to the technological adoption of big data: How to integrate a solution into your existing system. The above mentioned report highlights this as one of the major results from the survey:
The vast majority (85 percent) also said they were struggling with implementing the correct solutions to accurately analyze and interpret their existing data.KMPG report, p. 5
This is an issue we have encountered many times talking to customers. Most of the time an existing business intelligence system is in place as well as data warehousing and all kinds of databases. These systems are so vital to everyday business that even such small requests as diverting some of the data flow to another system makes hardcore developers and system administrators flinch.
To consider going the even more radical route and implementing a complete system in a your data center quickly turns into a full-blown project. Part of the issue here is that even though all those shiny frameworks promise the fastest way to crunch your data, they all forget to mention something: To make them useful, you will require not only your own Data Scientist, but also an operations expert who runs your cluster.
While the operations part is easy to understand, why do you need one of those hard-to-come-by Data Scientists? It is because those tools are mostly frameworks. They help you do the calculations very quickly, but they do not provide actual solutions to your common big data issues! To be fair, there are projects that aim to implement machine learning methods and other results from academia. However, they do not give you the solution you might looking for. An expert is still needed to select the right method, create an implementation that works well with your existing infrastructure and ensure results are available to you either in a programmatic or readable (reports) way.
There is such a demand for solving this problem that our friends from The Data Guild built a whole business around it, beginning with the status-quo analysis, understanding your business questions up until the point of implementing a specific solution. However, they don’t remove the obstacle that is inherently there: you will still receive a framework based implementation, but probably with much less pain.
For certain recurring requirements you will find SaaS providers such as New Relic or Sumo Logic that handle your data and provide you with a nice way to analyze it. But analyzing it, is still hard work on your side. What all these solutions have in common, with the notable exception of The Data Guild, is they are very much like a database. You pour your data into it and can then program queries to retrieve insights.
Is there a solution? It depends. With streamdrill we are offering a different way to solve big data problems mostly for realtime decisions. But we also take away the pain of having to implement algorithms in a framework by providing ready-to-use algorithms, packaged into solutions. That algorithmic approach provides us with an advantage as well as a challenge:
To achieve the speed of realtime requires us to discuss expected results upfront. There is no ‘just dump it in’ and see what we can find later on.
Dragging the ‘think about it’ part right to the front of the decision making process helps to keep the solution small. Most of the time we can run the whole thing on a laptop. It also allowed us to offer a basic solution that is up and running in just a few days (considering an adapter is necessary to tap into the data) giving us a first look at what the properties of the data stream are.
But then why ‘depends’? Because we are not trying to completely replace existing big data frameworks. They have their uses. For example, if you really do have to do heavy lifting, there is no way around them. One of our customers considers using the results from streamdrill as additional parameters for the time-consuming number crunching.
The frameworks provide ways to distribute work. Most of the time that is an unnecessary and cost-expensive feature that is being considered because the big players use it. If you can avoid it, you might not only get better results, but get them much quicker, too.