One of the questions I get asked by potential customers is, “How Big is Big Data? Is it hundreds of gigabytes, terabytes or even petabytes?” I’m sure by now, most of you are familiar with the three V’s of Big Data (Volume, Velocity and Variety, in case you weren’t) but most often, and understandably so, Big Data is synonymous with volume. Even though quantity is an attribute of Big Data, it isn’t necessarily the most challenging or interesting aspect. Volume is the most tangible of these attributes, but like so many other metrics in high-tech, it is a relative term and is continually being revised upwards. It’s amazing how our perception of data quantity has changed over time. You need only look back a decade to see how tens of gigabytes would have then been considered Big Data.
Managing the ever growing quantity of data is a challenge in it of itself, but storage solutions are prevalent and improving. The real challenge lies in not how much data you have, but rather, how well are you leveraging this data, to learn from, improve upon, and in turn, grow your business. I’ve noticed that many companies do some sort of analytics, mostly to corroborate what they already know like their customers’ purchasing behaviors and so forth – the obvious “analytics.” The data isn’t telling them anything new but confirms what they suspect is present in the data. This sort of analytics is important from planning, logistics and operational perspectives and can help you streamline essential business processes. However, it is the non-obvious artifacts in your data that could prove to be a potential goldmine. Here’s where I feel your choice of analytic tools may be holding you back.
Now I don’t believe in analyzing data for the sake of it, because if you can’t tie it back to a process that assists your business, then it’s not worth it. As someone who has helped improve several business processes, my main objective is to discover hidden relationships and then leverage these to better some aspect of the business like more effective targeted marketing, reduced inventory or more accurate forecasting. I would be amiss in leading the reader to believe that this is a trivial process. It isn’t. Discovering something new rarely is, but it could be rewarding.
This brings me back to my earlier statement about having the right set of tools. Although the current breed of analytic tools is good, one problem that I often am told about is that many of these tools are not interactive enough for their (Big) data volumes. Furthermore, data discovery is an iterative process. It can be time consuming. Since you don’t know what you’re looking for or only have a vague hypothesis, it is imperative that you either prove or disprove them quickly and move on. The productivity of your data scientist would improve manyfold if results came back in seconds instead of hours, or in some complex cases, days! The tools should help productivity, not get in the way.
Relational databases are great at transaction processing and I doubt they’re going away anytime soon, but when it comes to mining relationships, their rigidity can be detrimental to data discovery. Without getting into the mechanics of it, looking for hidden relationships requires complex queries that might require information from several different data sources to be examined. Not surprisingly, the performance of the queries is based on how the data is laid out. In other words, you need some knowledge of the schema of the database if you want these queries to perform well. And since you really don’t know what you’re looking for, it very likely that you don’t have the most efficient schema for it. Designing a new schema is not a trivial task either and can take several weeks. The structure of your database shouldn’t dictate expressivity or what you can ask of it. Not that any platform is limitation-free, but such fundamental obstacles make the difficult task of data discovery even more painful because now you have to navigate the limitations of your database.
Now if you want to, you can force a square peg through a round hole with a big enough sledgehammer, but the question is, why? I’m a firm believer in using the right tool for the right purpose, and no matter how fond I am of relational databases, I’ve learnt from practice that relationship mining is not their forte.
So if you’re a fan of the rule of 3’s as I am, these are the attributes you should consider in your next analytics platform:
- Ability to handle the ever growing volume of data
- Interactivity at data volumes and not having to deal with limitations of the tools
- Facilitate relationship discovery
In other words, your analytics platform has to be scalable, quick and productive and make your task of understanding your data easier.