Subscribe to RSS feedFollow us on LinkedinFollow us on TwitterSubmit Content

SAS: Responsible Use of Big Data: Evaluating New Data Sources

Posted: 3 June 2014 | Author: Kelly McGuire

At the beginning of the year, I released my 14 actions for 2014.  I outlined a list of actions that hotels can and should take right now to ensure they are set up for success in the years to come.  Action #4 cautioned analytic hospitality executives to carefully evaluate new data sources.  I thought this action in particular was worth some additional discussion.

In this “big data” era, new data sources are cropping up every day – from internal sources and third party data re-sellers.  With all of this activity, plus constant messages from big data vendors and technology experts about the value of capturing everything you can get your hands on (I recognize that I am part of this as well, of course), it’s tempting to think that you can just shove all of that new data into a database and you’re good to go.  Regardless of how inexpensive storage space is getting and how fast processing is becoming, capturing, storing and analyzing data still takes resources  - technology and human capital.  Further, the wrong kind of data, used in the wrong way, will simply add overhead and noise to your analysis, rather than providing any additional insight.

There are myriads of detailed technical and analytical methodologies for assessing and transforming data to make it useful for reporting and analysis, which I won’t go into here.  In this post, I will provide some business-oriented suggestions for how to think about a new data source, and discuss potential problems that could arise from throwing too much data at a problem.

I’ve said this many times before, the first important step in evaluating a potential new data source is to determine what business value you will gain from accessing that data.  You should clearly and specifically define not just the insight you expect to be able to gain from that data source, but also who will benefit from that insight and how the company will take action.  Assess how the data could contribute to an existing business analysis, improve a decision making process, or help you gain new insight.  Knowing the “fit” at the level of business value will help you justify the investment in acquisition.

Once you understand the potential business  value, you need to be sure the data can actually deliver.  The second step is to understand the characteristics of the data source.  Ask the following questions:

  • What is the data? Make sure that someone in the organization has a clear understanding of the data fields, how they are calculated, what level of detail is available and what they mean. You will also need to understand how this data relates to other data in the organization. For example, if you are looking at time series data, does the level of detail and the intervals match any related sources? Also determine whether the data is unique, or highly related or correlated to another source.
  • How is the data collected?  Understanding where the data comes from will give you a sense of how reliable it is.  If it is heavily driven by user entry, then you need to assess the business process around the data collection.  User driven data is notoriously unreliable unless it has tight business process around it.
  • How often is the data updated and how?   Your systems will need to be set up to receive and store the data in a timely fashion.   If the data comes too fast, and the ETL process takes too long, it might be useless by the time you are able to access it.  For example, tweets or geo-location data are stale almost as they are created, so if you aren’t able to process them in time to use them, it’s not worth the trouble.  Further, if the data delivery process is unreliable (as in it frequently doesn’t show up, or shows up with missing values etc), and you are counting on it for a critical piece of insight, you may want to look elsewhere.

Finally, determine whether you will need any additional technology or resources to manage the data source.  Unstructured text data can be highly valuable to the organization, but it’s large, and it requires some specialized analytics to interpret.    There are also human capital implications for adding new data sources.  Do you have enough people available to manipulate and analyze the data so that it can be effectively used by decision makers?  Obviously, if you need to make an investment in new technology and new resources, more work is required around my first point – understanding the business vale.

If you are just interested in using the new data source for reporting, or descriptive statistics, the previously outlined steps will keep you out of trouble.  Throwing more data at a predictive modeling or forecasting analysis is trickier.  I am going to introduce some statistical concepts that you should be aware of as you are thinking of incorporating more data into an advanced analytic application.

Some of you may be familiar with Occam’s razor.  It is a principle of mathematics developed in the 14th century which basically states that “simpler explanations are, other things being equal, generally better than more complex ones.”  Many statisticians follow this guidance, believing that you should always select the simplest hypothesis until simplicity can be traded for predictive power.  Occam’s razor cautions us that simply throwing more data at a statistical problem might not necessarily generate a better answer.

In fact, statistical analysis bears this out in some cases.  Note that when I talk about “more data” in the next few paragraphs, I am talking about more “predictor variables” not more observations within the same data set.  Generally speaking, more observations will help to increase the reliability of results, since they will help to detect patterns in the data with greater confidence.

Two different statistical phenomenon can occur in predictive analysis with the addition of predictor variables to a model.  In both cases, the addition of variables decreases the reliability or predictablity of the model.  I’m only going to define them at a very high level here, so that you can verify with your analysts whether there’s a concern.  There has been plenty of research on both of these issues, if you want more information.

The first issue to watch out for is multicolinearity.  This happens in a multiple regression analysis when two or more predictor variables are highly correlated, and thus do not provide any unique or independent information to the model.  Examples of things that tend to be highly correlated could be height and weight, years of education and income, or time spent at work and time spent with family.  The real danger in multicoliniarity is that it makes the estimates of the individual predictor variables less reliable.  So, if all you care about is the model as a whole, it’s not that big of a deal.  However, if you care about things like what variable has the biggest impact on overall guest value, or on likelihood to respond, then you do have to watch out for multicoliniarity.

The second thing to watch out for is overfitting, which happens there are too many parameters relative to the number of observations.  When this happens, the model ends up describes random error, not the real underlying relationships.   Every data sample has some noise, so if you try to drive out too much error, you become very good and modeling the past, but bad at predicting the future.  This is the biggest danger of overfitting a model.   This is particularly problematic in machine learning algorithms, or really any models that learn over time (like revenue management forecasting, for example).

So, what is the bottom line here?  Don’t assume more is better, prove it!

Share this article: