Careful: Your big data analytics may be polluted by data scientist bias


Credit: pzAxe/Shutterstock

Expectations surrounding the future of  big data range from the just huge to absolutely enormous – a reflection perhaps of both its real inherent potential and all the massive hype. Certainly though there is no dispute that companies can reap big benefits from exploring patterns found in the data they already generate and collect. Further, depending on the algorithms used, machine learning can even serve as a real world crystal ball: There are countless examples, but the story about Target’s ability to predict pregnancies by analyzing customer consumption patterns, or how well known mathematician Nate Silver predicted the winner in all 50 states during last November’s presidential election are two poignant examples.

But the fact remains that big data can only ever be as good as the machine learning that is used to provide insight, and even the most sophisticated machine learning techniques aren’t omniscient – the old adage “garbage in, garbage out” sums up this dilemma perfectly. Businesses planning to invest in big data science, with the hopes of reaping the potential wealth of insights available, must at all costs avoid introducing bias into the process – or risk jeopardizing everything.

Data bias syndrome

Data bias comes in many forms. It can come from poorly defined business domain objectives. Or, it can come from opting to gather data that are easy to collect rather than data that are most informative. Data scientists can also receive data that have been biased by incorrect assumptions by the domain experts. (And as a footnote, the recent example of the austerity economics Excel scandal shows how a minute data error can have cascading and devastating effects.)

Likewise, data scientists themselves are not immune to bias. Some can run afoul of their own preconceived notions about business domain – too much knowledge can cause one to filter out data that may actually be helpful.  Scientists with deep experience in a particular data set may develop too much reliance on pre-existing algorithms without re-examining validity for a particular use case.

Finally, data quantity is a common problem. Intelligent learning requires abundant data, and often the data available are not sufficient to draw accurate conclusions – a problem known as data sparsity. This may sound unbelievable considering that data volume is doubling every two years according to an EMC study,  but there’s a difference between a dense data set populated by similar data points, and the far more diverse sets of user data points we find in the real world. In these cases, the gaps in the data are filled by machine learning algorithms that may inherently be biased, based on assumptions made by the data scientist when designing the algorithm. The trick is to find the right balance between unbiased data exploration and data exploitation.

Removing bias

As companies bring data science in-house or purchase tools that act as a data abstraction layer, the need to address data bias becomes more immediate. The smart move is to build bias-quelling tactics into the data science process itself. Here’s how:

  • Employ domain experts Rely on them to help select relevant data and explore which features, inputs and outputs produce the best results. If heuristics are used to gain insights into smaller data sets, the data scientist will work with the domain expert to test the heuristics and ensure they actually produce better results. Like a pitcher and catcher in a baseball game, they are on the same team, with the same goal, but each brings different skill sets to complementary roles.
  • Look for white spaces  Data scientists who work with one data set for periods of time risk complacency, making it easier to introduce bias that reinforces preconceived notions. Don’t settle for what you have; instead, look for the “white spaces” in your data sets and search for alternate sources to supplement “sparse data.”
  • Open a feedback loop This will help data scientists react to changing business requirements with modified models that can be accurately applied to the new business conditions. Applying Lean Startup like continuous delivery methodologies to your big data approach will help you keep your model fresh.
  • Encourage your data scientists to explore.  If you can afford your own team of data scientists, be sure they have the space and autonomy to explore freely. Some equate big data to the solar system, so get out there and explore this uncharted universe!

Whatever you do, don’t ignore the issue: The last thing you want to do is implement a system that develops and propagates data, only to learn it’s hopelessly biased. If you don’t solve this problem sooner rather than later, your organization will miss out on what many analysts are calling the next frontier for innovation.

Haowen Chan is currently a principal scientist at Baynote,  a provider of personalization solutions for online retailers. Robin D. Morris is a senior data scientist at Baynote; he is also associate adjunct professor in the Department of Applied Math and Statistics at the University of California, Santa Cruz.

Have an idea for a post you’d like to contribute to GigaOm? Click here for our guidelines and contact info.

Photo courtesy pzAxe/



Great points – no amount of Sarbanes Oxley will counteract something you aren’t even aware of in your big data analytics observations


I should hope a data scientist would not so recklessly use the term “bias” as the authors.

Carla Gentry

Derek, after 15 years in the business you do form bias but it’s based on experience IMO!

Carla Gentry

But yes, have to agree with you on that one (as far as this “authors” use of it) Thanks!

Robin Morris

Carla – Thanks for your comments. To address your first comment, we agree a data scientist’s job is very difficult (you’re preaching the choir, here), and that personal domain expertise is usually valid and takes a considerable amount of time to develop. However, a data scientist and its respective company can benefit from interacting with nontechnical domain experts with years of direct field experience. The title wasn’t to say that all data scientists produce bad analysis; just call attention to a potential problem since more and more companies are hiring data scientists and incorporating big data analytics into the business process.

Carla Gentry

I think I’d rather has a Data Scientist who might have some bias (based on experience) over a Lay Man who knows nothing about data, it takes years to become an experienced Data Scientist – so was the title just an eye catcher?

Carla Gentry – Data Scientist at Analytical-Solution

Joseph Budner Elad

Excellent article. It highlights why it is crucial to put the analysis and the analytical tools in the hands of the domain experts, and not rely on data scientists alone (and, besides, there are not enough data scientists to do all the analysis).

It is also very important to give the domain experts (or, the business analysts in the case of business problems) the capabilities that combine free form exploration of the data coupled with machine driven discovery from the data, and allowing the analyst to go back and forth between these capabilities as they gain insight and make new discoveries from their data.

Comments are closed.