Sure, more data scientists would be great. But Scott Brave, of Baynote, says the better solution is to create analytics products that are so easy to use that you don’t even need a data scientist.

photo: Sergey Nivens/Shutterstock.com

Virtually any article today about big data inevitably turns to the notion that the country is suffering from a crucial shortage of data scientists. A much-talked-about 2011 McKinsey & Co. survey pointed out that many organizations lack both the skilled personnel needed to mine big data for insights and the structures and incentives required to use big data to make informed decisions and act on them.

What seems to be missing from all of these discussions, though, is a dialogue about how to steer around this bottleneck and make big data directly accessible to business leaders. We have done it before in the software industry, and we can do it again.

To accomplish this goal, it’s helpful to understand the data scientist’s role in big data. Currently, big data is a melting pot of distributed data architectures and tools like Hadoop, NoSQL, Hive and R. In this highly technical environment, data scientists serve as the gatekeepers and mediators between these systems and the people who run the business – the domain experts.

While difficult to generalize, there are three main roles served by the data scientist: data architecture, machine learning, and analytics. While these roles are important, the fact is that not every company actually needs a highly specialized data team of the sort you’d find at Google or Facebook. The solution then lies in creating fit-to-purpose products and solutions that abstract away as much of the technical complexity as possible, so that the power of big data can be put into the hands of business users.

By way of example, think back to the web content management revolution at the turn of the century. Websites were all the rage, but the domain experts were continually banging their heads against the wall – we had an IT bottleneck. Every new piece of content had to be scheduled and sometimes hard-coded by the IT elite. So how was it resolved? We generalized and abstracted the basic needs into web content management systems and made them easy for non-techies to use. As long as you didn’t need anything too crazy, the problem was solved easily, and the bottleneck averted.

Let’s dig a little deeper into the three main roles of today’s data scientist, using online commerce as a backdrop.

Data Architecture

The key to reducing complexity is to limit scope. Nearly every ecommerce business is interested in capturing user behavior – engagements, purchases, offline transactions and social data – and almost every one of them has a catalog and customer profiles.

Limiting scope to this basic functionality would allow us to create templates for the standard data inputs, making both data capture and connecting the pipes much simpler. We’d also need to find meaningful ways to package the different data architectures and tools, which currently include Hadoop, Hbase, Hive, Pig, Cassandra and Mahout. These packages should be fit for purpose. It comes down to the 80/20 rule: 80 percent of big data use cases (which is all most ecommerce businesses need), can be achieved with 20 percent of the effort and technology.

Machine Learning

Surely we need data scientists in machine learning, right? Well, if you have very customized needs, perhaps. But most of the standard challenges that require big data, like recommendation engines and personalization systems, can be abstracted out. For example, a large part of the job of a data scientist is crafting “features,” which are meaningful combinations of input data that make machine learning effective. As much as we’d like to think that all data scientists have to do is plug data into the machine and hit “go,” the reality is people need to help the machine by giving it useful ways of looking at the world.

On a per domain basis, however, feature creation could be templatized, too. Every commerce site has a notion of buy flow and user segmentation, for example. What if domain experts could directly encode their ideas and representations of their domains into the system, bypassing the data scientists as middleman and translator?


It’s never easy to automatically surface the most valuable insights from data. There are ways to provide domain-specific lenses, however, that allow business experts to experiment – much like a data scientist. This seems to be the easiest problem to solve, as there are a variety of domain-specific analytics products already on the market.

But these products are still more constrained and less accessible to domain experts than they could be. There is definitely room for a friendlier interface. We also need to take into consideration how the machine learns from the results that analytics deliver. This is the critical feedback loop, and business experts want to provide modifications into that loop. This is another opportunity to provide a templatized interface.

As we learned in the CMS space, these solutions won’t solve every problem every time. But applying a technology solution to the broader set of data issues will relieve the data scientist bottleneck. Once domain experts are able to work directly with machine learning systems, we may enter a new age of big data where we learn from each other. Maybe then, big data will actually solve more problems than it creates.

Scott Brave is co-founder and CTO of Baynote, an e-tail and e-commerce advisory business. He is also an editor of the “International Journal of Human-Computer Studies” (Amsterdam: Elsevier) and co-author of “Wired for speech: How voice activates and advances the human-computer relationship” (Cambridge, MA: MIT Press).

Photo courtesy of Sergey Nivens/Shutterstock.com

You’re subscribed! If you like, you can update your settings

  1. thx for composing but this is axiomatic !

  2. Palantir Technologies. The solution is already out there, just not applied in this market.

    1. Cory Palantir Technologies is impressive but please don’t shill or perhaps you’re just an enthusiastic forward deployed engineer ;)

      1. I shill you not ;). Take it as you will.

  3. exactly

  4. Interesting insights!

    In addition to developing data scientists, we need to make analyzing data and deriving insights a way of life. It starts with whoever is in charge asking a simple question:

    “Where is the data that supports what you propose and say?”

    If all leaders from board members, senior management, to line manager persistently ask the above simple questions all the time, we will very quickly build a culture of analytics and fact-based decision making, similar to what great scientist have done in the past and still do. : )

  5. Exactly Scott – while I’m a huge fan of Russel Jurney’s ‘agile data’ concept (http://www.slideshare.net/russell_jurney/hortonworks-roadshow), how many companies have the capacity/knowledge/finance to even consider hiring a team of Data Scientist to answer what are likely relatively ‘basic’ business intelligence questions?

    What’s much more likely to happen is for a company’s existing Business Analysts to be trained up on a set of tools, such as Platfora, Datameer, Karmasphere or Pentaho’s Instaview, or even better – continue using the tools they know (SAS, R) through transparent connectors to Hadoop (RHADOOP). Odds are they will be able to deliver insight that in 99% of the cases will be ‘good enough’.

    The 1% left likely has custom needs, domain-specific questions, requirements for specialized hardware/software and are likely to ‘roll their own’ regardless.

    1. True. But why not also package together some best practices on a per-domain basis?

  6. Dhiraj Kumar (MBA, PMP,MIS, TOGAF, ITIL) Sunday, December 23, 2012

    Agree !

  7. Robert Hawkins Sunday, December 23, 2012

    There are two ideas that have been around for years and I think they apply here. One is that we are drowning in data, ie: producing data, information and reports that no one wants or reads. We are actually using more copy and printer paper than ever before. The other is that if you torture the data long enough, it will confess. People often take data or information and study and twist it to mean what they want while conveniently leaving out parts of it to make their case.

    1. Comes to mind the David Ogilvy analogy about the drunen sailor who uses research (we now call it data) as a lamp post to lean against so as not to fall, rather than look ahead to move confidently forward.

  8. Obviously there are enough motivations and benefits to make big data easy to use. However, using the web content management tool as an analogy to big data is plausible: the quality of web content can be assessed by almost everyone; but the quality of big data product must be explained properly by a scientific mind. An untrained person is usually confused or misled by the metrics, and thus leads to costly wrong decisions.

    Well, don’t get me wrong. I am not saying that data scientists are the smartest and others are stupid. The world of big data is like the situation of “blind men and the elephant”. We are all blind, and all the way we learn is to read the numbers. If what you read is the tail, you may think the elephant is like a rope. That’s the cruel nature of data science and thus requires scientific minds.

  9. I do not think that data scientists are domain experts. Data Scientists and domain experts need to work together to build realistic models that are self-learning and constantly changing as and when business reality changes.

    1. I agree. But, one of the issues with big data is that, in a perverse way, it sometimes is bundled in ways that make it too linear and a lot of the insights that could have been derived simply aren’t.

    2. You do realize that the majority of machine learning techniques can be applied with no knowledge of the domain? In fact, if you analyze the output of Kaggle competitions you will see that domain knowledge is not required, as discussed here in October: http://gigaom.com/data/why-becoming-a-data-scientist-might-be-easier-than-you-think/

      1. No. Expressing your real-world situation as a machine learning problem requires both domain specific knowledge and an understanding of machine learning. Here’s a simple example: you’re making a dating website, and you want to decide what people to suggest as matches. What is an instance? Is each historical pairing an instance? Is each person an instance? How do you select your labels? Is there a way to view this as a classification problem?

        In practice, there is more to Machine Learning than taking a list of instances in a standardized format and applying black-box algorithms.

  10. Well presented.

Comments have been disabled for this post