Technology Analytics

Turning data scientists into action heroes: The rise of self-service Hadoop

Mike is chief operating officer at Altiscale.

The unfortunate truth about data science professionals is that they spend a shockingly small amount of time actually exploring data. Instead, they are stuck devoting significant amounts of time wrangling data and pouring resources into the tedious act of prepping and managing it.

While Hadoop excels at turning massive amounts of data into valuable insights, it’s also a notorious culprit for sucking up resources. In fact, these hurdles are serious bottlenecks to big data success, with research firm Gartner predicting that through 2018, 70 percent of Hadoop deployments will not meet cost savings and revenue generation objectives due to skills and integration challenges.

Whether it’s time stuck in a queue behind higher priority jobs or functioning as a Hadoop operations person, — building their own clusters, accessing data sources, and running and troubleshooting jobs — data scientists are wasting time on administrative tasks. Sure, it’s necessary to do some heavy lifting to successfully perform analysis on data. But it isn’t the best use of a data scientist’s time, and it’s a drain on an organization’s resources.

That said, how can data scientists stop serving as substitute Hadoop administrators and become analytics action heroes?

Just as the business intelligence industry has moved to a more self-service model, the Hadoop industry is also moving to a self-service model. Operational challenges are moving to the background, so that data scientists are liberated to spend more time building models, exploring data, testing hypotheses, and developing new analytics.

Self-service Hadoop solutions simplify, streamline, and automate the steps needed to create a data exploration environment. Self-service is achieved when a provider (one who runs and operates a scalable, secure Hadoop environment) delivers a data science platform for the analytics team.

With a self-service environment, data scientists can focus on the data analysis, while being confident that the data and Hadoop operations are well taken care of. And these environments can be kept separate from production environments, ensuring that test data science jobs don’t interfere with a production Hadoop environment that is core to business operations, thereby reducing risk of operational mishaps.

As we see a rise in self-service Hadoop, organizations will realize the benefits of analytics action heroes and their super power contributions. Here are a few reasons why:

  • Faster understanding of trends and correlations that drive business action: Self-service tools eliminate the complex and time-consuming steps of procuring and provisioning hardware, installing and configuring Hadoop and managing clusters in production. By automating issues that customers run into in production, such as job failures, resource contention, performance optimization and infrastructure upgrades, data analytics projects run with more ease and speed.
  • Freedom to take risks with more agile data science and analytics teams: Using the latest self-service technology in the Hadoop ecosystem, organizations can gain a competitive edge not previously possible. Teams can experiment with advanced technology in a production environment, without the overhead associated with maintaining an on-premise solution. This allows data scientists to develop cutting-edge products that leverage features in the most advanced software available.
  • Increased time for Hadoop experts to focus on value-added tasks: Operational stability frees up internal resources so Hadoop experts can focus on unearthing data insights and other value-added tasks such as data modeling insights. Simply put, with more time spent on examining the data rather than wrangling it, organizations can uncover insights that drive business forward — and deliver on the true promise of big data.

Hadoop has unlimited potential to drive business forward. Yet, it can quickly become a drain on internal operational resources when running in production and at scale. Organizations need to devote more time on data science and not on the Hadoop infrastructure to fully realize big data’s potential — self-service tools make this a reality.