10 Comments

Summary:

If we want to big data revolution to scale, then we need to make it as easy as Netscape made the web surfing experience. Here are 7 startups making that happen.

scientists

Ford’s data chief joined many other top executives who are bemoaning the lack of simple tools to solve big data problems — namely the fact that running Hadoop clusters or performing analytics is still a job that requires a specialist. If we want to big data revolution to scale, then we need to make it as easy as Netscape made the web surfing experience. Here are 7 startups making that happen.

Ford’s data chief John Ginder, did an interview with ZDNet in which he says:

“That’s a great endpoint I’d love us to move toward,” said Ginder, “but there aren’t enough of us and there aren’t enough of those tools out there to enable us to do that yet. We have our own specialists who are working with the tools and developing some of our own in some cases and applying them to specific problems. But, there is this future state where we’d like to be where all that data would be exposed. [And] where data specialists — but not computer scientists — could go in and interrogate it and look for correlations that might not have been able to look at before. That’s a beautiful future state, but we’re not there yet.”

Datahero: This startup is all about visualization — namely making it easy to take data and turn it into pretty pictures that can then generate new understanding or convince someone to take action. Users bring their datasets files and Datahero does the rest.

Prior Knowledge: Relative newcomer Prior Knowledge is the brainchild of MIT grads who wanted to let non data scientists play around with data. The company offers a service that lets people upload their data and hook into PK’s database API. The service then assess the information for correlations as well as helps app developers build predictive models. It’s raised $1.4 million in funding from Founders Fund and angels.

Platfora: Hadoop is everyone’s favorite big data batch processing platform, but it’s not easy enough for everyone to use. Like others Platfora wants to make Hadoop so easy even I could use it, through an intuitive user interface that has advanced data science functions built in, rather than making users perform queries. It has raised $5.7 million and its product will be out next year.

ClearStory: Big names back this startup, which is also a service as opposed to software. Google Ventures, Andreeseen Horowitz, and Khosla Ventures have funded ClearStory, which aims to help funnel data from a variety of source (including Hadoop!) into one place, where employees can then use a GUI to interact with and visualize that data.

Karamasphere: The Karmasphere product is designed to ease the process of developing Hadoop workloads and applications, even from the desktop. It lets users write SQL-like queries while also connecting to their favorite BI tools and analytics software to the software to perform analysis.

Datameer: Like others on this list Datameer is out to make Hadoop more relatable to non nerds. In this case it does this by creating a familiar spreadsheet overlay so businesspeople can analyze their Hadoop jobs and then let’s people create visualizations and draw correlations. It’s closest to Karamsphere, but its latest feature that allows someone to run it on a single machine is a differentiator.

BigML: Much like Prior Knowledge, BigML is a startup that combines data with machine learning to help give normal people access to the smarts to help them answer questions with their data. It hopes to let people do machine learning in four easy steps: set up a data source; create a dataset; create a model; and generate predictions. It’s in private-beta mode right now.

You’re subscribed! If you like, you can update your settings

  1. What is Hadoop???

  2. Mark Slusar Friday, July 6, 2012

    Don’t “ditch your data scientists” give them tougher problems to solve and pull ahead of the pack.

  3. Justin Donaldson Friday, July 6, 2012

    Hi Stacey, thanks for the story! Can we please get a link to http://www.bigml.com on the relevant paragraph?

  4. Nitin Borwankar Saturday, July 7, 2012

    There are a number of fundamental problems with trying to route around the problem of data scientists shortage

    a) a lot of this data is confidential and it is not going to leave the LAN. The more valuable it is the more this applies. Conversely if a company doesn’t care about uploading the data to an unknown third party – how valuable can this data be? Bottom line this “outsourced data science” is a solution for the long tail. But the money is at the other end of the power law curve ie the enterprise which is not uploading its data “family jewels” anytime soon.

    b) It takes a lot of bandwidth to upload any meaningful amount of data – AWS allows you to send in hard drives in the mail which they will then upload locally. This should tell you that the bottleneck is not lack of data scientists, its the thin pipe between the ground and the cloud.

    c) We are at a structural inflection point – we need to integrate data driven thinking into everything we do and we can’t delegate this. There is no silver bullet here.
    There is a chasm and only companies that are in it fo rthe marathon rather than the sprint will survive.
    Good luck to the sprinters.

    Did I already say there was no short cut?

  5. Morgan Goeller Sunday, July 8, 2012

    Don’t forget Cetas (http://www.cetas.net), just acquired by VMware. Very cool technology that takes a huge amount of the heavy lifting out of data science and can be deployed either on-site or in the cloud.

  6. BLUF: a typically provocative title that doesn’t accurately reflect the issues at hand.

    From the article:

    Ford’s data chief joined many other top executives who are bemoaning the lack of simple tools to solve big data problems — namely the fact that running Hadoop clusters or performing analytics is still a job that requires a specialist.

    Assumption #1: it’s possible to “dumb down” the analytics tools to the point that a non-specialist can effectively replace a specialist.

    My response: the basic tools of carpentry (hammer, nail gun, etc.) are easy enough to use, but nobody would want me to build a house for them. Such tasks are better left in the hands of a specialist, one who is skilled in understanding the problem space, formulating a solution, and applying the tools to get the job done.

    To use another analogy, easy-to-use software like TurboTax hasn’t put all CPAs out of work.

    I mean, honestly, is the author really suggesting that we roll-back the historical drive towards division of labor, or is she just bemoaning the fact that it’s hard to find & hire good people who are capable of doing the advanced analytical work he needs? General-purpose computer programming isn’t something that a non-specialist can perform, but good programmers aren’t as hard to find as good data scientists (a relatively new title that serves as an umbrella term for a very broad set of specialties).

    If we want to [sic] big data revolution to scale, then we need to make it as easy as Netscape made the web surfing experience.

    Assumption #2: performing advanced analytics on big data is somehow comparable in any way to browsing the web for photos of LOLcats.

    My response: not every problem lends itself to being solved by an “easy button.” Easy-to-use tools are great, but you still need to be a skilled practitioner (i.e., a specialist) in order to use them effectively.

    =====

    It seems that author of this article somewhat misinterpreted what Ford’s data chief actually said:

    I’d love us to move toward [...] where data specialists — but not computer scientists — could go in and interrogate it and look for correlations that might not have been able to look at before. That’s a beautiful future state, but we’re not there yet.

    To me, what he’s is saying is that he wants to be able to have data specialists (who are not computer scientists by training) to be able to do the work currently being performed by data scientists, who are largely computer scientists with an aptitude for analyzing big data.

    This already exists in some ways within the business intelligence (BI) community. Really good BI tools exist to enable analysts (who are not computer scientists) to analyze business operations and add value to organizations.

    The question is whether it is possible to build general-purpose data analytics tools capable of analyzing any kind of data as well as BI tools analyze business-related data. I suppose it’s possible, but a more likely scenario involves lots of different specialized tools focused on different problem domains (e.g., Orion for IC applications, RedOwl for corporate communications), since each domain has different kinds of data problems to solve.

    The tools outlined in the article could be useful, especially for a non-computer scientist “data specialist” to perform some initial exploration and analysis of data (cf. TurboTax), but there’ll still be a need for someone (i.e., a traditional data scientist) to build the custom tools necessary for more advanced analytics.

  7. Stacey, great insight. It is worth mentioning the HPCC Systems open source offering which provides a single platform that is easy to install, manage and code too. Their built-in analytics libraries for Machine Learning and integration tools with Pentaho for great BI capabilities make it easy for users who do not hold a PhD degree or carry a title like “Data Scientist” to easily analyze Big Data. I believe HPCC is better than Hadoop and commercial offerings, it has a real-time data analytics and delivery engine (Roxie) and runs on the Amazon cloud like a charm through the One Click portal. For more info visit: hpccsystems.com

  8. It’s critical to have intuitive analysis tools that business users can use to make better decisions: companies employing data-driven decision making outperform other companies. Data scientists do have an invaluable role in bringing together data acumen, analysis expertise and business understanding. There aren’t enough data scientists. Metamarkets is providing data science-as-a-service, a SaaS solution enabling data scientists to extend their reach and provide intuitive analytics to business users within their organizations.

  9. Steve Ardire Sunday, July 29, 2012

    And here’s another way to ditch your data scientists….

    Use a semantic data model ( RDF ) that encodes contextual data relationships enabling machines and knowledge workers to discover meaning for sharing and reuse across applications and platforms without data relationships having to be predefined like for SQL, NoSQL, Hadoop ( ergo need for data scientists ).

    So if you want to know the semtech startups that have their act together than ping me ;)

  10. Data science will likely follow the path of “desktop publishing” – taking a high skilled task and reducing it to an easy to use tool that gets most people’s work done reasonably well. The Social Media Research Foundation (http://www.smrfoundation.org) has been developing the NodeXL project that performs many “data scientist” tasks with a few clicks. Designed for people who are usually making pie charts and line charts, NodeXL supports the collection, analysis, visualization and publication of social media networks. See: http://nodexl.codeplex.com

    Regards,
    Marc

Comments have been disabled for this post