7 Comments

Summary:

A new startup called Trifacta, founded by UC-Berkeley professor Joe Hellerstein and Stanford professor Jeffrey Heer, wants to eliminate much of the hassle of making messy data usable. The company combines machine learning and human-computer interaction, and has raised $4.3 million from Accel Partners.

Ask a group of data scientists the toughest part of their job, and many will probably tell you – it’s not the math but the work required to turn raw data into something their software can work with.

A new San Francisco startup called Trifacta wants to solve this problem — not just for data scientists, but for all data analysts — and is banking on some of the best minds in data management and human-computer interaction to make it happen.

The problem Trifacta is trying to solve is one that folks in the data world have taken to calling “data munging.” Wikipedia defines it as “the process of converting or mapping data from one ‘raw’ form into another format that allows for more convenient consumption of the data”, and it takes up a lot of time for anyone trying to work with new and perhaps unique types of data. All the data tools in the world — from Tableau to R to Hadoop — aren’t of any use if they simply can’t make sense of the data they’re being asked to analyze or visualize.

And, like most things, data munging gets harder with scale (just ask NASA). The idea of big data is traditionally premised on the “three Vs” of volume, variety and velocity, which means companies trying to create a big data strategy are going to ask their analysts to do more, faster, and with lots of new data sources such as sensors, social media and mobile phones. Something has to give, and it’s not going to be the technology, which is getting better with every step. It’s going to be the humans trying to keep up.

“The real bottleneck is in people rather than the tools they’re using,” said Trifacta Co-Founder and CEO Joe Hellerstein, who is also a professor of computer science at the University of California, Berkeley, and a technical adviser to several data-focused companies including EMC Greenplum, Platfora and SurveyMonkey. Although the costs of storage and computing are becoming commoditized, he added, “the cost of human attention is not”.

Source: Trifacta

Making life easier by design

However, Trifacta is betting it can ease the pain by making it easier than ever for analysts to get their data formatted and get down to business analyzing it. According to co-founder and chief experience officer Jeffrey Heer, Trifacta blends advanced concepts such as machine learning with the cutting edge in human-computer interaction in order to make the process highly intuitive but also highly intelligent, learning as it goes what type of data it might be dealing with. It should appeal to data scientists who presently write code to solve all their formatting problems, as well as to everyday users who just like to poke around at data, he said.

Jeffrey Heer

Building such a product takes a team with a wide variety of skills. While Hellerstein is the hardcore computer scientist, Heer is a human-computer interaction professor at Stanford University who has helped develop a number of open source data-visualization projects such as Protovis and D3.js, and a data-munging program called Data Wrangler along with Hellerstein. CTO Sean Kandel is a financial analyst who studied analyst behavior and productivity at Stanford.

Trifacta’s advisers include New York Times visualization specialist (and former Heer student) Michael Bostock, Cloudera co-founder and chief scientist Jeff Hammerbacher, Greylock Ventures data scientist in residence DJ Patil and Tim O’Reilly.

It’s not sexy, but it’s very necessary

While Trifacta wants to span the spectrum in terms of appeal, though, its real profit center should come in the vast middle with everyday data analysts.  These folks and their employers will be overwhelmed by the volume, variety and expectations that come along with big data, and will pay the price in terms of wasted time and money. That’s a big addressable market, so it’s no wonder ecosystem partners and investors have already lined up behind Trifacta.

The company has raised $4.3 million from Accel Partners, as well as additional funding from X/Seed Capital, Data Collective, and angel investors Dave Goldberg, Venky Harinarayan and Anand Rajaraman. Big players within the data ecosystem, including Tableau and Cloudera, are already on board as supporters, citing the improved utility of their products when users can reduce the barriers of actually doing analysis.

“Everyone’s building the [big data] freeways with Hadoop and NoSQL databases, and everyone wants access to the freeway,” said Accel partner Ping Li, who heads the firm’s Big Data Fund. He sees Trifacta as an on-ramp that lets companies actually start mining from new data sources, thus clearing up a major bottleneck. “Until that happens,” he said, “the big data wave is going to hit a wall.”

Joseph Hellerstein

CEO Hellerstein doesn’t mind acting as the on-ramp, even if it means Trifacta won’t likely get all the glory of its somewhat more-exciting peers. “People think technology is all about building rocket ships,” he said, “but technology is most useful for building things like washing machines that remove a lot of drudgery from everyday life.” For data analysts, then, Trifacta could mean no more washing their data with rocks down by the river.

Feature image courtesy of Shutterstock user Andrea Danti.

  1. George Balayan Thursday, October 4, 2012

    the link to Trifacta (“called Trifacta wants to solve this”) is broken

    Share
    1. Derrick Harris Thursday, October 4, 2012

      Thanks, fixed.

      Share
  2. Great piece on Trifacta Derrick. Certainly a real trend and an interesting venture. And great to see the industry broadly adopting the “3Vs” of Big Data that Gartner first introduced over a dozen years ago. For future reference/attribution, here’s a link to the piece I wrote back in 2001 first defining these 3-dimensional data challenges: http://blogs.gartner.com/doug-laney/deja-vvvue-others-claiming-gartners-volume-velocity-variety-construct-for-big-data/. –Doug Laney, VP Research, Gartner, @doug_laney

    Share
    1. Chunting Xiang Monday, October 29, 2012

      Great vision Doug.

      Share
  3. Thomas Johansson Thursday, October 4, 2012

    This is already what the company Expertmaker is doing. I checked them at a Hackathon they had with Vodafone on information overload theme.
    They already have network of thousands of users working with their desktop tools, providing data cleansing, analytics, data mining, etc. spanning over almost all types of AI (like image recognition, advanced text classification, etc.) in order to transform databases, structured or unstructured data into predictions, decision models, feature extractors, etc.

    Share
  4. Where is this presence in the back end of all things, the foundation of the infrastructure… I want to build that part!

    Share
  5. Random-ness: I do apologize that this is a bit fragmented in terms of the complete idea… (I can provide some of my rough outline notes to potentially help clarify my position here) the kernel tuning being only one aspect, but cited to show the depth and breadth of what I am researching. So ‘self healing’ and ‘aware’ elude to complete machine learning based on a larger scope, again hence the inclusion of kernel ‘hacking.’ The kernel itself can actually lend itself nicely to self healing if in fact its ‘syslog’ output for example, were made to be a more meaningfully parse-able tool… its cryptic output should be replaced with more elegant solution possibilities, all from the kernel perspective, if that fails then it is handed to the application side to be dealt with in an algorithmic manner, as is typical for most front-end implementations of machine learning. Not sure starting from scratch is my goal with the kernel, probably more module amendment or extensions unless that proves ineffective/inefficient. The cash value from where I am coming from would be a more complete standalone solution, one that trivializes its own management by including that as part of the offering. Imagine building a base system, deploying it within your infrastructure and it can immediately survey and learn from ingestion and/or from peer nodes in place and begin/supplement the decision/remediation scope. I also see this more holistically as being a replacement for current management models, finally allowing us all to move away from the mundane (my opinion) aspects of, in some cases endless troubleshooting to root cause analysis, to making that ordeal a “non-event” category allowing us to all focus on more important things… innovation to facilitate further innovation is how I see this effort… Sure we could all still ride horses, but we don’t. Plug in to that!

    Share

Comments have been disabled for this post