Summary:

Big data and the horsepower needed to generate, store and manage it is all great. Now we need to make sure our data is reproducible, says AWS principle data scientist.

Matt Wood Amazon Web Services Structure Data 2013
photo: Albert Chau

Much has been done to bring big data closer to the people who need it. The advent of public cloud infrastructure has decimated the cost of collecting, maintaining and processing vast amounts of data. The next frontier is making that data reproducible, said Matt Wood, principal data scientist for Amazon Web Services, at GigaOM’s Structure:Data 2013 event Wednesday.

In short, it’s great to get a result from your number crunching, but if the result is different next time out, there’s a problem. No self-respecting scientist would think of submitting the findings for a trial or experiment unless she is able to show that the it will be the same after multiple runs.

“Much of today’s statistical modeling and predictive analytics is beautiful but unique. It’s impossible to repeat, it’s snowflake data science.” Wood told attendees in New York. “Reproducibility becomes a key arrow in the quiver of the data scientist.”

The next frontier is making sure that people can reproduce, reuse and remix their data which provides a “tremendous amount of value,” Wood noted.

For more on Wood, check out this Derrick Harris post.

And check out the rest of our Structure:Data 2013 coverage here, and a video embed of the session follows below:


A transcription of the video follows on the next page
page of 2

Comments have been disabled for this post