[protected-iframe id=”324aee6b1ef4becd0ff5b787f83f24c5-14960843-61002135″ info=”http://new.livestream.com/accounts/74987/events/2117818/videos/21944044/player?autoPlay=false&height=360&mute=false&width=640″ width=”640″ height=”360″ frameborder=”0″ scrolling=”no”]
Input sound file:
Session Name: What We Have Is Not Good Enough: Connecting The Numbers.
Jason Hoffman 00:08
Good morning, everybody. It’s nice to see there’s at least some of the seats filled in, despite the fact it’s 8:30 in the morning.
Jason Hoffman 00:17
I remember being a young scientist and coming up with some seemingly relatively simple ideas. The things we were trying to do about 10, 12 years ago, we would look at deciding to look at a ton of samples across say, a ton of patients and a ton of images, and I remember calling up our storage sales guy – I won’t name the company. I’m also not going to name the units of the storage because it would somewhat age me – but I basically said, “I need 12 of those,” and he said, “We’re only shipping four this year.” So, when we start looking at even this idea of the type of data that comes out of an effort as simple as say, sequencing the genome of a human being – we did spend across a ten year project sequencing the genome of about ten people, and now of course, there’s the Thousand Genome Project, which is a data set of about 200 terabytes in size – that is processed terabytes; it’s actually ten times that when you look at all the raw data – that’s only for 1, 000 people. So, you begin to start thinking about very simple things like, if I wanted to sequence 10 million genomes – and that’s about 20 exabytes worth of data – just sitting there and saying, “I’m going to go and collect all the raw data for this given scientific activity and do that.”
Jason Hoffman 01:43
The storage industry though, only shipped 16 exabytes in 2012. So, we’re still about four exabytes short from only say, addressing 10 million people out of that. All storage devices was about 1, 600 exabytes. So, that means the storage that goes on your phone, the storage that goes in an iPad, the storage that goes in a TV, there’s a couple things that’s important out of this. One of which is that, we still have singular ideas out there that can in fact consume say, an entire industry in our space. The idea that, still now, you could go ahead and do an effort and consume the entire storage industry is as a result is something that we should all sort of keep in mind. It’s not odd for there to be telescopes and even astronomy projects that could generate a petabyte to an exabyte a day as far as that. So, sitting there and looking at the inability to even collect, store, move that degree of data for a singular project still basically exists. Also, keep in mind that about 1% of that storage does ship, and 1% of that stuff does ship in what you’d call storage. So, 99% of the capacity that we ship per year still ships inside of a device.
Jason Hoffman 03:07
I did talk about genomics, we talked about astronomy, we talked about a few of these types of things, and there is in fact only three places that we do collect data from. So, the canonical place we collect data as a scientist is, data from nature. Data from nature is the highest resolution data available.
Jason Hoffman 03:27
We also collect data from humans. The photos we take, the videos we take, the texts that we write – meaning from books, to tweets, to blog posts, to liking things on Facebook – now even more, we’re collecting a lot of sound, too; push to talk, tons of audio. So, if you think of data from humans, we could go ahead and approximately estimate, that’s just about everything in let’s say, Google, plus Facebook, plus enterprise file and mail, plus all the cumulative activities that we basically want to do.
Jason Hoffman 04:01
The other critical one of course now is, data from machines. So, whether it be server logs, what phones are doing, wearable sensors, cars; this becomes the third and very often forgotten type of data source, but at the very same time, now that we’re thinking of an application abstraction going forward that is world of devices, network as the back plane, as the way that we have to address the world, we sit down and start dealing with the situation where the entire data set of human generated content is actually very small compared to the data set of machine generated content, which of course is still exceptionally smaller than what you have in nature.
Jason Hoffman 04:46
The key thing to keep in mind is that it is in fact machine data that is the big data trend. It is both machine and human learning. Machine data is the big driver in terms of what we’re doing. It’s correlating across human activities and machines that we wear and sensors, and then of course, tying back to the very sort of thing that we are. So, when you look at things like, the impact of wearables and human health, that’s something that cross-sects humans and machines and nature across all of those, and we deal with, even on the machine data level, the idea that if we want to collect all of the instrumentation data from our cars that exhausts the entire current storage industry. If we want to collect all the instrumentation data off of airplanes, that exhausts the entire storage industry. In fact, all we do now is, we typically collect 30 seconds when things go wrong and we hope that it’s still there if a plane crashes; that’s all we collect.
Jason Hoffman 05:39
So, if we think about what that is driving, what is that sort of idea that we’re having human data, machine data, we’ve been doing natural data in the sciences for a number of years, what is this driving? It’s driving basically another convergent event. So, I typically tend to think that industry trends start happening when fundamentals converge. The fundamentals in our industry are pretty simple. We have data that we care about, we do compute to it, and we send and receive it over a network; that’s all we do. We literally stick stuff on hard drives, stick stuff in memory, put things on a CPU, send and receive it over a network, and what we’re almost constantly doing in our industry is cycling between innovations in one of these three, and we’re typically sort of jamming these three together. For example, the phrase, “The network is the computer” really sort of emerges both the slogan for the PC and the internet cycle; that was compute and network sort of converging, where the PC meets the internet and we start sharing our own personal data on those types of things. In fact, the network storage was really around say, data and network convergence. What we’re really going to start right now, as far as human generated content, a data set from machines that is so much larger than what we’ve ever dealt with in our lives, we have to sit down and say, “Well, what happens when compute and data begins to converge?” We’ve had some indication of these types of things; the popularity of the old Sun Thumper was around the concept of being able to do compute directly on your data. The concepts of things like an Exadata doing compute directly on your data. A lot of what drives Hadoop is sitting there doing compute directly on your data.
Jason Hoffman 07:22
So, a lot of these types of things is, how do we start doing data sets that are a billion times bigger than our current data sets? How do we have storage that’s a billion times bigger than our current storage? How do we begin to actually start going and asking ad hoc questions directly on extremely large data sets and make that democratized, so it’s not purely in the hands of say, a company like Google or Facebook, or the like?
Jason Hoffman 07:44
So, it will be a large expansion of our industry. Every profit pool that exists now will basically be up for grabs over the next 30 years. Just like how in 1985 and 1986 – remember in 1986, Oracle and Microsoft IPO’d within a day of one another only as companies with market values of about 200, 400 million dollars. We saw the last 30 years of that sort of progression of that compute and network type convergence in the PC push. The 30 year trend around 2015, just like the time between 1985 and 1990 was, there were a lot of software companies that existed. The first sort of pure software company IPO’d in 1958, and there was a whole bunch that existed between there, and all of them were basically acquired or went out of business between 1985 and 1990 if they pre-dated Oracle or Microsoft.
Jason Hoffman 08:44
With that, thank you everyone. Enjoy the conference and take care.