1 Comment

Summary:

Los Alamos National Laboratory is trying to build to an exascale computer, which could process one billion billion calculations per second. The man in charge of executing that vision, however, sees a big obstacle toward building it. That problem, discussed at Structure:Data, is resilience.

Gary Grider of HPC Division, Los Alamos National Laboratory, Garth Gibson of Panasas, and Rich Brueckner inside-BigData at Structure:Data 2012

Los Alamos National Laboratory is trying to build to an exascale computer, which would be 1000 times faster than Cray’s Jaguar supercomputer and could process one billion billion calculations per second. The man in charge of executing that vision, however, sees a big obstacle toward building a computer with 1 millions nodes, running between 1 million to 1 billion cores. That problem is resilience.

Gary Grider of HPC Division, Los Alamos National Laboratory, Garth Gibson of Panasas, and Rich Brueckner inside-BigData at Structure:Data 2012

(c) 2012 Pinar Ozger. pinar@pinarozger.com

Speaking at GigaOM’s Structure:Data conference, Los Alamos HPC deputy division leader Gary Grider said that the exascale computer has so many parts, that some element will constantly be failing.

“It wouldn’t be worth building if it didn’t stay working for more than a minute,” Grider said. “Resilience is absolutely a must. The way you get answers to science is you run problems on these things for six months or more. If the machine is going to die every few minutes, that’s going to be tough sledding. We’ve got to figure out how to deal with resilience in a pretty fundamental way between now and then.”

Grider and Los Alamos’s technology partners have between 6 and 10 years to work on the problem, and the national lab won’t be alone. According to inside-Data president Rich Brueckner, who moderated the “Faster Memory, Faster Compute” panel Grider spoke on, countries from all over the world are in an exascale race. Brueckner said it’s just as likely as Russia, Japan, China, India or the European Union develops the exascale machine as the U.S.

Watch the livestream of Structure:Data here.

  1. I don’t understand the concern with resiliency here. With any system built to this scale, no single component should decide availability, only capacity. Unless the application is dependent on all components running all the time, which would be a very poor design strategy. Also, if specific workloads require a minimum capacity to run effectively, then you would model potential component failure rates and build that in to the total number (I.e., if 1 million is the minimum number and the yearly failure rate is 3%, then you would build to a spec of 1.03 million.)

    Share

Comments have been disabled for this post