9 Comments

Summary:

We are moving from the Information Age to the Insight Age, and as part of that shift we need a compute architecture that will handle the storage and processing required all without requiring a power plant hooked up to every data center. What architecture will win?

iStock_000006412772XSmall

We are moving from the Information Age to the Insight Age, as Parthasarathy Ranganathan, an HP Labs distinguished technologist told me. As part of that shift we need a computing architecture that will handle the storage of data, and the heavy processing power required to analyze that data, and we need to do it all without requiring a power plant hooked up to every data center.

The shift is a move from creating scads of information in a format that can be stored cheaply, to being able to process and analyze that information more cheaply as well (all the while adding new layers of data thanks to a proliferation of devices and networks). The challenge is that under the current computing paradigm, adding more processing is problematic both because it’s becoming more difficult to cram more transistors onto a chip, and those chips and their surrounding servers are sucking up an increasing amount of power.

With Power Consumption, The Question is, “How Low Can You Go?”

“Data is expanding faster than Moore’s Law and that’s a compelling problem that we’re trying to solve,” Ranganathan said. It’s apparently a problem that Intel’s Kirk Skaugen, vice president and general manager of the chipmaker’s Data Center Group, is thinking about too. Skaugen said at a speech last week at Interop that there were 150 exabytes of traffic on the Internet in 2009, and 245 exabytes in 2010, and the Internet could hit 1,000 exabytes of traffic by 2015 thanks to more than one billion people joining the web.

Want to hook one of these to your data center?

That’s a lot of bandwidth. But it’s also a lot of data and a lot of compute demand. Listening to Skaugen’s speech it appears that Intel’s primary function will be to convince the people who build the machines that process those exabytes of data, that their machines should run newer and more energy-efficient Intel processors. But is Intel’s architecture — and an upsell to its trigate 3-D transistors — the right chip for computing and big data’s future?

As I noted before, Intel’s much vaunted 3-D transistor advancement is cool, but only gets us so far in cramming more transistors on a chip and reducing the energy level needed. For example, a 22 nanometer chip using the 3-D transistor structure consumes about 50 percent less energy than the current generation Intel chip, but less than an Intel chip using the older architecture would at 22 nanometers (squeezing in more transistors also helps reduce the power consumption). And when we’re talking about adding a billion more people to the web, or transitioning to the next generation of supercomputing, a 50 percent reduction in energy consumption on the CPU is only going to get us so far.

The original, "flat" transistor at 32nm.

For example, scientists at the Department of Defense estimate (GigaOM Pro sub. req’d) that getting to the next generation of supercomputer at the current architectures would require possibly two power plants to serve every exascale computer — reducing that to one is great, but it’s not good enough. This is why the folks at ARM think they have an opportunity and why the use of GPUs in high performance computing is on the rise.

A New Architecture for a New Era

But there is more to this trend than merely eking more performance for less power — there is also a more subtle shift to matching your processors to your workloads in an acknowledgment that generic CPUs running x86 processors might not be the best solution for all workloads, especially in a cloud world. For example, startups are already trying to build optimized gear for companies such as Facebook or Google that can then run their own software on top of these optimized platforms.

Facebook's vanity-free server.

Don’t believe this is coming? Take a look at Facebook’s Open Compute efforts. This kept the same x86 processors made by Intel and AMD, but it was willing to question everything about the architecture of the servers and data centers those chips were house in. And that willingness to question everything is occurring at firms all over the world that are dealing with massive compute needs– a trend Intel can’t help but find worrying and folks such as Ranganathan at HP see as their big chance.

“Historically there is evidence that each killer app has an influence on the architectures that are preceded by the special purpose alternatives,” Ranganathan said. “So asking what instruction set for the processor, or if you want powerful or wimpy processors or special purpose processors are all legitimate architectural questions that we need to answer.”

HP’s answer is its concept of nanostores, chips that tie the memory and the processor together using a completely new kind of circuit called a memristor. The basic premise for HP is that 80 percent of the energy inside a data center is tied to moving data from memory to the processor and then back again. We’re already seeing the trend of moving memory closer to the processor (that’s what the addition of Flash inside the data center is about) to speed up computing.

But instead of next-door neighbors inside a box, HP essentially wants processing and memory married and in the same bed. HP won’t give a timeline on when this vision will become reality, but it has a manufacturing partnership with Hyinx it announced in 2010 to build such chips.

So Where’s Intel in This Architecture?

So when Skaugen gets up at Interop to push Intel’s 3-D transistors and the incredible inflows of data coming online he’s also making a pitch for Intel’s relevancy because big data processing is one of the areas where a general purpose CPU makes a lot of sense. So while folks may adopt GPUs for better supercomputing or data visualizations, or ARM may keep its upward momentum into more and more mobile computers or win some server designs in webscale businesses that can see a use case, just crunching those numbers associated with big data could become Intel’s game to lose.

There are plenty of folks hoping Intel will lose it (or at least that they will stand to gain) — not just Ranganathan at HP, but also the guys building 100-core chips at Tilera, or those hoping that the mathematical affinities inside digital signal processors might make them a good choice for data. It’s a topic I can’t wait to explore with Ranganathan, folks from Intel, Tilera and others at our Structure 2011 event on June 22 and 23. Because just like the steam engines and trains of the Industrial Age had to give way to the tools of the Information Age, the PCs and current servers used today will become a footnote as we pass into the Insight Age.

You’re subscribed! If you like, you can update your settings

  1. Alan Wilensky Sunday, May 15, 2011

    That is one big set of generalities for a publication as prestigious as Om’s brand. The architectural challenge of the next decade is not between RISC and CISC (ARM vs. Intel) or Load / store vs. vector (general cpu vs graphics proc), but will require a departure from the Harvard architecture of registers, code segments, stacks, and memory access. The basic architecture defined long ago my Von Neumann and Turing, and refined by some really smart people, but the next major phase of computational efficiency will come from the world of physics, number theory, and left field. Who and what will supplant the Harvard Architecture – that is one big hurdle to overcome in a the largest legacy library of tools, OS’s and design discipline to replace. the Program Counter based Register Load store architecture is shared by ARM and Intel, one relies on simple instructions that execute fast and with as few cycles for a load or store, most often 1 or 2, while the other has made the complex instruction that uses tens of cycles to accomplish more. Both RISC and CISC are converging – ARM makes chips with CISC like instructions, and Intel has created chips that trim down a few families of opcodes for better video and vector math handling. The video processor companies have started to create architectures to handle procedural code, but its hard to integrate that business into a general computing architecture, as the long lag in getting the tools to market has shown.

    1. Joshua Goldbard (ThePBXGuy) Alan Wilensky Sunday, May 15, 2011

      Hey,

      Thanks for this comment. I think both your viewpoint and the viewpoint in the article have a place in the discussion. Computer architecture, power usage, and networking all have a big part to play in the next generation of computing technology. The best would be a marriage of the three, but that’s hard to even conceptualize.

      I’m excited to see solutions to these large problems. As Arrington would say “this industry has room for Disruption”; don’t they all?

      Cheers,
      Joshua

    2. Stacey Higginbotham Alan Wilensky Monday, May 16, 2011

      Hey Alan, it has been a while :) I agree that the tools and eventual code will have to change, and there’s plenty of room for more articles on that. I didn’t for instance get into the ways that HP is trying to architect its nanostores to work with today’s programming models. However, if the article promotes quality discussion from folks like yourself on the way forward, that’s just more fodder for posts to come.

  2. a great way to reduce power consumption and compute loads would be to not waste those watt hours publishing articles stating that a 50% power reduction isn’t enough. think of all the power saved by not having to route the packets around the country or world just to inform people that the leading chip maker cut power by 50% but its not enough. speaking of… thin clients(sorry…cloud computers) arnt necessarily more efficient when you count the 24/7 power draw of the network they need to connect to.

  3. Over half of my career was spent writing compilers and interpreters for a variety of machines and languages from Cobol and RPG to C/C++ and Java. One machine ran C rather slowly compared to how well it did on the older 3rd generation languages. Upon looking deeper I found one reason was that it was able to perform data moves from one location to another very quickly for moves where the string length was known at compile time or even execution time. The data actually did not have to be relocated into the processor at all but remained in the memory subsystem. Many C/C++ programs operate on null terminated strings of unknown length and the data movement for such goes through registers one byte at a time, although some processors have special instructions to deal this issue. Thus not all the problem is in the machine design but some is found in modern programming language design. All my laptop and desktop machines run 64 bit Wintel operating systems, however unfortunately most of the programs I run are 32bit. With more general registers in 64 bit mode once the app providers switch and run compiled 64 bit code then there will be code that is more efficient less use of memory.

    I strongly agree that we need an architecture and programming languages that make efficient use of said architecture to handle the amounts of data that are being processed. At the very least memory to memory moves, compares… should occur down in the memory subsystem but possibly even more radical change is needed.
    Dave W

    1. Stacey Higginbotham gingoro Monday, May 16, 2011

      Sounds like I need to investigate innovation on the programming side as well.

  4. sevenOdouble Monday, May 16, 2011

    Take a look at the POET technology from Opel Solar International Inc, more specific the ODIS subsidiary!!

  5. Definitely this problem can and should be solved in both the hardware and software fronts. It is a huge improvement in efficiency when the hardware can save power by 50% (like Intel’s new process), but it will be even more impressive when we can improve software efficiency — in the order of magnitude of 10x. For example, a program written in Basic may take 30 seconds to run, but the same program written in assembly language may only take 0.5 second to complete. This results in power savings per task by several orders of magnitude. I am not suggesting that everyone programs in assembly, since that will take 10x amount of time to write each program. But there must be *something* that can be done to improve the efficiency of compilers to save power. I have several computers at home with each running a different version of Windows, ranging from Windows XP to Windows Vista. What I have found is that Windows XP system only takes about 20 seconds to boot up, while Windows Vista takes about 3 minutes to do the same. I do not know what the Vista operating system is doing during those 3 minutes, but as far as I am concerned, both operating systems allow me to get online to check email and watch Youtube video, so both operating systems get the job done. In my point of view, the Vista system becomes a total waste of electrical energy. This makes me believe that the Vista system has a huge room for improvement. Now based on this same idea, there must be millions other application programs out there that can be improved / accelerated in the same way. These improvements can easily result in 10x gains in power savings. New processor architecture by combining RISC & CISC plus adding memory inside the CPU/GPU will definitly help with speed and power efficiency, and this may result in another 20% to 50% gain in power saving, but more efficient software codes can improve by a few orders of magnitude, easily. And guess what? We need to do both to handle the hundreds of exa-byte of data floating in the Cloud in the coming years.

  6. carpetbomberz Saturday, May 28, 2011

    In terms of optimizing cpus for their respective workloads I always thought and secretly hoped Field Programmable Gate Arrays (Xilinx for instance) might work. But it’s a non trivial task to synthesize a computer core, much less try to do it on a workload that changes constantly. My feeling was so-called evolutionary algorithms could look at the workload re-optimize the FPGA’s routing and placement until over time you had a really good processor architecture for workload X. I know AMD once had an auxiliary socket on their motherboards for a co-processor which could have been an FPGA. But it was definitely a roll-your own type solution and not geared towards commodity servers. If some combo of a self optimizing algorithm and a reconfigurable processor could be put together and a ‘universe’ built up around them maybe that’s another way forward. As always it’s hard to adapt these technologies easily to practical, data center type work. Research is one thing production another.

Comments have been disabled for this post