10 Comments

Summary:

As semiconductors try to get faster without breaking the laws of physics (not that researchers aren’t trying that, too) multicore processors have become all the rage. Quad-core chips are commonplace in servers nowadays, and six-core chips have been launched this year. But after a certain point […]

As semiconductors try to get faster without breaking the laws of physics (not that researchers aren’t trying that, too) multicore processors have become all the rage. Quad-core chips are commonplace in servers nowadays, and six-core chips have been launched this year. But after a certain point adding more processor cores doesn’t improve performance for certain problems, because there’s not enough memory and bandwidth in the right places on the chip.

This is becoming an issue in the supercomputing world right now, and it’s even rearing its head in multicore embedded chips for devices from base stations to routers. It’s an problem that will meander its way down the computing food chain in a few years or so to affect servers and even smartphones, possibly changing the way chips are designed.

The issue is that, while some tasks can be broken up so each core solves part of the problem in parallel, other computing problems require each core to access more information in order to solve a compute problem. That information is located in memory that may or may not even be located on the chip.

It’s kind of like a mom dealing with the demands of half a dozen kids; it’s hard to process them and even harder to fulfill them thanks to physical limits, like only having one pair of hands. And unlike kids, chips can’t shout louder to have their information heard; they still have to go through a defined path on the silicon to get from the memory to the cores. As more cores try to request more information, the memory on the chip isn’t enough for all the data, and the channels of communication between the cores and the memory become bottlenecks.

Daniel Reed, Microsoft’s scalable and multicore computing strategist, calls this a hidden problem that’s just as big as the challenges of developing code that optimizes multicore chips.

Chip firms are aware of the issue. Intel’s Nehalem processor for servers adds more memory on chip for the multiple cores and tried to improve communications on the chip. Firms such as Texas Instruments (a TXN) have tweaked the designs of their ARM-based chips for cell phones to address the issue as well. Freescale has created a “fabric” inside some of its multicore embedded chips so they can share information more efficiently across a variety of cores.

But it’s possible that a straight redesign on the processor side is what’s needed. SiCortex, which makes a specially designed chip for the high-performance computing market, questions whether merely adding more memory, as Intel is doing, is the way to solve the issue. Its solution is closer to creating a communication fabric inside the device that scales with the number of cores that are added.

The contention is that adding more memory is kind of like bringing Dad in to help handle the kids’ multiple demands. It doesn’t scale if you keep adding more kids. As chipmakers look for alternative ways to handle this, startups such as MetaRAM, which is pioneering dense memory; Acumem, which has a tool to spot bottlenecks causing an application to slow; or those designing their own chips, such as SiCortex, could lead the way to a solution.

  1. More cores need better threading models that do not require programmers to think too much about semaphores and pending / unpending processes. We are WAY behind the curve in native code compilation for anything beyond four cores.

    This is a far more pressing issue than inter-core communication, which can be readily addressed with on chip caches and registers.

    Share
  2. Well put Alan. Programming has always been the biggest Achilles Heel of multiprocessors. SMP has been around for decades and decades. Countless papers have been written about the memory bandwidth and its closely related cousin, inter-processor communications, for years and years. These problems are very well understood. What has not progresses as quickly has been software technology. We have a hard enough time getting scantily trained programmers to write simple event loops for Windows. Something is going to have to be put in place that allows these non-academic coders to write code quickly without having to worry about cache coherence and mutual exclusion and memory and I/O bandwidth. Once that is in place, the smart people can go off and optimize that. But I don’t see the programming problem going away anytime soon.

    Share
  3. In the days of LSI-11 (PDP 11 on a board) we had SMP VME cubes running multithreaded Pascal. 10 CPU’s for a lab realtime Chromatography system. I wrote the FORTH executive with native OS threads. Pend and Unpend processes.

    I said to one of the geniuses: “You guys can’t see the big product picture like I can, but I can;t see the Multi-threading landmines”. So one real genius took my code, which was in production and just about real world bullet proof, and read the listing and ran coverage analysis. He came to me:

    “This module deadlocks under these rare circumstances”. Mmm……I didn;t see that, It had never happened. I took it up with the Founders. “My production code has a rare potential SMP deadlock that was not foreseeable at design time, it will cost x man hours to modify”.

    “Shit, no way”, said the boss, ” it will never come to pass”. “ok”, I said, “Your company”.

    The system deadlocked and returned bad production data the next day for a major client. In my defense, I was 25 at the time, and new to threaded applications, and my code boundaries were interdependent with the peripheral group with the geniuses who were CS and Chem grads.

    Beware the threads. More cores, diminishing returns until the compilers and JITS and can take whatever we throw and make the best of say, 8-24 cores.

    Share
  4. The IO-memory-CPU bottleneck was exactly why Map/Reduce got so popular. After a point, it doesn’t matter how fast your CPU is, you just can’t ship that much data to the processor. So instead of worrying about shipping the data faster on better hardware (where the cost is exponential to performance), the solution is to chunk up the data and send it to N cheap processors, achieving near-linear performance. While this is a great solution to build a dotcom company, it’s a huge blow to hardware research, since people are taking a step backwards towards commodity hardware.

    Glad to see some steps in the right direction here.

    Share
  5. [...] Intel processors, which have a technology used to speed up access to memory at the chip level as a way to reduce computation bottlenecks associated with multicore [...]

    Share
  6. [...] more information to process from the memory. Under Intel’s previous architecture, they had to access that memory outside of the chip — and do it one by one. Intel has addressed this by including integrated memory on the latest [...]

    Share
  7. [...] shrinking the environmental footprint of computing. We’re taxing current technology, from the silicon to the software, with massive data demands, and technology like this is a piece of the solution, [...]

    Share
  8. james braselton Sunday, May 31, 2009

    HI THERE WELL APLE HAS A MAC PRO WITH A 8 CORE CPU 32 GEGABYTES OF RAM AND UP TOO 4 TERABYTES OF STORAGE OR OPTIONAL 10,000 AND 15,000 RPM HARD DRIVES FOR FASTER ACESS

    Share
  9. james braselton Sunday, May 31, 2009

    HI THERE WELL APLE HAS A MAC PRO WITH A 8 CORE CPU 32 GEGABYTES OF RAM AND UP TOO 4 TERABYTES OF STORAGE OR OPTIONAL 10,000 AND 15,000 RPM HARD DRIVES FOR FASTER ACESS

    Share
  10. Good works Thanks from turkey

    Share

Comments have been disabled for this post