Monday, July 21, 2014

Bottlenecks: DRAM & Moore's Law

In addition to Moore's law slowing there are bottlenecks between the memory device and the CPU. 




The article below discusses
" How long will it will it take to find a technology so fundamentally different and better from anything we have today that we can do away with the DRAM latency and power consumption bottleneck?

There is a need for -
"A new RAM technology that cut main memory accesses by an order of magnitude would be reason enough to reevaluate the entire balance of resources on a microprocessor. If accessing main memory was as fast as accessing the CPUs cache, you might not need cache on the CPU die or package at all — or at least, you wouldn't need anything beyond L1 and maybe a small L2."

More about Moore's Law bottleneck from March 2012 Moore's Law End? (Next semiconductors gen. cost $10 billion)

Ron
Insightful, timely, and accurate semiconductor consulting.
Semiconductor information and news at - 
http://www.maltiel-consulting.com/




DRAM is pretty amazing stuff. The basic structure of the RAM we still use today was invented more than forty years ago and, just like its CPU cousin, it has continually benefited from the huge improvements that have been made in fabrication technology and density improvements. Less than ten years ago, 2GB of RAM was considered plenty for a typical desktop system — today, a high-end smartphone offers the same amount of memory but at a fifth of the power consumption.
After decades of scaling, however, modern DRAM is starting to hit a brick wall. Much in the same way that the CPU gigahertz race ran out of steam, the high latency and power consumption of DRAM is one of the most significant bottlenecks in modern computing. As supercomputers move towards exascale, there are serious doubts about whether DRAM is actually up to the task, or whether a whole new memory technology is required. Clearly there are some profound challenges ahead — and there’s disagreement about how to meet them.

What’s really wrong with DRAM?

A few days ago, Vice ran an article that actually does a pretty good job of talking about potential advances in the memory market, but includes a graph I think is fundamentally misleading. That’s not to sling mud at Vice — do a quick Google search, and you’ll find this picture has plenty of company:
DRAM scaling
The point of this image is ostensibly to demonstrate how DRAM performance has grown at a much slower rate than CPU performance, thereby creating an unbridgeable gap between the two system. The problem is, this graph no longer properly illustrates CPU performance or the relationship between it and memory.  Moore’s law has stopped functioning at anything like its historic level for CPUs or DRAM, and “memory performance” is simply too vague to accurately describe the problem.
The first thing to understand is that modern systems have vastly improved the bandwidth-per-core ratio compared to where we sat 14 years ago. In 2000, a fast P3 or Athlon system had a 64-bit memory bus connected to an off-die memory controller clocked at 133MHz. Peak bandwidth was 1.06GB/s while CPU clocks were hitting 1GHz. Today, a modern processor from AMD or Intel is clocked between 3-4GHz, while modern RAM is running at 1066MHz (2133MHz effective for DDR3) — or around 10GB/sec peak. Meanwhile we’ve long since started adding multiple memory channels, brought the memory controller on die, and clocked it at full CPU speed as well.
ddr_memory_data_rate
The problem isn’t memory bandwidth — it’s memory latency and memory power consumption. As we’ve previously discussed, DDR4 actually moves the dial backwards as far as the former is concerned, while improving the latter only modestly. It now looks as though the first generation of DDR4 will have some profoundly terrible latency characteristics; Micron is selling DDR4-2133 timed at 15-15-15-50. For comparison, DDR3-2133 can be bought at 11-11-11-27 — and that’s not even highest-end premium RAM. This latency hit means DDR4 won’t actually match DDR3′s performance for quite some time, as shown here:

This is where the original graph does have a point — latency has only improved modestly over the years, and we’ll be using DDR4-3200 before we get back to DDR3-1600 latencies. That’s an obvious issue — but it’s actually not the problem that’s holding exascale back. The problem for exascale is that DRAM power consumption is currently much too high for an exascale system.
The current goal is to build an exascale supercomputer within a 20MW power envelope,sometime between 2018 and 2020. Exascale describes a system that has exaflops of processing power, and perhaps hundreds of petabytes of RAM (current systems max out at around 30 petaflops and only a couple of petabytes of RAM. If today’s best DDR3 were used for the first exascale systems, the DRAM alone would consume 54MW of power. Clearly massive improvements are needed. So how do we find them?

Reinvent the wheel — or iterate like crazy

There are two ways to attack this problem, and they both have their proponents. One method is to keep building on the existing approaches that have given us DDR4 and the Hybrid Memory Cube. It’s reasonably likely that we can squeeze a great deal of additional improvement out of the basic DRAM structure by stacking dies, further optimizing trace layouts, using through-silicon vias (TSVs), and adapting 3D designs. According to a recent research paper, this could cut the RAM power consumption of a 100-petabyte supercomputer from 52MW (assuming standard DDR3-1333) to well below 10MW depending on the precise details of the technology.
While 100PB  is just one tenth of the way to exascale, reducing the RAM’s power consumption by an order of magnitude is unquestionably on the right track.
DRAM-Types
The other, more profound challenge, is the idea of finding a complete DRAM replacement. You may have noticed that while we cover new approaches and alternatives to conventional storage technologies, virtually all the proposed methods address the shortcomings of NAND storage — not DRAM. There’s a good reason for that — DRAM has survived more than 40 years precisely because it’s been very, very hard to beat.
The argument for reinventing the wheel is anchored in concepts like memristorsMRAM,FeRAM, and a host of other potential next-generation technologies. Some of them have the potential to replace DRAM altogether, while others, like phase change memory, would be used as a further buffer between DRAM and NAND. The big-picture fact that Vice does get right is that discovering a new memory technology that was faster and lower power than DRAM really would change the fundamental nature of computing — over time.
It’s easy to forget that the trends we’re talking about today have literally been true for decades. 11 years ago, computer scientist David Patterson presented a paper entitledLatency Lags Bandwidth, in which he measured the improvements in bandwidth against data accesses across CPUs, DRAM, LAN, and hard drives (SSDs weren’t a thing at that time). What he found is summarized below:
Latency lags bandwidth
In every case — and in a remarkably consistent fashion — latency improved by 20-30% in the same time that it took bandwidth to double. This problem is one we’ve been dealing with for decades — it’s been addressed via branch prediction, instruction sets, and ever-expanding caches. It’s been observed that we add one layer of cache roughly every 10 years, and we’re on track to keep that with Intel’s 128MB EDRAM cache on certain Haswell processors.
A new main memory with even half standard DRAM latency would give programmers an opportunity to revisit decades of assumptions about how microprocessors should be built. A new RAM technology that cut main memory accesses by an order of magnitude would be reason enough to reevaluate the entire balance of resources on a microprocessor. If accessing main memory was as fast as accessing the CPUs cache, you might not needcache on the CPU die or package at all — or at least, you wouldn’t need anything beyond L1 and maybe a small L2.
How long will it will it take to find a technology so fundamentally different and better from anything we have today that we can do away with the DRAM latency and power consumption bottleneck? Given how such a fundamental breakthrough would be vital to our ability to reach exascale computing and beyond, though, I hope it’s soon.

No comments:

Post a Comment