The Nehalem Preview: Intel Does It Again
by Anand Lal Shimpi on June 5, 2008 12:05 AM EST- Posted in
- CPUs
A Quick Path to Memory
Our investigation begins with the most visibly changed part of Nehalem's architecture: the memory subsystem. Nehalem implements a very Phenom-like memory hierarchy consisting of small, fast individual L1 and L2 caches for each of its four cores and then a single, larger shared L3 cache feeding the entire chip.
Nehalem's L1 cache, despite being seemingly unchanged from Penryn, does grow in latency; it now takes 4 cycles to access vs. 3. The L2 cache is now only 256KB per core instead of being 24x the size in Penryn and thus can be accessed in only 11 cycles down from 15 (Penryn added an additional clock cycle over Conroe to access L2).
CPU / CPU-Z Latency | L1 Cache | L2 Cache | L3 Cache |
Nehalem (2.66GHz) | 4 cycles | 11 cycles | 39 cycles |
Core 2 Quad Q9450 - Penryn - (2.66GHz) | 3 cycles | 15 cycles | N/A |
The L3 cache is quite possibly the most impressive, requiring only 39 cycles to access at 2.66GHz. The L3 cache is a very large 8MB cache, 4x the size of Phenom's L3, yet it can be accessed much faster. In our testing we found that Phenom's L3 cache takes a similar 43 cycles to access but at much lower clock speeds (2.0GHz). If we put these numbers into relative terms it takes 21.5 ns to get a request back from Phenom's L3 vs. 14.6 ns with Nehalem's - that's nearly 50% longer in Phenom.
While Intel did a lot of tinkering with Nehalem's caches, the inclusion of a multi-channel on-die DDR3 memory controller was the most apparent change. AMD has been using an integrated memory controller (IMC) since 2003 on its K8 based microprocessors and for years Intel has resisted doing the same, citing complexities in choosing what memory to support among other reasons for why it didn't follow in AMD's footsteps.
With clock speeds increasing and up to 8 cores (including GPUs) making their way into Nehalem based CPUs in the coming year, the time to narrow the memory gap is upon us. You can already tell that Nehalem was designed to mask the distance between the individual CPU cores and main memory with its cache design, and the IMC is a further extension of the philosophy.
The motherboard implementation of our 2.66GHz system needed some work so our memory bandwidth/latency numbers on it were way off (slower than Core 2), luckily we had another platform at our disposal running at 2.93GHz which was working perfectly. We turned to Everest Ultimate 4.50 to give us memory bandwidth and latency numbers from Nehalem.
Note that these figures are from a completely untuned motherboard and are using DDR3-1066 (dual-channel on the Core 2 system and triple-channel on the Nehalem system):
CPU / Everest Ultimate 4.50 | Memory Read | Memory Write | Memory Copy | Memory Latency |
Nehalem (2.93GHz) | 13.1 GB/s | 12.7 GB/s | 12.0 GB/s | 46.9 ns |
Core 2 Extreme QX9650 - Penryn - (3.00GHz) | 7.6 GB/s | 7.1 GB/s | 6.9 GB/s | 66.7 ns |
Memory accesses on Conroe/Penryn were quick due to Intel's very aggressive prefetchers, memory accesses on Nehalem are just plain fast. Nehalem takes a little over 2/3 the time to complete a memory request as Penryn, and although we didn't have time to run comparable Phenom numbers I believe Nehalem's DDR3 memory controller is faster than Phenom's DDR2 controller.
Memory bandwidth is obviously greater with three DDR3 channels, Everest measured around a 70% increase in read bandwidth. While we don't have the memory bandwidth figures here, Gary measured a 10% difference in WinRAR performance (a test that's highly influenced by memory bandwidth and latency) between single-channel and triple-channel Nehalem configurations.
While we didn't really expect Intel to somehow do wrong with Nehalem's memory architecture, it's important to point out that it is very well implemented. Intel managed to change the cache structure and introduce an integrated memory controller while making both significantly faster than what AMD managed despite a four-year headstart.
In short: Nehalem can get data out of memory quick like bunnies.
108 Comments
View All Comments
kilkennycat - Thursday, June 5, 2008 - link
Isn't 6GB of RAM a pretty sweet spot for desktop 64-bit applications, whatever about servers?jimmysmitty - Thursday, June 5, 2008 - link
Well I have been waiting for Nehalem. I gave in and decided to build a rig with the Q6600 but kinda sad now.Anwways. Crank the Planet, hes not showing fanboyism. He stated Intel has been promising 20-30% increase with Nehalem. They are seeing 20-50% from these benchmarks. Take 21 and divide it by 14 that gives you 1.5. That means that the AMD Phenoms latency is about 50% slower.
If anything you are showing fanboyism. Nehalem is showing to be one hell of a chip and you are just angry that AMD has nothing to compare to it. Even after AMD finishes absorbing ATI whats next, K10.5 aka Deneb? Thats just a 45nm refresh (just like Penryn was for Conroe). Unless there are some major changes in the architecture it will just, hopefully, make Phenom run at higher clocks and cooler.
Other than that I can't wait to see what this does for games. I know that most games are more GPU dependant but I myself play mainly Valve games using Source and thats very CPU dependant and already runs great on my Q6600 but I want to see what this game will do for their particle and physics system...
Nehemoth - Thursday, June 5, 2008 - link
Please, Please, Please Intel I would to have this monsters chip in our servers without the annoying FBD, I don't want hoty FBD bring me normal DDR2 (without FBD) or DDR3.Just what I ask.
Griswold - Thursday, June 5, 2008 - link
I'm a big fan of multi-core systems, but I'm not blind to reality: Why no single threaded benchmarks, but only benchmarks that scale very good with more cores/SMT? By the time these things will be on the market, most applikations will still be single threaded and you know it...I just want to know how much faster it is per clock per core.
Anand Lal Shimpi - Thursday, June 5, 2008 - link
Interestingly enough, none of our standard CPU benchmarks are single threaded at all - even the most benign ones are multithreaded (including the games). I did run some single thread Cinebench numbers though:Nehalem - 3015
Q9450 - 2396
bradley - Thursday, June 5, 2008 - link
Why is there such a large discrepancy between previous single-threaded Cinebench tests from six months ago: where the Q9450 scored a 2944, or a mere 2.4% decrease, compared to the current 2396, or a more substantial 20.5% decrease.http://www.anandtech.com/printarticle.aspx?i=3153">http://www.anandtech.com/printarticle.aspx?i=3153
I too believe single-threaded benches should be the foundation of any meaningful and relevant cpu review, if time indeed was permitting. To me this is the greatest objective real-world equalizer. There just isn't enough multi-threaded software out there, much less software able to run all eight cores. I would also like to emphasize that unlike server chips, desktop Nehalems will only have two memory channels. And as I understand, hyper threading also will only make an appearance in server and enthusiast chipsets. So already this makes an accurate comparison difficult enough.
Finally, I understand the avg visitor will treat this like any good entertainment, where one is meant to suspend his-her disbelief. Still I have a hard time believing anyone has the ability to abscond away such important chips from a huge corporation like Intel. "Without Intel's approval, supervision, blessing or even desire - we went ahead and snagged us a Nehalem (actually, two) and spent some time with them." That initial premise does make anything coming after less impactful, or seemingly less than straightforward.
Certainly if history has taught us anything, we know final shipping silicon is sometimes quite different from test chips. We should also assume it's a lot easier to create ond one chip than manufacture hundreds of thousands on a large scale. Nothing is ever a given, which makes it hard to draw much of a conclusion. Interesting preview nonetheless.
SiliconDoc - Monday, July 28, 2008 - link
Shhhhh... gosh we have to have core hype ... and the multicore testers have to optimize for the coming chips... geeze they have to make a living somehow...( You sir, are exactly correct, but we live in a strange world nowadays where the truth is so evident it must be hidden most of the time for various other reasons... )
Gosh, you want to crash the whole economy with that sane and rational talk ?
What are you an anarchist ? ( yes I'm kidding, that was a big high five to you)
Anand Lal Shimpi - Thursday, June 5, 2008 - link
Ignore those numbers (check page 6 of the comments for an explanation), the Q9450 comes in at 2931 vs. Nehalem's 3015.-A
pnyffeler - Thursday, June 5, 2008 - link
I'm not a Mac person, but I think Mac's may benefit from this technology even more than Vista. As I recall from a previous Anandtech article, Mac's have an excellent memory management system, which very direct benefit in increasing memory size. The increased bandwidth could make the snazzy OS even better...Visual - Thursday, June 5, 2008 - link
It is great that your "clock for clock" comparisons to the penryn in encoding and rendering are showing an improvement... but could that improvement be from the doubled amount of virtual processors that are visible? Are all of these benchmarks using eight or four threads on the nehalem?