Despite being extremely well prepared in having Nehalem, motherboards, coolers and memory well before launch, the run up to the NDA lift of Intel's Core i7 processors was stressful. There was so much to test: multi-GPU compatibility with X58, memory controller performance, general application performance, overclocking, Hyper Threading, etc...
We're all still hard at work on sorting out the details, Gary is working on a X58 motherboard roundup and has been testing 12GB memory configurations for the past several days (as well as working with board vendors to improve performance/compatibility with 12GB but I'll let you tell him about that), Derek is working on multi-GPU performance and Kris has been working on an overclocking guide. What have I been up to? Well, I've been trying to answer a few lingering questions about Nehalem.
What I've got today are the first results of the questions I've been asking, I've spent the past week looking at power efficiency, memory latency and talking to some of Intel's finest on the phone about Nehalem. And I'm back to report, gather 'round for Nehalem: The Unwritten Chapters.
The Uncore
I got a little more detail from Intel on the un-core clock. Just like Phenom, Intel’s Core i7 is divided into an area called the “core” and an area called the “uncore”. The core contains the individual processor cores and their L1/L2 caches, while the uncore houses the memory controller and the shared L3 cache. In our review I mentioned that the uncore runs at 2.66GHz, which is true, but only for the Core i7-965. The Core i7-940 and 920 both run the uncore at 2.13GHz.
The uncore clock is defined by Intel just like the core clock is - Intel sets it based on yield and performance targets. As I mentioned in the launch review, the uncore clock runs at a simple multiplier of the bclk (133MHz): 20x for the i7-965 and 16x for the i7-940/920. The uncore also runs at its own voltage (1.20V) and that voltage doesn't scale up/down.
On Intel’s own X58 board the uncore clock is configured on the memory settings page and is simply called UCLK:
I took the i7-965, ran it at 2.66GHz to simulate an i7-920, and varied the uncore clock to measure the impact in L3 cache and memory latency:
Core Clock | Uncore Clock | L3 Latency | Main Memory Latency | x264 HD Benchmark | Cinebench XCPU Benchmark |
2.66GHz | 2.93GHz | 34 cycles | 143 cycles | 72.8 fps | 13456 |
2.66GHz | 2.66GHz | 36 cycles | 148 cycles | 73.0 fps | 13429 |
2.66GHz | 2.13GHz | 41 cycles | 159 cycles | 72.7 fps | 13182 |
At a 2.66GHz uncore clock things seem to hit a sweet spot, although the translation to real-world performance just isn't there. Perhaps in a very memory intensive test we'd see something more pronounced, but even the x264 HD encoding test showed no performance difference between the three uncore clock speeds.
Surprisingly enough, I couldn’t get the i7-965’s uncore to hit 3.2GHz - Vista would bluescreen before I could even get to the desktop (note that the Intel X58 board I was using did not support adjusting the uncore voltage, so it remained at stock). As the table above shows, increases in uncore frequency aren't nearly as useful as increasing the CPU frequency. Intel recognized this performance relationship as well and chose to optimize the uncore for power consumption, not clock speed, which means that the uncore won't be able to clock as high as the core itself. You could always increase the voltage a lot to try and boost uncore speed but right now it's not looking like the tradeoff would be worth it as you'd increase power quite a bit.
23 Comments
View All Comments
lemonadesoda - Wednesday, November 19, 2008 - link
Anand. Fantastic article, but:1./ You didnt mention whether your tests were on 32bit or 64bit. We know that 32bit Core 2 is more efficient due to microcode fusion, whereas that isnt true for 64bit. On i7, opcode fusion is there on 64bit.
2./ I think you should execute a CPU HALT to observe deep down idle. This figure, say 110W, should then be SUBTRACTED from all other results. Why? Because this is essentially the mainboard/HDD/system power draw excluding the CPU. I see from your figures that the power used (as a delta from idle) on i7 is actually HIGHER than QX9770. So I actually have a very different view than you. I think x58 is much more efficient, and that internal memory controller is less power than older northbridge. But when the i7 is crunching, is is using more power AT THE CPU than the QX9770
prodystopian - Monday, November 10, 2008 - link
While this limit is a non-issue for anyone getting a X58 motherboard, what about those looking for the e2xxx of this generation? When looking for a cheap CPU to heavily OC to get an extreme Price/performance, it would be best to pair with a cheap motherboard such as the next P series (not X). I'm assuming we don't know whether this BIOS switch will be on the P series motherboards, but if it is not, that is where the real problem occurs.Live - Sunday, November 9, 2008 - link
I don't know if this has been answered yet but what are the advantage of the i7-965 higher QPI? Can you overclock the QPI and if so dose it make a difference?Live - Sunday, November 9, 2008 - link
Live I think you meant to write:I don't know if this has been answered yet, but what is the advantage of the i7-965 higher QPI? Can you overclock the QPI and if so does it make a difference?
CEO Ballmer - Saturday, November 8, 2008 - link
Made for Vista!http://fakesteveballmer.blogspot.com">http://fakesteveballmer.blogspot.com
Rev1 - Saturday, November 8, 2008 - link
Maybe im missing something but being that the multiplier was not unlocked how did he get it that high?frazz - Saturday, November 8, 2008 - link
Surely CPU power at a fixed voltage is proportional to the square of the voltage, not the cube? I thought the formula was this:Power dissipation = C.V^2.f where C is the capacitance being switched per clock cycle
frazz - Saturday, November 8, 2008 - link
Sorry I meant CPU power at a fixed FREQUENCY is proportional to the square of the voltage. D'oh.HolyFire - Saturday, November 8, 2008 - link
I agree. This surely was a misinterpretation of Intel's slide, which actually meant: If the frequency is increased proportionally to the voltage, the power will go like voltage cubed. But for a fixed frequency, power goes like voltage squared.In either case, I find that slide a little suspicious, as I have not yet seen any theoretical or experimental result suggesting that frequency should be linearly proportional to voltage.
ltcommanderdata - Friday, November 7, 2008 - link
Great article. It's nice to see someone do a more in depth analysis of Nehalem's characteristics rather than just printing a bunch of benchmarks.In regards to you Hyperthreading tests, it might be interesting to isolate the causes of HT performance increases in Nehalem. HT quite often was a hinderance for Netburst and it would be interesting to see whether the cause was primarily HT's implementation in Netburst or just do the the maturity of HT compatible software at the time. It's an odd coincidence that the last processor to carry HT, besides Atom, was the Pentium Extreme Edition 965 while the first desktop processor to reintroduce HT is again numbered 965 as part of the Core i7 family.
For instance, you could try to compare the speedup that 965EE receives going from 2 to 4 threads against the i7-965 doing the same. It would also be interesting to see if HT's performance delta improves going from Windows XP to Windows Vista, which would imply that Vista's scheduler is smarter about dispatching tasks to logical cores that don't share resources.
And in regards to mobile Nehalem, I agree that the power consumption improvements could really benefit notebooks, but it's kind of curious that Nehalem won't come to notebooks until Q3 2009. I believe previous Core 2 rollouts for Merom and Penryn were pretty fast, like a quarter spread between the desktop, notebook, and UP/DP server markets, but this looks to be a 3 quarter spread. I wonder what the delay is? With a Q3 2009 mobile Nehalem launch, they might as well just wait a quarter and do a strong roll out of Westmere on mobile first.