Opening the Kimono: Intel Details Nehalem and Tempts with Larrabee
by Anand Lal Shimpi on March 17, 2008 5:00 PM EST- Posted in
- CPUs
Nehalem supports QPI and features an integrated memory controller, as well as a large, shared, inclusive L3 cache.
Nehalem is a modular architecture, allowing Intel to ship configurations with 2 - 8 cores, some may have integrated graphics and with various memory controller configurations.
Nehalem allows for 33% more micro-ops in flight compared to Penryn (128 micro-ops vs. 96 in Penryn), this increase was achieved by simply increasing the size of the re-order window and other such buffers throughout the pipeline.
With more micro-ops in flight, Nehalem can extract greater instruction level parallelism (ILP) as well as support an increase in micro-ops thanks to each core now handling micro-ops from two threads at once.
Despite the increase in ability to support more micro-ops in flight, there have been no significant changes to the decoder or front end of Nehalem. Nehalem is still fundamentally the same 4-issue design we saw introduced with the first Core 2 microprocessors. The next time we'll see a re-evaluation of this front end will most likely be 2 years from now with the 32nm "tock" processor, codenamed Sandy Bridge.
Nehalem also improved unaligned cache access performance. In SSE there are two types of instructions: one if your data is aligned to a 16-byte cache boundary, and one if your data is unaligned. In current Core 2 based processors, the aligned instructions could execute faster than the unaligned instructions. Every now and then a compiler would produce code that used an unaligned instruction on data that was aligned with a cache boundary, resulting in a performance penalty. Nehalem fixes this case (through some circuit tricks) where unaligned instructions running on aligned data are now fast.
In many applications (e.g. video encoding) you're walking through bytes of data through a stream. If you happen to cross a cache line boundary (64-byte lines) and an instruction needs data from both sides of that boundary you encounter a latency penalty for the unaligned cache access. Nehalem significantly reduces this latency penalty, so algorithms for things like motion estimation will be sped up significantly (hence the improvement in video encode performance).
Nehalem also introduces a second level branch predictor per core. This new branch predictor augments the normal one that sits in the processor pipeline and aids it much like a L2 cache works with a L1 cache. The second level predictor has a much larger set of history data it can use to predict branches, but since its branch history table is much larger, this predictor is much slower. The first level predictor works as it always has, predicting branches as best as it can, but simultaneously the new second level predictor will also be evaluating branches. There may be cases where the first level predictor makes a prediction based on the type of branch but doesn't really have the historical data to make a highly accurate prediction, but the second level predictor can. Since it (the 2nd level predictor) has a larger history window to predict from, it has higher accuracy and can, on the fly, help catch mispredicts and correct them before a significant penalty is incurred.
The renamed return stack buffer is also a very important enhancement to Nehalem. Mispredicts in the pipeline can result in incorrect data being populated into Penryn's return stack (a data structure that keeps track of where in memory the CPU should begin executing after working on a function). A return stack with renaming support prevents corruption in the stack, so as long as the calls/returns are properly paired you'll always get the right data out of Nehalem's stack even in the event of a mispredict.
53 Comments
View All Comments
pugster - Monday, March 24, 2008 - link
Intel core2duo is probably good for business, but the OS doesn't need need anything more than 2 cores running at an average of 2ghz. I know that there are people out there who wants the latest and greatest for games, but more and more people rather buy in a game console like the ps2 rather than putting money down for an geforce 9800. It seems that the only way for Intel to make money making new products like the silverthorne or going back on the flash memory race.PlasmaBomb - Thursday, March 20, 2008 - link
Since it is based on penryn isn't 16 MB of cache an odd number? Should that not be 18 MB? (i.e. 3 x dual cores at 6 MB each)IntelUser2000 - Sunday, March 23, 2008 - link
Plasmabomb, Penryn has 6MB L2, not L3. Dunnington has 16MB L3 in addition to the whatever L2 it will have, please read!perzy - Wednesday, March 19, 2008 - link
Larrabee, thats good news. Finally some competition in the graphics department!Let's face it, right now you can get 2 xbox 360's and an ipod for the price of one fast graphics card...that can't be right.
AcaClone - Tuesday, March 18, 2008 - link
What can I say ...AcaClone - Tuesday, March 18, 2008 - link
On second thought - I guess that it is possible that the demo software is indeed multithreaded, but that only one thread is running when left idle??ajg - Tuesday, March 18, 2008 - link
The slide showing Intel: The architeturr for life is a page lifted from AMDs slide "Diversifying Platform Design Tracks"link below
http://www.tgdaily.com/index.php?option=com_conten...">http://www.tgdaily.com/index.php?option...mp;slide...
The CPU architecture is no different. I guess can't make expect an old dog to come up with new tricks?
clnee55 - Tuesday, March 18, 2008 - link
Yes, AMD said it but couldn't do it. Easily said than donemicha90210 - Tuesday, March 18, 2008 - link
Is that possible? There's a limit in XP to 3.25GB of ram. XP can't handle 16GB... is that picture real?oldhoss - Tuesday, March 18, 2008 - link
I'd venture to guess either XP Pro x64, or Windows 2003 Server.