Out of Order execution: AMD versus Intel
To make this article more accessible and to make the differences between the AMD K8 and the new Intel Core architecture more clear, I tried to make both CPU diagrams in the same style. Here's the Core architecture overview:And here's the K8 architecture:
There are a few obvious differences: Core has bigger OoO buffers: the 96 entry ROB buffer is - also thanks to Macro-op fusion - quite a bit bigger than the 72 Entry Macro-op buffer of K8. The P6 architecture could order only 40 instructions, this was doubled to 80 in the P-M architecture (Banias, Dothan, Yonah), and now it's increased even further to 96 for the Core architecture. We've created a table which compares the most important architectural details of several current CPU families:
Click to enlarge |
The Core architecture uses a central reservation station, while the Athlon uses distributed schedulers. The advantage of a central reservation station is that utilization is better, however distributed schedulers allow more entries. NetBurst also uses distributed schedulers.
Using a central reservation station is another clear example of how Core is in fact the "P8", the second big improvement of the P6 architecture. Just like the P6 architecture, it uses a Reservation Station (RS) and allocates a specific execution unit to execute the micro-op. After execution, the micro-op results are stored in the ROB entry for that micro-op. This aspect of the Core design is clearly taken from the Yonah, Dothan and P6 architectures.
The biggest differences are not immediately visible on the diagrams above. Previous Intel architectures can only perform one branch prediction every two cycles, but Core can sustain one branch per cycle. The Athlon 64 can also perform one branch prediction per cycle.
Another impressive area is Core's SSE multimedia power. Three very powerful 128-bit SSE/SSE2/SSE3 units are available, and two of them are symmetric. Core will outperform the Athlon 64 vastly when it comes to 128-bit SSE2/3 processing.
On K8, 128-bit SSE instructions are decoded into two separate 64-bit instructions. Each Athlon 64 SSE unit can only do one 64-bit instruction at a time, so the Core architecture has essentially at least 2 times the processing power here. With 64-bit FP, Core can do 4 Double Precision FP calculations per cycle, while the Athlon 64 can do 3.
When it comes to integer execution resources, the Core architecture is an improvement over the Pentium 4 and Dothan CPUs, and is at the same level (if we only look at the number of execution units) as the Athlon 64. The Athlon 64 seems to have a small advantage when it comes to calculating addresses: it has 3 AGU compared to Core's 2. This could give the Athlon 64 an advantage in some less common integer workloads such as decrypting algorithms. The deeper, more flexible (Memory disambiguation, see further) out of order buffers and bigger, faster L2-cache of the Core should negate this small advantage in most integer workloads.
87 Comments
View All Comments
Betwon - Tuesday, May 2, 2006 - link
The data can not be used for Core -- Because it did not use the smart prefetcher.The Advanced smart prefetchers of Core's L1D have decreased the miss-rates very much. In fact, The data cache of Core --much more efficiency than K8's.
Compared with Core's smart cahce, K8's 64KB L1D is like an idiot .
Spoonbender - Tuesday, May 2, 2006 - link
How do you know? As the article said, the prefetching *might* in some cases decrease performance, even if it'll usually be an advantage. But I don't really think you have enough information to make a valid comparison. My point was simply that generally speaking, a 64KB, 2-way associative cache will have better hit rates than a 32KB 8-way associative. Of course, having fancy prefetching is always a good thing, but its effect *is* limited. If it was a huge improvement, people would have done that 8 years ago, instead of just messing with cache size and associativity.Betwon - Tuesday, May 2, 2006 - link
Your information is too old and should be updated now.Prefetcher give much improvement in reducing the miss-rate.
About 30-90% miss rate reduced.
The good prefetcher tech is one of the most important performance factors.
http://www.hpcaconf.org/hpca11/slides/hpca_inst_sl...">http://www.hpcaconf.org/hpca11/slides/hpca_inst_sl...
Betwon - Tuesday, May 2, 2006 - link
Who is James E. Smith? I think that you should know him.Data Cache Prefetching Using a Global History Buffer -- the prefetcher bring the great performance improvement! From 20-110%
abstract
http://ieeexplore.ieee.org/search/freesrchabstract...">http://ieeexplore.ieee.org/search/frees...+buffer%...
Of course, you can download the full-text pdf file, if you have a IEEE member account. I can download and view it, but can not release it.
slides ppt
http://www.ece.gatech.edu/~leehs/ECE7102/slides/ka...">http://www.ece.gatech.edu/~leehs/ECE7102/slides/ka...
Sunrise089 - Monday, May 1, 2006 - link
flak flak flakSeriously - props to the author on a good article, but if I had one comment it would be that there are length issues to trying to provide the ammount of background needed for this sort of article. I think it's best to either just draw the comparisons between the two chips, or do a full-length many thousands of words write-up on the technical importance of the various topics. I read the article, and while writen well and informative in it's conclusions, I cannot say all the background was enough to make me really understand the concepts better. For example I already knew what out-of-order execution was, but only being able to read a few hundred words more on it didn't allow me to really learn enough to understand all of the reasons why the K8 had a disadvantage in that area, and if all you wanted was for me to understand that it did indeed have the disadvanatge, you could have just said so.
JohanAnandtech - Monday, May 1, 2006 - link
It is indeed an issue I struggled with. Writing full length articles on these subjects doesn't sound like a good idea for me: I personally do not like lengthy articles either. So I tried to keep a balance between being technical and keeping it understandable.Anyway, Just ask about the points where you were lost. Especially on the OoO matters: it is much more interesting than "AMD has a disadvantage". Basically, reordering happens between the decoding and the execute phase.
Pushing loads forward helps in two ways:
1.Whenever a load fails to get it's data from the L1-cache, the CPU has to find other instructions to execute. As loads are very common, it is easier to fill the gaps than when you can not move loads before other loads.
2. If a load gets pushed forward and a L1-cache miss for that load occurs, it isn't that bad. This is very simplified, but assume the load has been pushed 5 cycles forward, and your L2-cache latency is 10, you only have to wait 5 cycles instead of 10.
Furen - Monday, May 1, 2006 - link
I'll be the grammar nazi today, lol.Last page, paragraph 5: "[...] increasing the <b>wideness</b> of each unit [...]"
Width, perhaps? "Wideness" refers to either quality or state (neither of which is discrete) while "width" also also applies to measurable fact (128-bits wide, for example). You can talk about the wideness of the units, for example, but you cannot talk about increasing their wideness...
Great article, by the way, it's been long since I've read such an enjoyable article.
emboss - Monday, May 1, 2006 - link
Just a quick note ... on page 4 you have the table with the execution unit details. There's a couple things incorrect (IMO) in the numbers.First, you list the number of double precision FLOPs per cycle. Double precision can be done with SSE, so in the K8 you can do 2 DP ADDs and 2 DP MULs every two cycles (due to the 64-bit wide datapaths), a total of 2 DP FLOPs per cycle.
Core can do two SSE operations per cycle (the two symmetric units), giving it a total of 4 DP FLOPs per cycle. The third SSE unit does not handle FP ops, but instead handles shuffles and the like.
Obviously, double both of these numbers if you want a "peak" single precision FLOPs per cycle.
If instead you were meaning about extended precision (64 bit precision, 80 bit floats) x87 operations, it's exactly the same concept as above since Core has apparently has combined SSE/x87 units (and a fully pipelined FMUL, unlike the P4). This gives both the K8 and Core 2 EP FLOPs per cycle.
Finally, you have the number of SSE units for the K7 wrong. The K7, like the K8, has two SSE units (FADD and FMUL), and the same 64 bit datapath as the K8. Of course, the K7 cannot handle SSE2, so must use x87 instructions for double precision (ie: two DP FLOPs per cycle).
Apart from that, very nice article! I've been trying to optimise SSE code for the Core processor and have had to do things by trial and error thanks to the complete void of any decent documentation from Intel. One thing in particular was that I was finding "odd" performance properties with SSE that pointed towards it having two FMUL units. Being symmetric units explains a lot!
JarredWalton - Monday, May 1, 2006 - link
See above note regarding Core Duo versus Core "Conroe". (Nice naming scheme, Intel. *grumble*) I will let Johan take care of the rest of your comment as appropriate. (His knowledge of the low level details of all of the microarchitectures discussed here definitely surpasses mine!)Unfortunately, it's not particularly surprising to find out that optimal code for Core Duo may need to be slightly tweaked in order to extract the most performance from Conroe. Still, they ought to be similar enough that you own by optimizing for Core Duo. The flipside is the optimal code for Conroe could very likely run worse on Core Duo and other processors. Such is the price of progress, I guess.
prx99 - Monday, May 1, 2006 - link
Core is not the first x86 having 4 decoders. That was AMDs K5.I remember a statement from AMD that in some design they considered adding one more decoder. It turned out to actually slow down the design because the amount of clock speed lost was not compensated for by the smaller amount of performance gained.
In my interpretation the fusion is done past the initial decoding, so there is not way more that 4 x86 ops can be decoded in a clock cycle (I'm referring to the "4+1" figure). The profit from fusion is not in the decoding stage but in the out of order engine.
At AMDs, the "1 branch per cycle" rule is limited to branches seen by the predictor. A branch which is generally not taken is invisible to the prediction engine and therefore free.
The original P4 indeed had a L1 latency of 2. The major P4 redesign in Prescott however increased it to 3.
Load/store reordering is already done by the P4, but the penalty from a misprediction is fairly high. This is the drawback of any kind of prediction, whether branches or memory access: It speeds up things when being correct, but slows them down quite a bit more when not. This was the general picture seen in the P4: many applications were sped up by some amount, but some suffered greatly because they systematically fooled the P4's engines.
Gruss, Andreas