Barcelona Architecture: AMD on the Counterattack
by Anand Lal Shimpi on March 1, 2007 12:05 AM EST- Posted in
- CPUs
Core Tune-up
While the most significant sounding improvements were rolled into the SSE128 changes in Barcelona, they are merely the tip of the iceberg. The laundry list of improvements to Barcelona starts with the branch predictor.
In general, the accuracy of a CPU's branch predictor determines how wide and how deep of a design you can make. The average number of instructions before the predictor mispredicts governs how many instructions you can have in flight, which in turn controls how many execution units you can realistically keep fed on a regular basis. The K8's branch predictor was quite good and very well optimized for its architecture, but there were some advancements Intel introduced in the Pentium M and Pentium 4 that AMD could stand to benefit from.
Barcelona adds a 512-entry indirect predictor which, believe it or not, predicts indirect branches. An indirect branch is one where the target of the branch is a location pointed to by an address in memory, in other words, a branch with multiple targets. Instead of branching directly to a label indicated by the branch instruction, an indirect branch sends the CPU to a memory location that contains the location of the instruction that it should branch to.
Intel added an indirect predictor to its Pentium M processor based on the idea that the more you could limit the number of mispredicted branches, the more efficient your processor could be (thus lowering power consumption). The indirect predictor also made its way into Prescott in order to help minimize the performance deficit incurred by further pipelining the NetBurst architecture.
In Prescott, the simple addition of an indirect predictor resulted in over a 12% reduction in mispredicted branches in SPEC CPU2000. While details of how AMD and Intel differ in their predictor algorithms aren't public, we can expect similarly large improvements in areas where indirect branches are common. In the 253.perlbmk test of SPEC CPU2000 the reduction in mispredicted branches with Prescott was significant, reaching almost 55%. With Barcelona, fewer mispredicted branches means higher overall IPC and greater efficiency both from a power and performance standpoint. AMD doesn't have the incredibly deep pipeline to worry about that Intel did with Prescott, but the efficiency improvements should be significant.
The inclusion of an indirect predictor wasn't the only crystal ball improvement in Barcelona; the size of the return stack in the new core is double what it was in K8. In very deep call chains, for example code that calls many subroutines (e.g. recursive functions), the CPU will eventually run out of room to keep track of where it has been. Once it starts losing track of return addresses, it loses the ability to predict branches involved with those addresses. Barcelona helps alleviate the problem by doubling the size of the return stack. These sorts of improvements are generally implemented by profiling the behavior of software commonly used on a manufacturer's CPU, so we asked AMD what software or scenario drove this improvement of Barcelona. AMD wouldn't give us a concrete example of a situation other than to say that the return stack size improvements were made at the request of a "large software vendor".
The final improvement to the K8's branch prediction came through the usual channels - Barcelona now tracks more branches than its predecessor. There's no mystic science to branch prediction; a processor simply looks at branches it has taken and bases its predictions on historical data. The more historical data that is present, the more accurate a branch predictor becomes. When the K8 was designed it was built on a 130nm manufacturing process; with the first incarnation of Barcelona set to debut at 65nm AMD definitely has the die space to track more branch history data.
While the most significant sounding improvements were rolled into the SSE128 changes in Barcelona, they are merely the tip of the iceberg. The laundry list of improvements to Barcelona starts with the branch predictor.
In general, the accuracy of a CPU's branch predictor determines how wide and how deep of a design you can make. The average number of instructions before the predictor mispredicts governs how many instructions you can have in flight, which in turn controls how many execution units you can realistically keep fed on a regular basis. The K8's branch predictor was quite good and very well optimized for its architecture, but there were some advancements Intel introduced in the Pentium M and Pentium 4 that AMD could stand to benefit from.
Barcelona adds a 512-entry indirect predictor which, believe it or not, predicts indirect branches. An indirect branch is one where the target of the branch is a location pointed to by an address in memory, in other words, a branch with multiple targets. Instead of branching directly to a label indicated by the branch instruction, an indirect branch sends the CPU to a memory location that contains the location of the instruction that it should branch to.
Intel added an indirect predictor to its Pentium M processor based on the idea that the more you could limit the number of mispredicted branches, the more efficient your processor could be (thus lowering power consumption). The indirect predictor also made its way into Prescott in order to help minimize the performance deficit incurred by further pipelining the NetBurst architecture.
In Prescott, the simple addition of an indirect predictor resulted in over a 12% reduction in mispredicted branches in SPEC CPU2000. While details of how AMD and Intel differ in their predictor algorithms aren't public, we can expect similarly large improvements in areas where indirect branches are common. In the 253.perlbmk test of SPEC CPU2000 the reduction in mispredicted branches with Prescott was significant, reaching almost 55%. With Barcelona, fewer mispredicted branches means higher overall IPC and greater efficiency both from a power and performance standpoint. AMD doesn't have the incredibly deep pipeline to worry about that Intel did with Prescott, but the efficiency improvements should be significant.
The inclusion of an indirect predictor wasn't the only crystal ball improvement in Barcelona; the size of the return stack in the new core is double what it was in K8. In very deep call chains, for example code that calls many subroutines (e.g. recursive functions), the CPU will eventually run out of room to keep track of where it has been. Once it starts losing track of return addresses, it loses the ability to predict branches involved with those addresses. Barcelona helps alleviate the problem by doubling the size of the return stack. These sorts of improvements are generally implemented by profiling the behavior of software commonly used on a manufacturer's CPU, so we asked AMD what software or scenario drove this improvement of Barcelona. AMD wouldn't give us a concrete example of a situation other than to say that the return stack size improvements were made at the request of a "large software vendor".
The final improvement to the K8's branch prediction came through the usual channels - Barcelona now tracks more branches than its predecessor. There's no mystic science to branch prediction; a processor simply looks at branches it has taken and bases its predictions on historical data. The more historical data that is present, the more accurate a branch predictor becomes. When the K8 was designed it was built on a 130nm manufacturing process; with the first incarnation of Barcelona set to debut at 65nm AMD definitely has the die space to track more branch history data.
83 Comments
View All Comments
agaelebe - Friday, March 2, 2007 - link
Wow! A lot of dicussion in here.And, by the way, very interesting article.
I'm a software engineer from Brazil and I'm planning to change my PC this year.
I've bem using AMD processors since the K6.
Today I've a XP Mobile 2500+(@2.2ghz), 1gb ram, 200gb and an AGP 6600GT
My PC is not very slow, but I'm thinking in going dual core to speed things up(office applications, web development and some games).
I can run some of the newest games, but not in high graphics.
I expect that my PC can run C&C 3 (Already run the demo in 1024 medium, but have some craches although it's not running it slow)
So, today I'm thinking in 3 options:
1) Stay with this computer and wait until AMD launchs it's new architecture (I pretend to go with an average price Kuma)
2) Go with Intel Core 2 Duo (e6300 or e6400). They're not expensive and for games I can easily make an overclock and gain more performance.
3) Buy a good AM2 board and a cheap Atlhon X2 (3600) and wait new AMD processors and then change only the processor.
Here in Brazil the taxes are to high, so I'm planning in buying a PC with these specs:
- CORE 2 Duo e6300/6400 or X2 3600/3800
- mid-tier motherboard (
- 2 x 1gb DDR 800 4-4-4-12
- 2 x 250 gb
- X1950pro 256 or 512
- 500watts power
So the prices are below:
e6300 box US$ 300 (same price for a X2 4200+ box)
x23800 box US$ 220
motherboard: US$ 220
ram: US$ 400
video: US$ 450
DVD: US$ 70
case: US$ 150
HDs : US$ 250
Power: us$ 180
So I plan to spent about 2000 dollars (Sadly, I can buy this same PC in US for the half of the price).
My new PC should spent not to much power so I can leave it turned onall day long(max 150watts on iddle without monitor), otherwise I'll keep my old computer turned on just for downloding stuff)
So, If someone has an opinion, I'd like to "hear" it. You can give another options to, or make some comments about the specs I'm choosing now.
I had Pentium 75 and after that only AMD CPUs... Should know I surrender to the Core 2 Duo or believe that AMD can really beat it until the end of 2008?
And thanks for the cooperation and patience.
Zebo - Saturday, March 3, 2007 - link
Athlon 64 AM2's arnt exactly slow so if you're an AMD fan just get one..like a 3800+ or 3600+ and overclock it. It will be at least 4x faster than what you have now and accept K8L Agena core later. It will be cheaper than C2D by about $50 USD and You'll also pay cheap for a GeForce 6100 Motherboard which is only $50 USD. Overall expect the the AM2 system to be about $100 USD cheaper.Keep in mind that C2D is 20% faster clock for clock in most apps so it's not exactly a quantum leap here getting a C2D.. Gap gets a lot larger when overclocking since C2D's overclcok higher like 3.2Ghz is common on air vs. only 2.8Ghz for AM2, so, at the end of the day a C2D setup is able to be about 40% faster over most benchmarks. That is getting significant and why enthusiasts are buying C2D's.
agaelebe - Friday, March 2, 2007 - link
And,as always, sorry with the errors and not so good writing...Kiijibari - Thursday, March 1, 2007 - link
Hi,never heard of of that before, does anybody know what it is ?
So far I see 2 pad areas at the DIE photo, therefore I assume that it would be also 2 interfaces, e.g. x8 PCIe like Sun uses ?
bb
Kiijibari
mino - Friday, March 2, 2007 - link
It should be some management/coodrination stuff (can-t remember the name of that bus).Every northbridge and CPU has that.
davecason - Thursday, March 1, 2007 - link
Anand,Great article! I know it took a lot of time and I wanted you to know I really appreciate your effort. It is the kind of article that keeps me coming back to your site.
-Dave
yyrkoon - Thursday, March 1, 2007 - link
Page 5, paragraph 4 'pretty significantly'. Well is it, or is it not it ?
http://www.wikihow.com/Avoid-Colloquial-%28Informa...">http://www.wikihow.com/Avoid-Colloquial-%28Informa...
Aside from my gripe concerning writing style, good article :)
trisweb2 - Friday, March 16, 2007 - link
Usually we criticize writing style based on a whole experience... obviously Anand is one of the best technical review writers on the Internet; if you bother to read his articles more fully perhaps you'd realize that. The colloquial writing sometimes brings it to a more personal level that a reader can better relate to and understand -- it works especially well in this case, where it's a future design, we really don't know how it's going to perform. That he can guess and say "pretty significantly" tells me he understands the uncertainty of the situation, and the language communicates that fact perfectly well. It would be more confusing if he said it would impact performance "significantly" as you want him to, as that would imply that he was more certain than he might actually have been.Masters are allowed to bend the rules, and Anand is one, so lay off.
yyrkoon - Thursday, March 1, 2007 - link
*Is it, or is it not*/me hangs head in shame
baronzemo78 - Thursday, March 1, 2007 - link
Any rough guess as to how Barcelona will compete with Core2 in gaming? Many articles have shown how Core2 gets you a slight FPS boost in games that aren't graphics card limited. I'm curious how Barcelona will fit in with the overall picture of DX10 cards like G80 and R600.