AMD Core Counts and Bulldozer: Preparing for an APU World
by Anand Lal Shimpi on November 30, 2009 12:00 AM EST- Posted in
- CPUs
Last week Johan posted his thoughts from an server/HPC standpoint on AMD's roadmap. Much of my analysis was limited to desktop/mobile, so if you're making million dollar server decisions then his article is better suited for your needs.
He also unveiled a couple of details about AMD's Bulldozer architecture that I thought I'd call out in greater detail. Johan has been working on a CMP vs. SMT article so I'll try to not step on his toes too much here.
It all started about two weeks ago when I got a request from AMD to have a quick conference call about Bulldozer. I get these sorts of calls for one of two reasons. Either:
1) I did something wrong, or
2) Intel did something wrong.
This time it was the former. I hate when it's the former.
It's called a Module
This is the Bulldozer building block, what AMD is calling a Bulldozer Module:
AMD refers to the module as being two tightly coupled cores, which starts the path of confusing terminology. A few of you wondered how AMD was going to be counting cores in the Bulldozer era; I took your question to AMD via email:
Also, just to confirm, when your roadmap refers to 4 bulldozer cores that is four of these cores:
http://images.anandtech.com/reviews/cpu/amd/FAD2009/2/bulldozer.jpg
Or does each one of those cores count as two? I think it's the former but I just wanted to confirm.
AMD responded:
Anand,
Think of each twin Integer core Bulldozer module as a single unit, so correct.
I took that to mean that my assumption was correct and 4 Bulldozer cores meant 4 Bulldozer modules. It turns out there was a miscommunication and I was wrong. Sorry about that :)
Inside the Bulldozer Module
There are two independent integer cores on a single Bulldozer module. Each one has its own L1 instruction and data cache (thanks Johan), as well as scheduling/reordering logic. AMD is also careful to mention that the integer throughput of one of these integer cores is greater than that of the Phenom II's integer units.
Intel's Core architecture uses a unified scheduler fielding all instructions, whether integer or floating point. AMD's architecture uses independent integer and floating point schedulers. While Bulldozer doubles up on the integer schedulers, there's only a single floating point scheduler in the design.
Behind the FP scheduler are two 128-bit wide FMACs. AMD says that each thread dispatched to the core can take one of the 128-bit FMACs or, if one thread is purely integer, the other can use all of the FP execution resources to itself.
AMD believes that 80%+ of all normal server workloads are purely integer operations. On top of that, the additional integer core on each Bulldozer module doesn't cost much die area. If you took a four module (eight core) Bulldozer CPU and stripped out the additional integer core from each module you would end up with a die that was 95% of the size of the original CPU. The combination of the two made AMD's design decision simple.AMD has come back to us with a clarification: the 5% figure was incorrect. AMD is now stating that the additional core in Bulldozer requires approximately an additional 50% die area. That's less than a complete doubling of die size for two cores, but still much more than something like Hyper Threading.
94 Comments
View All Comments
mattclary - Monday, December 14, 2009 - link
[quote]Anand,
Think of each twin Integer core Bulldozer module as a single unit, so correct.
[/quote]
It's no wonder you misinterpreted what he said. This is vague at best! "Is it either or? - Correct!"
aj28 - Thursday, December 3, 2009 - link
Alright, so here's the quote from the article. Take note of the parts in bold...[QUOTE]Also, just to confirm, when your roadmap refers to 4 bulldozer cores that is four of these cores:
http://images.anandtech.com/reviews/cpu/amd/FAD200...">http://images.anandtech.com/reviews/cpu/amd/FAD200...
Or does each one of those cores count as two? I think it's the former but I just wanted to confirm.[/QUOTE]
And AMD's response...
[QUOTE]Think of each twin Integer core Bulldozer module as a single unit, so correct.[/QUOTE]
So to me this reads, "Correct, the former, meaning..."
[QUOTE]...when your roadmap refers to 4 bulldozer cores that is four of these cores:
http://images.anandtech.com/reviews/cpu/amd/FAD200...">http://images.anandtech.com/reviews/cpu/amd/FAD200...[/QUOTE]
There's a good chance that the majority is correct and I am in fact wrong, but... Well, that's just how I read their response. I feel there is a good chance of some more confusion afoot, much like the percentages being thrown around in the original article.
aj28 - Thursday, December 3, 2009 - link
I think it's also worth noting that I fail at quoting... Evidently... Sorry!swindelljd - Wednesday, December 2, 2009 - link
I bet Oracle is salivating over the new core count technique since it is sure to create a huge surge in their revenue because they charge per core on the x86 platform.Sivar - Tuesday, December 1, 2009 - link
If FP performance is given the backseat, it could impact game performance for well multi-threaded games.JumpingJack - Wednesday, December 2, 2009 - link
Depends on how effectively the designers are able to share the FP in this arrangement, but yeah -- gaming will be a question mark. I am pretty confident it will be better not worse.nirmv - Tuesday, December 1, 2009 - link
For what I understand, AMD figured out how to reduce core size by 25% without impacting performance.Each 2 cores will now share the same fetch/decode units (using SMT like Intel), and also the same FP unit (but doubled for 256 bits so actually it's 2 128 bit unit), but seperate Int unit like before). So actually they share half of the logic of two cores together, so they now use 150% of the die area of one core for 2 cores, or in other words save 25% of each core (75% * 2 = 150%).
But, it will still have 1/2 the throuput of Sandy Bridge in FP, and they still will have 1/2 the bandwidth of the fetch/decode because they use 1 for two cores instead using 1 per each.
Nevertheless it looks like a wise decision in terms of power/performance. So nice, but it won't give AMD the performance crown.
Seramics - Tuesday, December 1, 2009 - link
From the way it seems, I'm afraid the badly delayed, highly anticipated, much hyped and AMD's only hope to retake the performance crown from Intel will fall short of expectations. Unless they really come up with a competitive n powerful processor, I'm afraid the AMD we know from A64 days will continue to be history till the next major architecture after bulldozer which could well be 5 years or so after 2011. AMD to be budget player till then.Alberto - Tuesday, December 1, 2009 - link
Buldozer seems too late against Intel upcoming offerings.An eight core Buldozer will be clearly slower than an eight core Sandy Bridge, in both integer and Fp.
This cpu implementation seems done to fight Nehalem ( two 128 bit units, both possibly utilized from one core only ).
Sandy Bridge will have two times Fp power and threads per die,
assuming the article right.
The only manner to be competitive is to consider a single "block"
like a monolitic core. Intel can answer with 50% more cores/die,
performing a complessive better integer and Fp performance.
Still we don't know what will be the new integer performance of the
Sandy Bridge integer unit. I believe it will be higher than in Nehalem.
epobirs - Tuesday, December 1, 2009 - link
I don't buy this claim that FP will be eliminated from CPUs in favor of doing it all on a GPU. There are too many situations where FP is still needed on a per core basis with a primarily integer load. About two minutes after the first systems ship with no integrated FP in the CPU (Bulldozer SX?) there will be engineers thinking themselves clever by proposing to boost FP performance by integrating it into the CPU die!What will happen instead is the FP and onboard low-end graphics solution will merge. The monster GPUs will be there for high-end FP as needed and the die area consumed by the FP and IGA minimized so as to be beneath concern. FP may be external to the cores but they won't be sold without at least one FP/IGA module in the mix. That way you have a chip that is versatile for a wide range of different boxes but also cost competitive.