Intel Enterprise and Corporate Roadmap - Q3'05 Update
by Jarred Walton on September 12, 2005 12:00 AM EST- Posted in
- CPUs
Enterprise Server Roadmap
The last sector is one that few people will actually encounter, but we always find the Itanium roadmaps to be interesting. Intel has repeatedly stated that the desktop sector isn't really ready for 64-bit processors and OSes, and we have to say that they're right. XP-64 is available, but we can't actually recommend anyone switch to it just yet. Driver support is lacking, and performance isn't any better for most applications. Intel is also wrong, of course: we need 64-bit consumer processors before we'll get 64-bit consumer OSes and applications. The 32-bit 386 was launched years before we ever saw a proper 32-bit OS, for example. Intel wanted the 64-bit consumer world to come from the 64-bit server world, but that has changed and IA64 may now be stuck as an enterprise-only platform.Talk of multiple cores on future processors does give IA64 a chance, though. For instance, we could always see IA64 show up as a secondary core, and over time it might move to play a more central role. That may or may not happen, but regardless of what the desktop enthusiasts might think of Itanium, Intel continues to update the processor line. In case this wasn't already made clear, all of the Itanium chips support 64-bit applications, but they do not support x86-64 applications - at least not natively. Itanium 2 doesn't support x86 code natively either, but runs it through a combination of software emulation and hardware features. The inability to run x86 code with high performance has been a drawback of the IA64 platform since its inception, but the market for Itanium generally isn't too concerned about x86 compatibility.
Click to enlarge |
Like the server and workstation parts, Intel has also announced a change in name for the Itanium parts. Future Itanium 2 chips will join the model number club, with the 9000 series currently used for the Montecito-based Itanium 2 chips. Ranging from 9010 up through the 9055, the planned Itanium 2 parts cover a large range of cache sizes, features, clock speeds, and performance. At the low end, the 9010 and 9011 are single core versions of Montecito, with cache sizes and bus speeds that should help them take over the place of last generation's Madison 6MB parts. The model numbers are meant to reflect increasing features and overall performance for the platform, so even though the 9010/9011 are clocked higher than some of the other Montecito chips, the lack of dual cores and the limited L3 cache gives them a lower model number. The dual core additions and other technology in Montecito will have a dramatic impact on computational power, with Intel stating up to a 2.5X increase in computational power relative to Madison-9M. It's not clear whether that's in regards to the 9040 chip or the 9055, though - we would guess they're giving the best case scenario with the 9055.
Some of the other interesting aspects of the Itanium lineup are the bus speeds. With the later model Madison cores supporting up to 667FSB, it's a little odd to see the new cores continuing to support 400 and 533 FSB speeds. The lower speeds may simply be for compatibility with current generation Itanium motherboards. The lack of certain features is also a bit odd. DBS is the Itanium equivalent of Intel Speedstep technology from the desktop and mobile sector. We're not sure why the 9030/9031 and 9010/9011 chips do not include this feature. HyperThreading is also disabled on the same processors, so it may simply be a case of reduced features for a reduced price.
Other features are included on all of the Montecito cores. Pellston is a useful addition, as it allows the processor to detect L3 cache errors and dynamically disable faulty cache lines without requiring a reboot or any administrative intervention. That should certainly increase yields, as the large L3 cache comprises the majority of the die space. While the loss of usable cache might seem like a bad thing, Intel has stated that they will replace processors that develop more than 90 bad cache lines. Even with the smallest 6MB cache chips, that's still less than 0.1% of the total L3 cache. The Vitualization Technology is also present in all the new cores, giving hardware support for certain features that allow concurrently running OSes to be used. Finally, we have the Foxton technology with the listed clock speeds. Foxton allows the processors to run at higher speeds under certain load conditions when the core temperature is below a threshold. Basically, think of it as mild overclocking for the enterprise sector. Most of the chips only support a 200MHz "overclocked" speed, but that's still an 11 to 17% overclock. (The 9030/9031 only sport a 100MHz speed increase - a 6% increase.) The only thing we don't know is how often the higher clock speed can actually be reached. We've seen automatic overclocking tools from others that in practice rarely managed more than a 1% average speed increase.
Montecito is the first Itanium design to switch to 90nm process technology. The switch has allowed Intel to dramatically increase the amount of L3 cache available on the top models. If the Pentium 6xx line's 2MB of cache seems like a lot - there's a law of diminishing returns in respect to cache sizes - the up to 24MB of L3 cache probably seems like hubris. However, increasing cache sizes can help enterprise class servers quite a bit. These systems might have as much as 16 or even 64 GB of RAM, so a paltry 2MB cache can in fact be inadequate for large data sets. Transistor count for the 955 chip is a whopping 1.7 billion transistors - over seven times the size of Smithfield and Presler! What's really amazing is that this is done on a 90nm process. With mainstream chips scaling to 230 million transistors and more on the 65nm process, we have to wonder where Intel will take the Itanium line when they transition it to the new process. Anyone need an Itanium 2 with 48MB of L3 cache? It might be in the woodworks if the blooming L3 sizes of Montecito are any indicator of future trends. Like it or not, you still have to at least respect the ability of any company that can manage to create such a massive processor as the Itanium.
21 Comments
View All Comments
IntelUser2000 - Monday, September 12, 2005 - link
Itanium either supports hardware emulation OR software translation. The difference between emulation and translation may seem to be minimal, but translation has much better performance than emulation. While the hardware emulation just emulates instructions, the software translator dynamically optimizes the code on the fly to improve performance.Hardware emulation is NOT present on Montecito in favor of IA-32EL(software translation)
IntelUser2000 - Monday, September 12, 2005 - link
The MAJOR difference betweeen Foxton and *OTHER* dynamic overclocking is that Foxton is implemented on HARDWARE, while other dynamic overclocking is based on SOFTWARE.I guess you guys may refer to the dynamic overclocking by MSI by D.O.T. or the one in ATI Catalyst driver. But they are software based. 30 million of the LOGIC transistors are dedicated to JUST Foxton technology.
Foxton isn't just dynamic overclocking. If the power consumption exceeds the set threshold, it clocks the CPU down until its equal or under the threshold point. Unlike conventional overclocking, Foxton FINDS the right point where it won't damage the CPU, while providing the maximum clockspeed the design can provide.
OCing Prescott to 6GHz is not safe point, BTW.
Foxton responds extremely fast on demand and power consumption. The hardware feature for Foxton is extensive for power management, basing it on power consumption, temperature, workload.
JarredWalton - Monday, September 12, 2005 - link
Good points, and obviously I wasn't trying to get into the deep details of Itanium. I have a question for you, though, as you seem to know plenty about Itanium: Intel currently has IA-32EL; is there an IA-EM64T-EL in the works? (It might be called something else, but basically EM64T emulation for Itanium?)Even though Foxton is hardware based, we still don't know how it actually performs in practice - at least, I don't. (I probably never will, as I haven't even used an Itanium system other than to poke around a bit at some tradeshows.) 955 can run as high as 2.0 GHz under load - in practice, can you actually reach that speed most of the time, or is it more like 1.80 GHz for a bit, then 2.0 GHz for a bit, and maybe 1.90 GHz in between?
Also, are you sure about the "30 million transistors" part? That's larger than the entire Itanium Merced core (not counting the L3 cache). I suppose if you're talking about all the debugging and monitoring transistors, 30 million might be possible, but I didn't think all of that was lumped under "Foxton"?
IntelUser2000 - Monday, September 12, 2005 - link
I think there is plan for EM64T extension to IA-32EL. I heard from Inquirer that Montvale may have that, but either I could have misunderstood it/or its a rumor. Its just software support so I guess Intel can put it whenever they want to.For Foxton speeds, it depends. From what I understand, there is a thing called a power virus(A power virus is a malicious computer program that executes a specific instruction mix in order to establish the maximum power rating for a given CPU.), and if a number for power virus is 1.0(meaning 100% of maximum power), for Linpack its 0.8, specfp2k is 0.7, specint2k is 0.65, TpmC is 0.6. Since TpmC is furthest away from the power virus figure, it would reach maximum speed all the time, for 9055, that is 2.0GHz. For speccpu2k, it may be 1.9GHz, and for Linpack it may be 1.8GHz. So for some programs, there may be no benefit AT ALL, while others may get the maximum.
Foxton can sample every 8uS to change voltage and frequency.
Yes, I am sure about the Foxton hardware transistor count part. It uses custom 32-bit DSP with its own RAM to process the data necessary for Foxton. I was sort of surprised but yeah, around 30 million. Sorry I couldn't give the link, I'll send you somehow, give me info of how, but I do remember clearly. Merced has 25 million transistors including 96KB L2, without it that's around 20 million I guess, but Mckinley is actually simpler and has less logic transistors than Merced, which according to some, its around 15-17 million transistors.
Montecito has 64 million transistors NOT including L2. 64-30=34 million/2=17 million transistors, which is right on mark for
IntelUser2000 - Wednesday, September 14, 2005 - link
http://66.102.7.104/search?q=cache:fZ7OTmmmXrgJ:ww...">http://66.102.7.104/search?q=cache:fZ7O...f+1.7+bi...Well, I was KINDA right.
Though, yes that doesn't mean they are all for Foxton. Maybe, I don't know.
Itanium Merced has 25.4 million transistors. ~6 million of that is dedicated to x86 hardware emulator. Which leaves with 19.4 million transistors. W/O including 96KB L2, it would be around 14-15 million transistors for Merced core logic.
IntelUser2000 - Wednesday, September 14, 2005 - link
OTOH, I think the site could be wrong. It doesn't make sense with other Montecito papers saying it consumes less than 0.5W and takes less than 0.5% die size. I give up haha.Jimw18600 - Monday, September 12, 2005 - link
Your definition of HTT is a little skewed. It doesn't enable processing multiple threads; that was always there, whether they were earmarked or not. What it does do, is instead of flushing the instruction buffer back to the missed branch, it restarts the broken thread and continues the rest forward. Broken threads are simply tossed out and resources are reclaimed in the last stage in the pipeline; completed threads are retired. And by the way, the reason Intel was forced to go to HTT was they were heading for 31-stage pipelines. If you were still back at 12-15 stages, HTT didn't have that much to offer.JarredWalton - Monday, September 12, 2005 - link
My definition of HTT was actually taken directly from the roadmap. That's how Intel describes it, and obviously a 1 sentence summary leaves out a lot of details. HTT does allow the concurrent execution of more than one thread, but resource contention makes it difficult to say exactly how HTT will affect performance.One interesting point about SMT in general is that POWER5 doesn't have 20 to 31 pipeline stages and yet it still benefits from the IBM SMT design. This is purely a hunch on my part, but I wouldn't be at all surprised to see some form of HT come out for Conroe/Woodcrest in the future. Trouble filling all for issue slots from one thread? SMT could help out. We'll see if Intel does that or not in the future.
Note: HTT was actually present (but disabled) since Northwood for sure. Some people suspect that it was actually present in an early form in Willamette. Just because Conroe doesn't currently show any HT support, doesn't mean there's not some deactivaated features awaiting further testing. :)
IntelUser2000 - Monday, September 12, 2005 - link
From what I understand, modern single thread processors like the early Northwood P4's can execute multiple threads, but not ALL simultaneously. Since today's processors are fast enough anyway, it SEEMS like multi-tasking. The OS decides how to devote the time to the CPUs I guess.HT, makes use of the otherwise idle units, since it will give basically double demand to the CPU. None of the thread can make full advantage of the CPU(say 15%), but second thread makes it more efficient by taking 20% advantage of the CPU, which is 33% better throughput. It is more complex than that, but I think that explanation is enough.
Power 4/5 issue rate is 5-wide, which is quite a lot. It also has 17-stage pipeline, which is close to Pentium 4 Willamette/Northwood. Wide and deep, with lots of bandwidth and enough execution units, its perfect for SMT.
coomar - Monday, September 12, 2005 - link
kind of difficult to read the confidentialvirtualization sounds interesting