Aceshardware

(not so) temporary home for the aceshardware community
 FAQ •  Search •  Register •  Login 
It is currently Fri Oct 24, 2014 6:59 pm

All times are UTC + 1 hour



Welcome
Welcome to <strong>Aceshardware</strong>.

You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!


Post new topic Reply to topic  [ 11 posts ] 
Author Message
 Post subject: Praising the Power 6 design
PostPosted: Fri Mar 21, 2008 7:10 pm 
Offline

Joined: Fri Mar 21, 2008 4:07 pm
Posts: 74
Something bothered me for the past few days.Having followed closely the discussions here , original Aces and RWT on Power 6 I found the following posts to be lots of smoke and no actual data :
KTE wrote:
K10h is a 12 stage pipeline, 65nm, 283mm², 463M transistor, 23.x FO4 delays design. Not made for high clocks in any way, AMD intended, as presented at one of the global IEEE 2006 conferences to reach 2-2.8GHz with Barcelona with it's rated supply Vdd. Intel Core 2 is a 21 FO4 depth design AFAIK and Penryn at FO4 ~18, it is supposed to have been reduced substantially since HKMG integration.

The IBM Power6 is not the least nor the only architecture with 13 FO4 inversion delay, it just happens to be very well tuned for absolute speed and performance. P3 had FO4 15 depth, Willamette P4 FO4 8-10, Alpha 21264 has 15 FO4, and so on. Neither of those could achieve what IBM did.

However, I don't think the IBM Power6 bears any relevance to desktop computers. It is a major success for it's HPC market and trumped anything any competitor had to offer in 2007 including Harpertown and Itanium 2 Montecito. It's the only CPU to hold all 4 major industry records in one go, transactions, Java, throughput and floating point. Beat Harpertown 3.16GHz 8 core vs 8 core in Int too. Best in SAP, TPC-C OLTP, OASO, Spec Jbb2005, Linpack HPC and so on last I checked late 2007. For instance in TPC-C:

Bull Escala PL1660R 16-cores IBM Power6 4.7 GHz 1,616,162tpmC
NEC Express5800/1320Xf 32-cores Intel Dual-Core Itanium 2 9050 1.6GHZ 1,245,516tpmC
Bull Escala PL1660R 4-core IBM Power6 4.7 GHz 404,462tpmC
HP ProLiant ML370G5 X5460 QC 8-core Intel X5460 3.16GHz 275,149tpmC

As you can see, it trumps anything for what it was designed to do.

What it does show is those who usually guess SOI is the only clocking restriction are wrong, as if you look at IBM technical documentations, IBM Power6 scales to 6.1GHz with low LpolySi tuning on air using SOI at 1.3V Vdd supply. Far more than anything else out there including HKMG 45nm CPUs. The official 3.2GHz IBM Power6 is rated for less than 100W TDP at 65nm SOI, big achivement. The 4.7GHz is rated for a maximum of 160W TDP with massive 790M transistors inside a big 341mm² transitor package, wowzer achievement, especially at the same pipeline, instructions per cycle and latch cycle overhead from 90nm. No other chip from AMD/Intel at 65nm or 45nm can do sub-200W TDP at those specs or temperatures (sub 60C air, with 105C limit). To compare, the Power5+ 1.9GHz is a 200W TDP CPU at 389mm² 276M transistors. Intel "Montecito" Itanium 9000 running at 1.6GHz is less than half as powerful as IBM Power6 with a hefty 104W TDP, just shows how brilliant the engineering on Power6 really is and yet, you forget the +25W minimum TDP of the memory controller with Power6 that Itanium 2 doesn't have. Comparing Kentsfield 2.67GHz MCM 65nm was at 130W TDP, 286mm² 582M transistor package to IBM Power6 4.7GHz 160W gives us:

2.67GHz vs 4.7GHz
130W (+35W NB) vs 160W
582M vs 790M
20.5MHz/W vs 29.38MHz/W
0.455W/mm² vs 0.469W/mm²
0.22W/MilT vs 0.20W/MilT

In all respects it is far better a CPU at 65nm, but it isn't a desktop market intended chip, hence comparisons with our market shouldn't be made to judge absolute numerical performance, although electrically, you can do.

To get the speeds on any architecture is not just about one timing, jitter, skew, latch, wiring delays all add up and can increase delays and lower your clock performances per cycle. IBM mainly had to employ the use of very high speed low delay wiring and mainly, Dual Stress Liners within the transistors. Look at the low Vdd it needed for high frequencies, 0.9V Fmax is 3GHz.

Anyway, as for L3 cache, many have had it but not more than one Intel SKU before the K10h architectural lineup in the desktop market AFAIR (?). Alpha EV5 was the first that I can recall with others such as >Power4, UltraSPARC IV+, Madison/9M, etc. It helps only mainly when your L2 and L1 is saturated for high memory access or large matrix array applications, such as databases. It was always mainly a server design bonus, hence not featured much on the desktop but now that seems to be changing and it's led by AMD quite obviously with their MPU+IMC design. Not that Intel didn't know this before AMD, they just couldn't produce a chip below 45nm with it.

Someones also mentioned AMD K10h L3 cache is different to the upcoming "Nehalem" (it's not called Nehalem), as in inclusive rather than being exclusive. This isn't entirely true either, AMDs design is not specifcially inclusive or exclusive, but a bit of both:
Also the L3 is 20% of the die in K10h.

Well, the additions for Nehalem are good on paper, but fingers crossed as Native+IMC is too difficult to have running as a design without problems, esp. at your first go. 45nm HKMG helps a lot but not as much as 32nm would. I'm fearing the prices on these, as clocks are far harder to get, yeilds much lower, defect rates very high, and hence, price is where it'll bite us in the hind, unless AMD has something Intel fears by 13th October '08.

*Cache arrangement is nearly exactly the same as AMD K10h, no doubt, though Intel chose to keep it mainly inclusive. The L3 cache by nature of redundant access is slower than L2 and L1 but far quicker than RAM access. That large size of L3 will only help with large matrix applications, mainly in videoing/imaging/large gaming/server apps but beyond 8 or so MB data access, they might have paramount performance scaling issues when all caches are full with the same replete data, the latency will build. That's the problem with keeping it inclusive, they need speed and very low latency for it.
*IMC built within limits the current delta between IMC-Core and withholds overclocking/speeds. Each MB PWM design now has to provide separate power for the IMC and not just the separate cores. Same for VMods.
*IMC also increases power/TDP much, especially with triple channel memory support. You can add 30-60W of minimum to maximum power here at just 2.0-2.8GHz clocks, maybe even more so with SMT and QPI support being internal. Maximum theoretical AC and DC power consumption becomes much higher through individual latch testing.
*Triple channel memory is essentially needed, I reckon its a clever move, because Native+IMC design suffers from low real bandwidth, and worse so for write/copy bandwidth than read. Individual DRAM access by each core is the best way to go, should improve or at least keep level write/copy bandwidth but improve read bandwidth over current Penryn. Just having IMC+3 controllers, doesn't gurantee this at all though.
*4 vs 3 instructions executed at a time means it will obviously be quicker than Penryn per clock - unless something down the line holds it back, latencies being my fear. Hoping there will be major improvements here espceially with SSE 4.2 instruction updates once they're supported in software.
*I don't like the sound of keeping a small L2 and large L3, this is more a server segment design win. L2 will be far quicker than L3, but slower than L1; the Native+IMC approach require a large L2 for speed in smaller apps and large L3 for speed in larger apps, but suffers from little L2 in the smaller desktop apps. Apps like SuperPi will probably see a big hit with this, not just 1M, even 32M, although not as much hit as L3 exclsuive on AMD K10h does.
*Unfortunately, Native+IMC also means, lower bins, lower clocks, lower overclocks, higher defect rates, higher TDPs and much chances of cold bugs and low clocks held by the IMC, especially if they are in-sync. You have to realize the nature of binning and chip sorting is more than three times as difficult with 4-cores+IMC in one package. I hope they make the IMC run at a separate clock and PLL to the cores, fully adjustable, or IMC/MEM oc will also be poorer and very hard compared to modern Core 2 oc, which is easy. They have only quantified DDR3 800-1333 support which gives me the shivers of these clock restrictions since Nehalem is set for production in Q4 '08, for that time, I would've expected them to add DDR3 1600 support unless something is holding things back here.
*IMC clocking depends on the delta between IMC and MEM gate currents and volts, so this could be a very tricky area to have working with DDR3 1.5V unless the IMC gate voltages are at the required delta's.

Can't wait to see it in action, it is a revolutionary design for Intel, completely different to their previous CPU designs: a new architecture very clearly. They've chosen the same desktop architectural design as AMD now, to compete. It's the right way to go, but introducing SMT aka HyperThreading back again is not a good idea unless the single threaded Front End performance is weak or the clocks are lower than Penryn. That isn't a good sign of multi-threaded performance and clock speeds. We know software developers at Intel have been ringing developer ears since before Penryn of how poor multi-core paralellism exists everywhere on the desktop and home market (videos on their site showing 4 core to 6 core lost perf. scaling majorly) but let's hope they've improved this through the cache data fetch and eviction algorithms, the larger TLB and BTB can do this. This is mainly where SMT will help most.

As for people fighting cases of this firm vs that, you're all wrong as all electrical designs and knowledge of anything and everything is mostly copied and passed on = it's not called copying though, it's called sharing. How you teach your child, how you know anything about computers is mostly through reading or being told, which again is sharing by some male/female somewhere. How I draw a picture by watching a video of someone drawing it, doesn't mean I copied or that it isn't an achievement if I made it good or better. And FWIW, K10h featured many improvements which were exactly what Core 2 received from Pentium M.

I hope they do launch sub-120W TDP 2.8GHz quad-core Nehalem CPUs by October. Would be a major achievement to pull off with plus 1.1x Penryn performance per clock, not sure how many recognize ardently how difficult it is to produce especially at the same fabrication node as your current SKU lineups. Monstrously difficult job, go visit a fab and you'll realize much better. Just had a little time to sit down today. :)


and

KTE wrote:
Jack, I really don't want to get into details since its too lengthy for the time I have sidelined to post on a rumor thread, but for the sake of you and other level-headed enthusiasts, heregoes :D
They are details I've known from Intel and AMD themselves directly, thus more authoritative than online quotes I provide, apart from Penryn Core 2 Extreme where I was only told that FO4 delay is reduced from Core 2 65nm (drop them an email or ask David/s at RW, they should know as ISSCC and IEEE 2006/7 Conferences did make brief mention of them). The fact that either MFG doesn't like releasing such vital data online since the P4 days, means we're not going to find much online on it since the little documentation that does exist does not cover either architectural engineering in such depth for competitive reasons, until it's old. You only ever hear tid-bits through journals and studies now which isn't important detailed engineering and very few daily journalistic sources can catch or even understand on to the real details which matter in engineering (they're not exactly educated to). The only thing they tend to do is feed extravagant hype 12-18 months early to suite the intended extremists, who do a good enough payroll job each time to propel things in-favor of their obsession, as the MFG intended to begin with, and then you have unintelligent corner lurking individuals react like their mother is being held hostage by one of the MFGs, so they have to wage childish trantrums on anyone who speaks ever so slightly admonishing or not-so-perfect of that particular MFG, regardless of accuracy or their knowledge limitations, be it on Intel or AMD. A sad case I wish never existed since '98 online, since we're only interested in the architectures when discussing and I know I don't favour any MFG in any product but whatever is cheap and okay for my intended tasks in the end, as most of the sane will. They just want our money. Thus it just spoils forums and usefulness in discussion.
To search for those 2 figure FO4 online will require much time I'm not able to spend right now with an intermittent broken network connection for about 7 days now, even if it does exist, however I will try and get you some mentions of those FO4 depths no doubt, specifically for Core 2 and K10h later.

Ok, it wasn't that bad actually, just scanned to approximate how hard it may be to find it and took less than 20 seconds for K10h:
K10h inverter delay mentioned [SIZE="1"][end of page 2 and start of page 3][/SIZE]: http://www.hypertransport.org/docs/news/RWTech_Inside_Barcelona_05-16-07.pdf
This document mentions a few FO4 depths including that of Core 2 [SIZE="1"](only one I've found so far online)[/SIZE]: http://www.springerlink.com/index/q88838k207r37554.pdf
This document mentions them of many more CPUs: http://www.realworldtech.com/page.cfm?ArticleID=RWT081502231107&p=2
These comparative graphs also looks accurate to me judging off all the lower FO4s I know about to be correct: http://www-vlsi.stanford.edu/group/chart/cycleFO4.pdf, http://www-vlsi.stanford.edu/group/chart/ClockFrequency.pdf, http://www-vlsi.stanford.edu/group/chart/PowerDensity.pdf

I'll try and get some word from Intel on Penryn FO4 for you specifically and let you know the full reply by PM (you can then post it if you want, since I don't have any need to post in this thread after my first post and this to answer your enthusiastically put genuine request).
Yep, exactly. My point in focus wasn't to compare clocks between any of them at all, you and I both know there are major variables which would make that inaccurate, but that FO4 doesn't dictate Frequency@TDP alone, a wide variety of features and a whole architectural design and material choice can limit and affect this greatly. If Intel CPUs can clock greatly, I would never say it is only because of one circutry factor alone, it has been like this since NetBurst which went from +16 to 8 FO4 depths (not sure of the maximum, but it was above 16 for sure and some PEs wager 6 is the lowest FO4 they had) and Core 2 has a fairly reserved FO4 above 20 to begin with yet it still can clock high, although with high TDPs at 65nm, it still is very good.

I'll quickly explain a little for the benefit of genuine and sane minded knowledge seekers. In any modern microprocessor, the slowest pipeline stage is what more than determines your maximum operatable frequency. In VHDL, the critical path delays is where the major problem for clocking arises as the delays will add up here. The biggest factors affecting a CPU regarding maximum clock frequencies at a constant FO4 delay are:
a) Microarchitecture
b) Process Variation and Accessibility
c) Logic Styles
d) Timing Overheads
e) Cell designs
f) Wiring Size
g) Floorplan and Placement

Now, even more so than these are the FO4[SIZE="1"]latch[/SIZE] (incl. clock skew and jitter delays), FO4[SIZE="1"]logic[/SIZE] and subsequently FO4[SIZE="1"]pipeline[/SIZE] delays, which designate the depth of the critical path through logic in one pipeline stage. They affect a CPU clocking frequency greatly within the desktop TDPs, as well as how much of a CPU surface can be covered in one processor cycle. The most paramount of those parameters affects the critical path lengths and critical path delays (i.e. register propogation delay). Even the subthreshold leakage, gate direct tunneling leakage, junction leakage and gate induced drain leakage affects any CPUs clocking greatly at a given transistor V[SIZE="1"]dd[/SIZE] and T[SIZE="1"]ox[/SIZE]. Array power, latch and clock are the primary essential components of power dissipation in CPUs too, whereas modern CPUs have a power given by the formula P = P[SIZE="1"]dynamic[/SIZE] + P[SIZE="1"]leakage[/SIZE], and leakage for SiO is supposed to be as much as 40% of the used power, especially as the fabrication node decreases; decreasing the threshold voltage with any transistor increases the leakage current exponentially (i.e. decreasing the threshold voltage by 100mV increases the leakage current by a factor of 10) and decreasing the length of transistors increases the leakage current as well. This again poses huge clock frequency barriers to CPUs in real-life, rather than theoretical simulations when you shift process size [SIZE="1"][[url="http://books.google.com/books?id=86oXI7MWw8AC&pg=PA20&source=gbs_toc_r&cad=0_0&sig=0rZNwM3RS1iIPG7PCLLr9Edbb-I"]more on it here[/url]][/SIZE].
The fan-out of four inverters metric becomes an ideal metric to compare and estimate clocking which is entirely technology scaling based only, i.e. if you keep the same architecture but just change FO4, it is bound to clock better if design/TDP does not restrict this. The ratio of a CPUs FO4 delay to the minimal signal delay for any CMOS is node independent, and you can calculate it by the formula (I don't have the required characters) Fmax ≈1/π.Trise where Trise=τFO4 (one FO4 delay). Such that, for a given technology node, FO4 13 at 65nm for a CMOS has a maxmimum theoretical limit of ~7.5GHz, while at 18nm it has a maximum theoretical frequency of ~11.5GHz. Now, this is where FO4 delay becomes paramount for CMOS, alone, all things kept constant. If you decrease the FO4 as is commonly done in engineering to find the maximum theoretical frequency of the circuitry, to one FO4, then the maximum clock frequency possible at 65nm is ~90GHz whilst at 18nm it's ~225GHz. The industry standard is to measure energy efficiency between FO4s in different CPU designs as power-performance space (some do this as Energy*Delay^2). In this respect, an electrical assessment, you will see Power6 outperform Netburst, K8, Core 2 and K10h for the efficiency.

As I mentioned, Power6 is not a desktop CPU nor does it compare to desktop CPUs in the low load desktop workloads nor in applications it is designed for and they are made to drive, but for 65nm CPUs, it sure is electrically much better engineering than K8, P4 or Core 2 is for absolute Performance/MHz/TDP. It doesn't produce the most Gigaflops and throughput between them for no reason, and all the meanwhile it stays sub 60C on air cooling while it is a circuit designed for heavy temperatures (plus 100C was the burn-in testing).

Some excellent and authoritative sources for such knowledge are: Proceedings of the Advanced Metallization Conference -2007, IEEE Transactions on Computers (i.e. Integrated Analysis of Power and Performance for Pipelined Microprocessors), IEEE Transactions on Electron Devices, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Standby and Active Leakage Current Control and Minimization in CMOS VLSI Circuits, Inductance Calculations Working Formula and Tables- Research Triangle Park, Inductance Calculations in a Complex Circuit Environment - IBM J. Res. Develop. and so on. :)


From what I understand , FO4 delay isn't the biggest problems , chips are more likely to be power limited instead of circuit delay.

Sorry for the extremely long posts , but I had to ask the higher authority on these matters. :P

The original thread is here : http://www.xtremesystems.org/forums/sho ... ost2851682

Am I missing something ?


Top
 Profile  
 
 
 Post subject:
PostPosted: Fri Mar 21, 2008 8:32 pm 
Offline

Joined: Wed Aug 29, 2007 3:55 pm
Posts: 1006
Location: Great white north
The Power6 is a product of its corporate culture - IBM Micro and IBM's
proprietary server platform divisions. Both are trying to carve out a
future for themselves in a computer industry that is increasingly Intel
dominated. At the same time the costs of staying in the chip business
keep going up and even huge semicos are going fabless for their digital
products. So they want to stand out with bigger mega-giga numbers
than anyone else - stuff that gets reported in the WSJ, not just the
tech press. IBM needs the wow factor to claim a degree of technical
leadership and show that leading edge MPUs aren't yet Intel's one man
show.

There are two battles going on. The first is simply RISC vs IPF and that
is an issue of performance, performance/cost, and performance/power
for business critical servers with high RAS. The second battle is within
IBM itself and is much subtler. IBM is a company whose center of mass
is increasingly centered around software and services. Over the past
ten years IBM has spun out hardware divisions where IBM could no
longer differentiate itself because the products had become increasingly
commoditized - DRAM, PCs, printers, hard disks etc.

It isn't good enough anymore for IBM Micro and the pSeries folks to
simply turn out a high performance Power processor anymore. IBM
Micro has to justify its own existence in the brave new world where
TI, the third largest integrated device manufacturer after Intel and
Samsung, thinks it can no longer stay on the process development
treadmill and will turn to foundries like TSMC for its post 45 nm
digital process needs. A mega-giga part like Power6 is a showpiece
- higher frequency than anything else out there and provides IBM
Micro a pretext to claim it wouldn't be possible if IBM outsourced
process development and wafer manufacturing to an outside firm
like TSMC. We know of course that high frequency isn't the only
way to high performance, in fact the second generation netburst
microarchitecture showed it wasn't even a good way when compu-
tational power efficiency is sacrificed.

Later this year Intel will release the 65 nm bulk CMOS Tukwila and
it will likely easily outperform the 65 nm SOI CMOS Power6 on the
benchmarks of most interest to buyers of business critical servers
despite running at less than half its clock frequency and having
less than half its socket level bandwidth. IBM might have created
a better product and closer competitor to Tukwila better if Power6
had been a quad design based on a Power5 core worked over to
improve performance/power but then its wouldn't have the mega-
giga for headlines in the WSJ and given IBM Micro a measure of
bragging rights to help justify its continued existence. ;-)


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 21, 2008 11:14 pm 
Offline

Joined: Fri Mar 21, 2008 4:07 pm
Posts: 74
Paul DeMone wrote:
The Power6 is a product of its corporate culture - IBM Micro and IBM's
proprietary server platform divisions. Both are trying to carve out a
future for themselves in a computer industry that is increasingly Intel
dominated. At the same time the costs of staying in the chip business
keep going up and even huge semicos are going fabless for their digital
products. So they want to stand out with bigger mega-giga numbers
than anyone else - stuff that gets reported in the WSJ, not just the
tech press. IBM needs the wow factor to claim a degree of technical
leadership and show that leading edge MPUs aren't yet Intel's one man
show.

There are two battles going on. The first is simply RISC vs IPF and that
is an issue of performance, performance/cost, and performance/power
for business critical servers with high RAS. The second battle is within
IBM itself and is much subtler. IBM is a company whose center of mass
is increasingly centered around software and services. Over the past
ten years IBM has spun out hardware divisions where IBM could no
longer differentiate itself because the products had become increasingly
commoditized - DRAM, PCs, printers, hard disks etc.

It isn't good enough anymore for IBM Micro and the pSeries folks to
simply turn out a high performance Power processor anymore. IBM
Micro has to justify its own existence in the brave new world where
TI, the third largest integrated device manufacturer after Intel and
Samsung, thinks it can no longer stay on the process development
treadmill and will turn to foundries like TSMC for its post 45 nm
digital process needs. A mega-giga part like Power6 is a showpiece
- higher frequency than anything else out there and provides IBM
Micro a pretext to claim it wouldn't be possible if IBM outsourced
process development and wafer manufacturing to an outside firm
like TSMC. We know of course that high frequency isn't the only
way to high performance, in fact the second generation netburst
microarchitecture showed it wasn't even a good way when compu-
tational power efficiency is sacrificed.

Later this year Intel will release the 65 nm bulk CMOS Tukwila and
it will likely easily outperform the 65 nm SOI CMOS Power6 on the
benchmarks of most interest to buyers of business critical servers
despite running at less than half its clock frequency and having
less than half its socket level bandwidth. IBM might have created
a better product and closer competitor to Tukwila better if Power6
had been a quad design based on a Power5 core worked over to
improve performance/power but then its wouldn't have the mega-
giga for headlines in the WSJ and given IBM Micro a measure of
bragging rights to help justify its continued existence. ;-)


I fully agree with you on the economic side of the high-end server arena.I for one , patiently expect the day when SUN will embrace IPF ( I'd say in the next 12 months ).

But , is there anything special to Power 6 to warrant claims like

-"it sure is electrically much better engineering than K8, P4 or Core 2" ;
-"No other chip from AMD/Intel at 65nm or 45nm can do sub-200W TDP at those specs ..Itanium 9000 running at 1.6GHz is less than half as powerful as IBM Power6 with a hefty 104W TDP, just shows how brilliant the engineering on Power6 really is"
-"In all respects it is far better a CPU at 65nm" ( vs. Kentsfield )


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 22, 2008 12:20 am 
Offline

Joined: Wed Aug 29, 2007 3:55 pm
Posts: 1006
Location: Great white north
There is no magic in Power6 beyond fully competent circuit
and physical design. Its designers simply made design and
product engineering choices at every stage that enhanced
clock frequency:

- within any given power budget a dual core device can clock
faster than a quad core device of equal issue width.

- within any given power budget an in-order device can clock
faster than an OOOE device of equal issue width.

- if you accept that only the least leaky X% of all manufactured
devices will be considered for top bin then within any given
power budget the smaller X is the faster the top bin can clock.
(considering how long IBM took to ship 1000 p570s, maybe 5k
processors in total, it seems that X is a pretty small number ;-)


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 22, 2008 1:59 am 
Offline

Joined: Fri Oct 05, 2007 7:46 am
Posts: 167
Savantu -- what KTE wrote was not incorrect, Power 6 was a good design overall, but with overall different design goals.

When IBM published their journal on Power 6, it revealed a great deal of the philosophy that went into the design, scaling to high clocks was a high priority and they did so at the sacrifice of power.

This was my original point when looking at the Power 6 design methodologies.... needless to say, Apple will not be switching back to Power 6 ... this simply is not a mainstream chip.

Paul is also correct above, IBM is also 'chest-thumping' to an extent by pushing the clocking envelop.

KTE is a little ambiguous on how power limits clocking, in his context he is correct .. but, given an infinite heat sink (i.e. take power concerns out of the equation), fundamentally clocking is limited by the slowest circuit. What he said was not incorrect.

If I were to argue with him though I would take exception to "it sure is electrically much better engineering than K8, P4 or Core 2 is for absolute Performance/MHz/TDP." This is not quite correct, parametrically Intel's 65 nm process was ahead of IBM's ... to get to these clocks IBM had to scale thier gate ox thickness by 10-15% thinner (again, sacrificing leakage in pursuit of clock).

Cedarmill would clock up to 4.2-4.5 GHz (a 65 nm netburst core) at lower powers and thicker gates.

Jack


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 22, 2008 8:45 am 
Offline

Joined: Fri Mar 21, 2008 4:07 pm
Posts: 74
JumpingJack wrote:
..

KTE is a little ambiguous on how power limits clocking, in his context he is correct .. but, given an infinite heat sink (i.e. take power concerns out of the equation), fundamentally clocking is limited by the slowest circuit. What he said was not incorrect.


But are we anywhere near the point where the slowest circuit kills frequency ? I don't remember Power 6 to have extremely low latency caches , do a lot of work per cycle ( although it has a lot of execution units ).From what it looks like circuit delay isn't what's holding Power or any other CPU back ( except Itanium with its L1s ) , it's all about power.

Quote:
If I were to argue with him though I would take exception to "it sure is electrically much better engineering than K8, P4 or Core 2 is for absolute Performance/MHz/TDP." This is not quite correct, parametrically Intel's 65 nm process was ahead of IBM's ... to get to these clocks IBM had to scale thier gate ox thickness by 10-15% thinner (again, sacrificing leakage in pursuit of clock).

Cedarmill would clock up to 4.2-4.5 GHz (a 65 nm netburst core) at lower powers and thicker gates.

Jack


The performance per watt of Core is better IMHO than Power.We're talking about a 65w TDP device that clocks at 3GHz.


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 22, 2008 4:37 pm 
Offline

Joined: Sat Mar 22, 2008 4:10 pm
Posts: 4
savantu wrote:
The performance per watt of Core is better IMHO than Power.We're talking about a 65w TDP device that clocks at 3GHz.


and is that 3GHz chip from Intel faster the Power 6 in relevant benchmark (e.g processing of some heavy financial transactions?)

how can you talk about performance w/o actually knowing how Power 6 is performing?


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 22, 2008 5:00 pm 
Offline

Joined: Wed Aug 29, 2007 3:55 pm
Posts: 1006
Location: Great white north
savantu wrote:
From what it looks like circuit delay isn't what's holding Power or any other CPU back ( except Itanium with its L1s )


That comment about IPF L1 is a load of IPF bashing horse****
originated by Linus at RWT. The IPF L1s are so fast because
the ISA doesn't require an effective address offset adder in
the read critical path.

More than half of early stepping Montecitos ran above 2 GHz
with some running at over 2.35 GHz when power was not
constrained to the 104W TDP of production SKUs. Tukwila
runs at 2.4 GHz at 1.2V but apparently will be released at
2 GHz with a maximum Vcore of 1.15V to limit device power
(the "uncore" in Tukwila contributes up to 70W on its own).
These are clear indications this uarch line is not speedpath
limited in current processes and device configurations.


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 22, 2008 5:15 pm 
Offline

Joined: Wed Aug 29, 2007 3:55 pm
Posts: 1006
Location: Great white north
Phenom wrote:
savantu wrote:
The performance per watt of Core is better IMHO than Power.We're talking about a 65w TDP device that clocks at 3GHz.


and is that 3GHz chip from Intel faster the Power 6 in relevant benchmark (e.g processing of some heavy financial transactions?)

how can you talk about performance w/o actually knowing how Power 6 is performing?


The C2D microarchitecture's design target is primarily client computing,
i.e. PCs. Of the major platform independent benchmark suites the one
that probably best reflects client computing is SPECint_base2006:

3.0 GHz C2D (E6850) - 20.2
4.7 GHz Power6 (p570) - 17.8

(Note: there are higher submissions for C2D @ 3 GHz but these are
auto-parallel runs).

SPECint is also one of the major benchmark suites that responds best
to clock frequency as a major lever for achieving higher scores. In the
case of Power6 however clearly frequency isn't everything. Obviously
Power6 wasn't designed for client computing (sorry Apple :-P) but
even taking that into account the difference in performance/power in
the comparison above is staggeringly large, nearly three to one.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Mar 23, 2008 11:05 pm 
Offline

Joined: Fri Oct 05, 2007 7:46 am
Posts: 167
savantu wrote:
JumpingJack wrote:
..

KTE is a little ambiguous on how power limits clocking, in his context he is correct .. but, given an infinite heat sink (i.e. take power concerns out of the equation), fundamentally clocking is limited by the slowest circuit. What he said was not incorrect.


But are we anywhere near the point where the slowest circuit kills frequency ? I don't remember Power 6 to have extremely low latency caches , do a lot of work per cycle ( although it has a lot of execution units ).From what it looks like circuit delay isn't what's holding Power or any other CPU back ( except Itanium with its L1s ) , it's all about power.

Quote:
If I were to argue with him though I would take exception to "it sure is electrically much better engineering than K8, P4 or Core 2 is for absolute Performance/MHz/TDP." This is not quite correct, parametrically Intel's 65 nm process was ahead of IBM's ... to get to these clocks IBM had to scale thier gate ox thickness by 10-15% thinner (again, sacrificing leakage in pursuit of clock).

Cedarmill would clock up to 4.2-4.5 GHz (a 65 nm netburst core) at lower powers and thicker gates.

Jack


The performance per watt of Core is better IMHO than Power.We're talking about a 65w TDP device that clocks at 3GHz.


I haven't studied the perf/wat detail of Power6, in fact, about 10 minutes of googling turned up fruitless to find any kind of power measurement for a Power6 top bin part ... the data is hard to find.

Strictly speaking, though, a lot of variables to into determining highest bin clock, in consumer (and mainstream server) the power envelop is limiting, so I agree with your statement in that context.

However, in an infinite heat sink where power is not limiting, the physical limit of clocking is determined by the weakest circuit (functional block), this can arise from many sources.... as a rule of thumb (or first approximation), relative statements can be made simply by ascertaining the designed FO4 delays and the delay time of one FO4 stage.

In that argument, this is where we are arguing from two different angles, my assertion is that an AMD K10h (or K8 for that matter), would not achieve IBM Power6 like clocks regardless of how much energy is dissipated because it is a more complex beast overall. IBM constrained their designers to a low latency, in order design ... and this is where KTE marvels at the device, because they did some nice circuit design to achive the goal and maximize IPC -- very different approach than what Intel did with simply lengthening the pipeline as an example.


Top
 Profile  
 
 Post subject:
PostPosted: Sun Mar 23, 2008 11:06 pm 
Offline

Joined: Fri Oct 05, 2007 7:46 am
Posts: 167
Paul DeMone wrote:
Phenom wrote:
savantu wrote:
The performance per watt of Core is better IMHO than Power.We're talking about a 65w TDP device that clocks at 3GHz.


and is that 3GHz chip from Intel faster the Power 6 in relevant benchmark (e.g processing of some heavy financial transactions?)

how can you talk about performance w/o actually knowing how Power 6 is performing?


The C2D microarchitecture's design target is primarily client computing,
i.e. PCs. Of the major platform independent benchmark suites the one
that probably best reflects client computing is SPECint_base2006:

3.0 GHz C2D (E6850) - 20.2
4.7 GHz Power6 (p570) - 17.8

(Note: there are higher submissions for C2D @ 3 GHz but these are
auto-parallel runs).

SPECint is also one of the major benchmark suites that responds best
to clock frequency as a major lever for achieving higher scores. In the
case of Power6 however clearly frequency isn't everything. Obviously
Power6 wasn't designed for client computing (sorry Apple :-P) but
even taking that into account the difference in performance/power in
the comparison above is staggeringly large, nearly three to one.


What you said ;)


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 11 posts ] 

All times are UTC + 1 hour


Who is online

Users browsing this forum: Google [Bot] and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
suspicion-preferred