| View previous topic :: View next topic |
| Author |
Message |
Alberto
Joined: 04 Sep 2007 Posts: 111 Location: Italy
|
Posted: Thu Mar 27, 2008 7:52 pm Post subject: |
|
|
| jack wrote: | | Pjotr wrote: | | Johan wrote: | | I really would like to understand the effects of the cache a bit more. Considering that each core in Barcelona has only a 512 KB cache before it has to go to a slow L3, I can imagine that is far from optimal for most desktop apps. Even Shangai will not change this. Penryn with it's huge low latency L2-cache is really ideal for desktop apps. |
And where would that put Nehalem, going away from the big L2 to smaller L2 with a separate L3 instead? Perhaps it is a better design when you are not limited by a FSB. |
That will depend on L3 latency. However, based on previous designs, it should be quite decent. Certainly better than Barcelona's L3 which latency is quite horrible.
|
It is horrible? Who say that? It's an L3 studied for being one 6MB L3.
There is an Intel slide that claims 33 cycles for the 32nm Sandy Bridge L3, this cache doesn't seem FAST; maybe the question is the real influence of a large L3 cache in a cpu with one on two MCs on die, at least in consumer applications.
Alberto.
Last edited by Alberto on Fri Mar 28, 2008 1:34 pm; edited 1 time in total |
|
| Back to top |
|
 |
jack
Joined: 27 Jun 2007 Posts: 333
|
Posted: Thu Mar 27, 2008 9:03 pm Post subject: |
|
|
| EaS wrote: | | Some of the first details on IPC gains that AMD expects going from Barcelona to Shanghai |
Finally a benchmark which includes 45nm dual cores in comparison! Too bad that they didn't include 3GHz/$183 Wolfdale results. 2.66GHz E8200 still performed quite well, overtaking low-end Phenoms in some multithreaded benchmarks and being far ahead in gaming benchmarks.
|
|
| Back to top |
|
 |
redpriest
Joined: 30 Aug 2007 Posts: 52
|
Posted: Thu Mar 27, 2008 9:14 pm Post subject: |
|
|
Take a look at the Sysmark score in the Q9300 review - the one with 3MB L2 cache per core-pair. Notice how it craters going from 4MB to 3MB in E-Learning? Productivity is also similarly affected.
|
|
| Back to top |
|
 |
Opteron
Joined: 16 Mar 2008 Posts: 55
|
Posted: Thu Mar 27, 2008 11:00 pm Post subject: |
|
|
| Alberto wrote: | | jack wrote: |
That will depend on L3 latency. However, based on previous designs, it should be quite decent. Certainly better than Barcelona's L3 which latency is quite horrible.
|
It is horrible? Who say that? It's an L3 studied for being one 6MB L3.
There is an Intel slide that claims 33 cicles for the 32nm Sandy Bridge L3, this cache doesn't seems FAST; maybe the question is the real influence of a large L3 cache in a cpu with one on two MCs on die, at least in consumer applications. |
It is stated here:
http://www.digit-life.com/articles3/cpu/rmma-phenom.html
I guess one can use the adjective "horrible" in that ccontext.
The good news for Shanghai is the die shrink. Therefore my hopes are that the northbridge and thus the L3 will get a speed bump, too. But we have to wait ...
However the impact of a faster L3 seems not to be enormous, as some tests with the new X4 9850 show:
http://www.xbitlabs.com/articles/cpu/display/phenom-x4-9850_4.html#sect0
cheers
Opteron
|
|
| Back to top |
|
 |
jack
Joined: 27 Jun 2007 Posts: 333
|
Posted: Fri Mar 28, 2008 8:57 am Post subject: |
|
|
Yes, 20ns latency for 2MB L3 is quite horrible, considering that 65nm Conroe has less than 5ns latency for 4MB L2 (although this is shared only by two cores).
|
|
| Back to top |
|
 |
redpriest
Joined: 30 Aug 2007 Posts: 52
|
Posted: Fri Mar 28, 2008 10:44 am Post subject: |
|
|
So what do you think Intel's latency of 109 cycles on their Tulsa L3 implementation was? Fast?
BTW, Penryn's L2 latency appears to be variable. 13-14 cycles best case, I've seen up to 17-18 cycles worst case.
|
|
| Back to top |
|
 |
Opteron
Joined: 16 Mar 2008 Posts: 55
|
Posted: Fri Mar 28, 2008 11:10 am Post subject: |
|
|
| redpriest wrote: | So what do you think Intel's latency of 109 cycles on their Tulsa L3 implementation was? Fast?
| Of course, if you take its size into account (16MB) it was quite good ^^ | Quote: | | BTW, Penryn's L2 latency appears to be variable. 13-14 cycles best case, I've seen up to 17-18 cycles worst case. | As it is noted in the article I posted above, worst case latency for K10 L3 is 47-48 cycles ...
cheers
Opteron
|
|
| Back to top |
|
 |
Paul DeMone
Joined: 29 Aug 2007 Posts: 459 Location: Great white north
|
Posted: Fri Mar 28, 2008 2:15 pm Post subject: |
|
|
| Opteron wrote: | | redpriest wrote: | So what do you think Intel's latency of 109 cycles on their Tulsa L3 implementation was? Fast?
| Of course, if you take its size into account (16MB) it was quite good ^^ Opteron |
Presuming the 109 cycle figure is correct it is not impressive.
The 12 MB L3 in Montvale (90 nm) = 14 cycles = 8.4 ns @ 1.667 GHz
The 16 MB L3 in Tulsa (65 nm) = 109 cycles = 31 ns @ 3.5 GHz
|
|
| Back to top |
|
 |
jokerman
Joined: 22 Aug 2007 Posts: 25
|
Posted: Fri Mar 28, 2008 3:33 pm Post subject: |
|
|
Tulsa at Hot Chips
| Quote: |
Core to LLC load to use time: ~35 ns
- The technolgy allows a 16 MB cache to have < 9 ns access time
- All of the load-to-use stages and clock crossings cost a lot
|
slide 23
http://www.hotchips.org/archives/hc18/3_Tues/HC18.S9/HC18.S9T1.pdf
still, Tulsa showed ~70 % higher performance than Paxville.
|
|
| Back to top |
|
 |
Alberto
Joined: 04 Sep 2007 Posts: 111 Location: Italy
|
Posted: Fri Mar 28, 2008 5:15 pm Post subject: |
|
|
| Paul DeMone wrote: | | Opteron wrote: | | redpriest wrote: | So what do you think Intel's latency of 109 cycles on their Tulsa L3 implementation was? Fast?
| Of course, if you take its size into account (16MB) it was quite good ^^ Opteron |
Presuming the 109 cycle figure is correct it is not impressive.
The 12 MB L3 in Montvale (90 nm) = 14 cycles = 8.4 ns @ 1.667 GHz
The 16 MB L3 in Tulsa (65 nm) = 109 cycles = 31 ns @ 3.5 GHz |
Tulsa's L3 is half speed, if i remember correctly a your answer to a my question over an year ago, so is more like 62ns. Something is wrong.
Alberto.
|
|
| Back to top |
|
 |
Opteron
Joined: 16 Mar 2008 Posts: 55
|
Posted: Fri Mar 28, 2008 5:20 pm Post subject: |
|
|
| Paul DeMone wrote: | Presuming the 109 cycle figure is correct it is not impressive.
The 12 MB L3 in Montvale (90 nm) = 14 cycles = 8.4 ns @ 1.667 GHz
The 16 MB L3 in Tulsa (65 nm) = 109 cycles = 31 ns @ 3.5 GHz |
I do not know where the 109 cycles comes from, but Tusla's Cache just runs at half the core clock. Furthermore, it is a traditional, synchronous design. For using succh an approach, access times were good.
Itanium uses an asynchronous approach. This is better in any case, but I do not know if the mentioned 14 cycles are also worst-case or not.
I guess Intel is smarter now with Nehalem and will use the "Itanium" caches for Nehalem ;-)
cheers
Opteron
|
|
| Back to top |
|
 |
Paul DeMone
Joined: 29 Aug 2007 Posts: 459 Location: Great white north
|
Posted: Fri Mar 28, 2008 5:20 pm Post subject: |
|
|
| Alberto wrote: |
Tulsa's L3 is half speed, if i remember correctly a your answer to a my question over an year ago, so is more like 62ns. Something is wrong.
Alberto. |
The L3 latency will be reported in terms of CPU clock cycles. If the
L3 does run at half the processor frequency ( which is very likely)
that will be accounted for.
|
|
| Back to top |
|
 |
Opteron
Joined: 16 Mar 2008 Posts: 55
|
Posted: Fri Mar 28, 2008 5:25 pm Post subject: |
|
|
| Paul DeMone wrote: | | The L3 latency will be reported in terms of CPU clock cycles. If theL3 does run at half the processor frequency ( which is very likely)that will be accounted for. | Yes it is half CPU clock:
| Quote: | Figure 5.3.4 shows the clock distribution map. Separate PLLs
and clock distribution trees drive each core and the associated L2
cache. A third PLL drives the uncore half-frequency clock. The
FSB uses the external bus clock (200MHz) and the quad-pumped
version (800MHz). The three PLLs are grouped together on the
left side of the die and the differential clock input is routed to
three pairs of C4 bumps inside the package. The uncore clock is
distributed through a balanced tree embedded in nine vertical
spines. De-skew circuits controlled by on-die fuses [3] reduce the
uncore clock skew to less than 11ps. To ensure that the uncore
logic is not in the full chip critical timing path, a 5% margin is
added to the uncore timing-verification flow. | Source: IEEE Presentation 2006
cheers
Opteron
|
|
| Back to top |
|
 |
Paul DeMone
Joined: 29 Aug 2007 Posts: 459 Location: Great white north
|
Posted: Fri Mar 28, 2008 5:31 pm Post subject: |
|
|
| Opteron wrote: | | Itanium uses an asynchronous approach. |
The Madison 9M and all earlier IPF chips didn't. The 9 MB L3 in the
Madison also ran at 14 cycles at up to 1.667 GHz. The Montecito L3
used elements of asynch operation because it was designed to 1)
33% bigger than the Madison 9M's L3, 2) operate at over 2.0 GHz,
and 3) keep latency at 14 cycles.
| Quote: | | This is better in any case, but I do not know if the mentioned 14 cycles are also worst-case or not. |
In IPF chips L3 accesses are initiated out of the L2 miss queue and can
take place out of order. The best case is 14 cycles and that reflects the
performance of the L3 itself (SRAM, global signal paths etc). When an
L3 access takes longer it is because of queueing delays and contention
between multiple access requests.
|
|
| Back to top |
|
 |
Alberto
Joined: 04 Sep 2007 Posts: 111 Location: Italy
|
Posted: Fri Mar 28, 2008 5:35 pm Post subject: |
|
|
| Paul DeMone wrote: | | Alberto wrote: |
Tulsa's L3 is half speed, if i remember correctly a your answer to a my question over an year ago, so is more like 62ns. Something is wrong.
Alberto. |
The L3 latency will be reported in terms of CPU clock cycles. If the
L3 does run at half the processor frequency ( which is very likely)
that will be accounted for. |
So the LARGE Tulsa cache is not so bad, without power concerns and at full speed would have a latency of around 15ns. Yet IPF looks better ;-).
Thanks.
Alberto.
|
|
| Back to top |
|
 |
|