Aceshardware Forum Index Aceshardware
(not so) temporary home for the aceshardware community
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups    RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Phenom review is available
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8  Next
 
Post new topic   Reply to topic    Aceshardware Forum Index -> General forum
View previous topic :: View next topic  
Author Message
Alberto



Joined: 04 Sep 2007
Posts: 111
Location: Italy

PostPosted: Thu Mar 27, 2008 7:52 pm    Post subject: Reply with quote

jack wrote:
Pjotr wrote:
Johan wrote:
I really would like to understand the effects of the cache a bit more. Considering that each core in Barcelona has only a 512 KB cache before it has to go to a slow L3, I can imagine that is far from optimal for most desktop apps. Even Shangai will not change this. Penryn with it's huge low latency L2-cache is really ideal for desktop apps.


And where would that put Nehalem, going away from the big L2 to smaller L2 with a separate L3 instead? Perhaps it is a better design when you are not limited by a FSB.


That will depend on L3 latency. However, based on previous designs, it should be quite decent. Certainly better than Barcelona's L3 which latency is quite horrible.



It is horrible? Who say that? It's an L3 studied for being one 6MB L3.
There is an Intel slide that claims 33 cycles for the 32nm Sandy Bridge L3, this cache doesn't seem FAST; maybe the question is the real influence of a large L3 cache in a cpu with one on two MCs on die, at least in consumer applications.

Alberto.


Last edited by Alberto on Fri Mar 28, 2008 1:34 pm; edited 1 time in total
Back to top
View user's profile Send private message
jack



Joined: 27 Jun 2007
Posts: 333

PostPosted: Thu Mar 27, 2008 9:03 pm    Post subject: Reply with quote

EaS wrote:
Some of the first details on IPC gains that AMD expects going from Barcelona to Shanghai


Finally a benchmark which includes 45nm dual cores in comparison! Too bad that they didn't include 3GHz/$183 Wolfdale results. 2.66GHz E8200 still performed quite well, overtaking low-end Phenoms in some multithreaded benchmarks and being far ahead in gaming benchmarks.
Back to top
View user's profile Send private message
redpriest



Joined: 30 Aug 2007
Posts: 52

PostPosted: Thu Mar 27, 2008 9:14 pm    Post subject: Reply with quote

Take a look at the Sysmark score in the Q9300 review - the one with 3MB L2 cache per core-pair. Notice how it craters going from 4MB to 3MB in E-Learning? Productivity is also similarly affected.
Back to top
View user's profile Send private message
Opteron



Joined: 16 Mar 2008
Posts: 55

PostPosted: Thu Mar 27, 2008 11:00 pm    Post subject: Reply with quote

Alberto wrote:
jack wrote:


That will depend on L3 latency. However, based on previous designs, it should be quite decent. Certainly better than Barcelona's L3 which latency is quite horrible.



It is horrible? Who say that? It's an L3 studied for being one 6MB L3.
There is an Intel slide that claims 33 cicles for the 32nm Sandy Bridge L3, this cache doesn't seems FAST; maybe the question is the real influence of a large L3 cache in a cpu with one on two MCs on die, at least in consumer applications.


It is stated here:
http://www.digit-life.com/articles3/cpu/rmma-phenom.html

I guess one can use the adjective "horrible" in that ccontext.

The good news for Shanghai is the die shrink. Therefore my hopes are that the northbridge and thus the L3 will get a speed bump, too. But we have to wait ...

However the impact of a faster L3 seems not to be enormous, as some tests with the new X4 9850 show:
http://www.xbitlabs.com/articles/cpu/display/phenom-x4-9850_4.html#sect0


cheers

Opteron
Back to top
View user's profile Send private message
jack



Joined: 27 Jun 2007
Posts: 333

PostPosted: Fri Mar 28, 2008 8:57 am    Post subject: Reply with quote

Yes, 20ns latency for 2MB L3 is quite horrible, considering that 65nm Conroe has less than 5ns latency for 4MB L2 (although this is shared only by two cores).
Back to top
View user's profile Send private message
redpriest



Joined: 30 Aug 2007
Posts: 52

PostPosted: Fri Mar 28, 2008 10:44 am    Post subject: Reply with quote

So what do you think Intel's latency of 109 cycles on their Tulsa L3 implementation was? Fast?

BTW, Penryn's L2 latency appears to be variable. 13-14 cycles best case, I've seen up to 17-18 cycles worst case.
Back to top
View user's profile Send private message
Opteron



Joined: 16 Mar 2008
Posts: 55

PostPosted: Fri Mar 28, 2008 11:10 am    Post subject: Reply with quote

redpriest wrote:
So what do you think Intel's latency of 109 cycles on their Tulsa L3 implementation was? Fast?
Of course, if you take its size into account (16MB) it was quite good ^^
Quote:
BTW, Penryn's L2 latency appears to be variable. 13-14 cycles best case, I've seen up to 17-18 cycles worst case.
As it is noted in the article I posted above, worst case latency for K10 L3 is 47-48 cycles ...

cheers

Opteron
Back to top
View user's profile Send private message
Paul DeMone



Joined: 29 Aug 2007
Posts: 459
Location: Great white north

PostPosted: Fri Mar 28, 2008 2:15 pm    Post subject: Reply with quote

Opteron wrote:
redpriest wrote:
So what do you think Intel's latency of 109 cycles on their Tulsa L3 implementation was? Fast?
Of course, if you take its size into account (16MB) it was quite good ^^ Opteron


Presuming the 109 cycle figure is correct it is not impressive.

The 12 MB L3 in Montvale (90 nm) = 14 cycles = 8.4 ns @ 1.667 GHz

The 16 MB L3 in Tulsa (65 nm) = 109 cycles = 31 ns @ 3.5 GHz
Back to top
View user's profile Send private message
jokerman



Joined: 22 Aug 2007
Posts: 25

PostPosted: Fri Mar 28, 2008 3:33 pm    Post subject: Reply with quote

Tulsa at Hot Chips

Quote:

Core to LLC load to use time: ~35 ns
- The technolgy allows a 16 MB cache to have < 9 ns access time
- All of the load-to-use stages and clock crossings cost a lot


slide 23
http://www.hotchips.org/archives/hc18/3_Tues/HC18.S9/HC18.S9T1.pdf

still, Tulsa showed ~70 % higher performance than Paxville.
Back to top
View user's profile Send private message
Alberto



Joined: 04 Sep 2007
Posts: 111
Location: Italy

PostPosted: Fri Mar 28, 2008 5:15 pm    Post subject: Reply with quote

Paul DeMone wrote:
Opteron wrote:
redpriest wrote:
So what do you think Intel's latency of 109 cycles on their Tulsa L3 implementation was? Fast?
Of course, if you take its size into account (16MB) it was quite good ^^ Opteron


Presuming the 109 cycle figure is correct it is not impressive.

The 12 MB L3 in Montvale (90 nm) = 14 cycles = 8.4 ns @ 1.667 GHz

The 16 MB L3 in Tulsa (65 nm) = 109 cycles = 31 ns @ 3.5 GHz


Tulsa's L3 is half speed, if i remember correctly a your answer to a my question over an year ago, so is more like 62ns. Something is wrong.

Alberto.
Back to top
View user's profile Send private message
Opteron



Joined: 16 Mar 2008
Posts: 55

PostPosted: Fri Mar 28, 2008 5:20 pm    Post subject: Reply with quote

Paul DeMone wrote:
Presuming the 109 cycle figure is correct it is not impressive.

The 12 MB L3 in Montvale (90 nm) = 14 cycles = 8.4 ns @ 1.667 GHz

The 16 MB L3 in Tulsa (65 nm) = 109 cycles = 31 ns @ 3.5 GHz


I do not know where the 109 cycles comes from, but Tusla's Cache just runs at half the core clock. Furthermore, it is a traditional, synchronous design. For using succh an approach, access times were good.

Itanium uses an asynchronous approach. This is better in any case, but I do not know if the mentioned 14 cycles are also worst-case or not.

I guess Intel is smarter now with Nehalem and will use the "Itanium" caches for Nehalem ;-)

cheers

Opteron
Back to top
View user's profile Send private message
Paul DeMone



Joined: 29 Aug 2007
Posts: 459
Location: Great white north

PostPosted: Fri Mar 28, 2008 5:20 pm    Post subject: Reply with quote

Alberto wrote:

Tulsa's L3 is half speed, if i remember correctly a your answer to a my question over an year ago, so is more like 62ns. Something is wrong.

Alberto.


The L3 latency will be reported in terms of CPU clock cycles. If the
L3 does run at half the processor frequency ( which is very likely)
that will be accounted for.
Back to top
View user's profile Send private message
Opteron



Joined: 16 Mar 2008
Posts: 55

PostPosted: Fri Mar 28, 2008 5:25 pm    Post subject: Reply with quote

Paul DeMone wrote:
The L3 latency will be reported in terms of CPU clock cycles. If theL3 does run at half the processor frequency ( which is very likely)that will be accounted for.
Yes it is half CPU clock:
Quote:
Figure 5.3.4 shows the clock distribution map. Separate PLLs
and clock distribution trees drive each core and the associated L2
cache. A third PLL drives the uncore half-frequency clock. The
FSB uses the external bus clock (200MHz) and the quad-pumped
version (800MHz). The three PLLs are grouped together on the
left side of the die and the differential clock input is routed to
three pairs of C4 bumps inside the package. The uncore clock is
distributed through a balanced tree embedded in nine vertical
spines. De-skew circuits controlled by on-die fuses [3] reduce the
uncore clock skew to less than 11ps. To ensure that the uncore
logic is not in the full chip critical timing path, a 5% margin is
added to the uncore timing-verification flow.
Source: IEEE Presentation 2006

cheers

Opteron
Back to top
View user's profile Send private message
Paul DeMone



Joined: 29 Aug 2007
Posts: 459
Location: Great white north

PostPosted: Fri Mar 28, 2008 5:31 pm    Post subject: Reply with quote

Opteron wrote:
Itanium uses an asynchronous approach.


The Madison 9M and all earlier IPF chips didn't. The 9 MB L3 in the
Madison also ran at 14 cycles at up to 1.667 GHz. The Montecito L3
used elements of asynch operation because it was designed to 1)
33% bigger than the Madison 9M's L3, 2) operate at over 2.0 GHz,
and 3) keep latency at 14 cycles.

Quote:
This is better in any case, but I do not know if the mentioned 14 cycles are also worst-case or not.


In IPF chips L3 accesses are initiated out of the L2 miss queue and can
take place out of order. The best case is 14 cycles and that reflects the
performance of the L3 itself (SRAM, global signal paths etc). When an
L3 access takes longer it is because of queueing delays and contention
between multiple access requests.
Back to top
View user's profile Send private message
Alberto



Joined: 04 Sep 2007
Posts: 111
Location: Italy

PostPosted: Fri Mar 28, 2008 5:35 pm    Post subject: Reply with quote

Paul DeMone wrote:
Alberto wrote:

Tulsa's L3 is half speed, if i remember correctly a your answer to a my question over an year ago, so is more like 62ns. Something is wrong.

Alberto.


The L3 latency will be reported in terms of CPU clock cycles. If the
L3 does run at half the processor frequency ( which is very likely)
that will be accounted for.


So the LARGE Tulsa cache is not so bad, without power concerns and at full speed would have a latency of around 15ns. Yet IPF looks better ;-).

Thanks.

Alberto.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic   Reply to topic    Aceshardware Forum Index -> General forum All times are GMT + 1 Hour
Goto page Previous  1, 2, 3, 4, 5, 6, 7, 8  Next
Page 7 of 8   

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB
Hosted by FreeForums.org