Aceshardware

(not so) temporary home for the aceshardware community
 FAQ •  Search •  Register •  Login 
It is currently Thu Dec 17, 2009 8:50 am

All times are UTC + 1 hour



Welcome
Welcome to <strong>Aceshardware</strong>.

You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!


Post new topic Reply to topic  [ 110 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8  Next
Author Message
 Post subject:
PostPosted: Thu Mar 27, 2008 7:52 pm 
Offline

Joined: Tue Sep 04, 2007 8:13 am
Posts: 111
Location: Italy
jack wrote:
Pjotr wrote:
Johan wrote:
I really would like to understand the effects of the cache a bit more. Considering that each core in Barcelona has only a 512 KB cache before it has to go to a slow L3, I can imagine that is far from optimal for most desktop apps. Even Shangai will not change this. Penryn with it's huge low latency L2-cache is really ideal for desktop apps.


And where would that put Nehalem, going away from the big L2 to smaller L2 with a separate L3 instead? Perhaps it is a better design when you are not limited by a FSB.


That will depend on L3 latency. However, based on previous designs, it should be quite decent. Certainly better than Barcelona's L3 which latency is quite horrible.



It is horrible? Who say that? It's an L3 studied for being one 6MB L3.
There is an Intel slide that claims 33 cycles for the 32nm Sandy Bridge L3, this cache doesn't seem FAST; maybe the question is the real influence of a large L3 cache in a cpu with one on two MCs on die, at least in consumer applications.

Alberto.


Last edited by Alberto on Fri Mar 28, 2008 1:34 pm, edited 1 time in total.

Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 27, 2008 9:03 pm 
Offline

Joined: Wed Jun 27, 2007 1:38 pm
Posts: 479
EaS wrote:
Some of the first details on IPC gains that AMD expects going from Barcelona to Shanghai


Finally a benchmark which includes 45nm dual cores in comparison! Too bad that they didn't include 3GHz/$183 Wolfdale results. 2.66GHz E8200 still performed quite well, overtaking low-end Phenoms in some multithreaded benchmarks and being far ahead in gaming benchmarks.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 27, 2008 9:14 pm 
Offline

Joined: Thu Aug 30, 2007 6:50 pm
Posts: 68
Take a look at the Sysmark score in the Q9300 review - the one with 3MB L2 cache per core-pair. Notice how it craters going from 4MB to 3MB in E-Learning? Productivity is also similarly affected.


Top
 Profile  
 
 Post subject:
PostPosted: Thu Mar 27, 2008 11:00 pm 
Offline

Joined: Sun Mar 16, 2008 3:20 pm
Posts: 82
Alberto wrote:
jack wrote:

That will depend on L3 latency. However, based on previous designs, it should be quite decent. Certainly better than Barcelona's L3 which latency is quite horrible.



It is horrible? Who say that? It's an L3 studied for being one 6MB L3.
There is an Intel slide that claims 33 cicles for the 32nm Sandy Bridge L3, this cache doesn't seems FAST; maybe the question is the real influence of a large L3 cache in a cpu with one on two MCs on die, at least in consumer applications.


It is stated here:
http://www.digit-life.com/articles3/cpu ... henom.html

I guess one can use the adjective "horrible" in that ccontext.

The good news for Shanghai is the die shrink. Therefore my hopes are that the northbridge and thus the L3 will get a speed bump, too. But we have to wait ...

However the impact of a faster L3 seems not to be enormous, as some tests with the new X4 9850 show:
http://www.xbitlabs.com/articles/cpu/di ... html#sect0


cheers

Opteron


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 8:57 am 
Offline

Joined: Wed Jun 27, 2007 1:38 pm
Posts: 479
Yes, 20ns latency for 2MB L3 is quite horrible, considering that 65nm Conroe has less than 5ns latency for 4MB L2 (although this is shared only by two cores).


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 10:44 am 
Offline

Joined: Thu Aug 30, 2007 6:50 pm
Posts: 68
So what do you think Intel's latency of 109 cycles on their Tulsa L3 implementation was? Fast?

BTW, Penryn's L2 latency appears to be variable. 13-14 cycles best case, I've seen up to 17-18 cycles worst case.


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 11:10 am 
Offline

Joined: Sun Mar 16, 2008 3:20 pm
Posts: 82
redpriest wrote:
So what do you think Intel's latency of 109 cycles on their Tulsa L3 implementation was? Fast?
Of course, if you take its size into account (16MB) it was quite good ^^
Quote:
BTW, Penryn's L2 latency appears to be variable. 13-14 cycles best case, I've seen up to 17-18 cycles worst case.
As it is noted in the article I posted above, worst case latency for K10 L3 is 47-48 cycles ...

cheers

Opteron


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 2:15 pm 
Offline

Joined: Wed Aug 29, 2007 3:55 pm
Posts: 829
Location: Great white north
Opteron wrote:
redpriest wrote:
So what do you think Intel's latency of 109 cycles on their Tulsa L3 implementation was? Fast?
Of course, if you take its size into account (16MB) it was quite good ^^ Opteron


Presuming the 109 cycle figure is correct it is not impressive.

The 12 MB L3 in Montvale (90 nm) = 14 cycles = 8.4 ns @ 1.667 GHz

The 16 MB L3 in Tulsa (65 nm) = 109 cycles = 31 ns @ 3.5 GHz


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 3:33 pm 
Offline

Joined: Wed Aug 22, 2007 9:24 am
Posts: 28
Tulsa at Hot Chips

Quote:
Core to LLC load to use time: ~35 ns
- The technolgy allows a 16 MB cache to have < 9 ns access time
- All of the load-to-use stages and clock crossings cost a lot


slide 23
http://www.hotchips.org/archives/hc18/3 ... 8.S9T1.pdf

still, Tulsa showed ~70 % higher performance than Paxville.


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 5:15 pm 
Offline

Joined: Tue Sep 04, 2007 8:13 am
Posts: 111
Location: Italy
Paul DeMone wrote:
Opteron wrote:
redpriest wrote:
So what do you think Intel's latency of 109 cycles on their Tulsa L3 implementation was? Fast?
Of course, if you take its size into account (16MB) it was quite good ^^ Opteron


Presuming the 109 cycle figure is correct it is not impressive.

The 12 MB L3 in Montvale (90 nm) = 14 cycles = 8.4 ns @ 1.667 GHz

The 16 MB L3 in Tulsa (65 nm) = 109 cycles = 31 ns @ 3.5 GHz


Tulsa's L3 is half speed, if i remember correctly a your answer to a my question over an year ago, so is more like 62ns. Something is wrong.

Alberto.


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 5:20 pm 
Offline

Joined: Sun Mar 16, 2008 3:20 pm
Posts: 82
Paul DeMone wrote:
Presuming the 109 cycle figure is correct it is not impressive.

The 12 MB L3 in Montvale (90 nm) = 14 cycles = 8.4 ns @ 1.667 GHz

The 16 MB L3 in Tulsa (65 nm) = 109 cycles = 31 ns @ 3.5 GHz


I do not know where the 109 cycles comes from, but Tusla's Cache just runs at half the core clock. Furthermore, it is a traditional, synchronous design. For using succh an approach, access times were good.

Itanium uses an asynchronous approach. This is better in any case, but I do not know if the mentioned 14 cycles are also worst-case or not.

I guess Intel is smarter now with Nehalem and will use the "Itanium" caches for Nehalem ;-)

cheers

Opteron


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 5:20 pm 
Offline

Joined: Wed Aug 29, 2007 3:55 pm
Posts: 829
Location: Great white north
Alberto wrote:
Tulsa's L3 is half speed, if i remember correctly a your answer to a my question over an year ago, so is more like 62ns. Something is wrong.

Alberto.


The L3 latency will be reported in terms of CPU clock cycles. If the
L3 does run at half the processor frequency ( which is very likely)
that will be accounted for.


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 5:25 pm 
Offline

Joined: Sun Mar 16, 2008 3:20 pm
Posts: 82
Paul DeMone wrote:
The L3 latency will be reported in terms of CPU clock cycles. If theL3 does run at half the processor frequency ( which is very likely)that will be accounted for.
Yes it is half CPU clock:
Quote:
Figure 5.3.4 shows the clock distribution map. Separate PLLs
and clock distribution trees drive each core and the associated L2
cache. A third PLL drives the uncore half-frequency clock. The
FSB uses the external bus clock (200MHz) and the quad-pumped
version (800MHz). The three PLLs are grouped together on the
left side of the die and the differential clock input is routed to
three pairs of C4 bumps inside the package. The uncore clock is
distributed through a balanced tree embedded in nine vertical
spines. De-skew circuits controlled by on-die fuses [3] reduce the
uncore clock skew to less than 11ps. To ensure that the uncore
logic is not in the full chip critical timing path, a 5% margin is
added to the uncore timing-verification flow.
Source: IEEE Presentation 2006

cheers

Opteron


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 5:31 pm 
Offline

Joined: Wed Aug 29, 2007 3:55 pm
Posts: 829
Location: Great white north
Opteron wrote:
Itanium uses an asynchronous approach.


The Madison 9M and all earlier IPF chips didn't. The 9 MB L3 in the
Madison also ran at 14 cycles at up to 1.667 GHz. The Montecito L3
used elements of asynch operation because it was designed to 1)
33% bigger than the Madison 9M's L3, 2) operate at over 2.0 GHz,
and 3) keep latency at 14 cycles.

Quote:
This is better in any case, but I do not know if the mentioned 14 cycles are also worst-case or not.


In IPF chips L3 accesses are initiated out of the L2 miss queue and can
take place out of order. The best case is 14 cycles and that reflects the
performance of the L3 itself (SRAM, global signal paths etc). When an
L3 access takes longer it is because of queueing delays and contention
between multiple access requests.


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 5:35 pm 
Offline

Joined: Tue Sep 04, 2007 8:13 am
Posts: 111
Location: Italy
Paul DeMone wrote:
Alberto wrote:
Tulsa's L3 is half speed, if i remember correctly a your answer to a my question over an year ago, so is more like 62ns. Something is wrong.

Alberto.


The L3 latency will be reported in terms of CPU clock cycles. If the
L3 does run at half the processor frequency ( which is very likely)
that will be accounted for.


So the LARGE Tulsa cache is not so bad, without power concerns and at full speed would have a latency of around 15ns. Yet IPF looks better ;-).

Thanks.

Alberto.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 110 posts ]  Go to page Previous  1 ... 4, 5, 6, 7, 8  Next

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: