You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!
I really would like to understand the effects of the cache a bit more. Considering that each core in Barcelona has only a 512 KB cache before it has to go to a slow L3, I can imagine that is far from optimal for most desktop apps. Even Shangai will not change this. Penryn with it's huge low latency L2-cache is really ideal for desktop apps.
And where would that put Nehalem, going away from the big L2 to smaller L2 with a separate L3 instead? Perhaps it is a better design when you are not limited by a FSB.
That will depend on L3 latency. However, based on previous designs, it should be quite decent. Certainly better than Barcelona's L3 which latency is quite horrible.
It is horrible? Who say that? It's an L3 studied for being one 6MB L3.
There is an Intel slide that claims 33 cycles for the 32nm Sandy Bridge L3, this cache doesn't seem FAST; maybe the question is the real influence of a large L3 cache in a cpu with one on two MCs on die, at least in consumer applications.
Alberto.
Last edited by Alberto on Fri Mar 28, 2008 1:34 pm, edited 1 time in total.
Some of the first details on IPC gains that AMD expects going from Barcelona to Shanghai
Finally a benchmark which includes 45nm dual cores in comparison! Too bad that they didn't include 3GHz/$183 Wolfdale results. 2.66GHz E8200 still performed quite well, overtaking low-end Phenoms in some multithreaded benchmarks and being far ahead in gaming benchmarks.
Take a look at the Sysmark score in the Q9300 review - the one with 3MB L2 cache per core-pair. Notice how it craters going from 4MB to 3MB in E-Learning? Productivity is also similarly affected.
That will depend on L3 latency. However, based on previous designs, it should be quite decent. Certainly better than Barcelona's L3 which latency is quite horrible.
It is horrible? Who say that? It's an L3 studied for being one 6MB L3. There is an Intel slide that claims 33 cicles for the 32nm Sandy Bridge L3, this cache doesn't seems FAST; maybe the question is the real influence of a large L3 cache in a cpu with one on two MCs on die, at least in consumer applications.
I guess one can use the adjective "horrible" in that ccontext.
The good news for Shanghai is the die shrink. Therefore my hopes are that the northbridge and thus the L3 will get a speed bump, too. But we have to wait ...
Yes, 20ns latency for 2MB L3 is quite horrible, considering that 65nm Conroe has less than 5ns latency for 4MB L2 (although this is shared only by two cores).
Core to LLC load to use time: ~35 ns - The technolgy allows a 16 MB cache to have < 9 ns access time - All of the load-to-use stages and clock crossings cost a lot
Presuming the 109 cycle figure is correct it is not impressive.
The 12 MB L3 in Montvale (90 nm) = 14 cycles = 8.4 ns @ 1.667 GHz
The 16 MB L3 in Tulsa (65 nm) = 109 cycles = 31 ns @ 3.5 GHz
I do not know where the 109 cycles comes from, but Tusla's Cache just runs at half the core clock. Furthermore, it is a traditional, synchronous design. For using succh an approach, access times were good.
Itanium uses an asynchronous approach. This is better in any case, but I do not know if the mentioned 14 cycles are also worst-case or not.
I guess Intel is smarter now with Nehalem and will use the "Itanium" caches for Nehalem ;-)
Joined: Wed Aug 29, 2007 3:55 pm Posts: 829 Location: Great white north
Alberto wrote:
Tulsa's L3 is half speed, if i remember correctly a your answer to a my question over an year ago, so is more like 62ns. Something is wrong.
Alberto.
The L3 latency will be reported in terms of CPU clock cycles. If the
L3 does run at half the processor frequency ( which is very likely)
that will be accounted for.
The L3 latency will be reported in terms of CPU clock cycles. If theL3 does run at half the processor frequency ( which is very likely)that will be accounted for.
Yes it is half CPU clock:
Quote:
Figure 5.3.4 shows the clock distribution map. Separate PLLs and clock distribution trees drive each core and the associated L2 cache. A third PLL drives the uncore half-frequency clock. The FSB uses the external bus clock (200MHz) and the quad-pumped version (800MHz). The three PLLs are grouped together on the left side of the die and the differential clock input is routed to three pairs of C4 bumps inside the package. The uncore clock is distributed through a balanced tree embedded in nine vertical spines. De-skew circuits controlled by on-die fuses [3] reduce the uncore clock skew to less than 11ps. To ensure that the uncore logic is not in the full chip critical timing path, a 5% margin is added to the uncore timing-verification flow.
Joined: Wed Aug 29, 2007 3:55 pm Posts: 829 Location: Great white north
Opteron wrote:
Itanium uses an asynchronous approach.
The Madison 9M and all earlier IPF chips didn't. The 9 MB L3 in the Madison also ran at 14 cycles at up to 1.667 GHz. The Montecito L3 used elements of asynch operation because it was designed to 1) 33% bigger than the Madison 9M's L3, 2) operate at over 2.0 GHz, and 3) keep latency at 14 cycles.
Quote:
This is better in any case, but I do not know if the mentioned 14 cycles are also worst-case or not.
In IPF chips L3 accesses are initiated out of the L2 miss queue and can
take place out of order. The best case is 14 cycles and that reflects the
performance of the L3 itself (SRAM, global signal paths etc). When an
L3 access takes longer it is because of queueing delays and contention
between multiple access requests.
Tulsa's L3 is half speed, if i remember correctly a your answer to a my question over an year ago, so is more like 62ns. Something is wrong.
Alberto.
The L3 latency will be reported in terms of CPU clock cycles. If the L3 does run at half the processor frequency ( which is very likely) that will be accounted for.
So the LARGE Tulsa cache is not so bad, without power concerns and at full speed would have a latency of around 15ns. Yet IPF looks better ;-).
Users browsing this forum: No registered users and 0 guests
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum