Aceshardware

(not so) temporary home for the aceshardware community
 FAQ •  Search •  Register •  Login 
It is currently Thu Apr 27, 2017 8:12 am

All times are UTC + 1 hour



Welcome
Welcome to <strong>Aceshardware</strong>.

You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!


Post new topic Reply to topic  [ 91 posts ]  Go to page Previous  1, 2, 3, 4, 5 ... 7  Next
Author Message
 Post subject:
PostPosted: Sun Oct 07, 2007 5:03 pm 
Offline

Joined: Fri Oct 05, 2007 7:46 am
Posts: 167
Hans de Vries wrote:

The L2 and L3 cache have same density. The local access to an array
is actually a small part of the total access time. (L3 = 16 of these arrays)

Regards, Hans


This layout looks very odd for a couple of reasons...

1. The overall die is symmetric in only one axis, this is very 'weird' as this would not make the optimum path lengths the shortest between cores in the worst case.

2. The L2/L3 array is not shared between all 4 cores, or at least that is what would be inferred. Are you certain this is an L2/L3 array and not just 2 shared L2 arrays.

However, this would make sense if this was ultimately scalable to octo-core.

@DavidC1 -- an octo core in this layout is not as a bad as we might think, depending on how Intel plans to arrange and use cache. The octo core could simly mirror cores (not cache) to the other side, at about 30 mm^2 per core, add roughly 120 mm^2 not 265 mm^2 to the die size, this becomes 385 mm^2 ... still HUGE but not massive. For HPC and high margin/low volume applications, this is not unthinkable. GPU die are larger than that.

Jack


Top
 Profile  
 
 
 Post subject:
PostPosted: Sun Oct 07, 2007 6:37 pm 
Offline

Joined: Tue Jul 31, 2007 1:25 pm
Posts: 285
It looks like it's two dual core die and a north bridge printed onto one piece of silicon.

That would be ridiculous, of course, so it can't be what they're doing.

But it does look like it...

Imagine the religious battles we could have over whether or not this meant it was a "true" quad core!

:-)


Last edited by AtWork on Sun Oct 07, 2007 6:40 pm, edited 2 times in total.

Top
 Profile  
 
 Post subject:
PostPosted: Sun Oct 07, 2007 6:37 pm 
Offline

Joined: Sun Oct 07, 2007 6:22 pm
Posts: 119
Hans,

tsk, tsk, tsk... your analysis of the L2/L3 is seems...well... incorrect.

Given that both Penryn and Nehalem share the same 45 nm process
you can take 2 MB of the 6 MB L2 in Penryn, including the necessary
part of the tags that reside in the center, and then scale the Nehalem
die until the two match.

When you do that, you will end up with 2 MB of cache per core -- the
chunk you have listed as 0.5 MB L2 will actually be the tags.

Given that Intel has only stated "8 MB shared total", it ends up being
L2 cache, not L3 cache, no?

Next, if you compare the cores after this scaling exercise, then you'll
see identical L1 caches -- no increase in size. There is an interesting
difference below the L1d though, where Nehalem sports two greenish
trapezoids.

Last but not least, you can clearly see the two CSI^H^H^HQPI links in
the lower left and right corner, fuses and GPIOs above the left link,
and some more I/O pads above the right link -- the latter is a bit odd:
are these divided between the two links?


Top
 Profile  
 
 Post subject:
PostPosted: Sun Oct 07, 2007 8:05 pm 
Offline

Joined: Fri Oct 05, 2007 7:46 am
Posts: 167
AtWork wrote:
It looks like it's two dual core die and a north bridge printed onto one piece of silicon.

That would be ridiculous, of course, so it can't be what they're doing.

But it does look like it...

Imagine the religious battles we could have over whether or not this meant it was a "true" quad core!

:-)


Hmmmmmmm maybe not so ridiculous. Intel may simply port the shared L2/core over, and with the advantage of the IMC, leave L2 descrete to 2 cores total. They could be optimizing to a single threaded perfomrance, but sharing L2 cache over 2 cores to decrease coherency traffic.

Who knows... but it does not sound enormously ridiculous to me, odd perhaps but not impossible .... also,I do find it odd that the cores are not more symmetric about the IMC, as is the case with Barcelona.

But yeah, the fanboy debates would be laughable.
Jack


Last edited by JumpingJack on Sun Oct 07, 2007 11:35 pm, edited 1 time in total.

Top
 Profile  
 
 Post subject:
PostPosted: Sun Oct 07, 2007 9:29 pm 
Offline

Joined: Tue Aug 07, 2007 11:57 am
Posts: 304
Image


Some updates.

Regards, Hans


Top
 Profile  
 
 Post subject:
PostPosted: Sun Oct 07, 2007 9:37 pm 
Offline

Joined: Tue Aug 07, 2007 11:57 am
Posts: 304
JumpingJack wrote:
Hans de Vries wrote:

The L2 and L3 cache have same density. The local access to an array
is actually a small part of the total access time. (L3 = 16 of these arrays)

Regards, Hans


2. The L2/L3 array is not shared between all 4 cores, or at least that is what would be inferred. Are you certain this is an L2/L3 array and not just 2 shared L2 arrays.


The L3 cache is almost "per definition" shared by all processors. The
L2 caches are private per core and "shared" between the two threads.
with probably a lot of circuits doubled for the second thread (pre-fetchers,
read-write buffers)



Regards, Hans


Top
 Profile  
 
 Post subject:
PostPosted: Sun Oct 07, 2007 10:39 pm 
Offline

Joined: Fri Oct 05, 2007 7:46 am
Posts: 167
Hans de Vries wrote:
JumpingJack wrote:
Hans de Vries wrote:

The L2 and L3 cache have same density. The local access to an array
is actually a small part of the total access time. (L3 = 16 of these arrays)

Regards, Hans


2. The L2/L3 array is not shared between all 4 cores, or at least that is what would be inferred. Are you certain this is an L2/L3 array and not just 2 shared L2 arrays.


The L3 cache is almost "per definition" shared by all processors. The
L2 caches are private per core and "shared" between the two threads.
with probably a lot of circuits doubled for the second thread (pre-fetchers,
read-write buffers)



Regards, Hans


Thanks Hans, I yield to your expertise....

Double the prefetchers though?? What are your thought, isn't this overkill or do you expect this is needed to enable tracking two different threads into a (comparitively) smaller cache pool?


Top
 Profile  
 
 Post subject:
PostPosted: Sun Oct 07, 2007 10:40 pm 
Offline

Joined: Tue Aug 07, 2007 11:57 am
Posts: 304
[email protected] wrote:
Hans,

tsk, tsk, tsk... your analysis of the L2/L3 is seems...well... incorrect.

Given that both Penryn and Nehalem share the same 45 nm process
you can take 2 MB of the 6 MB L2 in Penryn, including the necessary
part of the tags that reside in the center, and then scale the Nehalem
die until the two match.

When you do that, you will end up with 2 MB of cache per core -- the
chunk you have listed as 0.5 MB L2 will actually be the tags.

Given that Intel has only stated "8 MB shared total", it ends up being
L2 cache, not L3 cache, no?


Intel's tags are generally only 10-11% of the size of the cache tiles,
not 25%. Note that there are vertical bands between the 0.5MB tiles
which are about the required 10%. It's the total transistor count of
731M which hints at the 10MB total cache size:
2xmerom with 8MB =~580 M, 2x Penryn with 12MB~820M.

[email protected] wrote:
Next, if you compare the cores after this scaling exercise, then you'll
see identical L1 caches -- no increase in size. There is an interesting
difference below the L1d though, where Nehalem sports two greenish
trapezoids.


Well, the actual L1 D cache SRAM tiles have a very different overall
form compared to Penryn, Almost 80% wider, but then the height is
less. so there's room for confusion here. I did drop this point for the
time being

The greenish trapezoids did caught my attention. This area should
contain bus/cache interface stuff like read and write buffers but I
played with L1 D related options as well. The wires obscure everything
in this area.

[email protected] wrote:
Last but not least, you can clearly see the two CSI^H^H^HQPI links in
the lower left and right corner, fuses and GPIOs above the left link,
and some more I/O pads above the right link -- the latter is a bit odd:
are these divided between the two links?


Indeed. Sometimes you see four links and sometimes two links. So I
presumed the links can be logically split.


Regards, Hans


Top
 Profile  
 
 Post subject:
PostPosted: Mon Oct 08, 2007 2:01 am 
Offline

Joined: Sun Jul 22, 2007 12:53 am
Posts: 256
So will the L3 be a flat clock rate (based off the fsb) for all processors? I don't mean it always runs one speed, just that it is always in sync with the fsb.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Oct 08, 2007 2:36 am 
Offline

Joined: Sun Oct 07, 2007 6:22 pm
Posts: 119
> Intel's tags are generally only 10-11% of the size of the cache tiles,
> not 25%.

Yes, if you measure Penryn's L2, then it's ~12%.

And no, I didn't say that all of the area that you consider to be L2
is used as tags -- only half of it is. Look closely. It appears to me
as if the other half is what used to be the four rectangular blocks
between Penryn's core(s) and its L2 array.

> Note that there are vertical bands between the 0.5MB tiles
> which are about the required 10%.

The same bands exist in Penryn... so... nope.

> It's the total transistor count of 731M which hints at the 10MB
> total cache size: 2xmerom with 8MB =~580 M, 2x Penryn with
> 12MB~820M.

I'd expect ~500M for 8M of cache, plus ~100M for four cores.
This leaves ~130M for the I/O (NB, MC, DDR, 2xCSI, etc.).

Looking back at K8 or K8L, that doesn't sound too far off.

By contrast, 10M of cache (>600M) plus four cores (~100M) do
leave 30M or less for the I/O, which looks wrong... simply from
the standpoint of how much area is consumed by that logic.

Last but not least, a 512K L2 isn't significantly faster than a 2M
L2, but a shared L3 adds quite a bit of latency. I don't see why
Intel would go for a 3-level scheme in the desktop variant.

> The greenish trapezoids did caught my attention. This area should
> contain bus/cache interface stuff like read and write buffers but I
> played with L1 D related options as well. The wires obscure everything
> in this area.

Multipliers? Relocated FP/SSE units?

> Indeed. Sometimes you see four links and sometimes two links. So I
> presumed the links can be logically split.

This die represents the mobile/desktop/DP variant -- two links.
They clearly show in the lower left/right corners.
The odd part is what's "above" the right link.
It looks like it can be split in two.
So maybe one half for each link.
Though one half would be on the "wrong" side of the die.
Shouldn't matter much though -- it's just a packaging issue.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Oct 08, 2007 3:05 am 
Offline

Joined: Thu Jul 26, 2007 12:28 pm
Posts: 261
[email protected] wrote:
Last but not least, a 512K L2 isn't significantly faster than a 2M
L2, but a shared L3 adds quite a bit of latency. I don't see why
Intel would go for a 3-level scheme in the desktop variant.


Perhaps because 8MB L2 shared between four cores would be too slow?
Same reasoning as AMD with Barcelona I suppose.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Oct 08, 2007 5:46 am 
Offline

Joined: Sun Oct 07, 2007 6:22 pm
Posts: 119
> Perhaps because 8MB L2 shared between four cores would be too slow?

Banias had 1M/SC @ 9 clocks
Dothan had 2M/SC @ 10 clocks
Yonah had 2M/DC @ 14 clocks
Merom had 4M/DC @ 14 clocks
Penryn has 6M/DC @ 15 clocks

It should be possible to build 8M/DC @ 16 clocks, and 8M/QC @ no more than 20 clocks.

In fact, it should be possible to build 8M so that each core has 2M @ no more than 14 clocks, with the other 6M being farther away by 1-2 clocks per "hop".

Think Larrabee.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Oct 08, 2007 7:07 am 
Offline

Joined: Wed Sep 19, 2007 9:05 pm
Posts: 18
Hans, I always thought 'you da man' regarding die analysis and been following/referencing your articles for years now.

But in this case I`m with [email protected] - he seems to have way more supporting evidence for his theory.


Top
 Profile  
 
 Post subject:
PostPosted: Mon Oct 08, 2007 4:58 pm 
Offline

Joined: Tue Aug 07, 2007 11:57 am
Posts: 304
[email protected] wrote:

I'd expect ~500M for 8M of cache, plus ~100M for four cores.
This leaves ~130M for the I/O (NB, MC, DDR, 2xCSI, etc.).

Looking back at K8 or K8L, that doesn't sound too far off.

By contrast, 10M of cache (>600M) plus four cores (~100M) do
leave 30M or less for the I/O, which looks wrong... simply from
the standpoint of how much area is consumed by that logic.


Look, again. This is off by a huge factor:

Comparing Merom with Penryn gives <= 476M transistors for 8MB
of cache and ~106 M transistors for 4 cores (Here we already
include all previous bus logic as a compensation for the grown
cores)

This leaves you with about 150 M transistors to explain, which is
the equivalent of 6 cores, and this should all be found in the North
Bridge/Switch (which is about half the size of a single core) and
the I/O paths/circuits ???


Regards, Hans


Top
 Profile  
 
 Post subject:
PostPosted: Mon Oct 08, 2007 10:37 pm 
Offline

Joined: Tue Sep 18, 2007 10:27 pm
Posts: 48
[email protected] wrote:
> Perhaps because 8MB L2 shared between four cores would be too slow?

Banias had 1M/SC @ 9 clocks
Dothan had 2M/SC @ 10 clocks
Yonah had 2M/DC @ 14 clocks
Merom had 4M/DC @ 14 clocks
Penryn has 6M/DC @ 15 clocks



Is Anandtech wrong here?

They claim Conroe is 13 clocks, and Penryn is actually down to 12 clocks.

http://www.anandtech.com/cpuchipsets/in ... i=3069&p=3


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 91 posts ]  Go to page Previous  1, 2, 3, 4, 5 ... 7  Next

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
cron
suspicion-preferred