Aceshardware
http://aceshardware.freeforums.org/

Finally an image of Shanghai
http://aceshardware.freeforums.org/finally-an-image-of-shanghai-t405-105.html
Page 8 of 10

Author:  who? [ Sat Mar 22, 2008 8:20 am ]
Post subject: 

Eric Bron wrote:

hey why do you think I'm not experimenting ? trust me I've carefully experimented SSE4.1 when it was something new, my conclusion is that it's worthless for any 3D purpose



Based on your few previous posting, that is not obvious that you even tried. Not knowing that SoA or SoS mainly use Packet instructions make it more than suspecious, I ll classify you in the "beau parleur" section, sorry.
And if you ever really tried, you ll know that with threaded code, SoA is a catastrophy ... and it is required to move to SoS

For those interested, thinking about how many streams SoA open in memory, and how many it is for 3D vectors x Matrix, especially with 4 cores ... now, think about the prefetcher pattern when you have so many streams open ... it is impossible to get! SoA is really not multicore friendly, good luck with this!


who?

Author:  Eric Bron [ Sat Mar 22, 2008 8:27 am ]
Post subject: 

who? wrote:
Based on your few previous posting, that is not obvious that you even tried. Not knowing that SoA or SoS mainly use Packet instructions


read again the one where I say "anyway ADDPS/MULPS will be used with a SoA or SoS layout", and hey please call them "packed" like everybody else not "PACK" or "Packet"

Author:  who? [ Sat Mar 22, 2008 8:30 am ]
Post subject: 

Eric Bron wrote:
who? wrote:
Based on your few previous posting, that is not obvious that you even tried. Not knowing that SoA or SoS mainly use Packet instructions


read again the one where I say "anyway ADDPS/MULPS will be used with a SoA or SoS layout", and hey please call them "packed" like everybody else not "PACK" or "Packet"


LOL! I guess, you can't connect "packed instruction" and "pack instructions" , when speaking float, it is pretty obvious.
Dude!

who?

Author:  Eric Bron [ Sat Mar 22, 2008 8:40 am ]
Post subject: 

who? wrote:
For those interested, thinking about how many streams SoA open in memory, and how many it is for 3D vectors x Matrix, especially with 4


you really talk as if you don't understand what loop fission is, ever heard of the L0 icache on the Core uarch ?

who? wrote:
cores ... now, think about the prefetcher pattern
who?


as most people know there is many prefetchers *per core*, with your 3D example you have 3x more fetch streams but 3X less cache misses per stream on average it is neutral if you are concerned by external RAM bandwidth requirements (though as you said in the past memory bandwidth is not important, though we all know you'll change your "mind" when Nehalem will be out)

more generally, if you continue to post your bold statements without timing information it looks more and more like you have something to hide about how wothless is non-standard "SSE4.1"

Author:  Eric Bron [ Sat Mar 22, 2008 8:45 am ]
Post subject: 

who? wrote:
LOL! I guess, you can't connect "packed instruction" and "pack instructions"


I have corrected it, and I told you that I'm sorry look at "my bad" in the previous concened post, try to keep things focused and hey, why not, try to improve the way you talk to your customers

Author:  Phenom [ Sat Mar 22, 2008 4:19 pm ]
Post subject:  Re: Time for an update.

Hans de Vries wrote:
Time for an update with Shanghai pictures and a Nehalem picture not so obscured by wiring.

Regards, Hans


Hans could "Bridge to 2nd Die?" really be PCIe link toward on package GPU like on this schematic
Image
BTW
petition towards "who?" and Eric: stop! Please stop!

Author:  Hans de Vries [ Sun Mar 23, 2008 6:44 am ]
Post subject:  Re: Time for an update.

Phenom wrote:
Hans could "Bridge to 2nd Die?" really be PCIe link toward on package GPU like on this schematic
Image


Yes. For what the 8 core version concerns. After having seen Dunnington
I think we may may expect another monstermonolithic die (~700 mm2)
rather than a dual die package.


Regards, Hans

Author:  no@spam.com [ Sun Mar 23, 2008 7:49 am ]
Post subject:  Re: Time for an update.

> > Hans could "Bridge to 2nd Die?" really be PCIe link toward on package GPU like on this schematic
Image
[/quote]
> Yes.

OTOH it could just be a debug port for QPI.

(While you could hook a probe to a QPI link, doing
so it pretty intrusive. So I wouldn't be surprised if
Intel had added the ability to mirror one of the two
links to a debug port. Ideally they'd have added it
in a way where it could be used as a debug port or
as a 3rd link, but judging from what's known so far
it does not look like the 45nm QC is gonna support
more than DP -- only the 8-core (OC) will.)

Author:  alavo03 [ Sun Mar 23, 2008 7:42 pm ]
Post subject:  About Phenom latencies.

Speaking of cache latencies on Phenom 9700 (2.4Ghz cores / 2 GHz NB&L3), my own mesures look as follows:

L1 3 cycles
L2 minimum 9, maximum 15 cycles
L3 minimum 20, maximum 48 cycles

The new hardware prefetcher, now acts not only in RAM but in L2 & L3 (First time at AMD), makes vary the L2 & L3 latencies.

I've tested the influence of NB/L3 clock in performance and the results aren't impressive... From 1.8 to 2 GHz in WinRAR there is a 4.5% increment (tested with a 9600 BE @ 2.3 GHz). This is a corner case as everyone knows the need for low latency of WinRAR.

The same can be said about the clock of the DDR2 interface. The DDR2 1066 with 17 GB/s bandwidth collapses the L3 64 bit bus @ 1.8 GHz (BW = 14.4 GB/s).

Remember the Athlon64 64 bit L2@core_speed that limited the bandwidth of dual DDR2 800 interface in low clocked CPUs?

Regards. Carlos.

Author:  redpriest [ Fri Mar 28, 2008 10:52 am ]
Post subject: 

alavo03, those are some good measurements. NB clock does matter on Barcelona to some extent, but probably of limited value past 2.4 ghz.

Author:  Gabriele Svelto [ Fri Mar 28, 2008 11:25 am ]
Post subject: 

redpriest wrote:
alavo03, those are some good measurements. NB clock does matter on Barcelona to some extent, but probably of limited value past 2.4 ghz.

Did someone measure how much the L3 and memory latency increase when having the NB/L3 run out-of-sync with the cores compared to in-sync? For example by measuring latencies of a 2.0 GHz K10 with the NB/L3 at 2.0GHz and 1.8GHz. I am curious about it because the NB/L3 was supposed (according to the rumors) to be able to run at the same frequency as the cores and possibly at a higher frequency (granted that it can vary independently from the cores' clock). The fact that NB/L3 speeds have been systematically lower than the core frequencies on all the available Phenoms and Opterons makes me think that AMD failed to deliver on that specific part and can possibly explain why they expect additional IPC gains from Shangai once it's fixed. Just speculation on my part though.

Author:  Opteron [ Fri Mar 28, 2008 11:56 am ]
Post subject: 

Gabriele Svelto wrote:
Did someone measure how much the L3 and memory latency increase when having the NB/L3 run out-of-sync with the cores compared to in-sync? For example by measuring latencies of a 2.0 GHz K10 with the NB/L3 at 2.0GHz and 1.8GHz.
Have a look at xbitlabs, I posted the link in the other Phenom thread:

http://aceshardware.freeforums.org/phen ... .html#5670

We are also talking there about the L3 latency, funny ^^

cheers

Opteron

Author:  mocad_tom [ Fri Mar 28, 2008 12:18 pm ]
Post subject: 

http://aceshardware.freeforums.org/what ... .html#4259
Quote:
In terms of AMD I think there are two things to seperate from each other.
We should talk about Northbridge+IMC+L3-Cache-Team and the Core-Team.

As far as I can see both didn't perform very well, both on very different areas.

The Core-Team don't wanted to risk to much - now they have an IPC improvement but not as much as they wanted to (I think a 4-issue wide processor would have been the better decision).

The Northbridge-Team had a lot of big designchallenges to fight against. Furthermore it was not possible to raise the clock-speeds in the regions they wanted to.

I think that Barcelona has a lot of headroom - when all speed-paths are optimised. The differently clockable areas on the die are very impressive - in my opinion a little bit to flexible. I hope that they will change this approach to a more static model.

Something like this for example:
CPU1 3GHz
CPU2 1.5GHz
CPU3 1.5GHz
CPU4 0.75GHz
NB 3GHz

I think it is not really helpful when you want to reach a L3-Cache-Cell as fast as possible and you have to go through oddly clocked areas. A lot of performance is killed here.


I mentioned this point in Decembre as well. I still think that AMD has to go one step back in terms of flexibility to get lower L3-Cache-Latencies. Only even dividers like full frequency(3GHz)/half frequency(1.5GHz)/quarter frequency(750MHz) should be applicable.

If you have a 2.8GHz -> 1.4 , 0.7
If you have a 2.6GHz -> 1.3 , 0.65

I think this type of flexibility is enough and you can integrate less complex buffer-stages. If NB&Core are running with the same frequency you can short-circuit this buffer-stage and have lower latencies.

I wonder how Nehalem will fight against this problem, very little details are out about all this stuff.

greets,
tom

Author:  EduardoS [ Sat Mar 29, 2008 12:11 am ]
Post subject: 

Gabriele Svelto wrote:
Did someone measure how much the L3 and memory latency increase when having the NB/L3 run out-of-sync with the cores compared to in-sync? For example by measuring latencies of a 2.0 GHz K10 with the NB/L3 at 2.0GHz and 1.8GHz. I am curious about it because the NB/L3 was supposed (according to the rumors) to be able to run at the same frequency as the cores and possibly at a higher frequency (granted that it can vary independently from the cores' clock). The fact that NB/L3 speeds have been systematically lower than the core frequencies on all the available Phenoms and Opterons makes me think that AMD failed to deliver on that specific part and can possibly explain why they expect additional IPC gains from Shangai once it's fixed. Just speculation on my part though.

My guess is no more than 3 cycles... Being "out-of-sync" isn't that bad.
And for AMD it allows them to control each core clock and L3 clock independently.

Author:  JumpingJack [ Sat Mar 29, 2008 5:23 am ]
Post subject: 

Gabriele Svelto wrote:
redpriest wrote:
alavo03, those are some good measurements. NB clock does matter on Barcelona to some extent, but probably of limited value past 2.4 ghz.

Did someone measure how much the L3 and memory latency increase when having the NB/L3 run out-of-sync with the cores compared to in-sync? For example by measuring latencies of a 2.0 GHz K10 with the NB/L3 at 2.0GHz and 1.8GHz. I am curious about it because the NB/L3 was supposed (according to the rumors) to be able to run at the same frequency as the cores and possibly at a higher frequency (granted that it can vary independently from the cores' clock). The fact that NB/L3 speeds have been systematically lower than the core frequencies on all the available Phenoms and Opterons makes me think that AMD failed to deliver on that specific part and can possibly explain why they expect additional IPC gains from Shangai once it's fixed. Just speculation on my part though.


Ooooh, this is a good question and a good experiment to run, I will run it and report back to you the numbers.

As you probably know, Kanter addressed the circuit design for the NB/L3 clock domain, K10 implemente FIFO buffers to absorb the clock skew, I am curious if these buffers are adding more latency than what AMD originally planned?

EDIT: Here is the data

CPU/NB--L1/L2/L3 Latency (Everest in ns), L1/L2/L3 CPUID in cycles
1.8/1.8 --1.6/5.1/10.8 -- 3/15/51
1.8/2.0 --1.6/5.1/9.8 -- 3/15/48 (oddly CPUID reports 4 levels of cache at these settings, 1 L1 @ 8K and 1 L1 at 64K weird)
2.0/2.0 --1.5/4.6/9.2 -- 3/16/50
2.0/1.8 --1.5/4.6/9.8 -- 3/16/45
2.0/1.6 --1.5/4.6/10.5--3/15/55

Not using great SW to measure this... L3 is very variable, the 2.0/1.8 run varied between 43 and 53 over about 6 runs for CPUID latency test, tried to capture the average.

Nothing earth shattering -- neither SW showed anything abnormal out of ordinary clock scaling in the time domain.

Memory access latency went as such (using DDR2-800 CL5)

1.8/1.8 - 75.8
1.8/2.0 - 72.6
2.0/2.0 - 72.2
2.0/1.8 - 67.5
2.0/1.6 - 69.1
All measured with Everest (not a great one to use) and beta at that 4.20 1283 Beta.

Page 8 of 10 All times are UTC + 1 hour
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/