Aceshardware

(not so) temporary home for the aceshardware community
 FAQ •  Search •  Register •  Login 
It is currently Fri Nov 27, 2009 3:49 pm

All times are UTC + 1 hour



Welcome
Welcome to <strong>Aceshardware</strong>.

You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!


Post new topic Reply to topic  [ 136 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9, 10  Next
Author Message
 Post subject:
PostPosted: Sat Mar 22, 2008 8:20 am 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Eric Bron wrote:

hey why do you think I'm not experimenting ? trust me I've carefully experimented SSE4.1 when it was something new, my conclusion is that it's worthless for any 3D purpose



Based on your few previous posting, that is not obvious that you even tried. Not knowing that SoA or SoS mainly use Packet instructions make it more than suspecious, I ll classify you in the "beau parleur" section, sorry.
And if you ever really tried, you ll know that with threaded code, SoA is a catastrophy ... and it is required to move to SoS

For those interested, thinking about how many streams SoA open in memory, and how many it is for 3D vectors x Matrix, especially with 4 cores ... now, think about the prefetcher pattern when you have so many streams open ... it is impossible to get! SoA is really not multicore friendly, good luck with this!


who?


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 22, 2008 8:27 am 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
who? wrote:
Based on your few previous posting, that is not obvious that you even tried. Not knowing that SoA or SoS mainly use Packet instructions


read again the one where I say "anyway ADDPS/MULPS will be used with a SoA or SoS layout", and hey please call them "packed" like everybody else not "PACK" or "Packet"


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 22, 2008 8:30 am 
Offline

Joined: Sat Sep 01, 2007 8:01 am
Posts: 652
Eric Bron wrote:
who? wrote:
Based on your few previous posting, that is not obvious that you even tried. Not knowing that SoA or SoS mainly use Packet instructions


read again the one where I say "anyway ADDPS/MULPS will be used with a SoA or SoS layout", and hey please call them "packed" like everybody else not "PACK" or "Packet"


LOL! I guess, you can't connect "packed instruction" and "pack instructions" , when speaking float, it is pretty obvious.
Dude!

who?


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 22, 2008 8:40 am 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
who? wrote:
For those interested, thinking about how many streams SoA open in memory, and how many it is for 3D vectors x Matrix, especially with 4


you really talk as if you don't understand what loop fission is, ever heard of the L0 icache on the Core uarch ?

who? wrote:
cores ... now, think about the prefetcher pattern
who?


as most people know there is many prefetchers *per core*, with your 3D example you have 3x more fetch streams but 3X less cache misses per stream on average it is neutral if you are concerned by external RAM bandwidth requirements (though as you said in the past memory bandwidth is not important, though we all know you'll change your "mind" when Nehalem will be out)

more generally, if you continue to post your bold statements without timing information it looks more and more like you have something to hide about how wothless is non-standard "SSE4.1"


Last edited by Eric Bron on Sat Mar 22, 2008 9:03 am, edited 1 time in total.

Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 22, 2008 8:45 am 
Offline

Joined: Fri Aug 31, 2007 10:08 pm
Posts: 208
Location: Switzerland
who? wrote:
LOL! I guess, you can't connect "packed instruction" and "pack instructions"


I have corrected it, and I told you that I'm sorry look at "my bad" in the previous concened post, try to keep things focused and hey, why not, try to improve the way you talk to your customers


Top
 Profile  
 
 Post subject: Re: Time for an update.
PostPosted: Sat Mar 22, 2008 4:19 pm 
Offline

Joined: Sat Mar 22, 2008 4:10 pm
Posts: 4
Hans de Vries wrote:
Time for an update with Shanghai pictures and a Nehalem picture not so obscured by wiring.

Regards, Hans


Hans could "Bridge to 2nd Die?" really be PCIe link toward on package GPU like on this schematic
Image
BTW
petition towards "who?" and Eric: stop! Please stop!


Top
 Profile  
 
 Post subject: Re: Time for an update.
PostPosted: Sun Mar 23, 2008 6:44 am 
Offline

Joined: Tue Aug 07, 2007 11:57 am
Posts: 181
Phenom wrote:
Hans could "Bridge to 2nd Die?" really be PCIe link toward on package GPU like on this schematic
Image


Yes. For what the 8 core version concerns. After having seen Dunnington
I think we may may expect another monstermonolithic die (~700 mm2)
rather than a dual die package.


Regards, Hans


Top
 Profile  
 
 Post subject: Re: Time for an update.
PostPosted: Sun Mar 23, 2008 7:49 am 
Offline

Joined: Sun Oct 07, 2007 6:22 pm
Posts: 105
> > Hans could "Bridge to 2nd Die?" really be PCIe link toward on package GPU like on this schematic
Image
[/quote]
> Yes.

OTOH it could just be a debug port for QPI.

(While you could hook a probe to a QPI link, doing
so it pretty intrusive. So I wouldn't be surprised if
Intel had added the ability to mirror one of the two
links to a debug port. Ideally they'd have added it
in a way where it could be used as a debug port or
as a 3rd link, but judging from what's known so far
it does not look like the 45nm QC is gonna support
more than DP -- only the 8-core (OC) will.)


Top
 Profile  
 
 Post subject: About Phenom latencies.
PostPosted: Sun Mar 23, 2008 7:42 pm 
Offline

Joined: Sun Mar 23, 2008 7:11 pm
Posts: 18
Location: Tarragona, Spain
Speaking of cache latencies on Phenom 9700 (2.4Ghz cores / 2 GHz NB&L3), my own mesures look as follows:

L1 3 cycles
L2 minimum 9, maximum 15 cycles
L3 minimum 20, maximum 48 cycles

The new hardware prefetcher, now acts not only in RAM but in L2 & L3 (First time at AMD), makes vary the L2 & L3 latencies.

I've tested the influence of NB/L3 clock in performance and the results aren't impressive... From 1.8 to 2 GHz in WinRAR there is a 4.5% increment (tested with a 9600 BE @ 2.3 GHz). This is a corner case as everyone knows the need for low latency of WinRAR.

The same can be said about the clock of the DDR2 interface. The DDR2 1066 with 17 GB/s bandwidth collapses the L3 64 bit bus @ 1.8 GHz (BW = 14.4 GB/s).

Remember the Athlon64 64 bit L2@core_speed that limited the bandwidth of dual DDR2 800 interface in low clocked CPUs?

Regards. Carlos.


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 10:52 am 
Offline

Joined: Thu Aug 30, 2007 6:50 pm
Posts: 68
alavo03, those are some good measurements. NB clock does matter on Barcelona to some extent, but probably of limited value past 2.4 ghz.


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 11:25 am 
Offline

Joined: Wed Jun 27, 2007 10:19 am
Posts: 325
Location: Milano, Italy
redpriest wrote:
alavo03, those are some good measurements. NB clock does matter on Barcelona to some extent, but probably of limited value past 2.4 ghz.

Did someone measure how much the L3 and memory latency increase when having the NB/L3 run out-of-sync with the cores compared to in-sync? For example by measuring latencies of a 2.0 GHz K10 with the NB/L3 at 2.0GHz and 1.8GHz. I am curious about it because the NB/L3 was supposed (according to the rumors) to be able to run at the same frequency as the cores and possibly at a higher frequency (granted that it can vary independently from the cores' clock). The fact that NB/L3 speeds have been systematically lower than the core frequencies on all the available Phenoms and Opterons makes me think that AMD failed to deliver on that specific part and can possibly explain why they expect additional IPC gains from Shangai once it's fixed. Just speculation on my part though.


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 11:56 am 
Offline

Joined: Sun Mar 16, 2008 3:20 pm
Posts: 82
Gabriele Svelto wrote:
Did someone measure how much the L3 and memory latency increase when having the NB/L3 run out-of-sync with the cores compared to in-sync? For example by measuring latencies of a 2.0 GHz K10 with the NB/L3 at 2.0GHz and 1.8GHz.
Have a look at xbitlabs, I posted the link in the other Phenom thread:

http://aceshardware.freeforums.org/phen ... .html#5670

We are also talking there about the L3 latency, funny ^^

cheers

Opteron


Top
 Profile  
 
 Post subject:
PostPosted: Fri Mar 28, 2008 12:18 pm 
Offline

Joined: Fri Sep 07, 2007 8:41 am
Posts: 12
http://aceshardware.freeforums.org/what ... .html#4259
Quote:
In terms of AMD I think there are two things to seperate from each other.
We should talk about Northbridge+IMC+L3-Cache-Team and the Core-Team.

As far as I can see both didn't perform very well, both on very different areas.

The Core-Team don't wanted to risk to much - now they have an IPC improvement but not as much as they wanted to (I think a 4-issue wide processor would have been the better decision).

The Northbridge-Team had a lot of big designchallenges to fight against. Furthermore it was not possible to raise the clock-speeds in the regions they wanted to.

I think that Barcelona has a lot of headroom - when all speed-paths are optimised. The differently clockable areas on the die are very impressive - in my opinion a little bit to flexible. I hope that they will change this approach to a more static model.

Something like this for example:
CPU1 3GHz
CPU2 1.5GHz
CPU3 1.5GHz
CPU4 0.75GHz
NB 3GHz

I think it is not really helpful when you want to reach a L3-Cache-Cell as fast as possible and you have to go through oddly clocked areas. A lot of performance is killed here.


I mentioned this point in Decembre as well. I still think that AMD has to go one step back in terms of flexibility to get lower L3-Cache-Latencies. Only even dividers like full frequency(3GHz)/half frequency(1.5GHz)/quarter frequency(750MHz) should be applicable.

If you have a 2.8GHz -> 1.4 , 0.7
If you have a 2.6GHz -> 1.3 , 0.65

I think this type of flexibility is enough and you can integrate less complex buffer-stages. If NB&Core are running with the same frequency you can short-circuit this buffer-stage and have lower latencies.

I wonder how Nehalem will fight against this problem, very little details are out about all this stuff.

greets,
tom


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 29, 2008 12:11 am 
Offline

Joined: Sat Mar 22, 2008 5:10 pm
Posts: 220
Gabriele Svelto wrote:
Did someone measure how much the L3 and memory latency increase when having the NB/L3 run out-of-sync with the cores compared to in-sync? For example by measuring latencies of a 2.0 GHz K10 with the NB/L3 at 2.0GHz and 1.8GHz. I am curious about it because the NB/L3 was supposed (according to the rumors) to be able to run at the same frequency as the cores and possibly at a higher frequency (granted that it can vary independently from the cores' clock). The fact that NB/L3 speeds have been systematically lower than the core frequencies on all the available Phenoms and Opterons makes me think that AMD failed to deliver on that specific part and can possibly explain why they expect additional IPC gains from Shangai once it's fixed. Just speculation on my part though.

My guess is no more than 3 cycles... Being "out-of-sync" isn't that bad.
And for AMD it allows them to control each core clock and L3 clock independently.


Top
 Profile  
 
 Post subject:
PostPosted: Sat Mar 29, 2008 5:23 am 
Offline

Joined: Fri Oct 05, 2007 7:46 am
Posts: 142
Gabriele Svelto wrote:
redpriest wrote:
alavo03, those are some good measurements. NB clock does matter on Barcelona to some extent, but probably of limited value past 2.4 ghz.

Did someone measure how much the L3 and memory latency increase when having the NB/L3 run out-of-sync with the cores compared to in-sync? For example by measuring latencies of a 2.0 GHz K10 with the NB/L3 at 2.0GHz and 1.8GHz. I am curious about it because the NB/L3 was supposed (according to the rumors) to be able to run at the same frequency as the cores and possibly at a higher frequency (granted that it can vary independently from the cores' clock). The fact that NB/L3 speeds have been systematically lower than the core frequencies on all the available Phenoms and Opterons makes me think that AMD failed to deliver on that specific part and can possibly explain why they expect additional IPC gains from Shangai once it's fixed. Just speculation on my part though.


Ooooh, this is a good question and a good experiment to run, I will run it and report back to you the numbers.

As you probably know, Kanter addressed the circuit design for the NB/L3 clock domain, K10 implemente FIFO buffers to absorb the clock skew, I am curious if these buffers are adding more latency than what AMD originally planned?

EDIT: Here is the data

CPU/NB--L1/L2/L3 Latency (Everest in ns), L1/L2/L3 CPUID in cycles
1.8/1.8 --1.6/5.1/10.8 -- 3/15/51
1.8/2.0 --1.6/5.1/9.8 -- 3/15/48 (oddly CPUID reports 4 levels of cache at these settings, 1 L1 @ 8K and 1 L1 at 64K weird)
2.0/2.0 --1.5/4.6/9.2 -- 3/16/50
2.0/1.8 --1.5/4.6/9.8 -- 3/16/45
2.0/1.6 --1.5/4.6/10.5--3/15/55

Not using great SW to measure this... L3 is very variable, the 2.0/1.8 run varied between 43 and 53 over about 6 runs for CPUID latency test, tried to capture the average.

Nothing earth shattering -- neither SW showed anything abnormal out of ordinary clock scaling in the time domain.

Memory access latency went as such (using DDR2-800 CL5)

1.8/1.8 - 75.8
1.8/2.0 - 72.6
2.0/2.0 - 72.2
2.0/1.8 - 67.5
2.0/1.6 - 69.1
All measured with Everest (not a great one to use) and beta at that 4.20 1283 Beta.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 136 posts ]  Go to page Previous  1 ... 5, 6, 7, 8, 9, 10  Next

All times are UTC + 1 hour


Who is online

Users browsing this forum: No registered users and 2 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to: