You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!
hey why do you think I'm not experimenting ? trust me I've carefully experimented SSE4.1 when it was something new, my conclusion is that it's worthless for any 3D purpose
Based on your few previous posting, that is not obvious that you even tried. Not knowing that SoA or SoS mainly use Packet instructions make it more than suspecious, I ll classify you in the "beau parleur" section, sorry.
And if you ever really tried, you ll know that with threaded code, SoA is a catastrophy ... and it is required to move to SoS
For those interested, thinking about how many streams SoA open in memory, and how many it is for 3D vectors x Matrix, especially with 4 cores ... now, think about the prefetcher pattern when you have so many streams open ... it is impossible to get! SoA is really not multicore friendly, good luck with this!
Based on your few previous posting, that is not obvious that you even tried. Not knowing that SoA or SoS mainly use Packet instructions
read again the one where I say "anyway ADDPS/MULPS will be used with a SoA or SoS layout", and hey please call them "packed" like everybody else not "PACK" or "Packet"
Based on your few previous posting, that is not obvious that you even tried. Not knowing that SoA or SoS mainly use Packet instructions
read again the one where I say "anyway ADDPS/MULPS will be used with a SoA or SoS layout", and hey please call them "packed" like everybody else not "PACK" or "Packet"
LOL! I guess, you can't connect "packed instruction" and "pack instructions" , when speaking float, it is pretty obvious.
Dude!
For those interested, thinking about how many streams SoA open in memory, and how many it is for 3D vectors x Matrix, especially with 4
you really talk as if you don't understand what loop fission is, ever heard of the L0 icache on the Core uarch ?
who? wrote:
cores ... now, think about the prefetcher pattern who?
as most people know there is many prefetchers *per core*, with your 3D example you have 3x more fetch streams but 3X less cache misses per stream on average it is neutral if you are concerned by external RAM bandwidth requirements (though as you said in the past memory bandwidth is not important, though we all know you'll change your "mind" when Nehalem will be out)
more generally, if you continue to post your bold statements without timing information it looks more and more like you have something to hide about how wothless is non-standard "SSE4.1"
Last edited by Eric Bron on Sat Mar 22, 2008 9:03 am, edited 1 time in total.
LOL! I guess, you can't connect "packed instruction" and "pack instructions"
I have corrected it, and I told you that I'm sorry look at "my bad" in the previous concened post, try to keep things focused and hey, why not, try to improve the way you talk to your customers
Time for an update with Shanghai pictures and a Nehalem picture not so obscured by wiring.
Regards, Hans
Hans could "Bridge to 2nd Die?" really be PCIe link toward on package GPU like on this schematic
BTW
petition towards "who?" and Eric: stop! Please stop!
Hans could "Bridge to 2nd Die?" really be PCIe link toward on package GPU like on this schematic
Yes. For what the 8 core version concerns. After having seen Dunnington
I think we may may expect another monstermonolithic die (~700 mm2)
rather than a dual die package.
> > Hans could "Bridge to 2nd Die?" really be PCIe link toward on package GPU like on this schematic
[/quote]
> Yes.
OTOH it could just be a debug port for QPI.
(While you could hook a probe to a QPI link, doing
so it pretty intrusive. So I wouldn't be surprised if
Intel had added the ability to mirror one of the two
links to a debug port. Ideally they'd have added it
in a way where it could be used as a debug port or
as a 3rd link, but judging from what's known so far
it does not look like the 45nm QC is gonna support
more than DP -- only the 8-core (OC) will.)
Joined: Sun Mar 23, 2008 7:11 pm Posts: 18 Location: Tarragona, Spain
Speaking of cache latencies on Phenom 9700 (2.4Ghz cores / 2 GHz NB&L3), my own mesures look as follows:
L1 3 cycles
L2 minimum 9, maximum 15 cycles
L3 minimum 20, maximum 48 cycles
The new hardware prefetcher, now acts not only in RAM but in L2 & L3 (First time at AMD), makes vary the L2 & L3 latencies.
I've tested the influence of NB/L3 clock in performance and the results aren't impressive... From 1.8 to 2 GHz in WinRAR there is a 4.5% increment (tested with a 9600 BE @ 2.3 GHz). This is a corner case as everyone knows the need for low latency of WinRAR.
The same can be said about the clock of the DDR2 interface. The DDR2 1066 with 17 GB/s bandwidth collapses the L3 64 bit bus @ 1.8 GHz (BW = 14.4 GB/s).
Remember the Athlon64 64 bit L2@core_speed that limited the bandwidth of dual DDR2 800 interface in low clocked CPUs?
Joined: Wed Jun 27, 2007 10:19 am Posts: 331 Location: Milano, Italy
redpriest wrote:
alavo03, those are some good measurements. NB clock does matter on Barcelona to some extent, but probably of limited value past 2.4 ghz.
Did someone measure how much the L3 and memory latency increase when having the NB/L3 run out-of-sync with the cores compared to in-sync? For example by measuring latencies of a 2.0 GHz K10 with the NB/L3 at 2.0GHz and 1.8GHz. I am curious about it because the NB/L3 was supposed (according to the rumors) to be able to run at the same frequency as the cores and possibly at a higher frequency (granted that it can vary independently from the cores' clock). The fact that NB/L3 speeds have been systematically lower than the core frequencies on all the available Phenoms and Opterons makes me think that AMD failed to deliver on that specific part and can possibly explain why they expect additional IPC gains from Shangai once it's fixed. Just speculation on my part though.
Did someone measure how much the L3 and memory latency increase when having the NB/L3 run out-of-sync with the cores compared to in-sync? For example by measuring latencies of a 2.0 GHz K10 with the NB/L3 at 2.0GHz and 1.8GHz.
Have a look at xbitlabs, I posted the link in the other Phenom thread:
In terms of AMD I think there are two things to seperate from each other. We should talk about Northbridge+IMC+L3-Cache-Team and the Core-Team.
As far as I can see both didn't perform very well, both on very different areas.
The Core-Team don't wanted to risk to much - now they have an IPC improvement but not as much as they wanted to (I think a 4-issue wide processor would have been the better decision).
The Northbridge-Team had a lot of big designchallenges to fight against. Furthermore it was not possible to raise the clock-speeds in the regions they wanted to.
I think that Barcelona has a lot of headroom - when all speed-paths are optimised. The differently clockable areas on the die are very impressive - in my opinion a little bit to flexible. I hope that they will change this approach to a more static model.
Something like this for example: CPU1 3GHz CPU2 1.5GHz CPU3 1.5GHz CPU4 0.75GHz NB 3GHz
I think it is not really helpful when you want to reach a L3-Cache-Cell as fast as possible and you have to go through oddly clocked areas. A lot of performance is killed here.
I mentioned this point in Decembre as well. I still think that AMD has to go one step back in terms of flexibility to get lower L3-Cache-Latencies. Only even dividers like full frequency(3GHz)/half frequency(1.5GHz)/quarter frequency(750MHz) should be applicable.
If you have a 2.8GHz -> 1.4 , 0.7
If you have a 2.6GHz -> 1.3 , 0.65
I think this type of flexibility is enough and you can integrate less complex buffer-stages. If NB&Core are running with the same frequency you can short-circuit this buffer-stage and have lower latencies.
I wonder how Nehalem will fight against this problem, very little details are out about all this stuff.
Did someone measure how much the L3 and memory latency increase when having the NB/L3 run out-of-sync with the cores compared to in-sync? For example by measuring latencies of a 2.0 GHz K10 with the NB/L3 at 2.0GHz and 1.8GHz. I am curious about it because the NB/L3 was supposed (according to the rumors) to be able to run at the same frequency as the cores and possibly at a higher frequency (granted that it can vary independently from the cores' clock). The fact that NB/L3 speeds have been systematically lower than the core frequencies on all the available Phenoms and Opterons makes me think that AMD failed to deliver on that specific part and can possibly explain why they expect additional IPC gains from Shangai once it's fixed. Just speculation on my part though.
My guess is no more than 3 cycles... Being "out-of-sync" isn't that bad.
And for AMD it allows them to control each core clock and L3 clock independently.
alavo03, those are some good measurements. NB clock does matter on Barcelona to some extent, but probably of limited value past 2.4 ghz.
Did someone measure how much the L3 and memory latency increase when having the NB/L3 run out-of-sync with the cores compared to in-sync? For example by measuring latencies of a 2.0 GHz K10 with the NB/L3 at 2.0GHz and 1.8GHz. I am curious about it because the NB/L3 was supposed (according to the rumors) to be able to run at the same frequency as the cores and possibly at a higher frequency (granted that it can vary independently from the cores' clock). The fact that NB/L3 speeds have been systematically lower than the core frequencies on all the available Phenoms and Opterons makes me think that AMD failed to deliver on that specific part and can possibly explain why they expect additional IPC gains from Shangai once it's fixed. Just speculation on my part though.
Ooooh, this is a good question and a good experiment to run, I will run it and report back to you the numbers.
As you probably know, Kanter addressed the circuit design for the NB/L3 clock domain, K10 implemente FIFO buffers to absorb the clock skew, I am curious if these buffers are adding more latency than what AMD originally planned?
EDIT: Here is the data
CPU/NB--L1/L2/L3 Latency (Everest in ns), L1/L2/L3 CPUID in cycles
1.8/1.8 --1.6/5.1/10.8 -- 3/15/51
1.8/2.0 --1.6/5.1/9.8 -- 3/15/48 (oddly CPUID reports 4 levels of cache at these settings, 1 L1 @ 8K and 1 L1 at 64K weird)
2.0/2.0 --1.5/4.6/9.2 -- 3/16/50
2.0/1.8 --1.5/4.6/9.8 -- 3/16/45
2.0/1.6 --1.5/4.6/10.5--3/15/55
Not using great SW to measure this... L3 is very variable, the 2.0/1.8 run varied between 43 and 53 over about 6 runs for CPUID latency test, tried to capture the average.
Nothing earth shattering -- neither SW showed anything abnormal out of ordinary clock scaling in the time domain.
Memory access latency went as such (using DDR2-800 CL5)
1.8/1.8 - 75.8
1.8/2.0 - 72.6
2.0/2.0 - 72.2
2.0/1.8 - 67.5
2.0/1.6 - 69.1
All measured with Everest (not a great one to use) and beta at that 4.20 1283 Beta.
Users browsing this forum: No registered users and 1 guest
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum