You are currently viewing our boards as a guest, which gives you limited access to view most discussions and access our other features. By joining our free community, you will have access to post topics, communicate privately with other members (PM), respond to polls, upload content, and access many other special features. Registration is fast, simple, and absolutely free, so please, <a href="/profile.php?mode=register">join our community today</a>!
CPU/NB--L1/L2/L3 Latency (Everest in ns), L1/L2/L3 CPUID in cycles 1.8/1.8 --1.6/5.1/10.8 -- 3/15/51 1.8/2.0 --1.6/5.1/9.8 -- 3/15/48 (oddly CPUID reports 4 levels of cache at these settings, 1 L1 @ 8K and 1 L1 at 64K weird) 2.0/2.0 --1.5/4.6/9.2 -- 3/16/50 2.0/1.8 --1.5/4.6/9.8 -- 3/16/45 2.0/1.6 --1.5/4.6/10.5--3/15/55
It seens Barcelona's prefetcher fooled Everest and CPUID... I'm not sure if it is possible to disable the prefetcher, anyway, Rightmark is a bit more complete.
Also, it is going to be interesting to see who gets to FMA first -- Intel, with the successor to Sandy Bridge, or AMD, with SSE5 in Bulldozer.
And then there is the SSE5 vs AVX vs Larrabee conundrum. Will AMD adopt AVX? Will it drop SSE5 in favor of AVX? Will Larrabee introduce yet another extension for the same functionality, and how will developers cope with having to develop/test/support all those different code paths?
would it be possible that AVX and SSE5 implement some common subset of each other? given that they do essentially the same exact things, seems logical to have an overlap in opcodes. after looking that the 2 instruction set listings, i thought there was some overlap (similar names, similar descriptions, AVX able to handle both 128 and 256 bits and 3 or 4 operands), but doesn't look like the case. Could someone verify that they are disjoint?
Joined: Sun Mar 23, 2008 7:11 pm Posts: 18 Location: Tarragona, Spain
The latency masures from my previous post were calculated with RMMA, I think the best memory and microarchitecture test suite. http://cpu.rightmark.org/download.shtml
The performance influence of the frecuency of the NB/L3/MemoryControllers in Barcelona Opterons is not that much important. I've tested NB frecuencies of 1.8/2.0/2.2 and 2.4 GHz and the best result is a 9% increase in winrar /7zip.
The latencies of L3 are reasonable considering:
- The asyncronous nature of the chip.
- The two clock domains with complex dividers.
- The semi-exclusive nature of L3 with two narrow 128 bit buses (one for each direction, to and from L2).
- The high associativity of 32 ways wich also adds latency.
I don´t think Shanghai changes much of these facts. I supose it to be a die shrink without much novelties.
A big step forward would be a totally syncronous Shanghai but, the power considerations? Probably the current thermal envelopes don't allow designs like this.
It seems Intel's Nehalem L3 will be asyncronous too, and the latency important. We'll see.
As a result, the load to use latency for Nehalem varies depending on the relative frequency and phase alignment of the cores and the L3 itself and the latency of arbitration for access to the L3. In the best case, i.e. phase aligned operation and frequencies that differ by an integer multiple, Nehalem’s L3 load to use latency is somewhere in the range of 30-40 cycles according to Intel architects.
It is quite in the same range as the Barcelona-L3. I'm wondering how the small L2-caches will work in that context. I think that Intel will face more problems than they expected. Core2 is profiting a lot of the big & low-latency L2.
Joined: Sun Mar 23, 2008 7:11 pm Posts: 18 Location: Tarragona, Spain
I'm aware of the excellent article of Real World Technologies, but I mean real measures, not estimations from the manufacturer. But 30-40 cycles in the best case seems quite high considering the expertise in caches from Intel.
From AMD Barcelona we can learn a few facts about that kind of complex asyncronous devices. The L3/NB/Memory_Controlers in his own clock domain creates latency problems aggravated by the fact that core clock changes in accordance with the performance requirement of the system.
>seems quite high considering the expertise in caches from Intel
Even their expertise cannot fight against physical laws.
I think that the decision to integrate the L2 so "border less" into the lower right corner of the core could be wrong. It is not possible to go to bigger L2-Caches without changing the position of some pipeline-stages. Montreal will have 1MB dedicated L2-cache per core instead of 512kb (shanghai & barcelona) and it won't change the look of the core.
I'm quite shure that benchmarks like superpi won't run that fast on nehalem compared to penryn (on an ipc basis), because of the slower cache-hierarchie(small l2-cache, slow l3-cache).
..."“We’ll start the production ramp in the summertime and start to ship products in volume in Q4 2008,” said Dirk Meyer, president and chief operating officer at AMD during a conference call with financial analysts."...
..."“We’ll start the production ramp in the summertime and start to ship products in volume in Q4 2008,” said Dirk Meyer, president and chief operating officer at AMD during a conference call with financial analysts."...
That is not a delay, it's is just one of the first truly upfront statements about the 45nm schedule.
Anything you heard from official sources about mid-2008 was cleverly deceptive, trying to make it look like they were less far behind than they really are. Phrases like "we'll start up the ramp" and such, which are quite meaningless to outsiders and therefore can always be wormed out of if anybody would actually call them out for it.
Those with good will towards AMD (or less experience with their marketing) could interpret those mid-2008 statements as them closing the process technology gap with Intel by up to half a year. But the realists already knew it would very likely be the end of the year, especially when word got out that they only got first silicon for Shanghai in Q1.
This is a recurring phenomenon by the way. They did it with 65nm and 45nm and they'll do it again with 32nm. Look out for it next time - always remember that "in the second half of the year" can mean December 31st without it being a lie ;).
Hans could "Bridge to 2nd Die?" really be PCIe link toward on package GPU like on this schematic
Yes. For what the 8 core version concerns. After having seen Dunnington I think we may may expect another monstermonolithic die (~700 mm2) rather than a dual die package.
Users browsing this forum: No registered users and 2 guests
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum