 |
Aceshardware (not so) temporary home for the aceshardware community
|
| View previous topic :: View next topic |
| Author |
Message |
Hans de Vries
Joined: 07 Aug 2007 Posts: 89
|
|
| Back to top |
|
 |
DavidC1
Joined: 15 Aug 2007 Posts: 32
|
Posted: Fri Jun 06, 2008 1:58 am Post subject: |
|
|
Well, apparently the benchmark tests were done on a motherboard with memory performance issues.
"We had access to a 2.66GHz Nehalem for the longest time, unfortunately the motherboard it was paired with had some serious issues with memory performance. Not only was there no difference between single and triple channel memory configurations, memory latency was high."
"Unfortunately we didn't have access to the more mature platform for very long at all, meaning the majority of our tests had to be run on the first setup (never fear, Nehalem is fast enough that it didn't end up mattering)."
"The motherboard implementation of our 2.66GHz system needed some work so our memory bandwidth/latency numbers on it were way off (slower than Core 2), luckily we had another platform at our disposal running at 2.93GHz which was working perfectly."
So the version tested that was 20-50% faster in multi-threaded apps has a problem with the motherboard which slows the memory latency/bandwidth to slower than Core 2 level. With a properly optimized one with no bug, we are gonna see even better performance increases.
|
|
| Back to top |
|
 |
inf64
Joined: 04 Sep 2007 Posts: 69
|
Posted: Fri Jun 06, 2008 2:01 am Post subject: |
|
|
Anand just updated the article with a correction...Seems all Penryn resuslts are shown lower then they usually are:
http://www.xtremesystems.org/forums/showpost.php?p=3040010&postcount=202
Spreadsheet error or not,now it seems Nehalem is roughly the same speed per clock as 45nm Core2 in singlethreaded scenarios(most desktop apps and games)...In multithreading though,it gives a nice boost over Penryn.But Anand dropped the ball on this one(as he did it in the past,so no surprise here).
http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3326&p=7
| Quote: | | We also ran the single-threaded Cinebench test to see how performance improved on an individual core basis vs. Penryn (Updated: The original single-threaded Penryn Cinebench numbers were incorrect, we've included the correct ones): |
| Quote: | | Cinebench shows us only a 2% increase in core-to-core performance from Penryn to Nehalem at the same clock speed. For applications that don't go out to main memory much and can stay confined to a single core, Nehalem behaves very much like Penryn. Remember that outside of the memory architecture and HT tweaks to the core, Nehalem's list of improvements are very specific (e.g. faster unaligned cache accesses). |
|
|
| Back to top |
|
 |
who?
Joined: 01 Sep 2007 Posts: 540
|
|
| Back to top |
|
 |
redpriest
Joined: 30 Aug 2007 Posts: 58
|
Posted: Fri Jun 06, 2008 4:39 am Post subject: |
|
|
who, for years pretty much any "analysis" done by lots of websites were inaccurate. I can remember Quake 3 in particular being lauded for being higher performance because of "better SSE2 instruction execution". Sadly, if anyone had bothered to do some casual analysis of Quake 3, they would find that:
1) id had invented their own VM
2) their VM didn't issue SSE2 instructions
|
|
| Back to top |
|
 |
who?
Joined: 01 Sep 2007 Posts: 540
|
Posted: Fri Jun 06, 2008 6:59 am Post subject: |
|
|
| redpriest wrote: | who, for years pretty much any "analysis" done by lots of websites were inaccurate. I can remember Quake 3 in particular being lauded for being higher performance because of "better SSE2 instruction execution". Sadly, if anyone had bothered to do some casual analysis of Quake 3, they would find that:
1) id had invented their own VM
2) their VM didn't issue SSE2 instructions |
heuuuuuu , please the links??? when i said that, I don t think you remember well!
It does not stop u from "fixing" your benchmark! lol!
I invite you to go there (http://www.intel.com/support/performancetools/vtune/
), and download vTune, then, look for the critical path, and you ll figure out that Radix 16 is a bigger deal that the memory subsystem into this, but cinebench workload certainly DOES NOT fit into the cache. (help yourself and run vtune instead of spitting randomly, as you ALWAYS DO)
who?
Last edited by who? on Fri Jun 06, 2008 7:13 am; edited 1 time in total |
|
| Back to top |
|
 |
jack
Joined: 27 Jun 2007 Posts: 359
|
Posted: Fri Jun 06, 2008 6:59 am Post subject: |
|
|
I think the reason is simple: Yorkfield is cheaper to manufacture, and since it's unlikely that fastest Yorkfields will be facing competition anytime soon, it makes sense to sell Nehalem only at the high-end for H1 2009.
|
|
| Back to top |
|
 |
redpriest
Joined: 30 Aug 2007 Posts: 58
|
Posted: Fri Jun 06, 2008 6:54 pm Post subject: |
|
|
| who? wrote: |
heuuuuuu , please the links??? when i said that, I don t think you remember well!
It does not stop u from "fixing" your benchmark! lol!
I invite you to go there (http://www.intel.com/support/performancetools/vtune/
), and download vTune, then, look for the critical path, and you ll figure out that Radix 16 is a bigger deal that the memory subsystem into this, but cinebench workload certainly DOES NOT fit into the cache. (help yourself and run vtune instead of spitting randomly, as you ALWAYS DO)
who? |
So I make a comment about other people's analysis, and you respond with a personal slam and an inaccurate slur. Do you understand English or do I have to spell it out for you? You know as well as I do that websites aren't going to go digging much beyond just the benchmark numbers and they're just going to state what they think is correct rather than do any digging themselves.
For the record, I never "fixed" any benchmark. To you to suggest as much is pretty much retarded considering your employer's tendency to fix other benchmarks. Sysmark much?
|
|
| Back to top |
|
 |
redpriest
Joined: 30 Aug 2007 Posts: 58
|
Posted: Fri Jun 06, 2008 6:56 pm Post subject: |
|
|
And btw, if you're going to suggest "fixing" anything I suggest you bring out an example.
|
|
| Back to top |
|
 |
who?
Joined: 01 Sep 2007 Posts: 540
|
Posted: Sat Jun 07, 2008 5:17 am Post subject: |
|
|
| redpriest wrote: | | And btw, if you're going to suggest "fixing" anything I suggest you bring out an example. |
EXAMPLE: "Fixing" the memory test of ScienceMark, to, and I quote you "trick the prefetcher"...
you don't remember posting this in 2006?
who?
|
|
| Back to top |
|
 |
dkanter
Joined: 20 Sep 2007 Posts: 59
|
Posted: Sat Jun 07, 2008 5:31 am Post subject: |
|
|
| who? wrote: | | redpriest wrote: | | And btw, if you're going to suggest "fixing" anything I suggest you bring out an example. |
EXAMPLE: "Fixing" the memory test of ScienceMark, to, and I quote you "trick the prefetcher"...
you don't remember posting this in 2006?
who? |
If the test was meant to measure latency to memory (which it probably was), there's nothing wrong with trying to fool the prefetcher.
I have a reasonable idea what the problem with Barcelona's L3 is, and the interesting question is whether Shanghai will improve latency.
DK
|
|
| Back to top |
|
 |
who?
Joined: 01 Sep 2007 Posts: 540
|
Posted: Sat Jun 07, 2008 10:16 am Post subject: |
|
|
| dkanter wrote: | | who? wrote: | | redpriest wrote: | | And btw, if you're going to suggest "fixing" anything I suggest you bring out an example. |
EXAMPLE: "Fixing" the memory test of ScienceMark, to, and I quote you "trick the prefetcher"...
you don't remember posting this in 2006?
who? |
If the test was meant to measure latency to memory (which it probably was), there's nothing wrong with trying to fool the prefetcher.
I have a reasonable idea what the problem with Barcelona's L3 is, and the interesting question is whether Shanghai will improve latency.
DK |
I know :)...
You know my humour... kind of Acid.
who?
|
|
| Back to top |
|
 |
JumpingJack
Joined: 05 Oct 2007 Posts: 124
|
Posted: Sat Jun 07, 2008 11:26 am Post subject: |
|
|
| dkanter wrote: | | who? wrote: | | redpriest wrote: | | And btw, if you're going to suggest "fixing" anything I suggest you bring out an example. |
EXAMPLE: "Fixing" the memory test of ScienceMark, to, and I quote you "trick the prefetcher"...
you don't remember posting this in 2006?
who? |
If the test was meant to measure latency to memory (which it probably was), there's nothing wrong with trying to fool the prefetcher.
I have a reasonable idea what the problem with Barcelona's L3 is, and the interesting question is whether Shanghai will improve latency.
DK |
David,
I would like to hear your hypthesis ...
myself, thought of a couple ... one, it might have something to do with the clock skew over several clock domains -- the FIFO buffers used to absorb that skew are probably kicking in a great deal of latency.
Another possibility is simply high latency caused by RC delay over such a large die and cache pool .... there was a great deal of whoopla about L2 latency in Brisbane when AMD made the 90 to 65 nm transistion, combined with the high interest in ultra-low K matrials for 45 nm may infer that AMD/IBM are hitting RC limitiations.
Speculation on my part for certain....
Jack
|
|
| Back to top |
|
 |
redpriest
Joined: 30 Aug 2007 Posts: 58
|
Posted: Sat Jun 07, 2008 8:46 pm Post subject: |
|
|
Brisbane's latency really didn't change much compared to 90nm K8. What changed was the bandwidth that was available for a cacheline transfer - and how often it was available. Common cache memory test benchmarks would saturate the bandwidth available to it and since there would be a delay from the next cacheline transfer from L2, this would show up as a latency increase.
To correctly measure the "best case" latency from L2 on 65nm K8, you would have to insert a delay between the next pointer chase - a simple register to register add would take care of this.
While performance indeed suffers, many cacheline transfers outside of streaming benchmarks aren't bunched together so performance isn't as affected badly as a many cycle increase that the cache utility reports.
|
|
| Back to top |
|
 |
JumpingJack
Joined: 05 Oct 2007 Posts: 124
|
Posted: Sat Jun 07, 2008 8:56 pm Post subject: |
|
|
| redpriest wrote: | Brisbane's latency really didn't change much compared to 90nm K8. What changed was the bandwidth that was available for a cacheline transfer - and how often it was available. Common cache memory test benchmarks would saturate the bandwidth available to it and since there would be a delay from the next cacheline transfer from L2, this would show up as a latency increase.
To correctly measure the "best case" latency from L2 on 65nm K8, you would have to insert a delay between the next pointer chase - a simple register to register add would take care of this.
While performance indeed suffers, many cacheline transfers outside of streaming benchmarks aren't bunched together so performance isn't as affected badly as a many cycle increase that the cache utility reports. |
Fair enough, your post gives a better explanation why latency measurements varied between different apps, however, there were mixed reports in that initial Brisbane reviews were measuring as much as 5 or 6 or even 8 cycles lost. Lost Circuits did a more detailed analysis and found 2 cycles lost.... AMD issued statements to a couple of reviewers that Brisbane did indeed lose two cycles -- AMD explained it as 'reserved the capability for increasing cache if needed' -- which, in my opinion, was a dodge.
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2893 see page 3.
| Quote: | AMD has given us the official confirmation that L2 cache latencies have increased, and that it purposefully did so in order to allow for the possibility of moving to larger cache sizes in future parts. AMD stressed that this wasn't a pre-announcement of larger cache parts to come, but rather a preparation should the need be there to move to a vastly larger L2. Thankfully the performance delta isn't huge, at least in the benchmarks that we saw, so AMD's decision isn't too painful - especially as it comes with the benefit of a cooler running core that draws less power; ideally we'd like the best of all worlds but we'll take what we can get. Note that none of AMD's current roadmaps show any larger L2 parts (other than the usual 2x1MB offerings), which tells us one of two things: either AMD has some larger L2 parts that it's planning on releasing or AMD is being completely honest with the public in saying that the larger L2 parts will only be released if necessary.
|
Latency measurements from CPUID and Sciencemark were off, per AMD and confirmed via Lost Circuits, but there was indeed a couple of cycles lost in the shrink.
This could be from many factors, but it certainly wasn't from increasing Si realestate from increasing cache size -- Brisbane never taped out anything larger than 512 KB/core. Down from some windsor products that had 1 Meg/core.
EDIT: what I find interesting, and the empirical information suggesting a weak backend, is their focus on ultra-low K materials at 45 nm. Add to that AMD added 2 layers of metal for Barcelona and it all points to a move to relieve backend RC delay.
Jack
|
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|